Does unicode support in the kernel needed?

irvanherz · **Joined:** Mon Sep 19, 2016 5:34 am **Posts:** 27

I hesitate to write some functions for the kernel, because so far I learned, my kernel and other examples only shows support for ASCII only.
Now I started developing my kernel in C ++, after a long stop and reading many references. And now I am confused.
Do you think the kernel should have a function to support more than one type of character encoding (ASCII and UTF), or focus on unicode only?
What I'm confused about seems like this:
- Implement kprintf (AsciiString * string_obj) and something like kprintf_w (Utf8String * string_obj)
OR
- Just implement kprintf (String * string_obj); let's just say the default character encoding in our kernel is UTF-8

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

That depends on what functionality you put into the kernel. For microkernels, thhe answer would generally be no. Even for monolithic kernels I find it hard to imagine a situation where the kernel needs to interpret incoming unicode data correctly: File names and similar identifiers are usually treated as opaque byte sequences - they might be encoded in UTF-8 but the kernel does not perform any normalization on them.

I do not think the kernel should ever generate non-ASCII data. Kernel log messages can be in English only. If the user ever gets to see a kernel panic, the language barrier will not be the reason they are unable to fix the problem.

Solar · **Posted:** Fri Mar 09, 2018 4:32 am

Seconded. Any valid ASCII (-7) string is also a valid UTF-8 string. As the kernel should never have to bother with what a string actually contains (like, counting words etc.), you should be safe.

The important thing you already have covered -- you know about Unicode, that number of bytes does not equal number of characters etc. -- so you should be able to avoid related pitfalls if they come up during the design process.

MichaelFarthing · **Posted:** Fri Mar 09, 2018 4:50 am

A file system might want to normalise file names to be case insensitive and might reasonably also allow multiscript filenames. This would require something beyond ASCII and not all byte sequences in UTF8 are valid and ought, I think, to be checked for. UTF8 also has some problems, beyond case sensitivity, with alternative representations of some characters (eg diacritical letters). Being a micro-man I don't regard file systems as a legit part of a kernel, but the point remains for the monolithic.

It could be made a user responsibility to do all this (probably with a standard library function) but my inclination would be to put this as a responsibility for the file system.

Solar · **Posted:** Fri Mar 09, 2018 7:25 am

MichaelFarthing wrote:

A file system might want to normalise file names...

A thing of the file system driver, which I indeed did not consider to be part of the kernel.

MichaelFarthing wrote:

This would require something beyond ASCII and not all byte sequences in UTF8 are valid and ought, I think, to be checked for. UTF8 also has some problems, beyond case sensitivity, with alternative representations of some characters (eg diacritical letters).

That is not as much a "problem" as the issue with normalized / denormalized / unnormalized UTF-8.

Personally I would require filenames to be presented to kernel system calls as normalized UTF-8, with a normalization done in the (userspace) wrapper for said system call. Which would enable you to rely on ready-made third-party software (like ICU) for that task, without having to drag that into kernel space.

zaval · **Posted:** Fri Mar 09, 2018 7:47 am

This is a very painful question. For me it's yes and no. No, because I don't see a need to use anything other than Latin letters for the kernel and system interface for developers/administartors. anything named - for example - registry keys, OS components' file names, those few named internal objects - they all should be ANSI only. In an international community it's enough to use just one language supposed to be the international interface means. and one ASCII encoding for it. My system isn't going to name drivers or device objects in Ukrainian or Chinese. It's just overkill.
On the other hand, there is no strict boundary where this administrator/developer area ends and a normal user area begins. Normal user might want to see text in their native language.
I've not decided yet, but I am inclined to have 2 variants - ANSI and UTF-16. Say, OpenFileA() and OpenFileW().
But it's so "easy" only as claims. Might be that other than the approach taken in Windows (everything inside is represented as UTF-16) isn't possible.
For example how to combine the internal Object Manager ANSI encoded namespace with Unicode File system part? I could put resrtiction on Registry (ANSI only, that could be met, but only for key names!), but couldn't enforce this on the FS level. Anyway I am going to limit text usage in kernel to minimum, and am thinking on binary object namespace (GUIDs). So if this combination will succeed, then I only would have to deal with Unicode at the FS level. Of course GUI, should be Unicode only.
But any debugging/developer oriented output from the kernel is ANSI only.

Solar · **Posted:** Fri Mar 09, 2018 9:05 am

You already made the mistake of using "ANSI" and "ASCII" as if the two would mean the same thing... ;-)

Anyway. UTF-16 has the issue of endianess to contend with, embedded zeroes, and a good deal of storage wasted for the majority of texts. Plus, Microsoft has severely muddied the waters in the past with using the terms "Unicode" and "UTF-16" even for software that really only did UCS-2 (not to mention the chimera that is TSTRING...).

I'd strongly recommend going for UTF-8 throughout, as it's not burdened with endianess, can be handled somewhat comfortable in string classes / functions not even aware of its existence, and has never been entangled in questionable phrasing.

You also don't get into the ugly details of 2-byte vs. 4-byte wchar_t...

UTF-8 Everywhere (Seriously, read it. It's not a rant but a well-sourced discussion on the various Unicode encodings.)

Solar · **Posted:** Fri Mar 09, 2018 10:36 am

By the way, I have to retract one of my earlier statements. Merely normalizing a Unicode string doesn't help. While re-reading the UTF-8 Everywhere Manifesto, I was reminded of one detail I had forgotten... that different normalized code points can still be semantically identical. The examples from the manifesto are { U+03A9 greek capital letter omega } and { U+2126 ohm sign }...

zaval · **Posted:** Fri Mar 09, 2018 1:50 pm

Yes I did a mistake using ANSI and ASCII for the same, ANSI is an organization at all, doh. I meant basically that encoding that uses 1 byte numbers up to 127 for encoding the most usable symbols, it's called Latin-1 or something, I don't care yet.

As of UTF-16 or UTF-8. UEFI and Windows use UTF-16, so definitely UTF-8 isn't "everywhere". Problem with endianness is a problem, not with misinterpretation, but with additional work that might occur. For example it might occur on the vacuum BE PPC port of my OS, when it will read FS file names. But UTF-8 for anything that doesn't use plain latin letters becomes a video decoding. :lol:

It's more work than for UTF-16.
The best approach would be picking the best encoding for the particlular case and store info about it. But it's impossible for anything outside of your system, like FSs, many different formats.

So far, I am sure in the only thing, that when developing a system, anything for the internal use, not intended to end up in UI of any kind, will be in the good old and 1-byte ANSI/ASCII/ISO-Whatever encoding.

OSwhatever · **Joined:** Mon Jul 05, 2010 4:15 pm **Posts:** 595

I made the decision to use utf8 everywhere, both for user space services and also some kernel calls but as it is a microkernel these are many. For the kernel calls maybe it could have been enough with ASCII as it is only used for resource handling but I went utf8 anyway as I had the infrastructure for it. This haven't been particular hard for me I think.

I would say the opposite, can you make an OS that only supports ASCII today and I would say no. For hobbyist maybe but a commercial OS, no way.

Another thing that I have removed are the zero terminated strings. System calls require a size together with the string data. Zero terminated strings is one of those historical mistakes that are more persistent than herpes.

utf8 seems be the way of the future though. Rust has utf8 string handling by default and for other languages native utf8string classes/implementation becomes more common.

Brendan · **Posted:** Fri Mar 09, 2018 9:09 pm

Hi,

irvanherz wrote:

Do you think the kernel should have a function to support more than one type of character encoding (ASCII and UTF), or focus on unicode only?

I think the majority of the OS (VFS, file systems, help system, any logging, all programming languages, all APIs for GUI or command line, ...) should support UTF-8 and nothing else. The only exceptions to this are that a few applications (text editor, web browser) may convert data from other encodings into UTF-8 for compatibility purposes (e.g. in case user opens a file encoded as UTF-16); and code that converts strings into pixel data ("font renderer") may internally convert to UTF-32 if it makes things easier and file formats used for font data may be designed around "UTF-32 indexing".

irvanherz wrote:

What I'm confused about seems like this:
- Implement kprintf (AsciiString * string_obj) and something like kprintf_w (Utf8String * string_obj)
OR
- Just implement kprintf (String * string_obj); let's just say the default character encoding in our kernel is UTF-8

As a micro-kernel fan; I'd only ever have an "append string to kernel log"; where anything (in user space) can ask to be notified when kernel log changes, including (e.g.) "kernel log viewer" applications, and including the VFS process (which may write kernel log to disk).

Functions like "printf()" and "kprintf()" are inefficient (require run-time parsing of the format string) and are considerably complex, and are inferior to "string builder" approaches (e.g. "cout" in C++, where you end up with small/simple functions/methods to convert pieces into sub-strings that are concatenated, and where there's no runtime parsing of a format string).

Note that part of the reason for this is atomicity - the ability to build a temporary string from many pieces; and then do "atomic append" or "atomic write" of all the pieces. In some circumstances this is very important. For example, for kernel log (where many CPUs might be adding to the kernel log at the same time) you don't want the log to become a jumbled mess (e.g. one CPU writes "foo" while another writes "bar" and you end up with "fboaor") and you don't want the hassle of explicitly managing a "kernel log lock" (e.g. acquire the lock, then print many lines of "memory map" with many newline characters, then release the lock; to make sure that nothing else adds unrelated lines of stuff in the middle of the memory map), and don't want excessive "kernel log lock contention" (because CPUs are doing extra work converting many pieces while the lock is held instead of doing that work before the lock is acquired).

More notes:

For security purposes, you want to ensure that it's impossible for processes to create file names that can't be typed and/or can't be displayed. This means that you can't do things like UTF-8 normalisation (or "UTF-8 canonicalisation") in user-space and then assume that user-space isn't malicious (e.g. and didn't deliberately do it wrong so that all software that does it correctly isn't able to construct a matching file name; and didn't deliberately provide a file name consisting of zero-width spaces or control characters or invalid UTF-8 bytes to prevent the file name from being displayed). For this reason I'd suggest that the VFS layer (which naturally must be "trusted" anyway) is the best place to do sanity checks and things like UTF-8 normalisation/canonicalisation.
For compatibility purposes, different ("non-native") file systems have different requirements (case sensitivity, allowed/disallowed characters, character encodings, name lengths, ...). This means that for a good/modular approach (where most of a file system's details are abstracted) there needs to be some cooperation between VFS and file system modules, where the file system code hides differences where possible (and does any conversion from UTF-8 to whatever encoding the file system expects) but the VFS has to be informed of differences that the file system code can't reasonably hide. This cooperation is not easy to design.
Case insensitivity is nasty. For example (for compatibility purposes), a different OS that is case sensitive might create files where the only difference between the file names is case (e.g. three files called "FOO", "Foo" and "foo" all in the same directory) and a case insensitive OS will be unable to handle that correctly (will never be able to access some of the files by name). Also note case conversion (converting everything to the same case for case insensitive comparison) is complex and locale dependent (for one example, the result of converting 'i' to upper case depends on whether it's Turkish or not) and is something I'd rather avoid dealing with.

Cheers,

Brendan

irvanherz · **Joined:** Mon Sep 19, 2016 5:34 am **Posts:** 27

Korona wrote:

I do not think the kernel should ever generate non-ASCII data. Kernel log messages can be in English only. If the user ever gets to see a kernel panic, the language barrier will not be the reason they are unable to fix the problem.

do not forget about VFS. In UN*X-like systems, tty are built with a VFS foundation. So, is it possible to implement tty without unicode support?

irvanherz · **Joined:** Mon Sep 19, 2016 5:34 am **Posts:** 27

Brendan wrote:

Hi,
I think the majority of the OS (VFS, file systems, help system, any logging, all programming languages, all APIs for GUI or command line, ...) should support UTF-8 and nothing else. The only exceptions to this are that a few applications (text editor, web browser) may convert data from other encodings into UTF-8 for compatibility purposes (e.g. in case user opens a file encoded as UTF-16); and code that converts strings into pixel data ("font renderer") may internally convert to UTF-32 if it makes things easier and file formats used for font data may be designed around "UTF-32 indexing".
Brendan

OK, now I've got enlightenment from this opinion.
I plan to manipulate all strings in the kernel with a String object
So, do you think creating a String class that based on UTF-8 is the best way?

Brendan · **Posted:** Sat Mar 10, 2018 9:13 am

Hi,

irvanherz wrote:

Korona wrote:

I do not think the kernel should ever generate non-ASCII data. Kernel log messages can be in English only. If the user ever gets to see a kernel panic, the language barrier will not be the reason they are unable to fix the problem.

do not forget about VFS. In UN*X-like systems, tty are built with a VFS foundation. So, is it possible to implement tty without unicode support?

TTY consumes characters. To be able to consume UTF-8 characters (e.g. generated by applications) a TTY would have to support UTF-8. To be able to consume ASCII characters (e.g. generated from kernel) a TTY wouldn't need to support UTF-8.

Note that while I mostly agree with Korona; I'd go further (all software that generates text intended for developers or administrators should use English; and all text that is intended for normal users should be "internationalised"). However; "English" doesn't necessarily mean ASCII and can include some "non-ASCII" where appropriate - e.g. © and ™, and things like é where they should occur in English (but often don't); and µS rather than uS; and various mathematical signs (× and ÷ rather than * and /), etc.

irvanherz wrote:

I plan to manipulate all strings in the kernel with a String object
So, do you think creating a String class that based on UTF-8 is the best way?

I'm really the wrong person to answer that; but if you're using C++ anyway (and if it doesn't provide a useful string class in its standard library) then writing your own (or downloading someone else's) would seem to make sense.

Cheers,

Brendan

OSDev.org

Does unicode support in the kernel needed?

Who is online