Hi,
irvanherz wrote:
Do you think the kernel should have a function to support more than one type of character encoding (ASCII and UTF), or focus on unicode only?
I think the majority of the OS (VFS, file systems, help system, any logging, all programming languages, all APIs for GUI or command line, ...) should support UTF-8 and nothing else. The only exceptions to this are that a few applications (text editor, web browser) may convert data from other encodings into UTF-8 for compatibility purposes (e.g. in case user opens a file encoded as UTF-16); and code that converts strings into pixel data ("font renderer") may internally convert to UTF-32 if it makes things easier and file formats used for font data may be designed around "UTF-32 indexing".
irvanherz wrote:
What I'm confused about seems like this:
- Implement kprintf (AsciiString * string_obj) and something like kprintf_w (Utf8String * string_obj)
OR
- Just implement kprintf (String * string_obj); let's just say the default character encoding in our kernel is UTF-8
As a micro-kernel fan; I'd only ever have an "append string to kernel log"; where anything (in user space) can ask to be notified when kernel log changes, including (e.g.) "kernel log viewer" applications, and including the VFS process (which may write kernel log to disk).
Functions like "printf()" and "kprintf()" are inefficient (require run-time parsing of the format string) and are considerably complex, and are inferior to "string builder" approaches (e.g. "cout" in C++, where you end up with small/simple functions/methods to convert pieces into sub-strings that are concatenated, and where there's no runtime parsing of a format string).
Note that part of the reason for this is atomicity - the ability to build a temporary string from many pieces; and then do "atomic append" or "atomic write" of all the pieces. In some circumstances this is very important. For example, for kernel log (where many CPUs might be adding to the kernel log at the same time) you don't want the log to become a jumbled mess (e.g. one CPU writes "foo" while another writes "bar" and you end up with "fboaor") and you don't want the hassle of explicitly managing a "kernel log lock" (e.g. acquire the lock, then print many lines of "memory map" with many newline characters, then release the lock; to make sure that nothing else adds unrelated lines of stuff in the middle of the memory map), and don't want excessive "kernel log lock contention" (because CPUs are doing extra work converting many pieces while the lock is held instead of doing that work before the lock is acquired).
More notes:
- For security purposes, you want to ensure that it's impossible for processes to create file names that can't be typed and/or can't be displayed. This means that you can't do things like UTF-8 normalisation (or "UTF-8 canonicalisation") in user-space and then assume that user-space isn't malicious (e.g. and didn't deliberately do it wrong so that all software that does it correctly isn't able to construct a matching file name; and didn't deliberately provide a file name consisting of zero-width spaces or control characters or invalid UTF-8 bytes to prevent the file name from being displayed). For this reason I'd suggest that the VFS layer (which naturally must be "trusted" anyway) is the best place to do sanity checks and things like UTF-8 normalisation/canonicalisation.
- For compatibility purposes, different ("non-native") file systems have different requirements (case sensitivity, allowed/disallowed characters, character encodings, name lengths, ...). This means that for a good/modular approach (where most of a file system's details are abstracted) there needs to be some cooperation between VFS and file system modules, where the file system code hides differences where possible (and does any conversion from UTF-8 to whatever encoding the file system expects) but the VFS has to be informed of differences that the file system code can't reasonably hide. This cooperation is not easy to design.
- Case insensitivity is nasty. For example (for compatibility purposes), a different OS that is case sensitive might create files where the only difference between the file names is case (e.g. three files called "FOO", "Foo" and "foo" all in the same directory) and a case insensitive OS will be unable to handle that correctly (will never be able to access some of the files by name). Also note case conversion (converting everything to the same case for case insensitive comparison) is complex and locale dependent (for one example, the result of converting 'i' to upper case depends on whether it's Turkish or not) and is something I'd rather avoid dealing with.
Cheers,
Brendan