Microkernels are easier to build than Monolithic kernels

Ethin · **Posted:** Sun Aug 28, 2022 6:52 pm

vvaltchev wrote:

AndrewAPrice wrote:

Change my mind

I won't even try :-)

Microkernels are what the mainstream opinion (professors, researchers, most developers here etc.) considers better, so go with it.

I can share my opinion though, hoping to not end up in an endless (and hatred) discussion. I really don't have time for that, guys.
I completely agree with Linus Torvalds' on the topic: despite the dislike of many people, the industry has proven him right. Most of the so-called "microkernel" real-world operating systems are not truly microkernel and avoid message passing & context switching at all cost. The overhead of message-passing is excessive and a well-written monolithic kernel can always outperform a complex micro-kernel OS both in terms of latency and throughput. Even if in a microkernel architecture the kernel itself will be simpler, the overall complexity will grow significantly compared to a monolithic implementation. In particular, if you try to care about performance and start doing all sort of "tricks" to speed up the insanely-slow native implementation. The increased stability of microkernels in case a module (service) crashes, that is so-much praised by Prof. Tanenbaum, is an overrated IMHO: you cannot really get the work done on a microkernel even if a small part of it crashes and gets restarted: that would cause an insane amount of side-effect failures on userspace. Therefore, the whole OS has to be extremely well tested and work flawlessly, no matter the architecture. Partial or total crashes cannot be realistically tolerated. Therefore, if that's the case, why even bothering with a complex microkernel system? If everything needs to be close to perfect and work flawlessly, why not implementing it in the simplest way possible? Also, with the addition of loadable modules and user-space filesystems (see FUSE), a "monolithic" kernel like Linux can be fairly flexible.

You would be right about the flexibility of loadable kernel modules if it weren't for the fact that a loadable kernel module can easily bring down the entire system. I'd agree with the "services can easily form complex dependency chains and therefore aren't much different from a monolithic kernel" statement if the industry hadn't shown us ways of dealing with that already. I'm mainly talking about a failover strategy. A modern computer could trivially handle it. Imagine that you need to update the disk service so it can handle a new disk standard. You have a few ways of going about this:

Make the service modular using an embeddable scripting language like JS or Lua.
Use a failover instance while you update the master instance.

I'm sure there are other ways, too. The failover strategy just involves firing up another instance to handle all disk requests temporarily. You fire up the secondary instance, reroute all requests to it, update the actual on-disk file and service using the secondary instance, start the master, and kill the secondary instance. Sounds complex and might result in a bit of extra memory usage, but this is also the strategy services use on the internet when they need to scale to handle higher load. When a large load balancer system receives more requests then it can handle, it automatically fires up another node (or lots of nodes) and all the incoming requests get distributed across all of those extra nodes, plus the already active nodes. That reduces the load on the suffering load balancer. Once the load has been reduced significantly enough that less nodes are needed, the system scales down again by terminating extraneous nodes so that only the nodes that absolutely need to be running are active. Something similar could be employed to update or restart services in a microkernel system. That very same strategy could also be employed to handle large requests that exceed a certain threshold. Mind, it wouldn't mitigate something like a DoS attack, of course, because to do that you'd need the system to be fully distributed. But it would be an interesting idea, even if it might be complex to implement.
The first option (using scripting languages) is also another good idea. FreeBSD does this in the kernel, as has been mentioned previously, but I don't think that much of the kernel even uses it right now (I might be wrong about that one). That would be significantly easier to do in user mode, though, since you wouldn't need to do much modification (if you had to do any at all) for it to work properly, whereas integrating that kind of thing into the kernel requires actually modifying the source code or writing your own implementation. I think that the overhead of message passing has been somewhat, if not entirely, mitigated due to the advent of things like COW and, of course, the ability to share pages (mapping the same page in multiple address spaces simultaneously, such that reads and writes from both ends are reflected in all processes where its mapped) (I like to call this "page mirroring"). Your "message passing" then boils down to notifying the target process that a message is available for reading. And if you use raw binary representation, and don't bother with "packaging" it up into some weird format, that's one less thing you have to worry about. Of course, locking is an issue, and you'll have to figure out how to signal processes when they can and can't write/read the data, but IPC synchronization mechanisms are not a new concept and exist already. Just my thoughts.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3192

nullplan wrote:

In a monolithic kernel, that call works its way down the stack, through the FD abstraction layer, the VFS, the file system, and the volume drivers, and finally ends up in the page cache for the drive, where it results in a request to the page cache to load page 12345 from the drive and notify the caller. Depending on implementation, that may be the only time the request has to be serialized in any form. Or if you skip the I/O scheduler then not even then; you just update the page cache right out of the system call context. Once the page is updated, the appropriate 20 bytes are copied to the user.

In a microkernel, the kernel sees the above call. Since read() is an I/O operation, it first has to package that request up and send it to the I/O manager. Which unpacks it and sends it to the VFS. Where it results in a request to the appropriate FS driver. Where it results in at least one request for a page from the appropriate volume driver. Where the request are changed only slightly before being sent to the page cache. It is not always the same request, but they are requests that result from the original desire of the application to read 20 bytes from a file.

Each time, a message has to be constructed and then sent to another process, because none of these things are in the same process, and nor are they in the kernel. All of the things that in a monolithic kernel are just simple or indirect function calls become remote procedure calls in a microkernel, and each time you have to turn the request into some kind of structure and send it down a message stream.

That's the legacy of POSIX. No sane file-IO interface will send a 20 bytes request for file data through all FS layers. If it is the first call, you request 4k bytes (if the file is that long), or a multiple of it. Then you get back an array of physical addresses for the requested data, and memory-map them in user space, and copy the 20 bytes from the buffer. Next, when you application wants another 20 bytes, it won't do a syscall at all, rather will copy 20 more bytes from the buffer. Note that I will do this like a microkernel would do it, and the VFS is running in another process. However, I will not create buffers, rather a request to read x bytes at position y from file handle z (three parameters). The VFS will return a size, starting position, and an array (mapped in kernel space) of physical addresses for the file data.

The real complexity with this method is not so much the VFS or the microkernel communication, but how to create optimal read requests, which user processes to map them in, and how to purge them as memory runs low. There is a need for a kernel structure (selector) per file that keep things like which processes that have the file open (including a link to the userspace block buffer) and which file blocks are cached. At the user side, each file needs a structure containing pointers to mapped file contents so app requests can be satisfied without syscalls. Part of the problem here is that userspace might or might not be allowed to read non-contents of the file, and you cannot expose things like physical addresses. Memory mapping must be done at the kernel side.

Another problem is that file data could be fragmented, and so might be on sector 100, 200, 150 and so on. This means it cannot be directly mapped to a 4k buffer, rather part of the content only will be somewhere in several 4k pages. It can then be decided that these are only mapped in kernel space, or mapped as readonly in user space.

Writes will need to use some kind of bitmaps of which sectors are modified. These can be updated in user space only if the 4k buffer is read-write.

vvaltchev · **Joined:** Fri May 11, 2018 6:51 am **Posts:** 274

Ethin wrote:

You would be right about the flexibility of loadable kernel modules if it weren't for the fact that a loadable kernel module can easily bring down the entire system.

Sure, agreed at 100%. My point was that if the microkernel as a whole (including the microservices) was unreliable (e.g. up and working 99.9% of time means unreliable to me), it's still unacceptable to have even intermittent failures, the same way as a module causing the whole system to crash. The tolerance for bugs is very little. In other words, complex systems that partially fail but without fully crashing are to me generally worse than systems which either work or crash. I'm all for very intensive testing, instead of "exception-driven" systems, where errors happens all the time almost as a normal routine.

If you write all your code in a completely unsafe language like C and a small bug could make the whole kernel to crash, but you have 100% line coverage and a ton of tests like unit tests, system-tests, stress-tests plus static analysis etc. why would you care? I prefer having powerful and exhaustive tests than fewer tests and claiming that my code is "safe". My assumption is that kernel code must not crash. Under that assumption and world-view the services in a microkernel arch. are mostly overhead.

Ethin wrote:

Make the service modular using an embeddable scripting language like JS or Lua.
Use a failover instance while you update the master instance.

I'm very against using managed languages for kernel code

The "limitation" of native languages is not a problem to me.

Anyway, I fully recognize the advantage of failover strategy you mentioned. It's clearly more flexible than what a monolithic kernel can typically do. It's worth probably saying that, theoretically, something like this could be implemented with modules as well. It will require supporting two loaded modules for the same thing plus an atomic switch from one to the other. But, yeah, typically monolithic kernels don't support fancy features like that.

Ethin wrote:

I think that the overhead of message passing has been somewhat, if not entirely, mitigated due to the advent of things like COW and, of course, the ability to share pages (mapping the same page in multiple address spaces simultaneously, such that reads and writes from both ends are reflected in all processes where its mapped) (I like to call this "page mirroring"). Your "message passing" then boils down to notifying the target process that a message is available for reading. And if you use raw binary representation, and don't bother with "packaging" it up into some weird format, that's one less thing you have to worry about. Of course, locking is an issue, and you'll have to figure out how to signal processes when they can and can't write/read the data, but IPC synchronization mechanisms are not a new concept and exist already. Just my thoughts.

Those are the tricks I was mentioning. Here are the drawbacks:

1. no matter how much memory sharing you do, you cannot avoid the extra context switches. And they are expensive, if they happen often. And here there is, again, a whole set of smart optimizations that could be done, nothing of which completely eliminates the problem, but increases the complexity. In the monolithic kernel we would have just used simple function calls instead of complex async queue systems.

2. the more memory sharing you do, the further you go from the "true" microkernel architecture with message passing. Also, you increase the likelihood of a crash affecting multiple services and so the whole system. So, memory sharing is an unsafe technique that at least partially breaks the microkernel architecture.

3. If you start using managed languages like Lua, you'd really have to pay the whole price of serialization and de-serialization of binary data. Memory sharing goes off the table. Also, the increased latency of such languages is significant compared to native languages. I wouldn't use a kernel that's slower than what it could be. But, forget me: in the large scale (the cloud) the economics would be in favor of the more efficient solution. Think about how much it would cost to add a +30% overhead to 10 million machines. +30% is not a random number. It is, to my best knowledge, the typical overhead a good microkernel has over a monolithic one. I believe Prof. Tanenbaum mentioned himself that number. He believes it's totally worth to pay that extra price for the increased stability and flexibility. Many people disagree. If it was something like +1%, there would had been no discussion probably.

thewrongchristian · **Joined:** Tue Apr 03, 2018 2:44 am **Posts:** 402

vvaltchev wrote:

Think about how much it would cost to add a +30% overhead to 10 million machines. +30% is not a random number. It is, to my best knowledge, the typical overhead a good microkernel has over a monolithic one. I believe Prof. Tanenbaum mentioned himself that number. He believes it's totally worth to pay that extra price for the increased stability and flexibility. Many people disagree. If it was something like +1%, there would had been no discussion probably.

30% of what, though? You have to say what you're measuring, If it's 30% of overhead per syscall time, not including syscalls that sleep, say, then that might translate into only a couple of percent in overall system utilisation.

That overhead on a "slow" system call that has to wait for a device, or in your cloud example, a network round trip, is likely to be very small indeed.

The only system calls where you're likely to feel it are system calls that don't sleep. And how many of those do you get, realistically?

I don't think the overall system overhead of microkernels is anything like 30%.

nexos · **Joined:** Tue Feb 18, 2020 3:29 pm **Posts:** 1071

The idea that microkernels are "slow" is based off of first-generation microkernels (MINIX and Mach mainly). MINIX is a teaching OS, not really production quality. Mach is a failure-by-design in a lot of ways. Don't get me wrong, Mach has clever ideas but is implemented very poorly. This was due to Mach's poor spatial locality and large, complicated IPC mechanism

Second-generation microkernels (L4) are much faster. E.g, a round trip syscall on Mach took 500 us, on L4, 16 us. I believe a Unix system based on L4's design principles could potentially take the OS world by storm. By implementing small, fast messaging using many optimizations, microkernels could get somewhat close to their monolithic counterparts.

Of course, not to many benchmarks have shown how microkernels scale SMP-wise compared to monolithic systems. Or how they scale in the presence of PCID, faster processors and other advancements.

As far as arguments that microkernels are not more stable than monolithic ones go, that is just false. Some services (e.g, a driver for your audio mixer) are not critical to the system running and can crash without crashing the whole system in the monolithic world. In the case of FS drivers and the like, I have a few ideas up my sleeve for making the system continue in this instance which I will elaborate on later.

vvaltchev · **Joined:** Fri May 11, 2018 6:51 am **Posts:** 274

nexos wrote:

As far as arguments that microkernels are not more stable than monolithic ones go, that is just false.

Hi nexos, I didn't say exactly that. I said that if microkernels are too tolerant of failures, the overall system stability could get worse because of the way people and companies will take advantage of the tolerance for faults like it happens already for userspace services. Not fatally crashing != having a stable system. See my answer below.

nexos wrote:

Some services (e.g, a driver for your audio mixer) are not critical to the system running and can crash without crashing the whole system in the monolithic world.

Let me clarify my point here: the fact that a kernel/OS core subsystem can crash without bringing down the whole system is often a BAD thing (I know that looks like counter-intuitive). Let me explain why: that level of error-tolerance allows developers and companies to relax too much so not enough resources (time & money) will be spent on quality (e.g. testing). People would totally rely on the fact that if it crashes, nothing too bad will happen and the service could be restarted. I've observed this phenomenon many times over and over again with userspace services: because they're allowed to crash and can get restarted by a watchdog, they become TOTAL CRAP. So, there is no need to work on a microkernel to observe this. It's a social/psychological phenomenon.

If instead everybody knows that a given piece of code is CRITICAL as the rest of the system's code, more time and resources will be allocated to make it good, because crashes will be fatal. This is kind of a natural law: people don't aim at the perfection, when they could simply do good enough.

Today's software is CRAP also because of that. Complex systems are made to tolerate failures and, therefore, more and more failures arise. Imagine a whole OS written like typical moderns APPs. While you're doing your work, from time to time, your audio driver crashes, you have a few seconds of no audio at all and then everything starts working again. Then, from time to time, your disk driver crashes, and you have to wait for the operation to resume. Then the video driver etc.. using a PC for everyday tasks would become a living hell.

Note: it feels like necessary to also point out (not for you, but for readers in general) that both the NT kernel and Apple XNU are not microkernels, but hybrid kernels: https://en.wikipedia.org/wiki/Hybrid_kernel. Therefore, they don't have services in user space. I tend to agree again with Linus that the term "hybrid" is a little more than a marketing thing.

The only (major) attempt I know to make a "full-blown" OS using the microkernel architecture has been so far GNU Hurd, which after 32+ years of development is still far from being usable in production.

AndrewAPrice · **Posted:** Tue Aug 30, 2022 9:01 am

What is the slowest part about context switching?

Is it pushing the registers? Or is it switching the address space? Switching between ring 0 and 3?

Regarding registers: Unlike preempting by a timer or interrupt, you don't need to save/restore all registers on a system call and only save/restore "callee saved" registers on a context switch. (Assuming your kernel conforms to the same calling convention as user programs.) Most IPC would be via system calls.

Address space switching is unavoidable. There are techniques to reduce the number of hops, e.g. instead of:
program->VFS->Disk Driver->VFS->program
You could let the Disk Driver directly respond to the program:
program->VFS->Disk Driver->program

Or for certain operations - such as opening a file for reading, the VFS could send the driver a list of file offset to disk offset mappings, and the program can directly talk to the disk driver.

vvaltchev · **Joined:** Fri May 11, 2018 6:51 am **Posts:** 274

AndrewAPrice wrote:

What is the slowest part about context switching?

Saving and restoring the registers has some overhead and the mode switch from ring 3 to ring 0 back and forth has some overhead, but by far the biggest overhead comes from the switch of the address space because that causes the TLB entries to be invalidated.

nexos · **Joined:** Tue Feb 18, 2020 3:29 pm **Posts:** 1071

vvaltchev wrote:

but by far the biggest overhead comes from the switch of the address space because that causes the TLB entries to be invalidated.

PCID takes care of some of this problem.

vvaltchev wrote:

Let me clarify my point here: the fact that a kernel/OS core subsystem can crash without bringing down the whole system is often a BAD thing (I know that looks like counter-intuitive). Let me explain why: that level of error-tolerance allows developers and companies to relax too much so not enough resources (time & money) will be spent on quality (e.g. testing). People would totally rely on the fact that if it crashes, nothing too bad will happen and the service could be restarted. I've observed this phenomenon many times over and over again with userspace services: because they're allowed to crash and can get restarted by a watchdog, they become TOTAL CRAP. So, there is no need to work on a microkernel to observe this. It's a social/psychological phenomenon.

I don't think they spend enough time on quality as it is. Just use Windows with some third-party hardware, and you're bound to experience crashes after a short amount of time.

If a third party device driver crashes, but doesn't bring down the system, that's a net win. I do agree that the attitude displayed by companies is bad as it is, but that's unsustainable anyway. Something needs to change soon anyway.

Also, occasional stops / restarts would be better than a crash on a high-uptime system. Not good, but better.

AndrewAPrice · **Posted:** Tue Aug 30, 2022 3:15 pm

nexos wrote:

vvaltchev wrote:

but by far the biggest overhead comes from the switch of the address space because that causes the TLB entries to be invalidated.

PCID takes care of some of this problem.

Thanks for helping me discover PCID. I found this great write-up. This sounds like a pretty big deal (with some limitations - you'd have to disable PCID if you have >4096 processes running?) The only time you have to preserve all registers is when you're preempting via an interrupt. If you're context switching to a process returning from a syscall, you only have to restore callee-saved registers.

Of course there are other things you need to do (e.g. look up the callee process to make sure you're sending your message to a valid destination) that you don't need for a regular function call.

nexos · **Joined:** Tue Feb 18, 2020 3:29 pm **Posts:** 1071

AndrewAPrice wrote:

you'd have to disable PCID if you have >4096 processes running?

No, you'd just have to make processes share PCIDs, which would be slower, but nonetheless better than no PCID.

vvaltchev · **Joined:** Fri May 11, 2018 6:51 am **Posts:** 274

thewrongchristian wrote:

30% of what, though? You have to say what you're measuring, If it's 30% of overhead per syscall time, not including syscalls that sleep, say, then that might translate into only a couple of percent in overall system utilisation.

I believe it was +30% overall system overhead (1.3x slowdown). Not sure on which workloads that has been tested though.
Actually, by taking a look at this very nice presentation from 2012: https://archive.fosdem.org/2012/schedule/event/549/96_Martin_Decky-Microkernel_Overhead.pdf, it looks like the typical overhead of Mach vs UNIX was 1.5x overall system slowdown, which is crazy! But, I have also to admit that Nexos' point about L4 is valid: it looks like L4 introduces a slowdown of just 1.03x, which is acceptable. We have to see if any actual full-featured microkernel OSs will be able to keep the overhead to such small numbers, but the results are interesting.

I wish all the best to people trying to create the next big thing!

vvaltchev · **Joined:** Fri May 11, 2018 6:51 am **Posts:** 274

nexos wrote:

I don't think they spend enough time on quality as it is. Just use Windows with some third-party hardware, and you're bound to experience crashes after a short amount of time.

Possible, but I didn't happen to me in quite some time.

nexos wrote:

If a third party device driver crashes, but doesn't bring down the system, that's a net win.

Agreed for the short term: a bad driver that doesn't make my OS to crash is better than observing full crashes or disabling the driver completely.

But... do you have any corporate experience? Imagine what happens when big customers of the vendor that distributed the bad driver complain in the following cases:

1) the driver fails from time to time but gets restarted by the "fancy OS" and introduces a minor inconvenience for the users
2) the driver fails and crashes the whole machine of thousands or millions of users, causing them a big inconvenience including loss of data

In which case the 3rd-party vendor will invest more $$$ in making the driver excellent? It doesn’t matter if the drivers are distributed with the OS itself or are 3rd-party, in order to observe the problem I'm talking about.

Also, completely unrelated, I wanted to mention that for sure PCID feature will mitigate the context switch problem, but cannot completely eliminate it, simply because the size of the TLB is the same. Even if all the previous entries are not discarded, when more and more new entries are needed, the TLB will replace the older entries inevitably. So, having many different processes, in particular written in a higher-level managed language, will have a bad impact of the system's performance. (Managed languages use the heap a lot, and fill it with objects everywhere, which decreases a lot the data locality etc.)

EDIT: Anyway, I wanted to end my comments on a positive note. Thanks to this conversation, I realized that my numbers on microkernels were correct but kind of out-of-date. The fact that there is a chance of having a microkernel OS with < %5 overhead is a good thing. I'm still concerned about the "social" side-effects of software features (discussion above) but, on the other side, even if my personal taste for software won't change, I'm curious to see what's next in the OS-dev world.

nexos · **Joined:** Tue Feb 18, 2020 3:29 pm **Posts:** 1071

vvaltchev wrote:

Imagine what happens when big customers of the vendor that distributed the bad driver complain in the following cases:

1) the driver fails from time to time but gets restarted by the "fancy OS" and introduces a minor inconvenience for the users
2) the driver fails and crashes the whole machine of thousands or millions of users, causing them a big inconvenience including loss of data

Enough complaints will still force the developers to fix the issue.

I do agree that the current corporate situation would allow microkernels to be exploited, but a third-party driver crashing is only marginally more annoying than an app crashing. Of course this depends on what kind of driver we're speaking of.

AndrewAPrice · **Posted:** Wed Aug 31, 2022 3:54 pm

The 1.03x slowdown is great news. Does anyone know of OS performance testing tools out there?

Regarding a driver crashing: a driver crashing is as catastrophic as the activity you a performing. A network driver crashing while writing an essay on a local word processor isn't a big deal - you can save your work. A network driver crashing in the middle of a $100,000 eSports competition is pretty catastrophic!

Even if your OS can seamlessly recover from a crashed driver, I think it's important that it notifies the user that the driver is unstable. We shouldn't silently ignore crashes.

OSDev.org

Microkernels are easier to build than Monolithic kernels

Who is online