Micro-kernel Overheads (Was: User Mode Swapping?)

linguofreak · **Joined:** Wed Mar 09, 2011 3:55 am **Posts:** 509

gerryg400 wrote:

At this point the kernel checks whether it has the page or not. If not it modifies the kernel state of the original app so that it appears that the app sent a message to the VFS asking for a page from the file that it is trying to load. It then adds the app to the message queue of the VFS. From the VFS point of view the message is a 'read' message from a file that was opened when mmap was originally called.

This part seems rather kludgy to me, and is why, while I think microkernels are a really great idea in theory, I don't think that they work well on current CPU architectures.

Most of the functions traditionally handled by a monolithic kernel are things that are properly implemented in a library, but for which there is special system-wide data that needs to be maintained. Current architectures make it easy to have one such library with one set of system-wide data (the kernel), but difficult to have multiple libraries with system-wide data that are isolated from each other. Microkernels on such architectures have to kludge around this by turning what should be a function call to a library without changing threads into message passing between different processes.

Brendan · **Posted:** Wed Jul 15, 2015 6:47 am

Hi,

linguofreak wrote:

Most of the functions traditionally handled by a monolithic kernel are things that are properly implemented in a library, but for which there is special system-wide data that needs to be maintained. Current architectures make it easy to have one such library with one set of system-wide data (the kernel), but difficult to have multiple libraries with system-wide data that are isolated from each other. Microkernels on such architectures have to kludge around this by turning what should be a function call to a library without changing threads into message passing between different processes.

The opposite is more true: Most of the functions traditionally handled by a monolithic kernel are things that have no system-wide data (e.g. and only have scheduler specific state, device specific state, file system specific state, network stack specific state, etc); and smushing it all together and putting all of it into a single place violates the principle of least privilege. Function calls (which only work within the same "protection domain") are inadequate; and current architectures make switching between "protection domains" easy.

Cheers,

Brendan

linguofreak · **Joined:** Wed Mar 09, 2011 3:55 am **Posts:** 509

Brendan wrote:

Hi,

linguofreak wrote:

Most of the functions traditionally handled by a monolithic kernel are things that are properly implemented in a library, but for which there is special system-wide data that needs to be maintained. Current architectures make it easy to have one such library with one set of system-wide data (the kernel), but difficult to have multiple libraries with system-wide data that are isolated from each other. Microkernels on such architectures have to kludge around this by turning what should be a function call to a library without changing threads into message passing between different processes.

The opposite is more true: Most of the functions traditionally handled by a monolithic kernel are things that have no system-wide data (e.g. and only have scheduler specific state, device specific state, file system specific state, network stack specific state, etc); and smushing it all together and putting all of it into a single place violates the principle of least privilege. Function calls (which only work within the same "protection domain") are inadequate; and current architectures make switching between "protection domains" easy.

You misunderstand what I mean by "system-wide". Compare a typical userspace library to a kernel component: The userspace library, when dealing with a call from a given process, only needs access to the data it is keeping for that process. While dealing with that process it can act like no other processes exist. It has no need to protect its data from the process, and operates as part of the process. A kernel component needs to keep data on all processes and have access to it any time the kernel component is called, and cannot have that data be accessible to any process. This is what I mean by "system-wide". Even if kernel components A and B are isolated from each other, component A serving process X has to have access to the same data as component A serving process Y, and likewise for B.

Current hardware typically has a small number of protection domains per address space (typically just two: user and kernel. The most I've ever seen is four on x86 (I think on Vax as well)). Kernel components are isolated from processes by putting the kernel components in kernelspace and the processes in userspace. They keep their data system-wide by making kernelspace be the same in every process. Processes are separated from each other by making userspace different for every process. But there's no way to isolate the kernel components from each other on such hardware except by putting them into completely separate processes themselves. The preferable solution would be for the hardware to support a large number of protection domains per address space, so that each kernel component could be put in its own protection domain without having to switch processes.

On current hardware, calling a driver under a microkernel goes like this:

1. Process crafts message
2. Process makes system call to kernel to send message to driver. Hardware switches from user to kernel protection domain.
3. Kernel switches to driver's address space
4. Kernel makes callback to driver. Hardware switches from kernel to user protection domain
5. Driver interprets message
6. Driver calls appropriate internal function to act on message
7. Once it has results, driver crafts message to return them to process.
8. Driver makes system call to send message to process. Hardware switches from user to kernel protection domain.
9. Kernel switches to process's address space
10. Kernel returns to process
11. Process interprets message

If the driver needs to call another driver to fullfill a request, insert steps 1 through 11 between steps 6 and 7.

Under a monolithic kernel, calling a driver goes like this:

1. Process makes system call to kernel. Hardware switches from user to kernel protection domain
2. Kernel interprets system call arguments, passes call to appropriate driver
3. Driver interprets system call arguments, calls appropriate internal function
4. Once it has results, driver returns to kernel system-call processing code
5. Kernel returns to process. Hardware switches from kernel to user protection domain.

If the driver needs to call another driver or kernel component, we basically repeat step 3.

Under a microkernel on microkernel-friendly hardware, calling a driver goes like this:

1. Process makes system call to driver. Hardware switches from user to driver protection domain.
2. Driver interprets system call arguments, passes call to appropriate internal function (depending on hardware and operating system architecture, it's possible that the driver exposes multiple entry points and the process was able to call the appropriate function directly, in which case this step is not necessary).
3. Once it has results, driver returns to process. Hardware switches from driver to user protection domain.

If the driver needs to call another driver or kernel component, repeat steps 1 through three (substituting "driver 1" and "driver 2" for "process" and "driver") between steps 2 and 3. This potentially makes the microkernel on friendly hardware a bit slower than a monolithic kernel, but by little enough that the microkernel approach becomes worthwhile.

The microkernel on current hardware makes 4 protection domain switches and 2 address domain switches for every driver call, and preserves isolation between drivers.

The monolithic kernel makes 2 protection domain switches for every time a process makes a driver call, no switches when a driver makes a driver call, and does not preserve driver isolation.

The microkernel on friendly hardware makes 2 protection domain switches for every driver call, and preserves driver isolation.

Brendan · **Posted:** Thu Jul 16, 2015 5:42 am

Hi,

linguofreak wrote:

Brendan wrote:

linguofreak wrote:

Most of the functions traditionally handled by a monolithic kernel are things that are properly implemented in a library, but for which there is special system-wide data that needs to be maintained. Current architectures make it easy to have one such library with one set of system-wide data (the kernel), but difficult to have multiple libraries with system-wide data that are isolated from each other. Microkernels on such architectures have to kludge around this by turning what should be a function call to a library without changing threads into message passing between different processes.

The opposite is more true: Most of the functions traditionally handled by a monolithic kernel are things that have no system-wide data (e.g. and only have scheduler specific state, device specific state, file system specific state, network stack specific state, etc); and smushing it all together and putting all of it into a single place violates the principle of least privilege. Function calls (which only work within the same "protection domain") are inadequate; and current architectures make switching between "protection domains" easy.

You misunderstand what I mean by "system-wide". Compare a typical userspace library to a kernel component: The userspace library, when dealing with a call from a given process, only needs access to the data it is keeping for that process. While dealing with that process it can act like no other processes exist. It has no need to protect its data from the process, and operates as part of the process. A kernel component needs to keep data on all processes and have access to it any time the kernel component is called, and cannot have that data be accessible to any process. This is what I mean by "system-wide". Even if kernel components A and B are isolated from each other, component A serving process X has to have access to the same data as component A serving process Y, and likewise for B.

Typical user-space libraries have nothing to do with it, and are usually exactly the same regardless of kernel type.

For things that we're actually talking about (things that are in a monolithic kernel and not in a user-space library, that are shifted to user-space for micro-kernels - e.g. drivers); typically there's "module specific data" (e.g. for controlling the device) plus queues of none or more operations that are waiting for the device to become available, and almost no data from any other process (excluding the data in those "queues of pending operations").

linguofreak wrote:

Current hardware typically has a small number of protection domains per address space (typically just two: user and kernel. The most I've ever seen is four on x86 (I think on Vax as well)). Kernel components are isolated from processes by putting the kernel components in kernelspace and the processes in userspace. They keep their data system-wide by making kernelspace be the same in every process. Processes are separated from each other by making userspace different for every process. But there's no way to isolate the kernel components from each other on such hardware except by putting them into completely separate processes themselves. The preferable solution would be for the hardware to support a large number of protection domains per address space, so that each kernel component could be put in its own protection domain without having to switch processes.

Erm, no. Current hardware typically has facilities that software can use to create an "infinite" number of protection domains using whatever means software likes; whether that's segmentation or paging or software isolation/managed code or virtualisation or anything else; or even a mixture of multiple different techniques. Most OSs use paging to create protection domains (e.g. give each thing its own virtual address space), and in that case the protection domains are normally called processes; but processes are just one type of protection domain.

linguofreak wrote:

Under a microkernel on microkernel-friendly hardware, calling a driver goes like this:

1. Process makes system call to driver. Hardware switches from user to driver protection domain.
2. Driver interprets system call arguments, passes call to appropriate internal function (depending on hardware and operating system architecture, it's possible that the driver exposes multiple entry points and the process was able to call the appropriate function directly, in which case this step is not necessary).
3. Once it has results, driver returns to process. Hardware switches from driver to user protection domain.

32-bit 80x86 has had support for this built into the CPU since 80386; but the hardware task switching mechanism (which can allow one task running at CPL=3 in one protection domain/virtual address space to use a "call gate" that switches to another task running at CPL=3 in another protection domain/virtual address space without passing through the kernel at all) is far worse than going through the kernel.

Note that the reason this is worse is because it's extremely inflexible; for example, you can't postpone the protection domain/virtual address space until it's actually needed and avoid overhead, you're limited to whatever permission system the CPU designer felt like providing, it fails completely for multi-CPU, it requires additional synchronisation (even on single-CPU) to ensure a task doesn't try to call a task that's already running, etc. Also note that the supposed "benefit" is negligible (and nowhere near close to making up for the disadvantages); and that switching from one thing's working set to another thing's working set has costs that are impossible to avoid regardless of how protection domains are implemented in software or in hardware.

linguofreak wrote:

The microkernel on current hardware makes 4 protection domain switches and 2 address domain switches for every driver call, and preserves isolation between drivers.

This only applies to "synchronous" designs; where you have 4 protection domain switches (2 lightweight switches where virtual address space is unchanged and 2 heavyweight switches where virtual address space is changed) per request/reply.

For "asynchronous" designs, a process running on one CPU can send 123 requests to a driver running on another CPU without any heavyweight protection domain switches at all; and that driver can send 123 replies back without any heavyweight protection domain switches at all. Alternatively, for single-CPU, the process can send 123 requests, you have one heavyweight protection domain switch from process to driver, then the driver can send 123 replies back, followed by a second heavyweight protection domain switch from driver back to process. In this case it costs 4 lightweight protection domain switches per request/reply plus zero or more heavyweight protection domain switches (up to a worst case maximum of 2 heavyweight protection domain switches per request/reply).

However; it's possible to do "batch kernel API calls". Instead of calling kernel to send each request/reply; you can construct a list of things you want done, then ask kernel to do everything on your list. This means that (e.g.) a process can send 123 requests with only 2 lightweight protection domain switches (instead of 2 switches per request), and driver can do the same. This only works for the "asynchronous" design - for anything synchronous you must do the protection domain switches to get replies/results. In this case; it costs 2 lightweight protection domain switches for any number of requests/replies, plus zero or more heavyweight protection domain switches (up to a worst case maximum of 2 heavyweight protection domain switches per request/reply).

If you think about this, you'll realise that it's possible (in some circumstances) for "asynchronous with batch kernel API calls, where process and driver are running on different CPUs" to have less protection domain switches than the "monolithic with no isolation" case despite the isolation.

linguofreak wrote:

The microkernel on friendly hardware makes 2 protection domain switches for every driver call, and preserves driver isolation.

In theory that looks like 2 heavyweight protection domain switches (process -> driver -> process). In practice, somehow the kernel's scheduler has to know what's going on (so it can do thread priorities, preemption, etc. properly) so the driver ends up informing kernel and you end up with 6 protection domain switches (process -> driver->kernel->driver->kernel->driver -> process) where there's 4 lightweight switches and 2 heavyweight switches. Note that this is a little worse than the "synchronous design" and much worse than "asynchronous design".

Cheers,

Brendan

SpyderTL · **Joined:** Sun Sep 19, 2010 10:05 pm **Posts:** 1074

Posting messages has a lot of advantages over making function calls in a multi-threaded environment. The analogy I like to use is the difference between sending an email and making a phone call.

With an email, the recipient doesn't have to stop what they are currently doing and answer the phone. They don't even have to be "present" when the email is sent. And they have the option to respond to emails in order of priority.

On the other hand, making a phone call is quicker, because the recipient can respond immediately. But that only works if there is one caller. As soon as you introduce a second caller, email instantly becomes the better solution.

The same situation applies to device drivers in a multi-threaded environment. Building a hard disk controller driver that is thread safe is harder than building one that simply reads a message queue and executes commands one at a time.

And the user will definately notice the difference between an application that is waiting for a message and an application that is waiting for a function to return, even if the message takes a few milliseconds longer.

linguofreak · **Joined:** Wed Mar 09, 2011 3:55 am **Posts:** 509

Brendan wrote:

32-bit 80x86 has had support for this built into the CPU since 80386; but the hardware task switching mechanism (which can allow one task running at CPL=3 in one protection domain/virtual address space to use a "call gate" that switches to another task running at CPL=3 in another protection domain/virtual address space without passing through the kernel at all) is far worse than going through the kernel.

Note that the reason this is worse is because it's extremely inflexible; for example, you can't postpone the protection domain/virtual address space until it's actually needed and avoid overhead, you're limited to whatever permission system the CPU designer felt like providing, it fails completely for multi-CPU, it requires additional synchronisation (even on single-CPU) to ensure a task doesn't try to call a task that's already running, etc. Also note that the supposed "benefit" is negligible (and nowhere near close to making up for the disadvantages); and that switching from one thing's working set to another thing's working set has costs that are impossible to avoid regardless of how protection domains are implemented in software or in hardware.

The x86 task switching architecture tries to support the "microkernel drivers as separate processes" model that I'm arguing against in hardware. It's not the kind of hardware feature that is actually needed for a good microkernel.

linguofreak wrote:

The microkernel on friendly hardware makes 2 protection domain switches for every driver call, and preserves driver isolation.

In theory that looks like 2 heavyweight protection domain switches (process -> driver -> process). In practice, somehow the kernel's scheduler has to know what's going on (so it can do thread priorities, preemption, etc. properly) so the driver ends up informing kernel and you end up with 6 protection domain switches (process -> driver->kernel->driver->kernel->driver -> process) where there's 4 lightweight switches and 2 heavyweight switches. Note that this is a little worse than the "synchronous design" and much worse than "asynchronous design".[/quote]

It's two *lightweight* protection domain switches. On well designed hardware, the switch to the driver protection domain would be similar to a mode switch to kernel mode on current hardware and would occur without changing address spaces. And, because we're not implementing drivers as separate processes, but more like libraries (as the whole kernel is in the monolithic kernel model), the kernel doesn't need to be informed, because we're not trying to do voodoo with process switching behind the kernel's back.

Brendan · **Posted:** Fri Jul 17, 2015 6:19 am

Hi,

linguofreak wrote:

Brendan wrote:

linguofreak wrote:

The microkernel on friendly hardware makes 2 protection domain switches for every driver call, and preserves driver isolation.

In theory that looks like 2 heavyweight protection domain switches (process -> driver -> process). In practice, somehow the kernel's scheduler has to know what's going on (so it can do thread priorities, preemption, etc. properly) so the driver ends up informing kernel and you end up with 6 protection domain switches (process -> driver->kernel->driver->kernel->driver -> process) where there's 4 lightweight switches and 2 heavyweight switches. Note that this is a little worse than the "synchronous design" and much worse than "asynchronous design".

It's two *lightweight* protection domain switches. On well designed hardware, the switch to the driver protection domain would be similar to a mode switch to kernel mode on current hardware and would occur without changing address spaces. And, because we're not implementing drivers as separate processes, but more like libraries (as the whole kernel is in the monolithic kernel model), the kernel doesn't need to be informed, because we're not trying to do voodoo with process switching behind the kernel's back.

Ah - I think I understand what you want now; and I think what you want is the protected control transfers that the Mill project provides. This actually does look good to me (about as good as "fundamentally flawed because it's synchronous" IPC can get).

Cheers,

Brendan

Combuster · **Posted:** Fri Jul 17, 2015 6:50 am

Quote:

This actually does look good to me (about as good as "fundamentally flawed because it's synchronous" IPC can get).

Calling the kernel to send a message to another process also includes a control transfer, possibly waits until the pointers you're trying to send are back in memory and swap something else out to make space to save a copy in the message queue, and can return any number of seconds later. Henceforth every message passing microkernel system is 100% synchronous. :mrgreen:

The only thing they did is making oldfashioned call gates exactly as cheap as a regular function call. You're confusing a control transfer primitive for an asynchronous message passing primitive.

And if you don't trust the server application to get the synchronous-to-asynchronous glue right, you let the kernel generate the message passing glue code for both sides of the fence.

Brendan · **Posted:** Fri Jul 17, 2015 9:18 am

Hi,

Combuster wrote:

Quote:

This actually does look good to me (about as good as "fundamentally flawed because it's synchronous" IPC can get).

Calling the kernel to send a message to another process also includes a control transfer, possibly waits until the pointers you're trying to send are back in memory and swap something else out to make space to save a copy in the message queue, and can return any number of seconds later. Henceforth every message passing microkernel system is 100% synchronous. :mrgreen:

The IPC is asynchronous (even though the kernel API used to send/receive isn't).

Combuster wrote:

The only thing they did is making oldfashioned call gates exactly as cheap as a regular function call. You're confusing a control transfer primitive for an asynchronous message passing primitive.

As far as I know the Mill security video (that I linked to) advocates a micro-kernel-like architecture based on synchronous message passing (using the "call gates exactly as cheap as a regular function call" mechanism); which is what linguofreak seems to want. I'm not confusing this with asynchronous message passing, I'm suggesting that (for device drivers, etc) it's inferior to asynchronous message passing.

In terms of cost; it's mostly about working sets. You have a process running with all caches, etc full of that process' code and data; and it switches to something else where none of the code or data is in any caches. The end result is hundreds of thousands of cycles spent on cache misses. It makes very little difference if the switch itself was "as cheap as a regular function call" or not, the "hundreds of thousands of cycles spent on cache misses" makes it expensive regardless. The only way to mitigate that is to postpone "working set switches" or avoid them completely, which is impossible with "synchronous".

Cheers,

Brendan

Combuster · **Posted:** Fri Jul 17, 2015 12:31 pm

Quote:

It makes very little difference if the switch itself was "as cheap as a regular function call" or not, the "hundreds of thousands of cycles spent on cache misses" makes it expensive regardless. The only way to mitigate that is to postpone "working set switches" or avoid them completely, which is impossible with "synchronous".

<rant>
If you want to argue black and white, I can do that too:

Asynchronous:
Process A runs on CPU 1 and creates 4MB of data and stores that in memory for submission. That costs it 4MB worth of cache misses.
Process B runs on CPU 2 and uses the 4MB of data. That costs it 4MB worth of cache misses because they have to be evicted from CPU 1.

Synchronous:
Process A runs on CPU 1 and creates 4MB of data and stores that in memory for submission. That costs it 4MB worth of cache misses.
Process B runs on CPU 1 and uses the 4MB of data. Now it's still part of the working set.

Contrary to your statement, synchronous wins here by a factor two. And if you simply repeat the process, this ratio even tends to infinity.
</rant>

----------------

I'm well aware it's never simple. Do you care to give a more balanced view on the matter for all the other readers?

Brendan · **Posted:** Fri Jul 17, 2015 1:41 pm

Hi,

Combuster wrote:

Quote:

It makes very little difference if the switch itself was "as cheap as a regular function call" or not, the "hundreds of thousands of cycles spent on cache misses" makes it expensive regardless. The only way to mitigate that is to postpone "working set switches" or avoid them completely, which is impossible with "synchronous".

<rant>
If you want to argue black and white, I can do that too:

Asynchronous:
Process A runs on CPU 1 and creates 4MB of data and stores that in memory for submission. That costs it 4MB worth of cache misses.
Process B runs on CPU 2 and uses the 4MB of data. That costs it 4MB worth of cache misses because they have to be evicted from CPU 1.

Synchronous:
Process A runs on CPU 1 and creates 4MB of data and stores that in memory for submission. That costs it 4MB worth of cache misses.
Process B runs on CPU 1 and uses the 4MB of data. Now it's still part of the working set.

Contrary to your statement, synchronous wins here by a factor two. And if you simply repeat the process, this ratio even tends to infinity.
</rant>

----------------

I'm well aware it's never simple. Do you care to give a more balanced view on the matter for all the other readers?

If you're deliberately attempting to be as biased and idiotic as possible, then I can play that game too.

Asynchronous:
Process A runs on CPU 1 and creates 64 bytes of data and stores that in memory for submission. That costs 1 cache miss.
Process B runs on CPU 2 and uses the 64 bytes of data (one cache miss), plus 1 MiB of code and 7 MiB of its own data to process that 64 byte message; then send a 64 byte reply back. That costs 2 cache misses.
Process A runs on CPU 1 and receives the reply, and continues whatever it was doing using 2 MiB of code and 6 MiB of data that's still in the cache from before the request was sent. That costs 1 cache miss.

Total cost is 4 cache misses, or (at an average of 150 cycles per cache miss) 600 cycles.

Synchronous:
Process A runs on CPU 1 and creates 64 bytes of data and stores that in memory for submission. That costs 1 cache miss.
Process B runs on CPU 1 and uses the 64 bytes of data (one cache miss), plus 1 MiB of code and 7 MiB of its own data to process that 64 byte message; then send a 64 byte reply back. That costs 131072 cache misses.
Process A runs on CPU 1 and receives the reply, and continues whatever it was doing using 2 MiB of code and 6 MiB of data that got pushed out of the cache by process B. That costs 131072 cache misses.

Total cost is 262144 cache misses, or (at an average of 150 cycles per cache miss) 39321600 cycles.

In this case; synchronous is 65536 times slower from the cache misses alone (ignoring TLB misses; and ignoring the fact that Process A and Process B couldn't be running at the same time in parallel on different CPUs).

Cheers,

Brendan

linguofreak · **Joined:** Wed Mar 09, 2011 3:55 am **Posts:** 509

Brendan wrote:

Ah - I think I understand what you want now; and I think what you want is the protected control transfers that the Mill project provides.

I'm not sure I'd do it the same way as Mill, or that I want everything it provides, or that it provides everything that I want, but I definitely want at least some of what Mill provides, because I don't think microkernels are truly viable without it.

Quote:

This actually does look good to me (about as good as "fundamentally flawed because it's synchronous" IPC can get).

Ah, but we aren't doing IPC. As I've said, one of the things that I think monolithic kernels get right is that they basically model kernel components/drivers as libraries with access to special data, rather than as separate processes. The problem with current hardware is that it forces all that special data to be in one protection domain if it is to be accessed without switching processes, rather than giving each kernel component its own protection domain. A microkernel on friendly hardware would model drivers as libraries, like a monolithic kernel, but keep them in separate protection domains, like any other microkernel.

linguofreak · **Joined:** Wed Mar 09, 2011 3:55 am **Posts:** 509

Brendan wrote:

Hi,

Combuster wrote:

Quote:

It makes very little difference if the switch itself was "as cheap as a regular function call" or not, the "hundreds of thousands of cycles spent on cache misses" makes it expensive regardless. The only way to mitigate that is to postpone "working set switches" or avoid them completely, which is impossible with "synchronous".

<rant>
If you want to argue black and white, I can do that too:

Asynchronous:
Process A runs on CPU 1 and creates 4MB of data and stores that in memory for submission. That costs it 4MB worth of cache misses.
Process B runs on CPU 2 and uses the 4MB of data. That costs it 4MB worth of cache misses because they have to be evicted from CPU 1.

Synchronous:
Process A runs on CPU 1 and creates 4MB of data and stores that in memory for submission. That costs it 4MB worth of cache misses.
Process B runs on CPU 1 and uses the 4MB of data. Now it's still part of the working set.

Contrary to your statement, synchronous wins here by a factor two. And if you simply repeat the process, this ratio even tends to infinity.
</rant>

----------------

I'm well aware it's never simple. Do you care to give a more balanced view on the matter for all the other readers?

If you're deliberately attempting to be as biased and idiotic as possible, then I can play that game too.

Asynchronous:
Process A runs on CPU 1 and creates 64 bytes of data and stores that in memory for submission. That costs 1 cache miss.
Process B runs on CPU 2 and uses the 64 bytes of data (one cache miss), plus 1 MiB of code and 7 MiB of its own data to process that 64 byte message; then send a 64 byte reply back. That costs 2 cache misses.
Process A runs on CPU 1 and receives the reply, and continues whatever it was doing using 2 MiB of code and 6 MiB of data that's still in the cache from before the request was sent. That costs 1 cache miss.

Total cost is 4 cache misses, or (at an average of 150 cycles per cache miss) 600 cycles.

Synchronous:
Process A runs on CPU 1 and creates 64 bytes of data and stores that in memory for submission. That costs 1 cache miss.
Process B runs on CPU 1 and uses the 64 bytes of data (one cache miss), plus 1 MiB of code and 7 MiB of its own data to process that 64 byte message; then send a 64 byte reply back. That costs 131072 cache misses.
Process A runs on CPU 1 and receives the reply, and continues whatever it was doing using 2 MiB of code and 6 MiB of data that got pushed out of the cache by process B. That costs 131072 cache misses.

Total cost is 262144 cache misses, or (at an average of 150 cycles per cache miss) 39321600 cycles.

In this case; synchronous is 65536 times slower from the cache misses alone (ignoring TLB misses; and ignoring the fact that Process A and Process B couldn't be running at the same time in parallel on different CPUs).

Cheers,

Brendan

In my model process A is running on CPU 1 the whole time. It starts out in a "user" protection domain in which its own code runs and stores its data. When it needs to make use of Driver X, it makes a call to a library that resides in its own special "Driver X" protection domain, and the hardware switches to that domain. If Driver X needs to make a call to Driver Y to fullfill the request, it makes a call to the Driver Y library, and the hardware switches to the appropriate protection domain. This is the same as on a monolithic kernel, where, during a system call, a process switches from the "user" to the "kernel" protection domain and back, without any process switch occurring. The only difference is that the different libraries that make up the kernel are in different protection domains instead of the same one. The cache overhead is pretty much the same as a monolithic kernel, as the addressing side of things remains pretty much the same, it's just that permission to access different regions of the address space is more granular, rather than being a simple "user/kernel" split.

Asynchronous IPC can also be done in this framework: In that case, a process that wishes to allow other processes to communicate with it exposes a library that has access to its heap and exposes functions corresponding to the types of messages it accepts. To send a message, another process calls one of these functions with parameters correpsonding to the message content it wishes to send. The library operates within the receiving process's protection domain and the function places a message corresponding to the parameters it was called with into a queue in the receiving process's address space, then returns to the code that called it. When the receiving process receives its next timeslice, it begins processing its message queue.

Brendan · **Posted:** Sat Jul 18, 2015 1:42 am

Hi,

linguofreak wrote:

Quote:

This actually does look good to me (about as good as "fundamentally flawed because it's synchronous" IPC can get).

Ah, but we aren't doing IPC. As I've said, one of the things that I think monolithic kernels get right is that they basically model kernel components/drivers as libraries with access to special data, rather than as separate processes. The problem with current hardware is that it forces all that special data to be in one protection domain if it is to be accessed without switching processes, rather than giving each kernel component its own protection domain. A microkernel on friendly hardware would model drivers as libraries, like a monolithic kernel, but keep them in separate protection domains, like any other microkernel.

You're talking about 2 extremely different things. For libraries:

It's like a large collection of public functions that processes want to use directly
A library typically has no state of its own
A library typically isn't scheduled independently
A library requires access to (at least some of) the process' data
It makes no sense to have multiple separate instances of the same library

For device drivers:

It's a very small number of "externally accessible" functions (e.g. standardised device driver interface), that processes never use directly (e.g. processes use VFS, network stack, GUI; and never call disk, network card, video card drivers directly)
A driver typically must have state of its own
A driver typically has multiple threads at different priorities that are scheduled independently
A driver does not require access to any process' data. Processes transfer data to (e.g.) VFS, which transfers data to (e.g.) ext2 file system, which transfers data to (e.g.) disk driver; where "transfer" includes an a change in ownership of that data.
It makes sense to have multiple separate instances of the same driver (e.g. if you have 2 UHCI controllers, then you want 2 instances of the UHCI driver each with its own separate state)

For libraries, making them part of each process' protection domain (or at least, each process that uses the library) makes perfect sense (regardless of whether it's done by dynamic linking or static linking); because that's exactly what libraries are for.

For micro-kernels, making drivers part of each process' protection domain makes no sense whatsoever; and if you give them their own protection domain then they are processes regardless of whether you use different mechanisms to isolate "driver processes" and "application processes", and regardless of whether you use different mechanisms for "communication with drivers" (e.g. an RPC mechanism that looks like function calls) and "communication with applications" (e.g. an message passing mechanism).

Basically what you want is to have 2 different types of processes, with 2 different types of isolation and 2 different types of communication; with no sane reason for any of the pointless duplication. If you're able to isolate drivers using a faster/cheaper method; then you're able to isolate applications using that same faster/cheaper method.

Cheers,

Brendan

willedwards · **Joined:** Sat Mar 15, 2014 3:49 pm **Posts:** 96

Brendan wrote:

what you want is the protected control transfers that the Mill project provides. This actually does look good to me (about as good as "fundamentally flawed because it's synchronous" IPC can get).

(Mill team; particularly involved in this area)

Hardware-managed asynchronous message passing would require that messages to be stored and message lifetimes managed by hardware. This would push towards small, fixed size messages and probably fixed sized queues and mission creep into hardware managed scheduling and such. It also suggests copying and system-managed heaps and lots of other nasties and inflexibilities. Its a slippery slope!

The Mill does secure message passing synchronously between protection domains, bounded only by available memory.

On top of this synchronous message passing you can build both immutability and asynchronous message passing in a couple of ways, depending on the directions of trust between the components communicating. The classic recipes of using buffered IO or message queues in a kernel will work just fine. You can also put the buffering into the recipient itself if it can be trusted not to DOS the sender, which might be suitable for your GPU driver and other flakey-but-trusted components with no-compromising performance requirements.

Mill protection is not just about isolation within the classic 'kernel', i.e. microkernels, but also about isolation and sandboxing within applications.

Sandboxing approaches on classic machines such as containers and NaCL need kernel intermediation and the kernel - a very large attack surface - needs to police all APIs and resources. The classic OSes are slowly getting there with Capsicum but they keep finding checks that they have missed...

On the Mill, the sandbox can be given access only to tailored portals that need not be the kernel. The idea is to make it so you can take a program and split it into subcomponents with isolation and least authority that are assumed to be working to the common good. When the attacker exploits a vulnerability in such a subcomponent they will only be able to DOS the components they can interact with (by not doing any work) and not escalate their privileges.

(Lots of details are not secret, but haven't yet been presented in a talk; we really need a whole new talk on the security aspects of IPC on the Mill, but preparing talks is such a time sink from actually trying to make a chip

)

OSDev.org

Micro-kernel Overheads (Was: User Mode Swapping?)

Who is online