Microkernel: How to deal with recursive messages?

LtG · **Joined:** Thu Aug 13, 2015 4:57 pm **Posts:** 384

Korona wrote:

Accounting buffers at the sender is not a universal solution. It just reverses the problem: Now, receivers can starve senders. Even if some process is acting as a client it might be sometimes desirable act as a receiver.

The problem is commonly (?) solved by accounting the buffers both at the sender and at the receiver and by enforcing per-user limits (in a UNIX-like system; this might differ among different OS designs).

True, but the receiver (service) can always provide bad service anyway, so there's some implicit trust towards services, plus the sender can decide what services to use.

When services are waiting for a client to receive they can try to code defensively and possibly cancel long pending messages if the receiver is not actually receiving.

Mainly I don't want them in the kernel where the messages are in a kind of "no mans land".. If you account for them at both sides then looking at "task manager" (or similar) it's very difficult to get an overview of the memory usage.

Right now Win8.1 task manager says over 4GiB of memory in use (the column), yet adding all the individual apps/services makes around 2.5GiB. I understand why it is so in Win8.1, I'm advocating that it should be very clear what is using memory, otherwise it's almost useless.

Brendan · **Posted:** Wed Aug 30, 2017 7:42 pm

Hi,

LtG wrote:

Brendan wrote:

For asynchronous interfaces (where threads never block until/unless they have nothing to do - no pending requests they can handle), single threaded processes are probably fine for most of it (the bottleneck in most cases is going to be the storage device hardware itself), and the only reason for any of the processes to be multi-threaded is to benefit from multiple CPUs. However, there's typically not much CPU usage involved and not much reason to want multiple CPUs; unless (maybe) if you have some expensive encryption in the storage device driver (for "whole disk encryption") or (maybe) if the VFS struggles to do path lookup fast enough or (maybe) if there's software RAID between FS and storage devices that's doing some relatively expensive form of RAID (e.g. calculating/checking parity for "RAID 5").

Assuming conventional x86 PC, multicore would suggest multiple threads can improve performance.

If you have a single thread then you would either need to check if that thread is sleeping and migrate it from one core to the current core, or alternatively send IPI to the other core and let it handle the work with the associated cache penalties and putting requesting thread and the thread that was running on the "VFS core" to sleep. With NUMA the performance penalty might be even greater, though NUMA isn't that common.

In practice I have no idea how much performance impact there actually is, but intuitively being able to handle the VFS stuff on the core where the requesting thread is should give better performance. Either the data is in cache and can be returned immediately or IO must be performed which shouldn't take long and thus returning to the requesting thread immediately. Returning the IO results might take time, but returning from the VFS shouldn't.

So I guess it would make sense to have at least one VFS thread per core, but given that the general OS design affects this a lot, it's difficult to say one way or the other.

Imagine there's multiple threads (possibly belonging to multiple processes - applications, VFS, file systems, etc) and each thread does something like this:

Code:

    do {
        message = get_message();    // Get next message (may cause thread to block until a message arrives)
        handle_message(message);    // Do whatever work is needed to handle that message
    } while(running);
    exit();

"Perfectly balanced" occurs when there's one CPU per thread and a message arrives immediately before "get_message()" is called. In this case all CPUs remain at 100% load and there's zero task switches; so you achieve "max. throughput" and "min. latency".

If a thread is too slow, messages build up in the thread's message queue. You still achieve "max. throughput" but latency suffers.

If a thread is too fast (has to block until a message arrives) the scheduler has to find something else for the CPU to do, and this means typically means doing a task switch when the thread blocks and then doing another task switch when a message arrives and the thread unblocks. When there's a lot of messages and the work done to handle each message is small (which is extremely likely for things like VFS) this can be a performance disaster (spending more time doing task switches than you spend doing useful work) - assuming "immediate pre-emption" you still achieve "min. latency", but throughput suffers.

Of course everything will be fluctuating wildy (messages take different amounts of time to handle, messages aren't sent at a fix rate, other threads are blocking/unblocking at "random" times, the number of CPUs is almost never equal to the number of threads, etc) and you might never achieve "perfectly balanced" (and if you do it won't last long). A more practical goal would be to try to fluctuate between "too fast (excessive task switches)" and "too slow (excessive amount of messages waiting)", so that while you're not handling messages fast enough the message queue grows, and while you're handling message too fast there's messages on your queue (and still no task switches). This is still a small "sweet spot" that you're not going to hit often; but it's a nice compromise even when you miss the "sweet spot".

Now imagine if each process uses "one thread per CPU" (e.g. with 8 CPUs you'd have 8 threads per process). In this case you can guarantee that you'll never get anywhere near "perfectly balanced". What this does is maximise the amount of CPU time wasted doing task switches (e.g. every message sent will unblock a thread and cause a task switch and then that thread will block again causing a second task switch). For things where there's lots of messages and a very small amount of work done to handle each message it would be a disaster (most CPU time spent doing task switches).

Note that for VFS, file systems and storage device drivers; almost all messages involve very little work - it's mostly all "create/update the state used to keep track of something, then forward a message somewhere else (or wait for an IRQ)". With normal processes/applications using 6 CPUs that are pounding the daylights out of VFS, you might need 1 CPU for VFS (and its caches), and a quarter of a CPU for file systems and storage device drivers combined. With normal processes/applications using 12 CPUs and pounding the VFS, the VFS might need a second thread.

Of course the best way to know for sure would be to implement it and measure where the bottlenecks are. This is something that could possibly be done at run-time - e.g. an "adaptive VFS" that spawns and terminates its threads (depending on number of messages waiting on existing thread's message queues).

Cheers,

Brendan

Brendan · **Posted:** Wed Aug 30, 2017 7:56 pm

Hi,

Brendan wrote:

Of course the best way to know for sure would be to implement it and measure where the bottlenecks are. This is something that could possibly be done at run-time - e.g. an "adaptive VFS" that spawns and terminates its threads (depending on number of messages waiting on existing thread's message queues).

I should've thought about this more!

I'm assuming that when a service has multiple threads; messages are spread across all threads evenly. It doesn't have to be like that.

Specifically; if the threads had an order ("thread #1, thread #2, ...") and the kernel put messages on the queue for the first thread that has no messages (or "less than [threshold]" messages); then it'd automatically adapt to changes in load. Under low load conditions "thread #1" would get all the messages, and as load increases "thread #2" would start getting some messages (while "thread #1" is always running), and as load increases more "thread #3" would start getting some messages (while both "thread #1" and "thread #2" are always running), and ....

Cheers,

Brendan

LtG · **Joined:** Thu Aug 13, 2015 4:57 pm **Posts:** 384

Brendan wrote:

Now imagine if each process uses "one thread per CPU" (e.g. with 8 CPUs you'd have 8 threads per process). In this case you can guarantee that you'll never get anywhere near "perfectly balanced". What this does is maximise the amount of CPU time wasted doing task switches (e.g. every message sent will unblock a thread and cause a task switch and then that thread will block again causing a second task switch). For things where there's lots of messages and a very small amount of work done to handle each message it would be a disaster (most CPU time spent doing task switches).

I think this is one of the main points we disagree on, your design has "heavy" task switches therefore you want to minimize switching, I'm going the opposite direction, extremely light weight task switching and using it to my advantage.

I have no idea how my design will perform ultimately, so I can't say if it's a good idea or not. If I switch the task on the same core to do the "very little work" for VFS, then I get better cache locality than if I sent an IPI or some other way of communicating to the other core(s). Also the communication overhead is also there in your design..

So having a single thread vs one thread per core really depends on the overall design.

linguofreak · **Joined:** Wed Mar 09, 2011 3:55 am **Posts:** 509

Artlav wrote:

I've been contemplating redesigning the filesystem handling in Aprom to use the messaging system (to get away from the broken call-gates/franken-lib-services system i've asked about before - http://forum.osdev.org/viewtopic.php?f=1&t=32374), and am having troubles coming up with a design that won't be getting stuck.
Here is a problematic scenario.

The VFS service gets woken up, checks the inbox and sees a request to read some data from a file handle of /cfg/run/2/descr.txt
It resolves the handle, finds that the file belongs to a filesystem driver F, sends the read message to it and goes into a wait state.

The filesystem driver F's service gets woken up, checks the inbox and sees a request to read from a file handle on an instance from /dev/rd1.
It computes the necessary offsets, sends a read message to VFS and goes into a wait state.

The VFS service is, however, still waiting for a reply from the FS driver.

How do you resolve recursive messaging like this?

One way i had "success" solving this issue was to make each service spawn a separate thread to handle each message.
This does make things work, but it's also really, really slow.
Thousands of threads are getting spawned and killed all the time.

I contemplated keeping one handle thread idling, but then the same issue arises - how do you tell if it's a case of still-busy thread or a thread that waits for a recursive call?

I suspect that i'm just trying to approach this from a wrong angle, so the wide question is - how are recursive messages typically handled or avoided?
Should i just make thread creation much faster somehow, or are there better designs?

Well, in monolithic kernels, you tend to have the whole kernel being, effectively, a massive shared library mapped into the address space of every process, with its own private data area that is not accessible to userspace processes, and so, for synchronous calls, instead of a series of recursive being passed between processes, you just have a series of calls being made in the context of the thread making the system call. If any of the calls in the chain can't proceed immediately (e.g, waiting for disk hardware), the kernel puts the calling thread to sleep and goes off to the scheduler to find the next runnable thread. When the disk access (or whatever) finishes, the thread is marked as ready again and is soon scheduled again (or maybe not so soon, if it is low priority and there are lots of other threads waiting to run).

For microkernels, there tends to be the paradigm that each service is its own process, running its own thread(s). But, fundamentally, many of the things these services are doing are still rather library-like, except that they require a data address space for bookkeeping that is inaccessible to their clients. Monolithic kernels provide this isolation by making use of the distinction provided between user and kernel pages by the MMU, which isolates kernel data from userspace processes, but doesn't isolate different kernel components' data spaces from each other. One solution you could use to do this on a microkernel would be to divide userspace up into a thread-local region and a process code/data region. You'd then a have call_external() syscall that would switch out the code/data region (possibly keeping a shared memory portion of the data region mapped) and switch in the code/data region for whatever service you were calling, whle keeping the thread-local region mapped, then call the requested function in the called service. When finished, the called service would use a return_external() syscall to return to the calling program. In this setup, a service wouldn't necessarily need any threads: the call_external() / return_external() back-and-forth would take place on the thread of the calling program. If a call blocked, the calling thread would go into a wait state until the call could continue.

The problem is that I'm not sure how efficient this would be on typical MMU hardware. It would probably need need hardware support for capability-based addressing to be efficient, and most attempts at doing that have been to fine-grained (e.g, iAPX 432), so the concept hasn't really taken off.

Brendan · **Posted:** Thu Aug 31, 2017 10:29 am

Hi,

LtG wrote:

Brendan wrote:

Now imagine if each process uses "one thread per CPU" (e.g. with 8 CPUs you'd have 8 threads per process). In this case you can guarantee that you'll never get anywhere near "perfectly balanced". What this does is maximise the amount of CPU time wasted doing task switches (e.g. every message sent will unblock a thread and cause a task switch and then that thread will block again causing a second task switch). For things where there's lots of messages and a very small amount of work done to handle each message it would be a disaster (most CPU time spent doing task switches).

I think this is one of the main points we disagree on, your design has "heavy" task switches therefore you want to minimize switching, I'm going the opposite direction, extremely light weight task switching and using it to my advantage.

I don't know what "light weight" really means - I tend to think of rendezvous messaging (where the "message" is tiny and fits in CPU's registers, and you switch tasks without saving or loading registers), but this is completely broken for multi-CPU.

In any case; you can expect that all of the stuff in the CPU's data caches, instruction caches, TLBs, branch prediction buffers and whatever else will all become worthless (old stuff for the previous task that's useless for the new task); and that adds up to "expensive" (lots of cache and TLB misses, and mispredicted branches, and...) and there's almost nothing you can do about it.

What you can do is try to make the task run on the same CPU it used last time (in the hope that some of its data might still be in that CPU's caches) and not the CPU that unblocked it; but that's a "more heavy weight" option (e.g. sending an IPI to cause a task switch on a different CPU because you don't want to do a task switch on the current CPU) and doesn't work well for synchronous messaging (sender has to block to wait for reply which causes a task switch on one CPU, receiver has to unblock causing a task switch on a different CPU - it adds up to double the task switches).

LtG wrote:

I have no idea how my design will perform ultimately, so I can't say if it's a good idea or not. If I switch the task on the same core to do the "very little work" for VFS, then I get better cache locality than if I sent an IPI or some other way of communicating to the other core(s).

More like the opposite (if that other task's code and data is cached anywhere, then it's probably cached in a different CPU's caches, so you ruin cache locality by running it on the current CPU). Note that this gets worse once you start thinking about NUMA.

LtG wrote:

Also the communication overhead is also there in your design..

Yes, but it's a small price to pay to reduce the number of task switches (which are more expensive, regardless of what you do).

LtG wrote:

So having a single thread vs one thread per core really depends on the overall design.

Sure, if you design the OS so that most CPUs spend most of their time trying to acquire locks/mutexes, then...

Cheers,

Brendan

LtG · **Joined:** Thu Aug 13, 2015 4:57 pm **Posts:** 384

Brendan wrote:

I don't know what "light weight" really means - I tend to think of rendezvous messaging (where the "message" is tiny and fits in CPU's registers, and you switch tasks without saving or loading registers), but this is completely broken for multi-CPU.

Why is it broken for multi-CPU? Especially if you run the recipient on the same core?

Brendan wrote:

In any case; you can expect that all of the stuff in the CPU's data caches, instruction caches, TLBs, branch prediction buffers and whatever else will all become worthless (old stuff for the previous task that's useless for the new task); and that adds up to "expensive" (lots of cache and TLB misses, and mispredicted branches, and...) and there's almost nothing you can do about it.

Suppose procA sends message to VFS and the VFS does very little work and then returns, the data and code caches should be just fine, TLBs (if tagged) are fine, so I would imagine that it's mostly all still valid/useful.

Also remember that quite often when you send a message that the message contains something that was prepared just prior to sending the message, so the message content to varying extent is in exactly the same core's caches, if you have another core pickup the work then it's guaranteed to be in the cache of a wrong core. In addition to your design having to interrupt that other core, or sacrifice prioritization and latency (and possibly affect throughput).

Brendan wrote:

LtG wrote:

I have no idea how my design will perform ultimately, so I can't say if it's a good idea or not. If I switch the task on the same core to do the "very little work" for VFS, then I get better cache locality than if I sent an IPI or some other way of communicating to the other core(s).

More like the opposite (if that other task's code and data is cached anywhere, then it's probably cached in a different CPU's caches, so you ruin cache locality by running it on the current CPU). Note that this gets worse once you start thinking about NUMA.

This should get better with NUMA, again all the data is being processed in the same core, thus the same NUMA domain. I would think that it's always beneficial to continue work in the same core, unless there's cores that are idle, in which case the current core could stay with it's "main" task instead of diverging into whatever whatever the reason was why the message was sent.

Brendan wrote:

Yes, but it's a small price to pay to reduce the number of task switches (which are more expensive, regardless of what you do).

If the message is in the CPU registers and you use TLB tagging, what's the extra price to do a task switch? Compared to an IPI?

Brendan wrote:

LtG wrote:

So having a single thread vs one thread per core really depends on the overall design.

Sure, if you design the OS so that most CPUs spend most of their time trying to acquire locks/mutexes, then...

If each core is almost entirely independent, then what locks/mutexes? I realize that your design on this aspect is radically different from mine, but I'm not planning on using almost any locks or mutexes as far as the kernel is concerned. A disk cache (VFS) thread would probably have to use a mutex when evicting a page from memory and when marking a page to be currently needed, but that can be done on a page by page basis, so the likelyhood of two cores wanting to do that should be relatively low and the mutex should be cheap (not many cycles).

If you try to view everything I've written here thru the lenses of your OS, then most/all of your criticism might be accurate, but again, my design is quite different to yours. When possible I try to avoid all the problems (locks/mutexes, task switching cost, etc) and hoping that eventually it all pays off... Time will tell.

Brendan · **Posted:** Thu Aug 31, 2017 7:42 pm

Hi,

LtG wrote:

Brendan wrote:

I don't know what "light weight" really means - I tend to think of rendezvous messaging (where the "message" is tiny and fits in CPU's registers, and you switch tasks without saving or loading registers), but this is completely broken for multi-CPU.

Why is it broken for multi-CPU? Especially if you run the recipient on the same core?

Rendezvous messaging:

Ruins scalability (for all requests the "requester" blocks while waiting for a reply, so the "requester" and "requestee" can't do their work in parallel on different CPUs)
Ruins cache locality and NUMA optimisations
Ruins the scheduler's ability to effectively schedule work ("what runs where and when" becomes a consequence of the constant blocking/unblocking of threads caused by IPC; and things like scheduling policies and thread priorities have little effect)

For "single-CPU" none of these things matter much, and for early micro-kernels (from back when almost all computers where single CPU, and had less need for CPU caches, etc because the difference between "CPU speed" and "RAM speed" was far less) rendezvous messaging was vaguely beneficial. For modern computers (which are almost all multi-CPU, and need lots of caching and other stuff to overcome the difference between "CPU speed" and "RAM speed") its incredibly bad.

LtG wrote:

Brendan wrote:

In any case; you can expect that all of the stuff in the CPU's data caches, instruction caches, TLBs, branch prediction buffers and whatever else will all become worthless (old stuff for the previous task that's useless for the new task); and that adds up to "expensive" (lots of cache and TLB misses, and mispredicted branches, and...) and there's almost nothing you can do about it.

Suppose procA sends message to VFS and the VFS does very little work and then returns, the data and code caches should be just fine, TLBs (if tagged) are fine, so I would imagine that it's mostly all still valid/useful.

Suppose procA sends 20 messages to "VFS running on different CPU", VFS sends a reply for 10 requests (that could be satisfied from file data caches, etc) and forwards 10 requests to three different file systems, and those file systems forward more requests to 5 different storage device drivers. Now we have a situation where all 10 processes can be doing useful work in parallel on different CPUs, potentially with zero task switches.

Suppose procA sends 20 messages to VFS, but it has to do them one at a time because the IPC is crap. It sends the first message (task switch from procA to VFS), VFS does a little work and sends a reply (task switch from VFS back to procA); procA sends the second message (task switch from procA to VFS), VFS does a little work and has to send a request to FS (task switch from VFS to FS) which also does a little work and sends a request to storage device driver (task switch from FS to driver); then driver sends reply (task switch from driver back to FS) and FS sends a reply (task switch from FS back to VFS) and VFS sends a reply (task switch from VFS back to procA).

For the same 20 messages; you're looking at 2 task switches each for ten of the requests and 6 task switches each for the other ten requests.

Instead of "10 processes can be doing useful work in parallel on different CPUs, potentially with zero task switches" you've got "only one process can do useful work at a time, with at least 80 task switches".

Now assume that (on average) your task switches are twice as expensive because the CPU's caches are full of trash (e.g. after each task switch, the new task's code and data is no longer in CPU's L1 caches and probably not in CPU's L2 caches either), and assume that there's 8 CPUs that could've been used but weren't. If task switches were free it'd be (up to) 8 times worse because you're not using all the CPUs properly, but task switches aren't free (and probably cost more than the work each task does) so it's going to be more like 50 times worse.

LtG wrote:

Also remember that quite often when you send a message that the message contains something that was prepared just prior to sending the message, so the message content to varying extent is in exactly the same core's caches, if you have another core pickup the work then it's guaranteed to be in the cache of a wrong core. In addition to your design having to interrupt that other core, or sacrifice prioritization and latency (and possibly affect throughput).

For my case; thread asks kernel to fetch the message from its message queue, kernel tells CPU to prefetch the first part of the message, and CPU prefetches first part of message into cache while CPU switches back to CPL=3 and while the thread figures out who sent the message, etc. The message data may be in cache before it's accessed anyway.

For your case; kernel does a task switch that destroys everything, then returns to CPL=3 causing a cache miss at RSP and a cache miss at RIP (and probably a bunch of TLB misses too). You've already lost before the message data is accessed.

LtG wrote:

Brendan wrote:

LtG wrote:

I have no idea how my design will perform ultimately, so I can't say if it's a good idea or not. If I switch the task on the same core to do the "very little work" for VFS, then I get better cache locality than if I sent an IPI or some other way of communicating to the other core(s).

More like the opposite (if that other task's code and data is cached anywhere, then it's probably cached in a different CPU's caches, so you ruin cache locality by running it on the current CPU). Note that this gets worse once you start thinking about NUMA.

This should get better with NUMA, again all the data is being processed in the same core, thus the same NUMA domain. I would think that it's always beneficial to continue work in the same core, unless there's cores that are idle, in which case the current core could stay with it's "main" task instead of diverging into whatever whatever the reason was why the message was sent.

You'd either pay a "NUMA penalty" when accessing the message data (which is typically small); or you pay a "NUMA penalty" accessing all of the task's code, stack and data (which is typically much larger). If you want to minimise the "NUMA penalties" you pay the small penalty to avoid the large penalty.

LtG wrote:

Brendan wrote:

Yes, but it's a small price to pay to reduce the number of task switches (which are more expensive, regardless of what you do).

If the message is in the CPU registers and you use TLB tagging, what's the extra price to do a task switch? Compared to an IPI?

The extra price is that the CPU has to fetch code from RIP and data from RSP; and you can expect that code will want to access things in ".data" or ".bss" or heap; and have you poor cache locality for everything that matters more than the message data.

Note that TLB tagging (e.g. address space IDs) isn't as good you might think. The problem is "multi-CPU TLB shootdown" - without TLB tagging there's multiple cases (e.g. single-threaded process) where you know that the TLB entry can't be in other CPU's TLBs and therefore don't need to send an IPI to invalidate the translation in other CPU's TLBs; and with TLB tagging you can't do this (you have to assume that other CPU's might still be caching the translation in their TLBs). This means that (with TLB tagging) you either send a lot more IPIs for "multi-CPU TLB shootdown"; or you keep track of "postponed shootdowns" and do them during task switches (which is much better for scalability, assuming you're not doing an excessive number of task switches because of your IPC).

LtG wrote:

Brendan wrote:

LtG wrote:

So having a single thread vs one thread per core really depends on the overall design.

Sure, if you design the OS so that most CPUs spend most of their time trying to acquire locks/mutexes, then...

If each core is almost entirely independent, then what locks/mutexes? I realize that your design on this aspect is radically different from mine, but I'm not planning on using almost any locks or mutexes as far as the kernel is concerned. A disk cache (VFS) thread would probably have to use a mutex when evicting a page from memory and when marking a page to be currently needed, but that can be done on a page by page basis, so the likelyhood of two cores wanting to do that should be relatively low and the mutex should be cheap (not many cycles).

If VFS, file systems and storage device drivers are all single-threaded; then there's no need to have any locks or mutexes anywhere in any of them. If VFS, file systems and storage device drivers are multi-threaded (e.g. could be running on different CPUs at the same time) then you're going to need locks/mutexes (and/or lock-free and/or block-free algorithms) everywhere.

Note that it's not really lock contention that I'd worry about - acquiring a lock has a cost even when there's no contention (there's always a cache line that ends up bouncing around between multiple CPUs). Even things like tracking statistics can become a surprising scalability problem (e.g. one CPU doing "add [total_bytes_written],rax" compared to multiple CPUs doing "lock add [total_bytes_written],rax") because of "cache line bouncing everywhere".

LtG wrote:

If you try to view everything I've written here thru the lenses of your OS, then most/all of your criticism might be accurate, but again, my design is quite different to yours. When possible I try to avoid all the problems (locks/mutexes, task switching cost, etc) and hoping that eventually it all pays off... Time will tell.

I'm not saying these things because my OS does things differently; my OS does things differently because of these things.

Cheers,

Brendan

OSDev.org

Microkernel: How to deal with recursive messages?

Who is online