IPC by "process hopping"

Demindiro · Post by **Demindiro** » Tue Sep 07, 2021 5:09 am

Hello!

I'd like to talk about a form of IPC which doesn't seem to be used by any OS yet*. I dub it "process hopping". It is aimed at micro- & hybrid kernels.

Instead of having a process with one or more local threads which block/yield on IPC, a thread initiating a request instead gets moved ("hops") to the receiving process. It then runs a handler in that process and "hops" back to the sending process when it has fetched the data it needs.

A (very) simplified visualization of an IPC request

A more complex scenario would be one where data needs to be fetched from an external source, which may take many milliseconds. A thread can exit when there is nothing else to be done. When the data is ready, the device triggers an interrupt and the kernel creates a new thread. This thread is then passed to the server process.

A simplified visualization of an IPC request involving data from an external source

The obvious advantages of this approach are:

Trivial soft real-time communication, since threads immediately begin processing whatever was requested.
Simpler scheduler implementation, since there is no concept of a blocked or sleeping thread. Instead, threads are created on demand.
Lower memory usage, since threads that aren't running simply don't exist. The process still needs to save the registers it uses but it will use stack space that is likely allocated anyways.

Some concerns come to mind though:

Restoring the state of the requesting thread can potentially be complex. I think it can be fairly simple with some kernel support but I'm not certain.
There may be a risk of having too many concurrent threads processing requests since that's the encouraged pattern, degrading performance. Making threads exit when waiting on some external source should prevent this but again, I'm not 100% certain.
This approach will only be efficient if creating threads is cheap. However, I do believe that it is possible to make threads very cheap to create.
Context switching can be very expensive. Then again, AFAICT the majority of modern CPUs has support for ASIDs/PCIDs so hopefully this overhead is minimal.

What are your thoughts? Are there any flaws (or benefits) that I missed?

* Perhaps someone did already implement it but uses a fancy name for it that I didn't think of. My search-fu is failing me.

nexos · Post by **nexos** » Tue Sep 07, 2021 11:59 am

L4 implements something like what you put above, only trimmed down to be very efficient. It is called "timeslice donation". Basically, when a process sends a message, it doesn't do a reschedule, but instead, swaps out the context and runs the message handler in the same timeslice as the caller. Excuse my very crude explanation, google "L4 timeslice donation" for a better explanation.
Your model is very radical to what OSes are typically built on. I'd be interested to see results of such a system. L4 maps better to "typical" OS scheduling models (but still is very different). In my OS, I am taking a lot of influence from L4. My messaging scheme will be based off of L4, with the addition of an asynchronous version of messaging.

Korona · Post by **Korona** » Tue Sep 07, 2021 12:04 pm

What does it mean for a thread to hop into a different process? Will it's stack be copied? Mapped into the other process? What if there is another thread that shares the same stack address? What if the thread wants to take data back to the original process? Where will that data be stored?

thewrongchristian · Post by **thewrongchristian** » Tue Sep 07, 2021 12:33 pm

Korona wrote:What does it mean for a thread to hop into a different process? Will it's stack be copied? Mapped into the other process? What if there is another thread that shares the same stack address? What if the thread wants to take data back to the original process? Where will that data be stored?

I interpret it as basically switching user contexts. It might be that the "server" in this case has no user context, and is an entirely kernel side context.

Given that kernel level threads in this case can die and be recreated, then attached to user contexts to provide a message return value, it looks very much like a N:M threading model, where N user threads are sharing M kernel threads, but the difference being the kernel threads are not tied to a process, but are instead generic kernel threads to be co-opted by any process that needs them, a bit like a thread pool.

In this sense, there is no need to copy the stack (the kernel stack, that is) and the user context will stay with the representation of the user thread that made the request (presumably tied to the requesting process.)

Presumably any mass data transfer can be achieved using standard shared memory mappings built into the IPC mechanism provided. Ie. The request will contain a pointer to some buffer, the kernel will map and pass ownership of that buffer into the server process, switch to the server address space (if required, when the server is a user process.) Then the server process will fill in the buffer as required when handling the IPC, and pass back ownership of the memory to the client as part of the IPC response.

It sounds like this could be implemented with a single kernel thread per CPU core, which would simplify task switching simply to the selection of the next user context to resume, and returning to that user context.

Of course, the slow bit about inter-process IPC is switching the address space of the user context, as that is described by the user process memory mappings, and may involve the disposal of TLB cache state and subsequent page faults etc, so for inter-process user level IPC, it won't be a massive performance gain, and for IPC to a kernel level server, it will have little benefit that I can see over a more traditional kernel threading model (switching kernel threads should be fast if the user level context can be lazily switched).

Octocontrabass · Post by **Octocontrabass** » Tue Sep 07, 2021 4:21 pm

Demindiro wrote:Trivial soft real-time communication, since threads immediately begin processing whatever was requested.

You can have the same thing with a more typical scheduler by spawning a new server thread in response to each client request. Of course, you won't get the other advantages that way.

Demindiro wrote:Lower memory usage, since threads that aren't running simply don't exist. The process still needs to save the registers it uses but it will use stack space that is likely allocated anyways.

Even if the scheduler doesn't see blocked threads, you still need to store the entire thread's context somewhere, so I'm not sure you're lowering memory usage.

thewrongchristian · Post by **thewrongchristian** » Wed Sep 08, 2021 3:25 am

Demindiro wrote:Hello!

I'd like to talk about a form of IPC which doesn't seem to be used by any OS yet*. I dub it "process hopping". It is aimed at micro- & hybrid kernels.

Instead of having a process with one or more local threads which block/yield on IPC, a thread initiating a request instead gets moved ("hops") to the receiving process. It then runs a handler in that process and "hops" back to the sending process when it has fetched the data it needs.

...snip diagrams...

The obvious advantages of this approach are:

Trivial soft real-time communication, since threads immediately begin processing whatever was requested.

Simpler scheduler implementation, since there is no concept of a blocked or sleeping thread. Instead, threads are created on demand.

Lower memory usage, since threads that aren't running simply don't exist. The process still needs to save the registers it uses but it will use stack space that is likely allocated anyways.

Only the latter comes to mind here, due to the reduced stack burden of only needing a single kernel thread per core.

You'll still have to schedule user processes.

Real-time doesn't really apply to IPC. When talking about real time, it's more in the context of responding to events like hardware interrupts. In the case of hard real time, within hard bounded time.

Your single kernel thread per core doesn't make that any easier.

Demindiro wrote:
Some concerns come to mind though:

Restoring the state of the requesting thread can potentially be complex. I think it can be fairly simple with some kernel support but I'm not certain.

Should be simple enough. Switching to a user level context is simple.

Demindiro wrote: [*] There may be a risk of having too many concurrent threads processing requests since that's the encouraged pattern, degrading performance. Making threads exit when waiting on some external source should prevent this but again, I'm not 100% certain.

A thread that is not running, especially if it has no stack, will not use any resources other than what small amount of memory is required to maintain its state.

Demindiro wrote:

[*] This approach will only be efficient if creating threads is cheap. However, I do believe that it is possible to make threads very cheap to create.

Your thread model would be very cheap to create threads, as your kernel threads have very little context.

Demindiro wrote: [*] Context switching can be very expensive. Then again, AFAICT the majority of modern CPUs has support for ASIDs/PCIDs so hopefully this overhead is minimal. [/list]

This is mainly a user level context issue, and is the same for your thread model or kernel thread per user thread model.

Demindiro wrote: What are your thoughts? Are there any flaws (or benefits) that I missed?

Your kernel thread model would depend on basically stackless coroutine like concurrency, reducing concurrency to a collection of state machines that are called in turn.

This is the model used by, for example, protothreads, which is very high performing in low memory situations, but is inherently difficult to use as the complexity of the task the thread has to achieve goes up (you're basically restricted to a single function per thread).

Of course, if you can break down your server tasks into groups of cooperating single function threads, wrapped in syntactic sugar to hide the state machine as done in protothreads, you might be on to a winner.

rdos · Post by **rdos** » Wed Sep 08, 2021 3:59 am

Personally, I don't think there is much to gain with process hopping. You will still need to do a user->kernel switch, reload CR3 and go back to user mode. The only thing you basically gain is the time it takes to put a thread to sleep and wake it up again, which is much faster than ring switches and the effects of TLB flushes. Both methods require saving and reloading of thread context (like registers) and so there is no obvious advantage in that area either.

Demindiro · Post by **Demindiro** » Wed Sep 08, 2021 5:34 am

Korona wrote:What does it mean for a thread to hop into a different process?

In a traditional OS processes have one or more threads. These threads are tied to their respective process.

In the OS I'm designing a process can have zero or more threads. A thread can request the kernel to be moved into a specific process, i.e. the sending process loses a thread and the receiving process gains one.

Korona wrote:Will it's stack be copied? Mapped into the other process? What if there is another thread that shares the same stack address?

A thread will not have any stack when it is moved. Instead, it allocates a new one which can be as small or as big as it needs to be.

Korona wrote:What if the thread wants to take data back to the original process? Where will that data be stored?

Before hopping, the thread specifies either a range of pages to be shared/moved or an address with a small amount of data to be copied. After hopping, it either maps the pages into the process' address space or copies the data to wherever.

Octocontrabass wrote: You can have the same thing with a more typical scheduler by spawning a new server thread in response to each client request. Of course, you won't get the other advantages that way.

I am considering this approach, but the memory overhead can add up quickly if there are many pending requests.

Octocontrabass wrote: Even if the scheduler doesn't see blocked threads, you still need to store the entire thread's context somewhere, so I'm not sure you're lowering memory usage.

I expect memory usage to be lower overall since there is likely stack space that is already allocated anyways & the user knows which registers actually need to be preserved (e.g. the contents of temporary registers may be safely discarded).

thewrongchristian wrote: Real-time doesn't really apply to IPC. When talking about real time, it's more in the context of responding to events like hardware interrupts. In the case of hard real time, within hard bounded time.

Audio servers benefit from real-time scheduling since audio latency is very noticeable.

thewrongchristian wrote: Your single kernel thread per core doesn't make that any easier.

I don't intend to have a single kernel thread per core. Instead, kernel threads are treated as regular threads except that they operate on kernel memory (& they can become user threads when necessary).

thewrongchristian wrote: Should be simple enough. Switching to a user level context is simple.

Indeed. I raised the issue because in a previous design I tried to do state restoration entirely in userspace which could be complex since the process has to check it isn't being spoofed (e.g. a malicious thread could hop into a client pretending to be the response of another process).

thewrongchristian wrote: A thread that is not running, especially if it has no stack, will not use any resources other than what small amount of memory is required to maintain its state.

On 64-bit systems with 32 registers a thread uses at least 256 bytes of memory. While it isn't much by itself, it can add up quickly if there are many blocked or sleeping threads. I do also intend to have many "threads" "running" at the same time since I believe it is the easiest to scale across many cores.

thewrongchristian wrote: This is the model used by, for example, protothreads, which is very high performing in low memory situations, but is inherently difficult to use as the complexity of the task the thread has to achieve goes up (you're basically restricted to a single function per thread).

I'll give it a read. Thanks!

____________

A major reason why I'm considering this approach is so that processes can naturally take advantage of all cores on a system. I could of course create as many threads per process as there are cores, but this may result in significant overhead if there are many cores and many processes. This overhead is avoided by creating or moving a thread on demand.

Octocontrabass · Post by **Octocontrabass** » Wed Sep 08, 2021 8:39 am

Demindiro wrote:
Octocontrabass wrote: Even if the scheduler doesn't see blocked threads, you still need to store the entire thread's context somewhere, so I'm not sure you're lowering memory usage.
I expect memory usage to be lower overall since there is likely stack space that is already allocated anyways & the user knows which registers actually need to be preserved (e.g. the contents of temporary registers may be safely discarded).

So what happens to the thread's stack? Isn't that also part of its context?

Korona · Post by **Korona** » Wed Sep 08, 2021 10:37 am

Mapping a page and unmapping it again is much more expensive than copying a hundred KiB of data or so.

thewrongchristian · Post by **thewrongchristian** » Wed Sep 08, 2021 10:45 am

Demindiro wrote:
thewrongchristian wrote: Real-time doesn't really apply to IPC. When talking about real time, it's more in the context of responding to events like hardware interrupts. In the case of hard real time, within hard bounded time.
Audio servers benefit from real-time scheduling since audio latency is very noticeable.

Indeed, but here, we're bounded by hardware requirements, namely keeping the audio buffer populated. It is inherently a hardware based real time requirement. Missing real time population of the audio buffer results in audio artifacts like audible pops.

If you're referring to latency in initiating playing of an audio clip, for example, then that is more a server priority issue. The IPC latency should be well below anything a human can perceive (we're talking on the order of microseconds for a "slow" IPC mechanism), so the latency is a matter of scheduler priority rather than real time scheduling requirements.

You'd want to split your sound service into a two components. One to handle interaction with the rest of the user system, and one to handle interaction with the hardware. The former would be a high priority user process, the latter arguably having real time scheduling requirements to service hardware buffers.

Demindiro wrote:
thewrongchristian wrote: Your single kernel thread per core doesn't make that any easier.
I don't intend to have a single kernel thread per core. Instead, kernel threads are treated as regular threads except that they operate on kernel memory (& they can become user threads when necessary).

Right, so it is more like a N:M threading model, with M kernel threads servicing N user threads, but in your case, not all N user threads are from the same process.

Demindiro wrote: A major reason why I'm considering this approach is so that processes can naturally take advantage of all cores on a system. I could of course create as many threads per process as there are cores, but this may result in significant overhead if there are many cores and many processes. This overhead is avoided by creating or moving a thread on demand.

Yes, I understand, and multiplexing kernel threads between user threads will certainly help. It is something I've considered myself, but I couldn't get it straight in my head how it would work in practice.

It could only really work for user threads which are CPU bound, and being pre-empted by the scheduler. Any user threads that are instead not being pre-empted would be needing the kernel thread, as the kernel would be doing work on behalf of the user process (such as handling a system call).

I suppose for a microkernel system, the kernel portion is just passing IPC requests back and forth, so there is little to no kernel level state that is particular to a request, which is why I thought you'd be multiplexing a single kernel thread per core.

But for my monolithic kernel, it wouldn't provide the benefit nearly so much.

AndrewAPrice · Post by **AndrewAPrice** » Mon Sep 13, 2021 8:56 am

By processing hopping, it sounds similar to what I was thinking here: viewtopic.php?f=15&t=36680&p=307500#p307500

I abandoned this. In theory you can make thread creation very efficient. We can use object pools (stacks, kernel objects, etc.) In practice it was a bit more difficult because libc does a bunch of work on thread creation/destruction (allocating TLS, calling TLS constructors, allocating a stack, deallocating and calling destructors, etc.)

Once I went down the rabbit hole of thinking about TLS object pools that are already initialized, etc. I basically ended up with what are essentially sleeping threads waiting for an incoming message.

Some learnings that made it into my OS from exploring this:

I came up with my own IDL called Permebufs, e.g. [urlhttps://github.com/AndrewAPrice/Perception/blob/master/Libraries/perception/permebuf/window_manager.permebuf[/url]. I have a Permebuf compiler that generates C++ server and client stubs. Processes can implement named "services", and you can enumerate over services by name, and call methods on them from any process much like a function call. The compiler was a lot of work and the investment paid off by making all RPC standardized and easy.
If a message is small (I call them "mini messages"), I can pass it in registers - the kernel doesn't need to access process memory, even though it does store it in a temporary object which I allocate in an object pool, so it's fast.
If a message is large or dynamically sized, I allocate it in page aligned memory. Then, when you transfer the request/response between the caller/callee, you can simply 'gift' the memory pages to the other process. Expecting most messages to be small, I optimized allocating/deallocating 4 KiB.
Fibers are awesome. They're not too complicated to implement, and require no kernel support. Basically, in user space you have a bunch of "tasks" (registers + stack - all allocated in user space), and anytime you have to sleep (e.g. you synchronously call an RPC, sleep until a timeout, wait for another fiber, etc.) your task (a.k.a. fiber) sleeps, but the thread immediately starts executing the next task. The thread only sleeps if there are no tasks to run. Using user-space object pools, creating and destroying tasks is super efficient.

linguofreak · Post by **linguofreak** » Wed Sep 22, 2021 5:12 pm

Demindiro wrote:Hello!

I'd like to talk about a form of IPC which doesn't seem to be used by any OS yet*. I dub it "process hopping". It is aimed at micro- & hybrid kernels.

I've thought about a model like this a fair bit. The big problem is that to work efficiently, I think it would require a memory management setup that most CPU architectures don't have (the only ISA I've come across that really seems to have appropriate memory management for this scheme is IBM ESA/390 / z/Architecture with its access register mechanism).

To really make this model work well, you need to have the ability to have multiple address spaces available at once (think something like Intel segmentation, except that each "segment" should be its own paged address space instead of a window into a single paged address space), and the set of available address spaces needs to automatically change according to which address space you're running code from (so that, for example, an application can't access the filesystem driver's private data, but once a thread jumps into the filesystem driver, the driver can immediately load the address space for its private data), and what address space you're currently using for thread-local data (before making a call to an untrusted external program, an application might want to switch to a thread-local address space that doesn't have access to load any of the application's data except for what the external program is to operate on). You also probably want some kind of "secure stack" to allow returning to a thread-local address space with broader access from one with narrower access (without giving the one with narrower access unqualified permission to load the address space with broader access).

From what I've been able to determine, ESA/390 has all of these features, but precious little else does. Intel segmentation has the multiple address spaces (implemented suboptimally), but doesn't have the rest (you'd basically need to split the LDT into code and stack parts, and then have a field in each segment descriptor designating a half-LDT. On any CS or SS load, you'd then use that field to load a new half-LDT, and then you'd need the secure stack mechanism described above).

Food for thought: You call this IPC model "process hopping" (I might have used a similar term early on while I was considering similar models), but if we move away from the "flat static address space" memory management model to something that can accommodate this IPC model well, what actually *is* a process? In such a model, we can describe individual address spaces as stacks/threads, or heaps, or (memory mapped) files (including both code files, that is, executables and libraries, and data files). In the traditional flat address space model, a "process" is one or more code files, plus one or more threads, plus zero or more heaps and zero or more data files, all put together in one address space. The flat address space nature of the hardware binds these together much more strictly than they would be otherwise (because of context-switching overhead), and leaves you with the options, if you want to isolate one component from another for security, of either putting the first in the kernel and the second in user-space, or putting them in completely separate address spaces (but the subcomponents of a given component all have to be in the same address space). So a "process" in the flat address space model is basically just the contents of an address space. In the memory management model described above, two different threads might have considerable overlap in what address spaces they have loaded but they may never at any time have the same set of address spaces loaded (and, of course, their stack/thread local address spaces will always be different), so identifying what consitutes a process is trickier.

deadmutex · Post by **deadmutex** » Tue Oct 12, 2021 12:43 pm

This sounds similar to migrating threads as explained in a Mach paper. Instead of a traditional context and state switch, only a subset of the state is switched. So while the address space changes, the registers, thread priority, and run status could remain unchanged. This allows the kernel to perform an upcall into userspace to execute user-mode code. And if the user code later performs a system call to reenter the kernel, another thread migration to userspace could occur.

A more recent paper briefly covers a few of the disadvantages of using this approach instead of the typical method. One is this: how will your system handle the case where a client calls a server calls a server and a protection fault occurs? Another potential issue is handling preemption.

1. Ford, B. and Lepreau, J. "Evolving Mach 3.0 to a Migrating Thread Model." In Proceedings of the Winter 1994 USENIX Technical Conference and Exhibition, pages 97–114, 1994, https://bford.info/pub/os/thread-migrate.pdf.

2. Parmer, G. “The case for thread migration: Predictable IPC in a customizable and reliable OS,” in Proceedings of the Workshop on Operating Systems Platforms for Embedded Real-Time applications (OSPERT 2010), 2010, p. 91, https://www2.seas.gwu.edu/~gparmer/publ ... pert10.pdf.

EDIT: added second paper & disadvantages

linguofreak · Post by **linguofreak** » Wed Oct 13, 2021 10:47 am

deadmutex wrote: A more recent paper briefly covers a few of the disadvantages of using this approach instead of the typical method. One is this: how will your system handle the case where a client calls a server calls a server and a protection fault occurs? Another potential issue is handling preemption.

Those sorts of things are why I favor the hardware-driven non-flat-address-space approach I described above, and the abandonment of the concept of a "process".

OSDev.org

IPC by "process hopping"

IPC by "process hopping"

Re: IPC by "process hopping"

Re: IPC by "process hopping"

Re: IPC by "process hopping"

Re: IPC by "process hopping"

Re: IPC by "process hopping"

Re: IPC by "process hopping"

Re: IPC by "process hopping"

Re: IPC by "process hopping"

Re: IPC by "process hopping"

Re: IPC by "process hopping"

Re: IPC by "process hopping"

Re: IPC by "process hopping"

Re: IPC by "process hopping"

Re: IPC by "process hopping"