Microkernel driver-hardware interface and IPC

physecfed · **Posted:** Sat Jul 16, 2016 5:24 am

Well,

This post is continuing in the wake of this previous thread, where the primary topic of discussion was schemes relevant to the driver-hardware interface and the impact/ramifications those decisions would have on inter-process communication in microkernels. It quickly became an interesting conversation, but I decided to start a new topic that I could begin after sitting down, consolidating my views and objectives and doing some reading on the subject, rather than being forced into the those matters on-the-fly.

As was established in the previous thread, I'm looking to build a microkernel architecture where security, modularity, and portability of the kernel are not just marketing buzzwords or succinct summaries - they're top-level design criteria. My line of thinking was that I would prefer to abstract driver access to hardware through kernel system call-based I/O utilities specific to drivers, as it produces a number of benefits for the architecture I'm looking to develop:

It further separates the kernel and drivers. If common hardware (say, an Ethernet or graphics card) was shared between two different instances of the kernel on different architectures, the driver would simply need recompilation to the new system call format, calling convention format, and instruction set but would not have to change its source, because it does not rely on a particular architecture's I/O protocols.
It enables better security and compartmentalization, at least from the context of nefarious or malfunctioning drivers. Because the kernel would mediate driver-hardware access, it could vet the driver's request against the role the driver is assigned to perform. The canonical example presented in the previous thread was that of a compromised, keylogging keyboard driver. The driver, upon mounting, would acknowledge itself to the kernel as a keyboard driver (my thoughts, at least initially, are something analogous to PCI BAR class codes). For the keylogger to do anything with the intercepted data, it would have to transmit to the network stack or hard disk interface, and this would be prohibited by the kernel (if implemented correctly!) on the sole fact that those categories of accesses are external to the scope of a keyboard driver.
Redirection. It enables the kernel to "virtualize" a device - while most obvious for things such as /dev/null and /dev/random, it would also allow for the kernel to symbolically utilize a resource that may altogether not be present on that machine (such as networked drivers or systems like NASes, computing farms, etc.) but present it to the requesting process as though it were physically present and without requiring that the requesting process be cognizant of that device's status.
It enables better reliability and development-friendliness, for much the same reasons - a kernel system call allows the kernel to have information about what the drivers are doing and what information they are requesting, which is important for driver development, debugging, and reverse-engineering of proprietary drivers (brings this to mind). It also allows the kernel to take action should a driver meet a threshold of illegal accesses (or just a malfunctioning driver), such as forcibly unloading the module entirely or taking other administrative action against it.

I'm (still) wanting to avoid permissioning via GDT/TSS/etc. as it would a) give my kernel a pretty strong architectural link to the x86 ecosystem (and one that isn't well-paralleled by most other CPU families), and b) it removes the above list of objectives. Now, one of the key discussion points in the last thread was the ramifications on IPC such a communication architecture would have. After thinking it over a number of different ways, I have to concur - in its current implementation scheme (which is summarized by what I've cooked up on my head, odd napkins, etc.), it sucks. In fact, it sucks horrendously. Here's a couple of diagrams I put out to describe the conundrum so far:

Now, obviously, this involves a number of ugly delay steps that have to be optimized out. Even assuming the IPC is fairly expedient (which I'm not counting on for the first 1,000 builds of my kernel!), this process involves 3 different context switches, simply to get the message to the hardware that "hey, this program wants this". An even better illustration is what occurs when drivers are compounded, such as SATA drivers that abstract over PCIe, or high-level USB device drivers that must interact with the USB stack and bus. The above image gets uglier:

So, I've begun looking into other means of tackling the issue that still retain the core objectives. When I was looking into ways microkernels handle these issues, I managed to stumble across the wiki for GNU Hurd, built on the old Mach microkernel, which notes:

Wikipedia wrote:

The servers collectively implement the POSIX API, with each server implementing a part of the interface. For instance, the various filesystem servers each implement the filesystem calls.

This appears to be a potential solution to the IPC and latency issues described above, at the potential complication of making the interface a little hairier. I'd have to sit down and attempt to figure a means with which to allow for system call routing to drivers in scenarios where multiple drivers of that type are present. For what it's worth, however, it's interesting to see some of the means by which other microkernels tackle these problems. So, now to put the question fairly broad-side-of-barn, does the "system call"-based kernel API scheme still have merits, and can it be optimized to the point of efficiency? I'm not looking for a microkernel that lounges comfortably in pedagogy, another MINIX - I'm looking for one that shows potential, even if it consumes my life force to realize. If so, what are some of your thoughts on how a "better version" of this scheme might pan out?

Brendan · **Posted:** Sat Jul 16, 2016 11:27 am

Hi,

Most of this (the first half) is makes perfect sense.

For the second half, you're confusing IPC with kernel API. When a driver calls the kernel to make an IO access request, it calls the kernel API (e.g. using something like "sycall" or "int 0x80" or whatever your kernel API uses), and this is not IPC (e.g. sending a message) and does not involve a task switch.

You also might be conflating the concept of IPC with task switches; which would be completely correct for some forms of IPC (e.g. rendezvous messaging) but isn't true of other forms of IPC (e.g. asynchronous messaging).

For example; for my kernels, sending a message causes the message to be moved to the receivers queue and doesn't (directly) cause a task switch; and receiving a message just means moving it from your queue into your address space and doesn't cause a task switch (unless you do "wait for message" and there are none). An application can send 1234 messages, then the scheduler might (or might not) decide to do a task switch to something else (which may or may not be one of the receivers).

For this; the "best case possible" is that the sender and receiver are running on different CPUs; the sender keeps sending request messages while checking for replies (with "check for message" and not with "wait for message") and handling them if they arrive and never blocks; and the receiver fetches requests, handles them and sends replies and never blocks (unless it run out of requests). In this case sender and receiver might transfer 123456 messages between each other, but there are no task switches at all.

Essentially there's 2 main categories for message passing:

"Synchronous". Easier for programmers (requests/replies behave a bit like function calls/returns). Impossible to avoid task switches. Micro-kernel developers using this tend to focus on "extremely fast task switches" because they're impossible to avoid, and tend to favour smaller messages (e.g. messages small enough to fit in CPU registers). This is more suited to single-CPU systems (where a request can't be handled in parallel while sender does something else anyway).
"Asynchronous". Harder for programmers (requests/replies behave more like events and aren't like function calls/returns at all). Buffers/queues are used to decouple message send/receive from task switching (which adds overhead for buffer/queue management). Micro-kernel developers using this tend to focus on avoiding task switches (and are less concerned with making task switches fast), and tend to favour larger messages (fewer larger messages rather than more smaller messages). This is more suited to multi-CPU systems (where you want to maximise the amount of work done in parallel where possible to make use of available CPUs).

POSIX is mostly designed for monolithic kernels; and is not designed to minimise the extra overhead of IPC/message passing that micro-kernels have. It's always disappointing when I see the inevitable performance comparisons between "micro-kernel using API designed for monolithic kernel" and "monolithic kernel using API designed for monolithic kernel". Sadly I never see equally fair performance comparisons between "micro-kernel using API designed for micro-kernel" and "monolithic kernel using API designed for micro-kernel".

While micro-kernels do sacrifice some performance for other goals (that, unlike performance, are harder to measure/compare/benchmark); years of researchers and their "let's use POSIX" idiocy has led people to believe the performance sacrifice is far larger than it actually needs to be.

Cheers,

Brendan

physecfed · **Posted:** Sat Jul 16, 2016 11:40 am

Brendan wrote:

Hi,

Most of this (the first half) is makes perfect sense.

For the second half, you're confusing IPC with kernel API. When a driver calls the kernel to make an IO access request, it calls the kernel API (e.g. using something like "sycall" or "int 0x80" or whatever your kernel API uses), and this is not IPC (e.g. sending a message) and does not involve a task switch.

You also might be conflating the concept of IPC with task switches; which would be completely correct for some forms of IPC (e.g. rendezvous messaging) but isn't true of other forms of IPC (e.g. asynchronous messaging).

For example; for my kernels, sending a message causes the message to be moved to the receivers queue and doesn't (directly) cause a task switch; and receiving a message just means moving it from your queue into your address space and doesn't cause a task switch (unless you do "wait for message" and there are none). An application can send 1234 messages, then the scheduler might (or might not) decide to do a task switch to something else (which may or may not be one of the receivers).

For this; the "best case possible" is that the sender and receiver are running on different CPUs; the sender keeps sending request messages while checking for replies (with "check for message" and not with "wait for message") and handling them if they arrive and never blocks; and the receiver fetches requests, handles them and sends replies and never blocks (unless it run out of requests). In this case sender and receiver might transfer 123456 messages between each other, but there are no task switches at all.

Essentially there's 2 main categories for message passing:

"Synchronous". Easier for programmers (requests/replies behave a bit like function calls/returns). Impossible to avoid task switches. Micro-kernel developers using this tend to focus on "extremely fast task switches" because they're impossible to avoid, and tend to favour smaller messages (e.g. messages small enough to fit in CPU registers). This is more suited to single-CPU systems (where a request can't be handled in parallel while sender does something else anyway).
"Asynchronous". Harder for programmers (requests/replies behave more like events and aren't like function calls/returns at all). Buffers/queues are used to decouple message send/receive from task switching (which adds overhead for buffer/queue management). Micro-kernel developers using this tend to focus on avoiding task switches (and are less concerned with making task switches fast), and tend to favour larger messages (fewer larger messages rather than more smaller messages). This is more suited to multi-CPU systems (where you want to maximise the amount of work done in parallel where possible to make use of available CPUs).

POSIX is mostly designed for monolithic kernels; and is not designed to minimise the extra overhead of IPC/message passing that micro-kernels have. It's always disappointing when I see the inevitable performance comparisons between "micro-kernel using API designed for monolithic kernel" and "monolithic kernel using API designed for monolithic kernel". Sadly I never see equally fair performance comparisons between "micro-kernel using API designed for micro-kernel" and "monolithic kernel using API designed for micro-kernel".

While micro-kernels do sacrifice some performance for other goals (that, unlike performance, are harder to measure/compare/benchmark); years of researchers and their "let's use POSIX" idiocy has led people to believe the performance sacrifice is far larger than it actually needs to be.

Cheers,

Brendan

Ah, whoopsed on the IPC vs. system call stuff. Maybe I should've reread some of the Unix books as well.

So in the more correct case, the user application would send a signal or message to the driver, which would then forward the request via syscall to the kernel and receive/forward the hardware response back to the application.

So, in the case of the API stuff, what are the typical design factors that make POSIX more tailored to monolithic kernels? That is, how would I design an API for users to interface to that makes optimal use of the microkernel environment? Could I do away with user-application system calls entirely and build the API simply on IPC/message passing to the appropriate service?

Brendan · **Posted:** Sat Jul 16, 2016 1:55 pm

Hi,

physecfed wrote:

So, in the case of the API stuff, what are the typical design factors that make POSIX more tailored to monolithic kernels? That is, how would I design an API for users to interface to that makes optimal use of the microkernel environment? Could I do away with user-application system calls entirely and build the API simply on IPC/message passing to the appropriate service?

There's many different ways that micro-kernels can be designed (how IPC is implemented, the protocols used, etc); so there's no easy way to say "an API should be more like ..... for all micro-kernels".

For an example of one specific system (mine); let's say you want to read 10 files. The code might look vaguely like:

Code:

    // Send "open and start reading" request for each file

    for(fileID = 0 to 10) {
        build_message(OPEN_WITH_READ, fileTable[fileID].fileName, fileID);
        sendMessage(VFS);
    }

    // Main message handling loop

    errors = 0;
    pending = 10;

    do {
        message = getMessage();
        switch(message.type) {
        case OPEN_WITH_READ_REPLY:
        case READ_REPLY:
            fileID = message.fileID;
            if(message.status != OK) {
                fileTable[fileID].status = message.status;
                errors++;
                pending--;
            } else {
                 fileTable[fileID].fileSize = message.fileSize;
                 fileTable[fileID].fileBuffer = realloc(fileTable[fileID].fileBuffer, currentPos+message.readBytes);
                 memcpy( &fileTable[fileID].fileBuffer[currentPos], message.data, message.readBytes);
                 currentPos += message.readBytes;
                 if(currentPos >= fileTable[fileID].fileSize) {
                     build_message(CLOSE, fileID);
                     sendMessage(VFS);
                 } else {
                     // Didn't read entire file so ask to read next part
                     build_message(READ_NEXT, fileID);
                     sendMessage(VFS);
                 }
            }
        case CLOSE_REPLY:
            fileID = message.fileID;
            if(message.status != OK) {
                fileTable[fileID].status = message.status;
                errors++;
            }
            pending--;
        }
    } while(pending > 0);

In this case; it's reading all files in parallel without caring about the order data arrives; the "open request" is combined with the first "read request" to get rid of an extra message per file; both "open" and "close" happen asynchronously (not just "read"); the VFS is free to re-order requests and do some (e.g. from file cache) before others (e.g. "file cache misses" where VFS has to ask a file system to fetch). For total task switches; with VFS on a different CPU (and if everything is cached by VFS) there might be none. More likely is that "getMessage();" at the start of the loop might block until a message arrives (causing a task switch to something else, and a task switch back later when a message arrives); but in that case when scheduler gives you time back you might have a queue of (up to) 10 messages to handle.

Now let's take this one step further. Why not have a single "do this list of things" message that you can send to the VFS? In that case you can send a single message to open all 10 files and start reading them. You could also postpone the "close requests" and do them in a single "close all of these" message. Also because the "file numbers" are controlled by the code (and not returned by "open()") these messages could be built at compile time. In some cases (everything already in VFS file cache, everything fits in a single message) VFS can send a single "here's all the replies" message back too.

Next; what if the kernel supports "batch kernel functions" where you mostly do the same thing but for the kernel API? In that case you can combine the first "open all 10 files" with "getMessage();" at the start; then throughout the middle you can combine "sendMessage(VFS);" with "getMessage();". This almost halves the number of kernel calls (and the "CPL=3 -> CPL=0 -> CPL=3" switching).

For best case (all of each file's data returned on first "open with read" request); this adds up to:

2 task switches (we'll have to wait after sending the "open and read all these files" for a reply to come back)
3 messages (one "open and read all these files", one containing all the replies from VFS and one "close all these files")
2 kernel API calls (one to send "open and read" and get the message containing the replies; and one to send the "close all the files")

Now think about how you'd do this with POSIX. For a start "open()" can't be done asynchronously, which means that for each file you open you need a kernel API call to send an "open" message, a task switch (because you have to wait for reply), an "open reply" message, and a task switch back (when reply arrives). For 10 files that adds up to:

20 task switches
20 messages
10 kernel API calls

We've already completely obliterated any hope of getting acceptable performance from the micro-kernel, and we haven't read a single byte yet. Isn't POSIX awesome!?

Cheers,

Brendan

gerryg400 · **Posted:** Sat Jul 16, 2016 7:39 pm

Brendan wrote:

For the second half, you're confusing IPC with kernel API.

This is important. It looks like you are drawing a picture of how you think a microkernel should look and then force-fitting your design onto it.

I encourage you to draw your system architecture a different way. The outstanding feature of microkernels is that they provide IPC. The IPC is the microkernel.

I encourage to draw your system so that it shows the messages/events/data that between processes as arrows directly between the processes and begin to think of that part of your kernel as simply the thing that implements the connection. How you actually send and receive the message/event/data then becomes an implementation detail. Think of process A sending a message to process B or process A receiving an event from process B or even Process A shares some memory with process B. The big block you have at the bottom of your architecture labelled 'kernel' should basically disappear.

Now some microkernels do supply non-IPC functions. Mine supplies clock and timer functions for example. Yours supplies physical IO. A block for that functionality may be shown in the usual way but perhaps with a different name, perhaps HAL or something. I think if you do this and read Brendan's post it will all make much more sense.

physecfed · **Posted:** Sun Jul 17, 2016 4:52 am

gerryg400 wrote:

This is important. It looks like you are drawing a picture of how you think a microkernel should look and then force-fitting your design onto it.

I'm worried that you guys might be reading too much into what (to me) was much more benign of an error. In my defense, had a rough night of sleep and wasn't running at full capacity. That and the previous post might have put me into "system call frenzy" and stuck that word in my head a little too much.

In fact, once I managed to kick myself into gear better, I think I had one of the better of the realizations that I've had in regards to this project.

The mental image of what a microkernel should look like is still hazy to me, but it's beginning to take shape. I'm starting to see the benefits and core principles of a microkernel - that of a kernel that provides hardware abstraction, process management, IPC, and perhaps specialized tasks such as clocking, but that otherwise leaves the applications free to interact amongst themselves to fulfill their needs without the overhead of kernel mediation.

Tell me if I'm aligned in the right direction, but I'm starting to see microkernels as fulfilling the role of a process referee and hardware interface, rather than attempting to, in the mode of monolithics, provide and handle everything but the kitchen sink. I'm finally beginning to realize what the minimalist and microkernel approach actually means.

This explains why IPC is one of the big issues with microkernels - that's one of the core services the kernel is supposed to provide.

Maybe it's a false "aha!", but it certainly feels like the dark clouds of OS development have parted enough for me to get a ray of sunshine on my face. It's realizations like the above that help to remind me why I'm wanting to pursue this.

I'll get back to the both of you on the more technical side of things in a few hours after I've had time to fully recalibrate myself, but for the first time in a long time, I'm excited to go forward.

embryo2 · **Joined:** Wed Jun 03, 2015 5:03 am **Posts:** 397

This thread has influenced some thoughts about "what is a kernel".

In fact I see it as a set of services. No matter what name is used because it's always more or less services. We can arrange a list of kernel types according to the increase in the number of services like this - exo-kernel, nano-kernel, micro-kernel, monolithic-kernel. And there's no essential border between the mentioned names. Just because it's about a gradual increase in the number of services.

Another approach is we can try to describe the kernel types by how easy it is to write an application for the OS with an appropriate type of kernel. But here again we can see the same services, but now from the point of view of application developer. If we start from "zero-kernel" which is essentially just a bootstrap code then the application developer should do almost everything himself with only some environment details predefined for applications by the zero-kernel. For exo-kernel we have a few services in addition to the predictable environment. For micro/nano kernels we have more enhanced set of services. And with monolithic kernel it is the easiest way to write a simple application, but in case of a complex and low-level aware application we again have some troubles. In the last case the troubles move from the requirement to write more code to the land of the "not implemented" constraint - we shouldn't expect something that kernel developers forgot to implement. And actually the last type of constraint is most disgusting because it prevents developers from doing something even if they are ready to write a lot of code. That's why I think the monolithic kernels are less popular among application developers, but are the ones which actually used in many popular OSes (because OS developers have more control over the users of OS's services, including application developers).

Another possible border case is related to the processor's privilege levels. When something runs at a privileged level it can be considered "kernel land". When something runs at a less privileged level it can be considered "user land". It's simple, but tells us nothing about the actual kernel architecture, it's capabilities and services. That's why I think the privilege level shouldn't be considered when we talk about kernel design. It's just a technical detail about how kernel developers expect to manage some misbehaving applications.

But how to classify the kernels then? I think the classification can be made along the loosely defined groups of services. First type - no services kernel (zero-kernel). Second type - basic services kernel (memory, process, thread, basic hardware management services). Third type - application friendly (but may be too rigid) kernel with some APIs for doing many things like file or network access, indirect hardware access like playing sounds or viewing movies and so on.

Linux, Windows, Android and many other OSes then belong to the third type. While many OSDever's OSes belong to the second type. And bootloaders like GRUB and UEFI belong to the first type.

It looks simple, but what do you guys think about it?

LtG · **Joined:** Thu Aug 13, 2015 4:57 pm **Posts:** 384

embryo2 wrote:

For micro/nano kernels we have more enhanced set of services. And with monolithic kernel it is the easiest way to write a simple application, but in case of a complex and low-level aware application we again have some troubles. In the last case the troubles move from the requirement to write more code to the land of the "not implemented" constraint - we shouldn't expect something that kernel developers forgot to implement. And actually the last type of constraint is most disgusting because it prevents developers from doing something even if they are ready to write a lot of code. That's why I think the monolithic kernels are less popular among application developers, but are the ones which actually used in many popular OSes (because OS developers have more control over the users of OS's services, including application developers).

I don't think the difference between micro vs mono is in the easiness of writing applications, but rather in how much is provided by the kernel.

Applications aren't developed against a kernel, they're developed against some API which is either the OS provided one or some library/framework. So writing applications is just as easy or difficult with both micro and monolithic kernels, assuming both are equally developed/complete.

Also, I've never thought of monolithic kernels as being less popular among app dev's, at least that doesn't apply to me and I haven't noticed anyone complaining.

embryo2 wrote:

But how to classify the kernels then? I think the classification can be made along the loosely defined groups of services. First type - no services kernel (zero-kernel). Second type - basic services kernel (memory, process, thread, basic hardware management services). Third type - application friendly (but may be too rigid) kernel with some APIs for doing many things like file or network access, indirect hardware access like playing sounds or viewing movies and so on.

Linux, Windows, Android and many other OSes then belong to the third type. While many OSDever's OSes belong to the second type. And bootloaders like GRUB and UEFI belong to the first type.

Umm, isn't that what the micro vs mono classification has always been? A micro kernel is just a mono kernel with some (majority?) of the services a mono provides having been outsourced to userland processes. For applications all the same services exist, they're just not provided by the kernel but by the rest of the OS (services).

I'm guessing but I can think of three reasons to design a monolithic kernel (using Linux as an example):

Organic growth; people just keep adding code to the existing and rarely refactor, thus you get a big blob of code.
Performance; People seem to think that monolithic is faster, though I'm not sure how accurate that is these days. I haven't seen any good comparisons of the two.
Easiness; At least initially monolithic is probably easier, however I would argue that once it grows it becomes much more difficult to maintain.

I think it's also worth noting that Linux is moving towards micro, at least to some extent.

OSDev.org

Microkernel driver-hardware interface and IPC

Who is online