Designing a kernel to be preemptible

Dreamsmith · **Posted:** Tue Jul 13, 2004 12:02 pm

Colonel Kernel wrote:

Quote:

I would assume, however, that the QNX kernel is in fact reentrant.

That's not what I read here:

Quote:

In an SMP system, QNX Neutrino maintains this philosophy of only one thread in a preemptable and restartable kernel. The microkernel may be entered on any processor, but only one processor will be granted access at a time.

That sounds non-reentrant to me.

It does? Nothing in what you quoted implies the Neutrino is non-reentrant. I think maybe you misunderstand what reentrant means. Reentrant code will not fail if called from one thread while another is inside it. It's guarenteed not to clobber essential information when this happens. Depending on why a function is non-reentrant, there may be many different ways to make it reentrant. For a lot of functions, simply adding some kind of mutex will make them reentrant. From the text you've quoted, it's obvious the kernel is reentrant, otherwise it would crash rather than simply deny access, block or spin when a second syscall was made while a first was in execution. Now, certain parts of the kernel may not be reentrant (up to 90% of it, from what you say). But the kernel itself must be reentrant if the text you quoted is true.

Dreamsmith · **Posted:** Tue Jul 13, 2004 12:12 pm

Colonel Kernel wrote:

Now, let's say the implementation of the "long-running" system call knows it's about to do something time consuming, and releases the spinlock. But now, thread B is released and is allowed to run the same system call code -- meaning it must be reentrant. This contradicts the initial assumption that the kernel is non-reentrant (otherwise what would be the purpose of the global spinlock...?), so that can't be how it works.

That's probably exactly how it works. The purpose of the global spinlock would be to protect the non-reentrant parts of the kernel. Since this particular part of the kernel is reentrant, it has no problems releasing the spinlock (assuming it even grabbed it to begin with).

Colonel Kernel · **Posted:** Tue Jul 13, 2004 1:58 pm

Dreamsmith wrote:

Nothing in what you quoted implies the Neutrino is non-reentrant. I think maybe you misunderstand what reentrant means. Reentrant code will not fail if called from one thread while another is inside it.

...snip...

From the text you've quoted, it's obvious the kernel is reentrant, otherwise it would crash rather than simply deny access, block or spin when a second syscall was made while a first was in execution. Now, certain parts of the kernel may not be reentrant (up to 90% of it, from what you say). But the kernel itself must be reentrant if the text you quoted is true.

I understand what you mean about reentrant in this case. I guess it comes down to semantics. If I have a function foo():

Code:

void foo()
{
    do_hardly_anything_at_all();
    lock();
    do_lots_of_stuff();
    unlock();
    undo_hardy_anything_at_all();
}

Although strictly speaking foo() is reentrant, for all practical purposes it wasn't designed to be, otherwise it wouldn't be doing nearly all of its work inside a critical section. Sorry if my subtle reinterpretation caused any confusion...

Quote:

That's probably exactly how it works. The purpose of the global spinlock would be to protect the non-reentrant parts of the kernel. Since this particular part of the kernel is reentrant, it has no problems releasing the spinlock (assuming it even grabbed it to begin with).

It must have grabbed it to begin with... Otherwise how would the "one thread in the kernel at a time" invariant hold on an SMP system? This invariant is the heart of the matter -- I think it exists for exactly the reason you say -- that it's protecting non-reentrant code. But if only one thread is allowed in the kernel at a time (not just one thread per processor at a time), then I would conclude that the kernel as a whole (ok, except for the interrupt handling and first few instructions leading up to the acquisition of the global spinlock), is not reentrant. Which makes the whole preemption thing confusing.

I sense circular logic here, so maybe it's best to leave the corpse of this particular strawman in the dust and chalk it up to "typo in the documentation".

FWIW, I'm leaning towards two conclusions for my own OS design:

If you want the lowest interrupt latency possible (which I probably do), nested interrupts are the way to go
Self-interrupts (whether via the APIC or PIT) are a good way to get back into the kernel in a sensible state after a whack of nested interrupts
Fine-grained locking is pretty much essential for an efficient and scalable SMP kernel, whether micro or not

Am I astray from any wisdom (conventional or otherwise) on any of these?

Dreamsmith · **Posted:** Tue Jul 13, 2004 7:20 pm

Colonel Kernel wrote:

FWIW, I'm leaning towards two conclusions for my own OS design:

If you want the lowest interrupt latency possible (which I probably do), nested interrupts are the way to go
Self-interrupts (whether via the APIC or PIT) are a good way to get back into the kernel in a sensible state after a whack of nested interrupts
Fine-grained locking is pretty much essential for an efficient and scalable SMP kernel, whether micro or not

Am I astray from any wisdom (conventional or otherwise) on any of these?

Aside from the fact that 3 != 2? ;D

I have to admit, I'm somewhat skeptical of the first point. As I see it, you haven't "handled" an interrupt until you exit its ISR. After all, if you're still in the ISR, you're obviously not finished handling it. Now, if another interrupt comes along and takes away control from you, sure, THAT interrupt gets handled more quickly, but the first interrupt is actually handled more slowly! It takes it far longer to finish executing its ISR than it would have had it not been interrupted.

So, sure, you're *entering* ISRs more quickly, but add in the overhead of making nested interrupts doable, and you're *servicing* interrupts more slowly. As for the question of latency, I think it's really a wash. You've improved the response time to the second interrupt by slowing down the first.

In a lot of cases, this may make sense. If you have something that's absolutely critical to service, and it can be serviced extremely quickly, it makes sense to interrupt another ISR. OTOH, if you can keep all of your ISRs short and sweet, you'd actually be better off handling interrupts first come first serve. In the end it really depends on what your needs are.

The other two points, I agree with completely.

Colonel Kernel · **Posted:** Tue Jul 13, 2004 7:57 pm

Here are my random thoughts on interrupt latency... First of all, the point you bring up about priority is absolutely valid, but it's also the crux of my argument -- some devices need to be serviced in a more time-critical manner than others, so this makes sense to me. I don't much care if the printer is ready for the next page if there's a Gigabit Ethernet card craving attention (for example). As for the overhead of making nested interrupts possible, I guess the only way to be sure is to try both approaches and do some benchmarking.

I can imagine a situation as follows (here we go again...

). Let's say you have N devices, whose ISR's should run at increasingly higher priorities (0 is the lowest priority, 1 is the next highest, etc.).

Now imagine the nested interrupt case in the following situation -- Device 0 interrupts the CPU, its ISR runs most of the way through, and is about to queue up some work to send a message to its corresponding driver thread. Before it gets a chance to do this, device 1 interrupts, goes through the same steps, and it gets interrupted by device 2, and so on. The end result is that device N's ISR runs to completion after a fairly short wait time, and queues up a message for its driver thread.

Now, everything unwinds, and assuming the driver thread priorities reflect their corresponding ISR priorities, the kernel switches to device N's driver thread and sends it the message. All the other messages are on hold.

What if non-nested interrupts were being used? In this case, assuming the interrupts were triggered at roughly the same time as in the first case, device 0's ISR would run to completion, resulting in a context switch and potentially message delivery (depending on preemptibility of the implementation) to device 0's driver thread, before device N's ISR even gets a chance to run. If the other N-2 devices' interrupts get through before device N, multiply that initial context switch (and possibly message-pass) by N-2 before device N's ISR even gets to run.

I guess my point is, it seems to me there is overhead in either case. It may just depend on the kind of frequency and distribution of interrupts during "typical usage" (whatever that means) and system stress. I'll have to experiment, once I have enough code to experiment with. ;D

Colonel Kernel · **Posted:** Tue Jul 13, 2004 11:30 pm

Quote:

FWIW, I'm leaning towards two conclusions for my own OS design:

Quote:

Aside from the fact that 3 != 2?

Those off-by-one errors get you every time...

Time to get some much-needed sleep.

BTW, I've learned more on this site in the past two weeks than I would have thought possible. Kudos to whoever runs this place!

Brendan · **Posted:** Wed Jul 14, 2004 1:03 am

Hi,

Dreamsmith wrote:

I have to admit, I'm somewhat skeptical of the first point. As I see it, you haven't "handled" an interrupt until you exit its ISR. After all, if you're still in the ISR, you're obviously not finished handling it. Now, if another interrupt comes along and takes away control from you, sure, THAT interrupt gets handled more quickly, but the first interrupt is actually handled more slowly! It takes it far longer to finish executing its ISR than it would have had it not been interrupted.

Actually the PIC chips and IO APICs normally prioritize the IRQs, so that an ISR will only be interrupted if the interrupting IRQ has a higher priority. If the interrupting IRQ has a lower priority it will wait until the first ISR sends an EOI (end of interrupt).

If interrupt nesting is done correctly it would improve the interrupt latency of the higher priority IRQs at the expense of lower priority IRQs. This is a good scheme until you think about shared IRQs on the PCI bus, where your high speed network card might be sharing an IRQ with the sound card's MIDI.

For my OS the code to send a message will disable interrupts to reduce the time between acquiring the spinlock and releasing it, and as sending messages is the only thing the ISR does interrupt nesting would be a waste of time. For OSs that handle the IRQ within the ISR (e.g. monolithic, or micro-kernels that run device drivers at cpl=0 in kernel space) nested interrupts are much more important.

Also note that on a multi-cpu computer self-interrupts aren't the best way of supporting interrupt nesting. Rather than sending an interrupt to itself the CPU should send the interrupt to the CPU that is running the lowest priority thread (which might be itself, but might not). In addition there's "CPU affinity" - where a thread can be set to run on a specific CPU only (and/or on a group of CPUs only). In this case an IRQ may need to cause a context switch on a specific CPU, rather than the CPU that received the IRQ.

Cheers,

Brendan

Dreamsmith · **Posted:** Wed Jul 14, 2004 2:08 pm

Those are very valid points -- if you prioritize your interrupts well, it makes a lot of sense.

I just can't help but feel that no ISR should take so long to process that it needs to be interruptable. Nested interrupts strike me as a hack, a band-aid solution for a problem with your ISRs. It just isn't something that ought to be needed. I/O devices, even your Gigabit Ethernet controller, operate at a snails pace compared to processors. What on earth are you DOING in your ISR that's taking so dang long? Shouldn't fixing THAT be the priority?

<sigh> Oh well, feel free to ignore my misgivings. Having witnessed the horror of the NT driver model while writing NDIS drivers, I've probably acquired a few insanity points that are biasing my reaction to the idea... :-\

Colonel Kernel · **Posted:** Thu Jul 15, 2004 12:36 am

It's not so much that I'm the one writing large ISRs... Although it is bad practice, I can't ignore the possibility of someone writing a driver that installs a particularly lengthy ISR.

Unless of course, the kernel doesn't allow drivers to install ISRs and it just sends messages to the relevant driver threads, as Brendan's OS does. BTW, it is apparently possible to design a microkernel such that a user-mode driver can install its own ISR (QNX can do it), but I imagine switching address spaces to run an ISR makes the interrupt latency a bit nasty.

It seems a bit yucky that in the message-passing case, you'd have to leave the interrupt masked when finishing up the ISR, and rely on the driver thread to unmask it appropriately. This means putting a bit of interrupt-controller wrapper code in a shared library for the drivers (which live in user space). A bit icky, but I can't see any way around it... Perhaps it's better than trying to get ISRs from user processes installed in the kernel.

Also, isn't there something to be said for avoiding a context switch to the driver thread if the interrupt can be handled quite trivially? I'm not sure how many such situations exist (lack of practical experience and all that)... Here's an example I found in the QNX docs:

Quote:

It's important to note that most interrupts terminate without delivering an event. In a large number of cases, the interrupt handler can take care of all hardware-related issues. Delivering an event to wake-up a higher-level driver thread occurs only when a significant event occurs. For example, the interrupt handler for a serial device driver would feed one byte of data to the hardware upon each received transmit interrupt, and would trigger the higher-level thread within (devc-ser*) only when the output buffer is nearly empty.

BTW, I'm not trying to plug QNX or anything... It's just that they have a wealth of documentation on their web site that makes for interesting reading for students of OS design.

I can see the simplicity in the pure message-passing approach to interrupt handling (if the messages are small enough). I'm just trying to shake loose the disadvantages as well.

Brendan · **Posted:** Thu Jul 15, 2004 3:59 am

Hi,

Colonel Kernel wrote:

Unless of course, the kernel doesn't allow drivers to install ISRs and it just sends messages to the relevant driver threads, as Brendan's OS does. BTW, it is apparently possible to design a microkernel such that a user-mode driver can install its own ISR (QNX can do it), but I imagine switching address spaces to run an ISR makes the interrupt latency a bit nasty.

It seems a bit yucky that in the message-passing case, you'd have to leave the interrupt masked when finishing up the ISR, and rely on the driver thread to unmask it appropriately. This means putting a bit of interrupt-controller wrapper code in a shared library for the drivers (which live in user space). A bit icky, but I can't see any way around it... Perhaps it's better than trying to get ISRs from user processes installed in the kernel.

My kernel automatically sends any EOI needed and all IRQs are always enabled (even when there's no device driver using them). This makes the device driver tidy, as it never has to do anything related to the PIC/APIC. If the same IRQ is generated 10 times before the first is handled then the device driver will have 10 "IRQ received" messages on it's message queue.

It's also good for automatic IRQ detection, as the kernel keeps a counter for each IRQ source. For example, if you're writing a serial port driver you can get the serial port to generate an IRQ (using internal loopback) and see which IRQ counter/s increase by 1. You'd keep generating IRQs until all IRQs have been ruled out except one, and then you know which IRQ it is. The same trick can be used on most devices.

My serial driver has a list of base IO ports which are tested one by one. When the driver finds a valid base IO port it automatically detects the IRQ. In this way you can have a computer with 4 serial ports with strange IRQ assignments (common with older multi-IO expansion cards) and still make them work, without any configuration needed - this makes it easier to install the OS (and therefore easier to market the OS

).

Cheers,

Brendan

Colonel Kernel · **Posted:** Thu Jul 15, 2004 9:38 am

Brendan wrote:

My kernel automatically sends any EOI needed and all IRQs are always enabled (even when there's no device driver using them). This makes the device driver tidy, as it never has to do anything related to the PIC/APIC.

What happens if a particular device uses level-triggered interrupts? Won't you get stuck in an endless loop where every time you exit the kernel, the interrupt causes it to pop back in? EOI is easy enough for the kernel to do by itself... I was thinking more along the lines of device-specific interrupt acknowledgement.

Brendan · **Posted:** Thu Jul 15, 2004 11:28 pm

Hi,

Colonel Kernel wrote:

Brendan wrote:

My kernel automatically sends any EOI needed and all IRQs are always enabled (even when there's no device driver using them). This makes the device driver tidy, as it never has to do anything related to the PIC/APIC.

What happens if a particular device uses level-triggered interrupts? Won't you get stuck in an endless loop where every time you exit the kernel, the interrupt causes it to pop back in? EOI is easy enough for the kernel to do by itself... I was thinking more along the lines of device-specific interrupt acknowledgement.

Aargh! You're right - for level triggered interrupts the device-specific interrupt acknowledgement would need to happen before the EOI is sent to the PIC/APIC. Looks like my kernel will need a sendEOI(int IRQ) function for device drivers. I haven't done PCI yet (original idea works well for edge triggered IRQs

)...

Thanks,

Brendan

Colonel Kernel · **Posted:** Tue May 10, 2005 6:49 pm

<major_bump/>

I apologize in advance for the thread "grave-digging".

I've finally reached the point of designing and implementing this stuff myself...

Brendan wrote:

Aargh! You're right - for level triggered interrupts the device-specific interrupt acknowledgement would need to happen before the EOI is sent to the PIC/APIC. Looks like my kernel will need a sendEOI(int IRQ) function for device drivers. I haven't done PCI yet (original idea works well for edge triggered IRQs

)...

Brendan, how did you end up solving this problem? It seems to me that the kernel can still send the EOI, as long as it leaves the interrupt masked. The driver thread would then unmask it after sending an ack to the device. Does this make sense?

I have another question, but since the topic was straying onto interrupts anyway (10+ months ago, but still...), I'll create a new thread.

Brendan · **Posted:** Tue May 10, 2005 8:05 pm

Hi,

Colonel Kernel wrote:

Brendan wrote:

Aargh! You're right - for level triggered interrupts the device-specific interrupt acknowledgement would need to happen before the EOI is sent to the PIC/APIC. Looks like my kernel will need a sendEOI(int IRQ) function for device drivers. I haven't done PCI yet (original idea works well for edge triggered IRQs

)...

Brendan, how did you end up solving this problem? It seems to me that the kernel can still send the EOI, as long as it leaves the interrupt masked. The driver thread would then unmask it after sending an ack to the device. Does this make sense?

The kernel could mask the IRQ, send the EOI and then unmask the IRQ later, but there's little difference from unmasking it later and sending the EOI later.

For the new scheme, the kernel maintains a count for each IRQ that keeps track of how many threads are processing that IRQ. When an IRQ is received the kernel increases this count when sending "IRQ notification" messages to each device driver. When each device driver has handled the IRQ it calls an EOI kernel function, which decreases the count. When the count reaches zero the kernel sends the EOI.

Please note, I haven't actually implemented any EOI code yet except for the PIT/IRQ0 and the local APIC timer (neither of these can be shared as the former is edge triggered and the latter is built into the CPU/s). I still need to do a pile of work on device auto-detection before I can assign resources, which needs to happen before I'll worry about device drivers actually using the assigned resources

.

Cheers,

Brendan

Colonel Kernel · **Posted:** Thu May 12, 2005 12:22 am

Brendan wrote:

The kernel could mask the IRQ, send the EOI and then unmask the IRQ later, but there's little difference from unmasking it later and sending the EOI later.

I thought there was a big difference in that IRQs of the same or lesser priority (as determined by the PIC) will not be sent to the CPU until it sends an EOI to the PIC. In nitty-gritty terms, the bit in the In-Service Register (I won't use "ISR" because of its overloaded meaning in this context) for the IRQ remains set until the EOI is sent, and as long as this bit is set, it suppresses the signaling of the INT pin unless a higher-priority IRQ gets its corresponding bit set in the IRR.

In other words, if you wait to send the EOI until after the driver thread has given the all-clear, then you've allowed a user-land thread to effectively block all lower-priority interrupts. This sounds kinda scary to me... I'd mask, send the EOI right away, then unmask when the driver thread is finished doing its thing. That way only the IRQ handled by that thread is masked.

Does this make sense, or am I making a bad assumption about the 8259A's priority scheme?

OSDev.org

Designing a kernel to be preemptible

Who is online