OSDev.org

The Place to Start for Operating System Developers
It is currently Thu Mar 28, 2024 7:41 am

All times are UTC - 6 hours




Post new topic Reply to topic  [ 7 posts ] 
Author Message
 Post subject: IRQ balancing
PostPosted: Sun Oct 04, 2015 4:06 pm 
Offline

Joined: Thu Aug 29, 2013 4:10 pm
Posts: 15
Can anyone point me to any materials regarding IRQ balancing on (modern, boring, shared-memory, mostly uniform and cache-coherent) multiprocessors, or express their own thougts on the matter? It appears it should be a problem receiving almost as much academic attention as process scheduling, but I can’t in fact find anything apart from the documentation section on the irqbalance website, which only scarcely describes a rather ad-hoc approach.

Rationale, in case I don’t get around to implementing things and it might still be useful to somebody else: It seems that this is one of the few policy-related things the resident part of a kernel/monitor has to be concerned with: scheduling decisions might (or might not, I’m not yet sure) be postponable to some sort of “timer driver”, lightweight intra-CPU synchronous IPC à la Liedtke is all coding and no policy, but something has to decide, centrally, where to deliver a particular IRQ. (Conveniently, deciding where to deliver inter-CPU synchronous IPC is exactly the same question.)


Top
 Profile  
 
 Post subject: Re: IRQ balancing
PostPosted: Sun Oct 25, 2015 1:39 pm 
Offline
Member
Member

Joined: Fri Feb 15, 2013 9:29 pm
Posts: 35
In the past I have dealt with this in several ways. The way I like (I am not going to say that it is the best)is to have my task scheduler give a higher priority to tasks that have more external interrupts that they need to handle. This way when an external interrupt is thrown, the IRQ handling code is run. The task(s) that is registered to be using that IRQ is put in a high priority queue where is is most likely to be run sooner and deal with the end result of the IRQ. This would be in cases like where network data was received and the buffers need emptied.

_________________
Programming is like fishing, you must be very patient if you want to succeed.


Top
 Profile  
 
 Post subject: Re: IRQ balancing
PostPosted: Mon Oct 26, 2015 12:41 am 
Offline
Member
Member
User avatar

Joined: Sat Jan 15, 2005 12:00 am
Posts: 8561
Location: At his keyboard!
HI,

Let's start by assuming it's a NUMA system with 4 NUMA domains, like this:

Code:
             ________
            |        |
            | IO Hub |
            |________|
    _____    ___:____        ________    _____
   |     |  |        |      |        |  |     |
   | RAM |--| CPUs   |------| CPUs   |--| RAM |
   |_____|  | 0 to 1 |      | 2 to 3 |  |_____|
            |________|      |________|
                :          /    :
                :         /     :
                :        /      :
                :       /       :
                :      /        :
                :     /         :
    _____    ___:____/       ___:____    _____
   |     |  |        |      |        |  |     |
   | RAM |--| CPUs   |------| CPUs   |--| RAM |
   |_____|  | 4 to 5 |      | 6 to 7 |  |_____|
            |________|      |________|
                             ___:____
                            |        |
                            | IO Hub |
                            |________|


Each IO Hub connects to PCI devices; which means there's PCI devices in NUMA domain #0 (top left) and more PCI devices in NUMA domain #3 (bottom right). Obviously you'd want the device drivers for devices connected to NUMA domain #0 to be running on CPUs that are in NUMA domain #0, which are CPUs 0 and 1; and devices connected to NUMA domain #3 to be using CPUs 6 and 7.

Let's also assume that the OS doesn't have a single global IDT, but has a different IDT for each NUMA domain. This means that (using MSI) you can have 150 interrupt vectors for devices in NUMA domain #0 and another 150 interrupt vectors for devices in NUMA domain #3, and you're not limited to a global max. of about 200 interrupt vectors.

Now...

When a device in NUMA domain #3 sends an IRQ you want it to go to CPU 6 or 7. If CPU 6 is in a low power mode you don't want to wake it up (which would be bad for power consumption and bad for latency because waking CPUs up takes time). If CPU 6 is running a high priority task and CPU 7 is running a low priority task, then you don't want to interrupt the high priority task. Finally; if neither CPU is in a low power mode and both are running similar priority tasks, then you want the IRQs to be balanced reasonably evenly (e.g. about half to CPU 6 and half to CPU 7).

If you look into the way IRQ priorities interact with APICs, you'll notice there's a "send to lowest priority CPU" mode and a CPU's priority is determined (in part) by a "task priority register" in the CPU's local APIC. If the task priority register is higher than the IRQ's priority then the CPU won't accept it, which would be bad - when all CPUs are running at "too high priority" none accept the IRQ. However, if the task priority register is kept within the range that corresponds to exception handlers anyway this won't happen. This means you can adjust the task priority register during task switches and when putting a CPU to sleep and waking it up; so that IRQs are automatically sent to the "best" CPU by hardware.

Now let's think about CPUs 2, 3, 4 and 5. What if CPUs 6 and 7 are both running very high priority tasks and/or in a power saving state? Maybe we want CPUs 4 and 5 to help handle IRQs from NUMA domain #3 because its worth paying the "wrong NUMA domain" penalty in that case. To do this, you can set the task priority registers for CPUs 4 and 5 in the "high to very high priority" range (depending on what they're doing); and that way if (e.g.) CPU 4 is running a low priority task and CPUs 6 and 7 are currently running very high priority tasks (or asleep), then the IRQ would automatically get sent to CPU 4.

Now...let's look at the boring/simple computer that looks like this:

Code:
             ________
            |        |
            | IO Hub |
            |________|
    _____    ___:____
   |     |  |        |
   | RAM |--| CPUs   |
   |_____|  | 0 to 1 |
            |________|


Is this "no NUMA", or is it "NUMA with only one NUMA domain"? There's no difference. ;)

All the same shenanigans we were doing for the complicated NUMA systems end up being perfectly fine for the much simpler "NUMA with only one NUMA domain" case.

Note that none of the above is really that easy. There's things like APIC logical destination register that come into it, and differences between xAPIC and x2APIC, and things like "directed EOI" to take into account, and dodgy hardware, and... What I'm trying to do is describe an "ideal framework" without the complications/distractions.


Cheers,

Brendan

_________________
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.


Top
 Profile  
 
 Post subject: Re: IRQ balancing
PostPosted: Fri Jan 01, 2016 1:48 pm 
Offline
Member
Member

Joined: Wed Nov 18, 2015 3:04 pm
Posts: 396
Location: San Jose San Francisco Bay Area
never dealt with it before but encountered the terms more than once. looks like my guess for what it is is completely wrong:
i was thinking irq balancing might have something to do with PCIe interrupts being evenly distributed by 4 lines INT#A INT#B INT#C INT#D in which many PCIe devices interrupt shared.

_________________
key takeaway after spending yrs on sw industry: big issue small because everyone jumps on it and fixes it. small issue is big since everyone ignores and it causes catastrophy later. #devilisinthedetails


Top
 Profile  
 
 Post subject: Re: IRQ balancing
PostPosted: Mon Jan 04, 2016 3:47 am 
Offline
Member
Member
User avatar

Joined: Tue Oct 17, 2006 11:33 pm
Posts: 3882
Location: Eindhoven
INTA-INTD was a backward-compatible hack for Pci-Express to look like PCI, where it was a backward-compatible hack to fit on top of ISA interrupt routing (PIC) with some of them multiplexed and rotated around between slots. Remember that all Pci-E interrupts are MSI, so there's no reason not to have a 256 different ones.


Top
 Profile  
 
 Post subject: Re: IRQ balancing
PostPosted: Mon Apr 04, 2016 4:05 pm 
Offline

Joined: Thu Aug 29, 2013 4:10 pm
Posts: 15
Brendan: thanks for the detailed explanation (with pictures, even!). However, although of course
Brendan wrote:
all the same shenanigans we were doing for the complicated NUMA systems end up being perfectly fine for the much simpler "NUMA with only one NUMA domain" case,

I didn’t dismiss (device communication latency in) NUMA because I think it’s irrelevant or that the simple case can’t be handled as a degenerate case of the complex one: I just thought the problem was complicated enough already. In fact, I was actually thinking about another part of the problems NUMA causes: shared caches. To wit, the CPU of the laptop that I’m writing this on has three levels of memory caches and two levels of TLBs shared to various degrees between four logical cores. Especially after all the discussion about the costs of hardware cache coherency (ask the lockless people), it looks like thread migration is a really bad idea most of the time, and the irqbalance text agrees. Thus the tradeoff I was talking about was not “subpar device communication speeds vs. CPU power and thread priority”, it was “cache thrashing vs. the same”. Sorry about not being clear about it the first time, and—is anybody aware of any sound approach to this problem?


Top
 Profile  
 
 Post subject: Re: IRQ balancing
PostPosted: Mon Apr 04, 2016 4:13 pm 
Offline

Joined: Thu Aug 29, 2013 4:10 pm
Posts: 15
tlf30 wrote:
This way when an external interrupt is thrown, the IRQ handling code is run.

The silent assumption in this sentence is that we know which CPU (on a multiprocessor) is to deliver the IRQ to. This is the unavoidable part of the problem I was talking about. Whether or not to
tlf30 wrote:
have my task scheduler give a higher priority to tasks that have [...] external interrupts

or to handle interrupts in the kernel or to not have priorities at all, on the other hand, is an unrelated design decision.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 7 posts ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: No registered users and 16 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group