Hi,
devsau wrote:
When I first designed my thread scheduler, I remember choosing between setting up the hpet to send a lowest priority MSI, (send int to could with lowest ppr) or.. just send it to the bsp only, have bsp update timer based structures, and then send a broadcast ipi off to the other APs.
The reason I did this was because at the time it just seemed like a simpler approach without having to introduce more interrupt handlers in case another AP got the actual hpet interrupt. Looking back on this, and also having recently taken a look at the windows kernel, I see it is also done this way. Hence my question, other than introducing extra complexity, are there other solid reasons to do this?
For a thread scheduler; every sane OS currently uses the local APIC timer in some way.
The problem is that often the same local APIC timer is used for other things (as part of a general purpose "high precision timer" abstraction that's used for everything - networking time-outs, device driver delays, etc), and unfortunately (for a lot of CPUs) when a CPU is put into a "very low power consumption" state to save power (because it's been idle for long enough) the local APIC timer stops working. In this case it's reasonable to shift all of the remaining "timer events" (that are likely to have nothing to do with thread scheduling because the CPU is idle) over to a different timer, like HPET, until the CPU has work to do and is bought back to a "higher power consumption" state (and its local APIC timer starts working again).
Note: This is mostly done for scalability - you do not want all CPUs trying to use the same shared resource (e.g. the same HPET timer) because that means you need locks, etc (and get lock contention and other problems that ruin performance for "many CPUs"); so to avoid that you use an "each CPU does its own timing independently (where possible)" approach (which leads to using local APIC timers for everything).
If HPET is only used when CPUs are in a "very low power consumption" state, it doesn't make sense to broadcast HPET's IRQ/s to all CPUs because that will wake all CPUs from their "very low power consumption" state and ruin the power savings. Ideally you want to ensure that only one CPU (that is not in some kind of power saving state) receives HPET's IRQ/s (and ensure that only one CPU handles the "timer events" on behalf of any/all CPUs that are currently in a "very low power consumption" state).
Also note that without power management (and without any "very low power consumption" states), the HPET might also be used to keep everything in sync with "wall clock time" (as part of a tiered approach - e.g. NPT used to keep HPET in sync with "wall clock time", then HPET used to keep local APIC timers and TSC in sync with "wall clock time"); but the HPET IRQ/s are not needed for this (it can be done with HPET's "main counter" without using any of HPET's comparators). This means that HPET's IRQ can be enabled on demand - e.g. HPET's IRQ could be enabled (if it's not enabled already) when a CPU enters a "very low power consumption" state and its "timer events" are shifted to HPET, and then the IRQ can be disabled when there are no more "timer events" left for HPET to handle.
Cheers,
Brendan