Hi,
ShukantPal wrote:
On the IOAPIC page in the OSDev wiki, it was stating that bits 56-59 in destination field are usable in physical destination mode. It was even saying that in the Intel IOAPIC preliminary documentation. Maybe that is for sending interrupts to another IOAPIC (right/wrong ?). But for sending to a CPU (local APIC), the full 8-bits are usable.
Ah - for old computers/chipsets it was 4-bits (and could only support up to 16 APIC IDs/CPUs in physical destination mode); and for newer computers/chipsets it was increased to 8 bits (and then extended to 16-bits later). I'm not entirely sure when this happened, but I suspect that the increase to 8 bits happened when MSI was introduced (and the extension to 16-bits was part of x2APIC).
ShukantPal wrote:
But what about x2APIC mode where more than 255 CPUs are present;
For x2APIC it's a confusing mess. The "destination" in the IO APIC (and in MSI, etc) is used as an index into an "interrupt remapping table" to find the real (32-bit) destination APIC ID. I suspect that this is because Intel were too lazy to make the IO APIC sane and just recycled something intended for a completely different purpose (the "interrupt remapping table" that was originally intended for virtualisation).
ShukantPal wrote:
For that I have thought of masking interrupt vectors and using the Logical Destination Mode. For example, cpus (0, 8, 16...) will all have logical ID = 1. But cpu 0 will handle vector 32 and cpu8 will handle vector 40 (masking off 32 & 48) while cpu16 will handle vector 48 (masking off 32 & 40). Any other ideas? My idea doesn't fit for "hot-pluggable" cpus.
For logical destinations; Intel suggests a "1 bit per CPU" format (that breaks as soon as you have more than 8 CPUs). Instead I like to use one bit for "first CPU within core", one bit for "last CPU within chip", and one bit for "first CPU within NUMA domain" (so I can broadcast an IPI to one CPU in each core, or broadcast an IPI to one CPU in each physical chip, or broadcast an IPI to one CPU in each NUMA domain; which can be useful for things like power management, etc). For the remaining 5 bits, over the years I've thought of several schemes, including using one bit for each IO hub ("send IRQ to the lowest priority CPU that is close to this device") and including using the bits to reduce multi-CPU TLB shootdown costs ("broadcast IPI to CPUs that are executing a process where the lowest bits of the process ID match").
Unfortunately, for x2APIC it all changed - for x2APIC it became hard-wired (you can't control how logical destinations are configured) and you have to use "cluster mode" where the highest 16 bits determine the cluster (NUMA domain) and the lowest 16 bits are in "1 bit per CPU within the cluster" format; which mostly ruins any clever trickery.
ShukantPal wrote:
Continuing your example of task on CPU #3 creating event on CPU3. What if the runqueue balancer moves it to another CPU? Should I bother moving the event to other CPU OR just send a IPI (I built a inter-processor request code which also allows sending "messages" to other CPUs) and give the reference to the event.
I'd leave the timer event where it is and send an IPI when the event expires; partly because (in theory) it could be moved many times before the timer event expires.
ShukantPal wrote:
Also, how to "free" memory for the events. Wait until all handlers call a event_finish() function until all event in the array are finished?
I'm not sure which algorithm you're thinking of. For "rotating buckets" (or the "timer wheels" in Linux) where you do something like "bucket_number = (expiry_time / time_per_bucket) % number_of_buckets" you just continually recycle the same array/bucket (and an array/bucket might never be empty - e.g. if an event expires in 1234 years time then it would remain in its bucket for 1234 years).
ShukantPal wrote:
By using the TSC, I don't need to use the HPET for interrupts right. I could just keep its counter increasing and increasing, disable all comparators, right?
The TSC alone can't generate an IRQ and is therefore not very useful for timer events (where you need an IRQ). For the local APIC's timer (regardless of whether you're using the newer "TSC deadline mode" or the older "one shot mode"); does it keep working in all of the CPU's power saving/sleep states? If the local APIC's timer doesn't work when CPU/s are in (deeper) sleep state/s, then what happens to your events when the local APIC timer isn't working?
ShukantPal wrote:
Also, at boot time (for non-constant TSC) all CPUs need to synchronize TSCs. Boot CPU synchronizes with HPET? How do application CPUs synchronize with boot CPU?
At regular intervals (including during boot) CPUs need to synchronise their TSC with something else (e.g. RTC).
Let's work backwards. I'm in favour of having a single representation of time (e.g. "nanoseconds" for everything instead of having multiple different representations of time (e.g. "clock_t" for some things and "time_t" for other things, with awkward conversions between them); and (for both security and compatibility reasons) I won't allow user-space code to access the TSC itself.
To handle CPUs where TSC isn't constant; I'd have a function in the kernel that does something like:
Code:
uint64_t get_nanoseconds_since_epoch(void) {
TSC_now = RDTSC();
time_now = last_time_for_this_CPU + (TSC_now - TSC_last_time_for_this_CPU) * TSC_speed_for_this_CPU;
last_time_for_this_CPU = time_now;
TSC_last_time_for_this_CPU = TSC_now;
return time_now;
}
When a CPU's TSC is being synchronised (at regular intervals, and during boot) I'd adjust "last_time_for_this_CPU" and "TSC_last_time_for_this_CPU". When the speed of the CPU's TSC isn't known (during boot) I'd measure it and set "TSC_speed_for_this_CPU"; and if the speed of the CPU's TSC isn't constant I'd update "TSC_speed_for_this_CPU" whenever the CPU's speed changes.
For CPUs where TSC is constant, "TSC_speed_for_this_CPU" would only be changed rarely (e.g. set during boot, and after that only changed to compensate for any minor drift and not because the CPU's speed changed). Apart from that, it would make no difference if TSC is constant or not.
Of course there's also no real need to synchronise all CPUs at the same time. If a CPU is doing nothing and/or is sleeping you could not bother (leave it until later). For "hot-plug CPU" you'd synchronise one CPU (alone) when it goes from "offline" to "online"; and during boot most CPUs could be left "offline" to save some time (e.g. bring them "online" if/when they're needed). If someone asks for an extremely precise time event, then maybe you could synchronise that CPU earlier than normal. Maybe you could do something like "if( how_much_I_care > how_much_it_might_need_synchronising ) { sychronise_it(); }" where each CPU synchronises itself whenever it feels like it (and avoids the overhead of synchronising when it's not really needed).
Cheers,
Brendan