Timer-Interrupts using Events - Tickless Kernel

ShukantPal · **Joined:** Tue Dec 12, 2017 6:51 am **Posts:** 23

I've recently read about tickless kernels. It seems that the "tickless" operation in the linux kernel is not fully true, because it cannot operate without regular timer interrupts if more than one task is on the run-queue. I am trying to tackle this tickless operation problem by implementing a timer-manager in my kernel (Silcos Kernel).

I am thinking that the timer-manager will accept timer event requests that are associated with a timestamp (in the future). Whenever the kernel finishes a timer-interrupt, it will reset the timer to fire an interrupt at the next timestamp associated OR whenever a new request is inserted - check if it is the first one (before any other timestamp) and reset the timer if so. Each timer-event will have a associated handler that will execute in the local interrupt context and a list of event listeners which execute after the handler finishes on a separate kernel routine (a kernel task which is special in Silcos, not implemented yet!).

This timer-manager will also delete events that already finished. It will have separate structs for each-CPU allowing them to manage time on their own.

Some specific notes:

1. I know that system time should be updated regularly. For that, one processor (BSP) will continuously fire interrupts at a configurable rate (by user-mode processes using kernel calls) and update the time. That will be implemented by making the "event handler" insert another "event" into the timer-manager.

2. Linux documentation states that CFS does not need regular interrupts. But current implementation require interrupts in at least a second for updating vruntime (or whatever). They are working on making CFS scheduler completely independent of interrupts at regular intervals.

3. To hold the "event records" in the timer-manager, will LinkedList be okay or a BinarySortTree will be better? It will not hold too many requests (I think average will be less 8 ). Anyways, the linked-list will be sorted and we could insert events from behind assuming that newer events will occur later to save insertion overhead.

4. Will this technique perform well or cause delay in the system (note that events will always be inserted from outside the timer-interrupt)? The interrupt will execute the handler (which may be the scheduler too) and then reset the timer based on the closest timestamp.

Brendan · **Posted:** Tue Jan 23, 2018 1:25 am

Hi,

ShukantPal wrote:

1. I know that system time should be updated regularly. For that, one processor (BSP) will continuously fire interrupts at a configurable rate (by user-mode processes using kernel calls) and update the time. That will be implemented by making the "event handler" insert another "event" into the timer-manager.

Typically, for tracking time (e.g. nanoseconds since boot) modern kernels use the CPU's TSC, and then just have a periodic IRQ (e.g. once per second "update IRQ" from the RTC chip) to make sure that the TSC doesn't drift too much (and then NTP to make sure the RTC chip doesn't drift too much). This gives you better than 1 nanosecond precision with very good accuracy.

If you implement by inserting an "event" into the timer-manager then you have to compromise between overhead and precision, and you end up with significantly worse overhead and significantly worse precision (e.g. a timer event causing an IRQ every millisecond with 1 millisecond precision instead of 1 nanosecond precision).

ShukantPal wrote:

3. To hold the "event records" in the timer-manager, will LinkedList be okay or a BinarySortTree will be better? It will not hold too many requests (I think average will be less 8 ). Anyways, the linked-list will be sorted and we could insert events from behind assuming that newer events will occur later to save insertion overhead.

A linked list (per CPU) would be awful. Assuming the list is kept sorted (to avoid searching the list in the timer IRQ handler), every time a new timer event is created or an existing timer event is cancelled you'd have to do a linear search through the list, where at every step during the search you'd have an unpredictable/unprefetchable cache miss; so with 200 things on the list you're looking at (up to) 200 cache misses (and maybe up to 200 TLB misses too). To minimise that you can use "time buckets". For example, if you had 256 linked lists and did "list_number = expiry_time % buckets" to find the right bucket, then with 200 timer events you'd probably only have one cache miss and not 200 of them. Of course there are other ways (e.g. linked lists of fixed length arrays of entries).

Cheers,

Brendan

ShukantPal · **Joined:** Tue Dec 12, 2017 6:51 am **Posts:** 23

Hi Bredan,

I get that using the TSC would be far better than regularly updating time at "every jiffy (or whatever)". That is even better for the system (I didn't know about that before) so making use of the events is a good idea.

About the LinkedList - First of all, once the interrupt fires - the timer-manager doesn't need to search the list. It will start from the first event in record and keep going until the timestamp of an event exceeds the current kernel-time.

But you're right about insertion overhead. That is because if "n" threads say that they're sleeping until time "t" then "n" events will be registered, right. So the amount of events depends more on the workload. But then, if linked-list is not suitable, which data structure should I use. I need a data structure that can be searched in a fast-manner (to reset the timer to the closest event). I don't think a HashMap has fast searching facilities. How would a BinarySearchTree work in this condition (also tell about AVL-Tree vs Red-Black Tree in this situation)?

Brendan · **Posted:** Tue Jan 23, 2018 7:21 am

Hi,

ShukantPal wrote:

About the LinkedList - First of all, once the interrupt fires - the timer-manager doesn't need to search the list. It will start from the first event in record and keep going until the timestamp of an event exceeds the current kernel-time.

But you're right about insertion overhead. That is because if "n" threads say that they're sleeping until time "t" then "n" events will be registered, right. So the amount of events depends more on the workload. But then, if linked-list is not suitable, which data structure should I use. I need a data structure that can be searched in a fast-manner (to reset the timer to the closest event). I don't think a HashMap has fast searching facilities. How would a BinarySearchTree work in this condition (also tell about AVL-Tree vs Red-Black Tree in this situation)?

Binary trees are better than linked lists; but still aren't great. For example, for 256 entries you'd need a depth of 8; which means (for insertion or search) a worst case of 8 cache misses and 8 TLB misses, plus 7 unpredictable branches.

From memory; Linux does something like the "time buckets" idea I described, except that it doesn't bother sorting entries in most of the buckets (it only keeps the "soonest bucket" sorted, and sorts that one bucket when the buckets are rotated and a new bucket becomes the "soonest bucket"). This helps because a lot of the events are for networking time-outs that are cancelled before they expire (it avoids the overhead of sorting for things that get cancelled before sorting matters).

To understand what I mean; assume you have "unsorted buckets of linked lists of arrays", like this:

Code:

struct time_event {
    uint64_t expiry_time;
    uint32_t owner_ID;
    uint32_t handle;
}

#define ENTRIES_PER_EVENT_ARRAY   255              // Chosen so "sizeof(list_entry)" is <= 4096 bytes to minimise TLB misses

struct list_entry {
    struct list_entry * next;
    int used_entries;
    struct time_event event_array[ENTRIES_PER_EVENT_ARRAY];
}

#define NUMBER_OF_BUCKETS    1024

struct list_entry * buckets[NUMBER_OF_BUCKETS];

To create an event you might do:

Code:

int create_timer_event(uint64_t expiry_time, uint32_t owner_ID) {
    int bucket = expiry_time % NUMBER_OF_BUCKETS;
    struct list_entry * previous_list_entry = NULL;
    struct list_entry * current_list_entry = buckets[bucket];
    int entry_number;

    while( (current_list_entry != NULL) && (current_list_entry->used_entries >= ENTRIES_PER_EVENT_ARRAY) ) {
        previous_list_entry = current_list_entry;
        current_list_entry = current_list_entry->next;
    }
    if(current_list_entry == NULL) {
        current_list_entry = kmalloc(sizeof(struct list_entry));
        if(current_list_entry == NULL) {
            return ERROR_NO_MEMORY;
        }
        current_list_entry->next = NULL;
        current_list_entry->used_entries = 0;
        if(previous_list_entry == NULL) {
            buckets[bucket] = current_list_entry;
        } else {
            previous_list_entry->next = current_list_entry;
        }
    }
    entry_number = current_list_entry->used_entries;
    current_list_entry->used_entries++;
    current_list_entry->event_array[entry_number].expiry_time = expiry_time;
    current_list_entry->event_array[entry_number].owner_ID = owner_ID;

    return OK;
}

By using arrays like this you pack the most information into the least cache lines (it's very good for "locality") and (for iteration) give the CPU nice sequential (pre-fetchable) access patterns to work with. Of course large arrays are bad - they waste memory (when they're only partially used) and make deletion expensive (e.g. "move all entries that are after the deleted entry"); which is why (for each bucket) I'm using a linked list of small arrays and not a single large array (so I get the locality and pre-fetchability benefits of arrays while using linked lists to avoid the disadvantages of large arrays).

Note: The example code I've shown was slapped together (after I came home from a meal at a restaurant with my family) purely to illustrate the kind of data structure I'm talking about; and the code I've shown isn't tested (might not even compile), and isn't complete (doesn't check if the bucket is the "soonest bucket" and keep it sorted, doesn't do any kind of locking, etc).

Cheers,

Brendan

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Brendan already provided useful info for "time buckets". Linux calls this the "timer wheel", if you want a googleable term; scientifically it is sometimes called "calendar queue". This data structure is great for large amounts of timing events that do not need to be precise. For example, TCP timeouts (that almost never expire because packets are received before the timeout) fall into this category.

For high-precision events, you want an exact data structure. Brendan suggested trees, which have (e.g. if implemented as a red-black tree) O(log n) insertion and removal times. However, you can do better here: The proper data structure you're looking for are heaps. Heaps can determine the smallest element of a dynamically changing set of items. Unlike queues, they do cannot give you the full order though. This makes them much easier to understand and to implement and also much faster. There are heaps with constant armortized runtime (in certain use cases). However, in practice, the binary heap and the pairing heap data structures are best suited to the timer task. Binary heaps are really simple and can be implemented in < 100 lines of C code. They store their elements in an array and thus have great caching properties. Pairing heaps are implemented using pointers. The advantage that they have over binary heaps is that they can be implemented intrusively and thus do not require memory allocation on insertion.

Intrusive data structures are great if you need to start timers from code paths that cannot fail due to out-of-memory. For example, I rely on this structure in asynchronous IPC operations. The operation can be started by a syscall that is allowed to fail. However if the syscall succeeds, the asynchronous operations is guaranteed to succeed too. Due to the intrusive data structure, during the asnyc operation the kernel can enqueue the timer without a possibility of out-of-memory.

ShukantPal · **Joined:** Tue Dec 12, 2017 6:51 am **Posts:** 23

I think an simple implementation with a bucket of pairing heaps would be good.

Code:

// C++ syntax & style

struct Event
{
        Event *left;
        Event *right;
        unsigned long timeStamp;// time-stamp of event
        void (*kernHandler)();// handler to execute during interrupt
        LinkedList kernListeners;// listeners to execute on thread(s) after handler
        U32 lockMask;// used for synchronization (and other stuff)
};

struct EventCollection // (timer-based version of) node of a pairing heap
{
        EventCollection *childStack;
        EventCollection *nextSibling;
        EventCollection *lastSibling;
        Event *pointEvents;// doubly-linked circular list of event nodes
        unsigned long eventCounter;// no. of events in the collection
        unsigned long timeStamp;// time-stamp for all events

       /*
         * we could remove this if all events are gone, by getting the collection tree using
         * the bucket table (for this CPU).
         */
};

struct CollectionTree // pairing heap for EventCollection
{
        EventCollection *treeRoot;
        // i don't think we need any count here, right
};

struct Processor
{
        ...... // whatever man
        CollectionTree stampBuckets[256];
};

How would this type of implementation perform? Here we are using pairing heaps in each bucket to hold collections of events for a specific time-stamp. The collection of events is implemented as a doubly-linked list for cancellation requests.

Some optimizations:

1. Align timeout request by calculating how far they are in the future. If a request is 511 milliseconds away, we could align it to 512 (depending on the caller, exceptions are scheduler).

2. The tickless-part on exiting a timer interrupt would do - check the current timestamp. From the buckets that directly come after, we linearly check whether any bucket has a event-collection. One finding the bucket with a event-collection, we will reset the timer to generate an interrupt at its timestamp. However, if no event-collection is present - make an interrupt 1 second later (adding an event for it too).

NOTE:
If we group events in array, then cancellation of events will have a lot of overhead. Is that required? Anyways, my kernel's slab allocator should improve cache performance of event & collection structs.

Regards,
Shukant Pal

Brendan · **Posted:** Wed Jan 24, 2018 9:44 pm

Hi,

ShukantPal wrote:

I think an simple implementation with a bucket of pairing heaps would be good.

Code:

// C++ syntax & style

struct Event
{
        Event *left;
        Event *right;
        unsigned long timeStamp;// time-stamp of event
        void (*kernHandler)();// handler to execute during interrupt
        LinkedList kernListeners;// listeners to execute on thread(s) after handler
        U32 lockMask;// used for synchronization (and other stuff)
};

struct EventCollection // (timer-based version of) node of a pairing heap
{
        EventCollection *childStack;
        EventCollection *nextSibling;
        EventCollection *lastSibling;
        Event *pointEvents;// doubly-linked circular list of event nodes
        unsigned long eventCounter;// no. of events in the collection
        unsigned long timeStamp;// time-stamp for all events

       /*
         * we could remove this if all events are gone, by getting the collection tree using
         * the bucket table (for this CPU).
         */
};

struct CollectionTree // pairing heap for EventCollection
{
        EventCollection *treeRoot;
        // i don't think we need any count here, right
};

struct Processor
{
        ...... // whatever man
        CollectionTree stampBuckets[256];
};

How would this type of implementation perform? Here we are using pairing heaps in each bucket to hold collections of events for a specific time-stamp. The collection of events is implemented as a doubly-linked list for cancellation requests.

To estimate performance; for kernel code you should assume that the CPU spends most of its time running user-space code (that fills caches, etc) and therefore whenever kernel code starts running nothing the kernel wants is in the cache anymore. For this reason you'd assume that the first time you access any cache line it will cause a cache miss that is going to cost 100 cycles or more (and may also cause a TLB miss that's even more expensive). Also, for each unpredictable branch you'd allow "half of about 30 cycles" (15 cycles) to account for branch misprediction costs. Everything else can be completely ignored (the CPU can do simple things like additions, subtractions, AND, OR, etc in a fraction of 1 cycle, and the cost of slightly more expensive things like multiplications and memory accesses that hit L1 cache can be hidden by out-of-order execution).

For a silly example; consider something like this:

Code:

    while(event != NULL) {
        if( event->timeStamp < time) {            // Cache miss plus unpredictable branch = about 115 cycles
            event = event->left;
        } else if( event->timeStamp > time) {    // Unpredictable branch = about 15 cycles
            event = event->right;
        } else {
           return event;
        }
    }
    return NULL;

This adds up to about 130 cycles per iteration.

ShukantPal wrote:

Some optimizations:

1. Align timeout request by calculating how far they are in the future. If a request is 511 milliseconds away, we could align it to 512 (depending on the caller, exceptions are scheduler).

To avoid IRQs, Linux merges events that are "close enough".

I've considered using "time ranges" (giving each event an earliest time and a latest time) so that the caller can be more specific about the precision they need - e.g. "nanosleep()" executed by a device driver can have a very small time range (tens of nanoseconds), scheduler can have a larger time range (tens of microseconds), networking time-outs can have a huge time range (tens of milliseconds), etc. The idea would be to find the period of time where the time ranges of multiple events overlap then configure the time for the earliest time in the overlapping area, to minimise the number of IRQs as much as possible (while still giving precision where actually needed).

ShukantPal wrote:

If we group events in array, then cancellation of events will have a lot of overhead. Is that required?

If you give the caller an "event handle" that indicates where the event is (e.g. maybe use the virtual address of the event as the handle, if you trust the caller), then (if the caller cancels the event) you already know where to find the event from the handle the caller used; and if you allow the event to be marked as "cancelled" (e.g. maybe set the timeStamp to zero) and leave it in the array (and skip over any cancelled events later, either when you're looking for the soonest event or when the array is being sorted); then cancellation can cost "almost zero overhead".

Cheers,

Brendan

ShukantPal · **Joined:** Tue Dec 12, 2017 6:51 am **Posts:** 23

Hi Brendan,

I really like your clever idea of the way to cancel requests by using the "zero-overhead" technique. I am just trying to finish the "brainstorming" of ideas for my event-management subsystem which manages time (also features tickless scheduling).

i) Please comment on the memory overhead of collecting events in 4-KB blocks. Could I do something like this - for "time-ranges" (as you explained) that are large, have large blocks (4 or even 8K for minute-range event-arrays). But for high-precision events (tens of nanoseconds) I could use the event without an array.

ii) Other than not collecting events in arrays, was my implementation good enough. Heap-buckets (4-KB table of buckets) -> Pairing Heap Tree -> "Timestamp Node" -> List of arrays of events!

Other than that, currently I am facing a lot of confusion over setting up timers. Things that I have understood -

i) Constant TSC - Synchronized on all logical CPUs present in the system.
ii) Invariant TSC - Not comment on synchronization, only ticks at a constant-rate on the local CPU

Hopefully, I am right. But if the TSC in the system is not constant/invariant then should I not use the TSC at all on that system? If so, which timer should I use - HPET or RTC or is the local APIC timer (other than TSC mode) usable for SMP systems.

Brendan · **Posted:** Thu Jan 25, 2018 10:56 pm

Hi,

ShukantPal wrote:

i) Please comment on the memory overhead of collecting events in 4-KB blocks. Could I do something like this - for "time-ranges" (as you explained) that are large, have large blocks (4 or even 8K for minute-range event-arrays). But for high-precision events (tens of nanoseconds) I could use the event without an array.

If you use one thing for lower-precision events and another thing for higher-precision events; then your IRQ handler is going to have to look in two different places to figure out "how long until next IRQ is needed", and that will probably double the overhead of the timer's IRQ handler.

A better idea might be to separate "events that will expire in the next 2 seconds" (that might use "1 millisecond per bucket") from "events that won't happen for at least 2 seconds" (that might use "1 second per bucket"); where the timer's IRQ handler still only has to care about a single bucket. However, even in this case you end up with the overhead of moving events from one set of buckets to a different set of buckets and I wouldn't be sure that there's a benefit that justifies the extra overhead (unless maybe the "events that won't happen for at least 2 seconds" is a global thing shared by all CPUs, and "events that will expire in the next 2 seconds" is per CPU).

ShukantPal wrote:

ii) Other than not collecting events in arrays, was my implementation good enough. Heap-buckets (4-KB table of buckets) -> Pairing Heap Tree -> "Timestamp Node" -> List of arrays of events!

Part of the reason I described how I estimate overhead is that I was too lazy to figure out how your implementation worked from the information provided. I was hoping that you'd be able to estimate overhead and determine if something is/isn't good enough yourself.

Note that "good enough" is different for each project - there's always a compromise between simplicity and performance, and (especially for beginners) there's a strong argument for "initially implementation as simple as possible" (and replacing it with something more complex that gives better performance later).

ShukantPal wrote:

Other than that, currently I am facing a lot of confusion over setting up timers. Things that I have understood -

i) Constant TSC - Synchronized on all logical CPUs present in the system.
ii) Invariant TSC - Not comment on synchronization, only ticks at a constant-rate on the local CPU

Hopefully, I am right. But if the TSC in the system is not constant/invariant then should I not use the TSC at all on that system? If so, which timer should I use - HPET or RTC or is the local APIC timer (other than TSC mode) usable for SMP systems.

There's multiple problems involved. For some CPUs with constant TSC, it's only synchronised between CPUs in the same physical chip. For different CPUs, in different power saving states different timers (TSC, local APIC) might stop working (e.g. and when putting a CPU into a deeper sleep state you might have to shift all the CPU's events to a global HPET data structure to work around that). In general, you want to keep things synchronised with "wall clock time" (e.g. using RTC to keep TSC from drifting, and using NTP to keep RTC from drifting).

In theory; for CPUs where TSC doesn't tick at a constant rate, how fast TSC does tick will depend on CPU clock speed; and if you know when the CPU speed changes you can compensate for the change. CPU clock speed only changes when the OS explicitly changes it and when the CPU gets too hot and goes into "thermal throttling" mode (which causes a "thermal monitor IRQ" in the local APIC), and you can know when these things occur. Therefore (in theory) you can compensate for TSC speed changes and get "accurate enough" timing from TSC despite the fact that the TSC changes speed (especially when combined with RTC to keep TSC from drifting).

Finally; it's entirely possible to write code that manages uncertainty (and entirely possible to write timer code that doesn't require TSC to be synchronised between CPUs). For example, imagine if you have a "get nanoseconds since boot" function that doesn't/can't know exactly what time it is, but can/does know that the time must be within a certain range, where this function returns two values - one for "earliest time it could be now" and another for "latest time it could be now". Don't forget that it's impossible to guarantee that an event's handler will be called at exactly the right time (e.g. interrupts might be temporarily disabled when the timer's IRQ occurs causing the timer's IRQ to be postponed/delayed), and (because of this) you can only guarantee that the event handler won't be called too early (but may be called later than desired). If your code manages uncertainty (never knows exactly what time it actually is but does know the earliest and latest time it could be now) then you can still guarantee that the event handler isn't called too early (by relying on "latest time it could be now").

Cheers,

Brendan

ShukantPal · **Joined:** Tue Dec 12, 2017 6:51 am **Posts:** 23

Hi,

Thank you for your advice. Another question

- The IOAPIC could only handle upto 16 CPUs (in physical destination mode). That means I shouldn't rely on it for interrupts as it can only handle 16 CPUs??? I should use the TSC only to deliver interrupts by calculating the rate at which it is incrementing. The HPET will be more like a system-clock.

For multiple CPUs, the event-manager will become more complex. .i.e what about grouping events into arrays if there are different event-structs on each CPU? Or should I not care at all about allocating events on each CPU (I mean not think about "balancing" events on each CPU) assuming that the events will be given almost equally by the system processes.

Brendan · **Posted:** Fri Jan 26, 2018 2:10 am

Hi,

ShukantPal wrote:

Thank you for your advice. Another question

- The IOAPIC could only handle upto 16 CPUs (in physical destination mode). That means I shouldn't rely on it for interrupts as it can only handle 16 CPUs??? I should use the TSC only to deliver interrupts by calculating the rate at which it is incrementing. The HPET will be more like a system-clock.

For physical destination mode, the IO APIC uses an 8-bit "destination APIC ID" field and (if there's only one IO APIC) it can handle up to 255 CPUs (one APIC ID for the IO APIC itself, and 255 more for local APICs).

ShukantPal wrote:

For multiple CPUs, the event-manager will become more complex. .i.e what about grouping events into arrays if there are different event-structs on each CPU? Or should I not care at all about allocating events on each CPU (I mean not think about "balancing" events on each CPU) assuming that the events will be given almost equally by the system processes.

For balancing, I wouldn't worry much - if the tasks that are creating events are balanced across CPUs then the events they create will also be balanced across all CPUs. Note that under some circumstances you actually want it to be "very unbalanced" (e.g. if there's 4 CPUs but only two tasks running you want to put 2 of the CPUs to sleep to save power and don't want to be constantly waking them up for timer events).

More important is trying to make sure that (e.g.) if a piece of software running on CPU #3 creates a timer event, then CPU #3 handles that timer event and CPU #3 calls the software's callback when the event expires. You don't want a different CPU involved because all of the software's code and data is more likely to still be cached by CPU #3 and less likely to be cached by any other CPU.

Cheers,

Brendan

ShukantPal · **Joined:** Tue Dec 12, 2017 6:51 am **Posts:** 23

Hi Brendan,

On the IOAPIC page in the OSDev wiki, it was stating that bits 56-59 in destination field are usable in physical destination mode. It was even saying that in the Intel IOAPIC preliminary documentation. Maybe that is for sending interrupts to another IOAPIC (right/wrong ?). But for sending to a CPU (local APIC), the full 8-bits are usable. But what about x2APIC mode where more than 255 CPUs are present;

For that I have thought of masking interrupt vectors and using the Logical Destination Mode. For example, cpus (0, 8, 16...) will all have logical ID = 1. But cpu 0 will handle vector 32 and cpu8 will handle vector 40 (masking off 32 & 48) while cpu16 will handle vector 48 (masking off 32 & 40). Any other ideas? My idea doesn't fit for "hot-pluggable" cpus.

_______________________________________________________________________________________________________________________________________________________

Continuing your example of task on CPU #3 creating event on CPU3. What if the runqueue balancer moves it to another CPU? Should I bother moving the event to other CPU OR just send a IPI (I built a inter-processor request code which also allows sending "messages" to other CPUs) and give the reference to the event.

Also, how to "free" memory for the events. Wait until all handlers call a event_finish() function until all event in the array are finished?

_______________________________________________________________________________________________________________________________________________________

By using the TSC, I don't need to use the HPET for interrupts right. I could just keep its counter increasing and increasing, disable all comparators, right?

Also, at boot time (for non-constant TSC) all CPUs need to synchronize TSCs. Boot CPU synchronizes with HPET? How do application CPUs synchronize with boot CPU?

Brendan · **Posted:** Sat Jan 27, 2018 1:57 am

Hi,

ShukantPal wrote:

On the IOAPIC page in the OSDev wiki, it was stating that bits 56-59 in destination field are usable in physical destination mode. It was even saying that in the Intel IOAPIC preliminary documentation. Maybe that is for sending interrupts to another IOAPIC (right/wrong ?). But for sending to a CPU (local APIC), the full 8-bits are usable.

Ah - for old computers/chipsets it was 4-bits (and could only support up to 16 APIC IDs/CPUs in physical destination mode); and for newer computers/chipsets it was increased to 8 bits (and then extended to 16-bits later). I'm not entirely sure when this happened, but I suspect that the increase to 8 bits happened when MSI was introduced (and the extension to 16-bits was part of x2APIC).

ShukantPal wrote:

But what about x2APIC mode where more than 255 CPUs are present;

For x2APIC it's a confusing mess. The "destination" in the IO APIC (and in MSI, etc) is used as an index into an "interrupt remapping table" to find the real (32-bit) destination APIC ID. I suspect that this is because Intel were too lazy to make the IO APIC sane and just recycled something intended for a completely different purpose (the "interrupt remapping table" that was originally intended for virtualisation).

ShukantPal wrote:

For that I have thought of masking interrupt vectors and using the Logical Destination Mode. For example, cpus (0, 8, 16...) will all have logical ID = 1. But cpu 0 will handle vector 32 and cpu8 will handle vector 40 (masking off 32 & 48) while cpu16 will handle vector 48 (masking off 32 & 40). Any other ideas? My idea doesn't fit for "hot-pluggable" cpus.

For logical destinations; Intel suggests a "1 bit per CPU" format (that breaks as soon as you have more than 8 CPUs). Instead I like to use one bit for "first CPU within core", one bit for "last CPU within chip", and one bit for "first CPU within NUMA domain" (so I can broadcast an IPI to one CPU in each core, or broadcast an IPI to one CPU in each physical chip, or broadcast an IPI to one CPU in each NUMA domain; which can be useful for things like power management, etc). For the remaining 5 bits, over the years I've thought of several schemes, including using one bit for each IO hub ("send IRQ to the lowest priority CPU that is close to this device") and including using the bits to reduce multi-CPU TLB shootdown costs ("broadcast IPI to CPUs that are executing a process where the lowest bits of the process ID match").

Unfortunately, for x2APIC it all changed - for x2APIC it became hard-wired (you can't control how logical destinations are configured) and you have to use "cluster mode" where the highest 16 bits determine the cluster (NUMA domain) and the lowest 16 bits are in "1 bit per CPU within the cluster" format; which mostly ruins any clever trickery.

ShukantPal wrote:

Continuing your example of task on CPU #3 creating event on CPU3. What if the runqueue balancer moves it to another CPU? Should I bother moving the event to other CPU OR just send a IPI (I built a inter-processor request code which also allows sending "messages" to other CPUs) and give the reference to the event.

I'd leave the timer event where it is and send an IPI when the event expires; partly because (in theory) it could be moved many times before the timer event expires.

ShukantPal wrote:

Also, how to "free" memory for the events. Wait until all handlers call a event_finish() function until all event in the array are finished?

I'm not sure which algorithm you're thinking of. For "rotating buckets" (or the "timer wheels" in Linux) where you do something like "bucket_number = (expiry_time / time_per_bucket) % number_of_buckets" you just continually recycle the same array/bucket (and an array/bucket might never be empty - e.g. if an event expires in 1234 years time then it would remain in its bucket for 1234 years).

ShukantPal wrote:

By using the TSC, I don't need to use the HPET for interrupts right. I could just keep its counter increasing and increasing, disable all comparators, right?

The TSC alone can't generate an IRQ and is therefore not very useful for timer events (where you need an IRQ). For the local APIC's timer (regardless of whether you're using the newer "TSC deadline mode" or the older "one shot mode"); does it keep working in all of the CPU's power saving/sleep states? If the local APIC's timer doesn't work when CPU/s are in (deeper) sleep state/s, then what happens to your events when the local APIC timer isn't working?

ShukantPal wrote:

Also, at boot time (for non-constant TSC) all CPUs need to synchronize TSCs. Boot CPU synchronizes with HPET? How do application CPUs synchronize with boot CPU?

At regular intervals (including during boot) CPUs need to synchronise their TSC with something else (e.g. RTC).

Let's work backwards. I'm in favour of having a single representation of time (e.g. "nanoseconds" for everything instead of having multiple different representations of time (e.g. "clock_t" for some things and "time_t" for other things, with awkward conversions between them); and (for both security and compatibility reasons) I won't allow user-space code to access the TSC itself.

To handle CPUs where TSC isn't constant; I'd have a function in the kernel that does something like:

Code:

uint64_t get_nanoseconds_since_epoch(void) {
    TSC_now = RDTSC();
    time_now = last_time_for_this_CPU + (TSC_now - TSC_last_time_for_this_CPU) * TSC_speed_for_this_CPU;

    last_time_for_this_CPU = time_now;
    TSC_last_time_for_this_CPU = TSC_now;
    return time_now;
}

When a CPU's TSC is being synchronised (at regular intervals, and during boot) I'd adjust "last_time_for_this_CPU" and "TSC_last_time_for_this_CPU". When the speed of the CPU's TSC isn't known (during boot) I'd measure it and set "TSC_speed_for_this_CPU"; and if the speed of the CPU's TSC isn't constant I'd update "TSC_speed_for_this_CPU" whenever the CPU's speed changes.

For CPUs where TSC is constant, "TSC_speed_for_this_CPU" would only be changed rarely (e.g. set during boot, and after that only changed to compensate for any minor drift and not because the CPU's speed changed). Apart from that, it would make no difference if TSC is constant or not.

Of course there's also no real need to synchronise all CPUs at the same time. If a CPU is doing nothing and/or is sleeping you could not bother (leave it until later). For "hot-plug CPU" you'd synchronise one CPU (alone) when it goes from "offline" to "online"; and during boot most CPUs could be left "offline" to save some time (e.g. bring them "online" if/when they're needed). If someone asks for an extremely precise time event, then maybe you could synchronise that CPU earlier than normal. Maybe you could do something like "if( how_much_I_care > how_much_it_might_need_synchronising ) { sychronise_it(); }" where each CPU synchronises itself whenever it feels like it (and avoids the overhead of synchronising when it's not really needed).

Cheers,

Brendan

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Note that according to the Linux source, the accessing the HPET from multiple cores at the same time is really slow. They have extra logic to broadcast the time instead of performing a register read across the system bus.

So the optimal time source is probably:
TSC (if it can generate IRQs and is constant + invariant) > APIC timer (if it can be relied on in the current power state) > HPET (to recalibrate/synchronize TSC and APIC) > PIC + ACPI timer (if there is no HPET).

ShukantPal · **Joined:** Tue Dec 12, 2017 6:51 am **Posts:** 23

Is there any way to determine the speed of the APIC timer? Using CPUID.15H, we could easily get the TSC's nominal frequency. How do I get the processor's current frequency? Also, the manual states that if CPUID.15H is not supported then the APIC timer ticks at the bus frequency? How to determine the bus frequency? Is the bus-frequency same as "scalable bus frequency" or "FSB-frequency"?

Also, as you said that "APIC timer may not work in sleep states", then i could use the HPET for IRQs. But then HPET doesn't have to provide many comparators - so only one/two/three CPUs will be able to handle timer IRQs. Any mitigation? If so, I could rely on the TSC only for scheduler purposes (as it requires v. high precision)?

_____________________________

Also, most OSes can't guarantee nano-second precision for executing events, right (due to scheduling or not being able interrupt a higher-priority task)? Then how does TSC improve performance (totally naive question)?

OSDev.org

Timer-Interrupts using Events - Tickless Kernel

Who is online