Scheduling algorithm

LegendDairy · **Posted:** Tue Aug 23, 2016 3:32 am

Hi,

I'm trying to rewrite my scheduler, I worked out an algorithm i like and i would like some feedback on it:

The scheduling algorithm is some sort of a multilevel queue and each queue uses Round Robin to select the next thread in that specific queue. There are multiple queues with different priorities. Queues with higher priority are used more often than lower ones. For example imagine 3 threads A,B and C and 3 different queues (high, medium and low priority). Thread A has a high priority, B has a medium priority and C a low priority. Then these threads will be executed in the following order:
`A-B-A-B-A-C`
So every odd quantum a highest priority thread will be executed and every second and 4th quantum a medium priority and finally every 6th quantum a lowest priority thread. For now thread priority is determined by the kernel. In the future I might add some type of good-boy-points system where a thread can rise to a higher priority or be punished and descent to lower one.

This is what I have implemented right now, however it scales horribly on smp because there are only 3 locks (so a high chance of lock contention on big systems), and there is no 'pinning' of threads to a specific cpu. In order to optimize this for SMP systems, I was thinking about using the 1 scheduler per logical cpu method, with each scheduler having its own set of queues and locks. There would then be one thread in charge of load balancing the threads across the different schedulers.The schedulers can only have lock contention when a thread is being added to or removed from one of its queues that he is trying to access, because of load balancing or because a thread was started, blocked... When a thread is blocked for some reason, it would be put on the blocked list of the scheduler it was using. When it's time to start the thread again, it will be awoken on the same cpu. The load balancing thread might later try to reappoint it to an other scheduler.

I could take it a step further and remove locks completely. When a thread is to be added or removed from a scheduler, an IPI is send to the logical cpu that holds that thread. It then disables interrupts to make sure its queues wont be accessed and then adds or removes the threads. However IPIs have quiet a big overhead (e.g. pipeline flush), but locking the bus every time the scheduler is called isn't great for performance either especially on large systems. Also note that if a thread that is being executed on a cpu x, tries to stop/start a thread that is also on cpu x, it wont have to send an IPI, just temporarily disable interrupts. So a thread blocking itself won't cause a performance hit.

willedwards · **Joined:** Sat Mar 15, 2014 3:49 pm **Posts:** 96

Having per-core run lists is great for scalability.

Ideally your queues are thread-safe and lock-free. You simply don't need locks on modern processor memory models. This is a nice soft intro: http://preshing.com/20120612/an-introdu ... ogramming/

Rather than having a dedicated balancing thread that runs periodically you could have the scheduler running on each core do admin when it has too little to do or too much to do, e.g. you could have 'work stealing' so a core with nothing to do will do the admin of looking in the other cores' run lists for work it can take, and you can also have a core's scheduler judge how 'full' its run list is periodically and actively try to migrate work away occasionally.

When sleeping threads die you don't have to delete them immediately, only mark them as both runnable and dead. The scheduler on that core will soon enough try to run them, and notice they are dead. This is called 'tombstoning' and is common-enough in lock-free data-structures.

simeonz · **Joined:** Fri Aug 19, 2016 10:28 pm **Posts:** 360

The approach that you describe is obviously very efficient. One downside I can see here is that the amount of time that a thread gets depends on the number of contenders for this priority level. You may compensate by dynamically adjusting frequencies, but I am not sure what other factors might interact with that.

As far as I know, Windows and the Linux O(1) scheduler use queues. However, they don't interleave exactly. Linux O(1) uses the time slice to account for the thread priority and Windows executes the higher priority threads ad infinitum, but uses task aging to promote lower priority threads on regular intervals. In a sense Windows has interleaving similar to your design, but the number of threads in a queue do not impact the fairness between queues. The new Linux completely fair scheduler (CFS) uses balanced tree to store the threads sorted according to their scheduler deadline or some similar metric. As a ready thread's execution is profusely stalled on the run queue, it automatically advances to the front where the scheduler picks it up eventually.

Regarding the locking performance, OSes use per-cpu structures for everything - scheduling, allocators, deferred work. As willedwards said already, you can schedule work using work stealing. The idle processors take it onto themselves to find overburdened peers and lift off their work into their queues. So, the starved CPU should assume the role of the rebalancer. If all CPUs are frequently rebalancing into their queues, this means that they are frequently idle. Thus the trashing is not bottleneck for performance. Again, as willedwards said, you don't always need locks and you don't always need even interlocked operations. For example, you can poll other CPU's using aligned reads, which will be stale data, but will be coherent (will not tear.) Whether you use IPI - I think it is used in OS schedulers for this purpose, so you may be on the right track.

The biggest issue IMHO is the schedule's quality. You already mentioned that you plan to work on that. It is important, how you plan to handle threads that are stalled on other threads or I/O. Windows and Linux O(1) use priority boosting (in Windows it is under I/O driver's control), whereas Linux CFS simply gets the boost effect from the fact that blocked threads are far behind their deadline when they wake up. The latter however can also have negative effect on performance as well, as the unblocking thread may become immediately interrupted by the unblocked thread. This can result in tearing of batches of incoming work requests from one thread to another, or similar anomalies.

Sorry for the long rant

sleephacker · **Posted:** Tue Aug 23, 2016 7:51 am

LegendDairy wrote:

So every odd quantum a highest priority thread will be executed and every second and 4th quantum a medium priority and finally every 6th quantum a lowest priority thread.

You might want to take the number of threads in each queue into consideration as well. For example, if there are 6 high priority threads and only one low priority thread, each high priority thread will only run once every 12 quanta while the one low priority thread runs twice as often, every 6 quanta.

Ycep · **Joined:** Mon Dec 28, 2015 11:11 am **Posts:** 401

Maybe this could be implemented by something like this (IDK if anyone dislikes it, that's just my first solution):
We maybe put, for example an variable which resets to zero always when it gets to 3.
Every second we increase it by 1.
In each second:
If is that variable equal to zero, then we finish tasks with high priority.
If is that variable equal to 1, then we take 500ms for high priority tasks and other 500ms for medium priority tasks.
If is that variable equal to 2, then we take another 500ms for medium priority tasks and 500ms for low priority tasks.
So basically:
Every 3 second:
1.5 sec is used for high priority tasks, if there's any
1 sec is used for medium priority tasks, if there's any
0.5 sec is used for low priority tasks. If there any
Of course, if there isn't any low priority tasks, their time is taken by other priorities. If there isn't medium priority tasks, do high priority tasks within their time.
Personally I think this is bad idea, but it's first fit.

simeonz · **Joined:** Fri Aug 19, 2016 10:28 pm **Posts:** 360

Quote:

very 3 second:
1.5 sec is used for high priority tasks, if there's any
1 sec is used for medium priority tasks, if there's any
0.5 sec is used for low priority tasks. If there any

As sleephacker already mentioned, assigning time per priority level (rather then per thread) voids the benefits of prioritization. If 100 threads use the high priority level and only 1 the low priority level, the high priority threads will be starved for cpu. Such scheduling may be used in embedded systems, because of its low overhead and predictable real-time behavior, but those systems are designed holistically and imbalances can be dealt with on an individual basis.

Looking at basic forms of scheduling, you can instead:

Adjust the time slice of the high priority threads to be longer and execute all threads once per time interval.
Interleave high and low priority threads, but make sure that high priority threads are scheduled with higher frequency, such that they get more quanta per time interval.

Brendan · **Posted:** Fri Aug 26, 2016 3:12 pm

Hi,

LegendDairy wrote:

I'm trying to rewrite my scheduler, I worked out an algorithm i like and i would like some feedback on it:

If it's only one algorithm, then it's bad.

Different threads have different characteristics and different requirements. You're going to have threads for things like handling network packets and user interfaces than need low latency (how quickly they get CPU time and can respond when a network packet arrives or when user input arrives) but don't use very much CPU time. You're also going to have threads that do heavy processing that use lots of CPU time but don't care about latency at all. You might also have (soft or hard) real-time threads, or (if it's a microkernel) driver threads that need extremely low latency for IRQ handlers. You might also have "idle threads" that are used for doing things in the background (e.g. optimising cached data, de-fragmenting disks, protein folding, etc), which should never get CPU time unless there's nothing else to do. Maybe you have other kinds of threads.

In any case, it's impossible to have a single algorithm that is acceptable.

Instead, create "scheduling policies" where each scheduling policy is designed for a different type of threads (with different characteristics and different requirements). Maybe your "policy 0" is intended for real-time threads and uses "earliest deadline first" scheduling algorithm. Maybe your "policy 4" is intended for those "idle threads" and uses "first come first served" scheduling algorithm. Maybe "policy 1" is intended for threads that need low latency and uses "variable frequency".

This involves a kind of "meta scheduler" that decides which policy to give CPU time (and then that policy determines which thread using that policy is given CPU time). Typically this is extremely simple; like "if there are any threads in policy 0 that can run choose a thread from policy 0; else if there are any threads in policy 1 that can run choose a thread from policy 1; else ...". Note that a thread can have a priority within its policy (if that makes sense for the scheduling algorithm that the policy uses).

Finally; define pre-emption rules. For example, maybe when a policy 0 thread is unblocked it immediately preempts the currently running thread if the currently running thread is policy 0 and has a lower priority or if the currently running thread belongs to any other priority. Maybe when a policy 3 thread is unblocked it never preempts the currently running thread.

Note that if you do it right, you can have (e.g.) a GUI that always responds within 1ms, regardless of whether there's nothing else for CPU to do or if the CPU is struggling/failing to handle the load of 1234567 other threads. Note that "very important tasks preempt immediately" is a minimum requirement for anything involving (soft or hard) real-time threads and anything where drivers are in user-space; and (while not a strict requirement) is extremely desirable for anything involving networking or user interfaces.

LegendDairy wrote:

The scheduling algorithm is some sort of a multilevel queue and each queue uses Round Robin to select the next thread in that specific queue. There are multiple queues with different priorities. Queues with higher priority are used more often than lower ones. For example imagine 3 threads A,B and C and 3 different queues (high, medium and low priority). Thread A has a high priority, B has a medium priority and C a low priority. Then these threads will be executed in the following order:
`A-B-A-B-A-C`
So every odd quantum a highest priority thread will be executed and every second and 4th quantum a medium priority and finally every 6th quantum a lowest priority thread. For now thread priority is determined by the kernel. In the future I might add some type of good-boy-points system where a thread can rise to a higher priority or be punished and descent to lower one.

This sounds like you want "variable frequency", where you have a fixed length time slice (e.g. 1 ms) and higher priority threads get more time slices.

To implement "variable frequency"; the basic idea is to give each thread a "current count" and a "max. count" (where "max. count" is determined by the thread's priority). When choosing a thread, decrease the thread's "current count" and if it reaches zero do "thread->current_count = thread->max_count" and give that thread CPU time; otherwise move on to the next thread (and wrap around) - eventually a thread will get CPU time. In big-O notation, the basic algorithm is known as "O(my god!)" - it's not fast.

It can be implemented as "O(1)". To do this you need a list for each "current count"; a circular buffer of "list heads", and a "current position in circular buffer". When the list at "current position in circular buffer" isn't empty, you remove the first thread from that list and give it CPU time. Otherwise (when the list at "current position in circular buffer" is empty) there are no threads with "count = 1" and you move the "current position in circular buffer" to the next slot in the circular buffer (effectively, decrementing the count for all threads simultaneously).

LegendDairy wrote:

This is what I have implemented right now, however it scales horribly on smp because there are only 3 locks (so a high chance of lock contention on big systems), and there is no 'pinning' of threads to a specific cpu. In order to optimize this for SMP systems, I was thinking about using the 1 scheduler per logical cpu method, with each scheduler having its own set of queues and locks. There would then be one thread in charge of load balancing the threads across the different schedulers.The schedulers can only have lock contention when a thread is being added to or removed from one of its queues that he is trying to access, because of load balancing or because a thread was started, blocked... When a thread is blocked for some reason, it would be put on the blocked list of the scheduler it was using. When it's time to start the thread again, it will be awoken on the same cpu. The load balancing thread might later try to reappoint it to an other scheduler.

Having "per CPU" scheduler data is a good idea.

For load balancing, how my schedulers work is that when a thread is given CPU time it's moved from the "ready to run" state into the "running state", and removed from the scheduler's queues/data structures (the scheduler only ever keeps track of "ready to run" threads and not "running" threads). When a thread goes into the "ready to run" state for any reason (it was running and got preempted, it was spawned, it unblocked) the kernel decides which CPU it should use, and puts the thread onto that CPU's scheduler data.

This tends to keep high priority tasks very well balanced (because they tend to move between "running" and "ready to run" states quickly); and also avoids "excessive re-balancing" of low priority tasks that don't matter.

The problem with a "load balancer thread" is that you have to compromise between "high priority load balancer thread that gobbles far too much CPU time" and "low priority load balancer thread that takes far too long to respond to changes in load"; and there's no good compromise (partly because there's thread switching involved so "frequent rebalancing" adds up to "frequent thread switches").

Cheers,

Brendan

LegendDairy · **Posted:** Fri Aug 26, 2016 3:23 pm

Thanks for all the feedback! I really appreciate it

sleephacker wrote:

You might want to take the number of threads in each queue into consideration as well. For example, if there are 6 high priority threads and only one low priority thread, each high priority thread will only run once every 12 quanta while the one low priority thread runs twice as often, every 6 quanta.

I slightly tweaked my algorithm to avoid this problem. This is what I have so far.

Before I just selected one thread from a specific queue every 'tick', now I run an entire queue every tick.
Note that the notion of 'tick' changes: before a 'tick' happened every time the scheduler is called, now it happens every time time an entire queue is executed:

Instead of scheduling 1 thread every tick, I schedule an entire queue every tick:
* Every odd tick the entire high priority queue is scheduled.
* Every sixth tick the entire lowest priority queue is scheduled
* Every second and fourth the entire medium priority queue is scheduled

So we get: high_queue - medium queue - High_queue - medium queue - High queue - Low queue

An example:
Imagine 3 queues, and a couple of threads:
high priority: A, B, C
medium priority: D
low priority we have E and F

The new algorithm will look like: A-B-C-D-A-B-C-D-A-B-C-E-F
the old: A-D-B-D-C-E-A-D-B-D-F-C-D-A-D-B-E

In the new algorithm it the time till a thread is executed again is: (3*#highprioritythreads + 2*#medprioritythreads + #lowprioritythreads) / (1,2 or 3)
Where '#'='number of'. Where we divide by 1 for low, by 2 for med and by 3 for high priority

In the old algorithm it was: (3 + 2 + 1)*#threadsinhispriority

Because in new algorithm the number of threads in all the different queues factor in, the time it takes for a thread to be rescheduled is distributed more uniformly across the different levels. The time depends on the distribution of threads as a whole multiplied by a factor, so the frequency of two threads only differs by a factor thats depends on their priority. Personally I would say that this new algorithm is more fair. Now i just have to think how I will determine priority levels and/or how I will change priorities dynamically to imrpove performance, I'll probably implement a the points system where threads can be punished or rewarded, but I first want to put some more thought into that.

LegendDairy · **Posted:** Fri Aug 26, 2016 4:13 pm

Brendan wrote:

Hi,
If it's only one algorithm, then it's bad.

Different threads have different characteristics and different requirements. You're going to have threads for things like handling network packets and user interfaces than need low latency (how quickly they get CPU time and can respond when a network packet arrives or when user input arrives) but don't use very much CPU time. You're also going to have threads that do heavy processing that use lots of CPU time but don't care about latency at all. You might also have (soft or hard) real-time threads, or (if it's a microkernel) driver threads that need extremely low latency for IRQ handlers. You might also have "idle threads" that are used for doing things in the background (e.g. optimising cached data, de-fragmenting disks, protein folding, etc), which should never get CPU time unless there's nothing else to do. Maybe you have other kinds of threads.
.

I was planning on adding more features as I progress, for example I read your idea somewhere on the forum about infinity and minus infinity threads, I was planning on implementing something similar to this 'minus infinity'/ 'idle thread'. For extremely low latency I was thinking of 'instant-waking', basically when an irq handler was woken up it might get a God-priority where it will be scheduled as the next thread no matter what.
The thing is that I first wanted to build something simple that I could expand later on.

Brendan wrote:

In any case, it's impossible to have a single algorithm that is acceptable.
Instead, create "scheduling policies" where each scheduling policy is designed for a different type of threads (with different characteristics and different requirements). Maybe your "policy 0" is intended for real-time threads and uses "earliest deadline first" scheduling algorithm. Maybe your "policy 4" is intended for those "idle threads" and uses "first come first served" scheduling algorithm. Maybe "policy 1" is intended for threads that need low latency and uses "variable frequency".
This involves a kind of "meta scheduler" that decides which policy to give CPU time (and then that policy determines which thread using that policy is given CPU time). Typically this is extremely simple; like "if there are any threads in policy 0 that can run choose a thread from policy 0; else if there are any threads in policy 1 that can run choose a thread from policy 1; else ...". Note that a thread can have a priority within its policy (if that makes sense for the scheduling algorithm that the policy uses).
Finally; define pre-emption rules. For example, maybe when a policy 0 thread is unblocked it immediately preempts the currently running thread if the currently running thread is policy 0 and has a lower priority or if the currently running thread belongs to any other priority. Maybe when a policy 3 thread is unblocked it never preempts the currently running thread.
Note that if you do it right, you can have (e.g.) a GUI that always responds within 1ms, regardless of whether there's nothing else for CPU to do or if the CPU is struggling/failing to handle the load of 1234567 other threads. Note that "very important tasks preempt immediately" is a minimum requirement for anything involving (soft or hard) real-time threads and anything where drivers are in user-space; and (while not a strict requirement) is extremely desirable for anything involving networking or user interfaces.

But I didn't really think about all of this. My idea was indeed just to expand one scheduling policy with extra features, but i guess that you're right. Back to the drawing board!

, however to be honest I don't think I'll ever have time to implement something like this so I might actually stick with something a lot simpler.

LegendDairy wrote:

The scheduling algorithm is some sort of a multilevel queue and each queue uses Round Robin to select the next thread in that specific queue. There are multiple queues with different priorities. Queues with higher priority are used more often than lower ones. For example imagine 3 threads A,B and C and 3 different queues (high, medium and low priority). Thread A has a high priority, B has a medium priority and C a low priority. Then these threads will be executed in the following order:
`A-B-A-B-A-C`
So every odd quantum a highest priority thread will be executed and every second and 4th quantum a medium priority and finally every 6th quantum a lowest priority thread. For now thread priority is determined by the kernel. In the future I might add some type of good-boy-points system where a thread can rise to a higher priority or be punished and descent to lower one.

Brendan wrote:

This sounds like you want "variable frequency", where you have a fixed length time slice (e.g. 1 ms) and higher priority threads get more time slices.
...
It can be implemented as "O(1)". To do this you need a list for each "current count"; a circular buffer of "list heads", and a "current position in circular buffer". When the list at "current position in circular buffer" isn't empty, you remove the first thread from that list and give it CPU time. Otherwise (when the list at "current position in circular buffer" is empty) there are no threads with "count = 1" and you move the "current position in circular buffer" to the next slot in the circular buffer (effectively, decrementing the count for all threads simultaneously).

Yes, that is sort of how I implemented it, the different 'queues' I talked about are lists/'feedback-queues', and there is one for each frequency/"current count", each with a beginning and end pointer. The difference in frequency is achieved by picking the correct list where some lists (higher priority onces) are chosen multiple times in one scheduling-period.

Brendan wrote:

For load balancing, how my schedulers work is that when a thread is given CPU time it's moved from the "ready to run" state into the "running state", and removed from the scheduler's queues/data structures (the scheduler only ever keeps track of "ready to run" threads and not "running" threads). When a thread goes into the "ready to run" state for any reason (it was running and got preempted, it was spawned, it unblocked) the kernel decides which CPU it should use, and puts the thread onto that CPU's scheduler data.

Yea, I think I will step away from the thread idea, it was more an idea to keep my code a bit cleaner and sort separate it out.
How do you make sure a thread isn't hopping to often from one CPU to another? My idea was to always try put a thread back on the cpu it came from unless the load on that cpu is significantly bigger than the other CPUs. The question than becomes how do I properly define 'the load' of a cpu.

Thank you for your feedback!

(Sorry for the double post, I read Brendans post after I posted the previous one and wanted to reply but mixing those two posts seemed like a bad idea)

Ycep · **Joined:** Mon Dec 28, 2015 11:11 am **Posts:** 401

simeonz wrote:

Quote:

very 3 second:
1.5 sec is used for high priority tasks, if there's any
1 sec is used for medium priority tasks, if there's any
0.5 sec is used for low priority tasks. If there any

As sleephacker already mentioned, assigning time per priority level (rather then per thread) voids the benefits of prioritization. If 100 threads use the high priority level and only 1 the low priority level, the high priority threads will be starved for cpu. Such scheduling may be used in embedded systems, because of its low overhead and predictable real-time behavior, but those systems are designed holistically and imbalances can be dealt with on an individual basis.

Looking at basic forms of scheduling, you can instead:

Adjust the time slice of the high priority threads to be longer and execute all threads once per time interval.
Interleave high and low priority threads, but make sure that high priority threads are scheduled with higher frequency, such that they get more quanta per time interval.

I know it's bad idea, it's just first idea I made in my https://en.wikipedia.org/wiki/Inferior_frontal_gyrus

.

OSDev.org

Scheduling algorithm

Who is online