I'm trying to rewrite my scheduler, I worked out an algorithm i like and i would like some feedback on it:
If it's only one algorithm, then it's bad.
Different threads have different characteristics and different requirements. You're going to have threads for things like handling network packets and user interfaces than need low latency (how quickly they get CPU time and can respond when a network packet arrives or when user input arrives) but don't use very much CPU time. You're also going to have threads that do heavy processing that use lots of CPU time but don't care about latency at all. You might also have (soft or hard) real-time threads, or (if it's a microkernel) driver threads that need extremely low latency for IRQ handlers. You might also have "idle threads" that are used for doing things in the background (e.g. optimising cached data, de-fragmenting disks, protein folding
, etc), which should never get CPU time unless there's nothing else to do. Maybe you have other kinds of threads.
In any case, it's impossible to have a single algorithm that is acceptable.
Instead, create "scheduling policies" where each scheduling policy is designed for a different type of threads (with different characteristics and different requirements). Maybe your "policy 0" is intended for real-time threads and uses "earliest deadline first" scheduling algorithm. Maybe your "policy 4" is intended for those "idle threads" and uses "first come first served" scheduling algorithm. Maybe "policy 1" is intended for threads that need low latency and uses "variable frequency".
This involves a kind of "meta scheduler" that decides which policy to give CPU time (and then that policy determines which thread using that policy is given CPU time). Typically this is extremely simple; like "if there are any threads in policy 0 that can run choose a thread from policy 0; else if there are any threads in policy 1 that can run choose a thread from policy 1; else ...". Note that a thread can have a priority within its policy (if that makes sense for the scheduling algorithm that the policy uses).
Finally; define pre-emption rules. For example, maybe when a policy 0 thread is unblocked it immediately preempts the currently running thread if the currently running thread is policy 0 and has a lower priority or if the currently running thread belongs to any other priority. Maybe when a policy 3 thread is unblocked it never preempts the currently running thread.
Note that if you do it right, you can have (e.g.) a GUI that always responds within 1ms, regardless of whether there's nothing else for CPU to do or if the CPU is struggling/failing to handle the load of 1234567 other threads. Note that "very important tasks preempt immediately" is a minimum requirement for anything involving (soft or hard) real-time threads and anything where drivers are in user-space; and (while not a strict requirement) is extremely desirable for anything involving networking or user interfaces.
The scheduling algorithm is some sort of a multilevel queue and each queue uses Round Robin to select the next thread in that specific queue. There are multiple queues with different priorities. Queues with higher priority are used more often than lower ones. For example imagine 3 threads A,B and C and 3 different queues (high, medium and low priority). Thread A has a high priority, B has a medium priority and C a low priority. Then these threads will be executed in the following order:
So every odd quantum a highest priority thread will be executed and every second and 4th quantum a medium priority and finally every 6th quantum a lowest priority thread. For now thread priority is determined by the kernel. In the future I might add some type of good-boy-points system where a thread can rise to a higher priority or be punished and descent to lower one.
This sounds like you want "variable frequency", where you have a fixed length time slice (e.g. 1 ms) and higher priority threads get more time slices.
To implement "variable frequency"; the basic idea is to give each thread a "current count" and a "max. count" (where "max. count" is determined by the thread's priority). When choosing a thread, decrease the thread's "current count" and if it reaches zero do "thread->current_count = thread->max_count" and give that thread CPU time; otherwise move on to the next thread (and wrap around) - eventually a thread will get CPU time. In big-O notation, the basic algorithm is known as "O(my god!)" - it's not fast.
It can be implemented as "O(1)". To do this you need a list for each "current count"; a circular buffer of "list heads", and a "current position in circular buffer". When the list at "current position in circular buffer" isn't empty, you remove the first thread from that list and give it CPU time. Otherwise (when the list at "current position in circular buffer" is empty) there are no threads with "count = 1" and you move the "current position in circular buffer" to the next slot in the circular buffer (effectively, decrementing the count for all threads simultaneously).
This is what I have implemented right now, however it scales horribly on smp because there are only 3 locks (so a high chance of lock contention on big systems), and there is no 'pinning' of threads to a specific cpu. In order to optimize this for SMP systems, I was thinking about using the 1 scheduler per logical cpu method, with each scheduler having its own set of queues and locks. There would then be one thread in charge of load balancing the threads across the different schedulers.The schedulers can only have lock contention when a thread is being added to or removed from one of its queues that he is trying to access, because of load balancing or because a thread was started, blocked... When a thread is blocked for some reason, it would be put on the blocked list of the scheduler it was using. When it's time to start the thread again, it will be awoken on the same cpu. The load balancing thread might later try to reappoint it to an other scheduler.
Having "per CPU" scheduler data is a good idea.
For load balancing, how my schedulers work is that when a thread is given CPU time it's moved from the "ready to run" state into the "running state", and removed from the scheduler's queues/data structures (the scheduler only ever keeps track of "ready to run" threads and not "running" threads). When a thread goes into the "ready to run" state for any reason (it was running and got preempted, it was spawned, it unblocked) the kernel decides which CPU it should use, and puts the thread onto that CPU's scheduler data.
This tends to keep high priority tasks very well balanced (because they tend to move between "running" and "ready to run" states quickly); and also avoids "excessive re-balancing" of low priority tasks that don't matter.
The problem with a "load balancer thread" is that you have to compromise between "high priority load balancer thread that gobbles far too much CPU time" and "low priority load balancer thread that takes far too long to respond to changes in load"; and there's no good compromise (partly because there's thread switching involved so "frequent rebalancing" adds up to "frequent thread switches").