Optimal multi-logical-processor scheduling

Virtlink · **Posted:** Thu Sep 25, 2008 5:24 am

Today's and future (desktop) systems are getting more and more logical CPUs. I read that Intel's Nehalem will get eight cores, each with hyperthreading to support up to 16 logical processors. It will be more when supporting NUMA, and it will be even more in the future. (A millidicentitetraicosa core processor?)

What is the most ideal way to schedule the running applications among these processors?
My initial thoughts are to assign one or more address spaces to each CPU. Each address space contains one or more processes, each with one or more threads. Small processes don't need an entire address space, so the number of address space switches is reduced. However, this doesn't work very well with multiple threads, because they run sequentially on the same CPU. And it has no provisions for smaller-scale parallelism, such as in loops.

To what scale should the scheduler manage?
For my system, the scheduler will be able to manage the applications to the machine-code level, so it has the ability to execute loops in parallel. Is this desired? How must these low-level parallel things be divided among the CPU's? At this scale, all CPU's execute a single address space, process, thread but different loop parts, I think. And what about loops which have less parallel loops than there are CPU's: what do the rest of the CPU's do?

Please share your thoughts...

Love4Boobies · **Posted:** Thu Sep 25, 2008 10:20 am

There is still a lot of research in this field. Microsoft and Intel made a contest where several universities subscribed and the top 2 would get their project sponsored for further research. Berkley was the first, can't rememeber the other. But keep in mind that we don't even have an "optimal" way to schedule page replacement so... :!:

Brendan · **Posted:** Thu Sep 25, 2008 11:57 am

Hi,

Virtlink wrote:

To what scale should the scheduler manage?
For my system, the scheduler will be able to manage the applications to the machine-code level, so it has the ability to execute loops in parallel. Is this desired? How must these low-level parallel things be divided among the CPU's? At this scale, all CPU's execute a single address space, process, thread but different loop parts, I think. And what about loops which have less parallel loops than there are CPU's: what do the rest of the CPU's do?

An OS should never need to manage applications at the machine-code level. The OS's kernel should only need to care about processes and "kernel level threads" (if supported). Things like automatically doing loops in parallel and "user level threads" should be handled by languages/compilers, and mapped onto whatever the kernel supports by the language/compiler.

Virtlink wrote:

What is the most ideal way to schedule the running applications among these processors?
My initial thoughts are to assign one or more address spaces to each CPU. Each address space contains one or more processes, each with one or more threads. Small processes don't need an entire address space, so the number of address space switches is reduced. However, this doesn't work very well with multiple threads, because they run sequentially on the same CPU. And it has no provisions for smaller-scale parallelism, such as in loops.

If you're designing a scheduler, then don't design a linear memory manager. The scheduler shouldn't care how the linear memory manager allocates address spaces to processes and threads. Instead the scheduler should just call a function in the linear memory manager during process/task/thread switches, where the function in the linear memory manager takes care of switching address spaces or whatever, if/when necessary. The scheduler manages CPU time usage, nothing else.

Multiple threads should not run sequentially on the same CPU. If a process has 4 threads then all of those 4 threads should be able to run at the same time on different CPUs (assuming there's enough CPUs available).

Ok, now that the easy stuff is out of the way...

Virtlink wrote:

What is the most ideal way to schedule the running applications among these processors?

Welcome to hell.

Start with "CPU topology". You've got 1 or more logical CPUs that belong to one or more cores that belong to one or more chips that belong to one or more NUMA domains (that belong to one or more computers, if you're considering a distributed system). Some computers don't have some of these levels, but this doesn't matter - you can pretend that all computers have all these levels. For example, an "old fashioned" single CPU 80486 system can be considered as a computer with 1 NUMA domain, 1 chip, 1 core and 1 logical CPU, even though it doesn't actually support NUMA, multi-chip, multi-core or hyper-threading.

Next, what are the scheduler's goals? Well, here's a quick list:

To allow process/task/thread creation and termination
To support some sort of process/task/thread priority scheme/s
To manage CPU load balancing (for performance reasons)
To manage CPU load balancing (for power management)
Scalability (no point having 306 CPUs all waiting to acquire a spinlock deep inside the scheduler)

Out of these things, process/task/thread creation and termination goes without saying. The process/task/thread priority scheme/s is something you'd choose to suit your OS - there isn't a right answer (just lots of wrong answers, where it's your job to choose the "least wrong" answer

).

CPU load balancing (for both performance and power management) and scalability are the interesting/hard parts...

Cheers,

Brendan

bewing · **Posted:** Thu Sep 25, 2008 1:14 pm

Virtlink wrote:

My initial thoughts are to assign one or more address spaces to each CPU. Each address space contains one or more processes, each with one or more threads.

I prefer your initial thought.

Quote:

Small processes don't need an entire address space, so the number of address space switches is reduced.

meh. Then you need to relocate them, one way or another, if they are sharing an address space. And there is no inter-process memory protection. I don't think it's worth it, just to save an address space switch.

Quote:

However, this doesn't work very well with multiple threads, because they run sequentially on the same CPU.

Important disagreement #1: I think it works quite excellently with multiple threads, for precisely that reason. There should always be enough processes/address spaces running on the machine to saturate the total number of cores. There is no reason to assume that any one process will ever have such a high priority that it will ever need to be granted exclusive use of a core, plus exclusive use of that core's hyperthread, PLUS MORE. To postulate a situation where one process is requiring that much CPU time, and all other processes are completely idle is a) ridiculous, and b) a situation where it is quite acceptable for an OS to provide less than 100% performance from the hardware. (However, if it is vital to create such a gluttonous app, then the programmer should create it as multiple processes in idependent address spaces, communicating via IPC or the filesystem, if necessary.)

Some logical CPUs share all or most of their caches. For the benefits of "locality" you want only the hardware that shares caches to be sharing address spaces. Period.

You never want to extend a process beyond the shared caches of the logical CPUs. If you do, you get RFOs, which are guaranteed to slow your system to a crawl -- and it will crawl EVEN SLOWER, on a faster machine with thousands of cores, than it will on a system with two. This is negative scalability.

You can guarantee that a hyperthread shares ALL the caches of its associated CPU. So that is the obvious candidate for sharing address spaces between logical CPUs. Beyond that, I think it is a bad idea -- unless you get really clever about writing the code, and about compiling it -- and you do not have that degree of control over user apps.

Quote:

And it has no provisions for smaller-scale parallelism, such as in loops.

I agree with Brendan. Each process (as said above) has (at most) two logical CPUs assigned to it. Managing small-scale parallelism between the two logical CPUs is the province of the app itself -- so it is a compiler issue.

Quote:

what do the rest of the CPU's do?

One runs the kernel. One runs each manager/executive/system service. One runs each interrupt. One runs each other process on the machine. Until you fill them all up. Then you start scheduling.

Brendan · **Posted:** Fri Sep 26, 2008 3:35 am

Hi,

bewing wrote:

Quote:

Small processes don't need an entire address space, so the number of address space switches is reduced.

meh. Then you need to relocate them, one way or another, if they are sharing an address space. And there is no inter-process memory protection. I don't think it's worth it, just to save an address space switch.

A 32-bit OS can use segmentation to implement "small address spaces"; where segmentation provides the relocation (segment base) and protection (segment limit). In this case you'd need to mark GDT/LDT descriptors that are used by other processes as "not present" so that CPL=3 code can't change segment registers and access data that belongs to another process (which in turn might mean using a different GDT for each CPU), but apart from that it's easy enough to do. AFAIK the L4 kernel supports this.

For a 64-bit OS running 32-bit (or 16-bit) processes the same principle applies. For a 64-bit OS running 64-bit processes it won't work though - you can use RIP relative addressing for relocation, but you can't provide the protection without messing with paging structures (e.g. marking page directories as "not present" to prevent access) which involves TLB flushing and defeats the purpose.

bewing wrote:

Quote:

However, this doesn't work very well with multiple threads, because they run sequentially on the same CPU.

Important disagreement #1: I think it works quite excellently with multiple threads, for precisely that reason. There should always be enough processes/address spaces running on the machine to saturate the total number of cores. There is no reason to assume that any one process will ever have such a high priority that it will ever need to be granted exclusive use of a core, plus exclusive use of that core's hyperthread, PLUS MORE.

Huh? Firstly this has nothing to do with priority - a process with n low priority threads should still be able to use n CPUs (if present). Otherwise you end up with entire CPUs doing nothing while one CPU does everything.

bewing wrote:

To postulate a situation where one process is requiring that much CPU time, and all other processes are completely idle is a) ridiculous, and b) a situation where it is quite acceptable for an OS to provide less than 100% performance from the hardware. (However, if it is vital to create such a gluttonous app, then the programmer should create it as multiple processes in idependent address spaces, communicating via IPC or the filesystem, if necessary.)

Huh? A situation where one process is requiring that much CPU time, and all other processes are completely idle is a) extremely likely. Think of something like a ray-tracer, where the frame being generated is divided into n areas for n CPUs. You could try to force something like this to use multiple processes instead of just using multiple threads, but in this case it'd just end up using shared memory mappings which ends up being almost identical to multi-threaded anyway (and if your OS doesn't support shared memory mappings, then they'll find a different OS that does let them use all the CPUs).

bewing wrote:

Some logical CPUs share all or most of their caches. For the benefits of "locality" you want only the hardware that shares caches to be sharing address spaces. Period.

You never want to extend a process beyond the shared caches of the logical CPUs. If you do, you get RFOs, which are guaranteed to slow your system to a crawl -- and it will crawl EVEN SLOWER, on a faster machine with thousands of cores, than it will on a system with two. This is negative scalability.

It's not so simple. For example, for multi-core CPUs often the L2 cache is shared between 2 cores.

If there's 8 logical CPUs (e.g. a 4 core CPU with hyper-threading) and a process is only allowed to use one pair of logical CPUs, then that process will end up running about 125% faster than it would if it only ran on one CPU (because the speedup from hyperthreading is nothing like the speedup from multi-CPU). If the same process was allowed to use all logical CPUs but suffered a 10% performance penalty because of things like cache line sharing, etc, then it'd end up running about 430% faster than it would if it only ran on one CPU. IMHO 430% is greater than 125% by a significant margin...

bewing wrote:

You can guarantee that a hyperthread shares ALL the caches of its associated CPU. So that is the obvious candidate for sharing address spaces between logical CPUs. Beyond that, I think it is a bad idea -- unless you get really clever about writing the code, and about compiling it -- and you do not have that degree of control over user apps.

You don't need to be very clever about writing code - it's relatively easy to design a process where one thread does one type of work and another thread does another type of work or where threads do the same type of work on different data (with very little data sharing); and it's also relatively easy to design a process consisting of many I/O bound threads (where caches are cold by the time any thread gets CPU time again anyway). Memory that's never modified can happily be in many CPU's caches at the same time (this can actually improve performance because it's faster for one CPU to get data from another CPU's cache than it is to read it from RAM).

It's only when data that is used by different CPUs is modified that you get potential performance problems. However, if a piece of data needs to be shared then a piece of data needs to be shared, and an RFO is much better than forcing programmers to rely on messaging, pipes or file I/O (or any other form of inter-process communication that doesn't involve shared memory mappings and RFOs).

Basically what you're doing is assuming you know more about the process than the person writing the process, even though you don't even know what that process is. If the programmer writing the process wants to be limited to a subset of available CPUs, then let them use CPU affinity to reduce the number of CPUs that the process can use; or alternatively, if the programmer writing the process doesn't want to be limited to a subset of available CPUs, then let them use CPU affinity to increase the number of CPUs the process can use. In this case you only need to worry about the most sensible "default" behavior; and IMHO the most sensible default behavior is to let a process use all CPUs within a NUMA domain.

bewing wrote:

Quote:

what do the rest of the CPU's do?

One runs the kernel. One runs each manager/executive/system service. One runs each interrupt. One runs each other process on the machine. Until you fill them all up. Then you start scheduling.

I wouldn't necessarily do that either. Having one CPU that's idle 20% of the time and 7 more CPUs in a low power sleep state is much better than having 8 CPUs that are idle 90% of the time (that are constantly being taken out of any low power state they manage to get into). Even better might be to have 2 CPUs going at 80% load using software controlled throttling (e.g. CPU clocks halved) with the other 6 CPUs in a low power sleep state, where every five seconds or so the CPUs are rotated so that the heat is spread across all available heatsinks (and so that no single chip generates all the heat, so that all CPUs are more able to sustain a longer period of "full throttle" activity before becoming hot).

My suggestion is to use some sort of calculation to give each potential CPU a score and then make the thread run on the CPU with the best score. That way you can dramatically change scheduling characteristics by changing the function that calculates the score. I'd start by using something like this to calculate the score:

Code:

int find_best_CPU_to_use_for_thread(THREAD_ID thread) {
    int best_CPU = -1;
    int best_score = -1;

    for(potential_CPU = 0; potential_CPU < max_CPU; potential_CPU++) {
        if(potential_CPU_is_allowed_in_CPU_affinity_mask) {
            score = 100;
            if(thread_was_running_recently) { // Previously used CPU's caches might still be warm
                if(potential_CPU = previously_used_CPU) score += 20;
                else if(potential_CPU_shares_L1_cache_with_previously_used_CPU) score += 10;
                else if(potential_CPU_shares_L2_cache_with_previously_used_CPU) score += 5;
            }
            if(process_has_a_default_NUMA_domain) {
                score -= NUMA_distance_table[potential_CPU_NUMA_domain][default_NUMA_domain];
            }
            if(potential_CPU_temperature == HOT) score -= 25;
            if(potential_CPU_temperature > COLD) {
                if(potential_CPU_state == sleeping) score -= 50;
            }
            score -= potential_CPU_current_load;
            if(bestScore < score) {
                best_score = score;
                best_CPU = potential_CPU;
            }
        }
    }
    return best_CPU;
}

Of course you'd want to do some intensive performance tuning on this...

Cheers,

Brendan

Dario · **Joined:** Sun Aug 31, 2008 12:39 pm **Posts:** 117

...I think that Sun has released CPU with 64 cores(UltraSPARC T2) and Intel with 80: http://www.intel.com/pressroom/archive/releases/20070204comp.htm

I heard that each core has it's own little router to route instruction to different cores..

Brendan · **Posted:** Fri Sep 26, 2008 6:09 am

Hi,

Dario wrote:

...I think that Sun has released CPU with 64 cores(UltraSPARC T2) and Intel with 80: http://www.intel.com/pressroom/archive/releases/20070204comp.htm

I heard that each core has it's own little router to route instruction to different cores..

Intel's chip is a research project only - from your own link:

Intel wrote:

Intel has no plans to bring this exact chip designed with floating point cores to market.

IMHO it's an attempt to maximize FLOPS without caring about "real world" performance at all.

It'd be much more likely to see something like a 32 core chip, consisting of 4 normal/complex cores glued onto 28 smaller/simpler cores (like Atom but with much wider SIMD, and marketed as a GPU). Of course this isn't just speculation....

Cheers,

Brendan

Dario · **Joined:** Sun Aug 31, 2008 12:39 pm **Posts:** 117

Brendan wrote:

Hi,

Dario wrote:

...I think that Sun has released CPU with 64 cores(UltraSPARC T2) and Intel with 80: http://www.intel.com/pressroom/archive/releases/20070204comp.htm

I heard that each core has it's own little router to route instruction to different cores..

Intel's chip is a research project only - from your own link:

Brendan

"SANTA CLARA, Calif., Feb. 11, 2007 - Intel Corporation researchers have developed the world's first programmable processor that delivers supercomputer-like performance from a single, 80-core chip not much larger than the size of a finger nail while using less electricity than most of today's home appliances."ou

:wink:

So it's already here...

YouTube link: http://www.youtube.com/watch?v=We_PRtRfiNs

But I was wrong about the SPARC processor...it has 8 cores, and 64 threads per chip( 8 threads per core). It's called "system on the chip" because it has:
-Integrated multithreaded 10 Gb Ethernet networking
-Integrated PCI Express I/O expansion
-Integrated floating point and cryptographic processing units per core

Here is the presentaton:
http://www.sun.com/processors/UltraSPARC-T2/

EDIT: oooh yes, you're right..it's research project only :oops:

bewing · **Posted:** Fri Sep 26, 2008 9:02 am

As I'm sure everyone can tell, I disagree with Brendan on the RFO issue. :wink:

If you have n independent processors (with separate caches) asynchronously running on the same data in the same address space, you are almost certainly going to need synchronization. There is no good way to do that without RFOs. If a programmer accidentally (or foolishly) writes stupid code that modifies shared memory often, that generates RFOs. The performance penalty of RFOs goes with the factorial or something of the number of CPUs involved -- so one stupid programmer with one stupid app can kill your entire machine with RFOs, if you allow it to happen. And I am surprised at Brendan in this case -- usually he is very concerned about not trusting programmers, when it comes to doing foolish things.

The performance penalty of RFOs may be 10% for a well-designed app, on a 4 core machine -- but it'll get a lot worse than that in a few years. I deny that Brendan's method is truly scalable.

On the other hand, Brendan is also always very concerned about achieving perfection, while I am much more concerned with creating something extremely simple and robust. But in this case, I think my way is better.

Colonel Kernel · **Posted:** Fri Sep 26, 2008 10:37 am

@Brendan & bewing:

The kernel-level issues you guys highlight are a major motivation behind some important ideals for future programming language design:

Processes should be lightweight first-class constructs in the language.
Shared memory should only be used if it is immutable. This fits nicely with the semantics of functional languages.
Message passing should be used instead of mutable shared memory for IPC.

These restrictions limit the "stupid app killing your machine with RFOs" problem, and a host of other gnarly problems related to concurrency and synchronization. The only usages of mutable shared memory would be in the kernel and language run-time, which hopefully excludes a lot of the "stupid" programmers bewing is justifiably concerned about.

I don't think it will be possible to design a system for massively parallel machines that is tractable and performs well if it must run code written in C-like languages.

bewing · **Posted:** Fri Sep 26, 2008 1:08 pm

Those restrictions would certainly solve the problem, but they look like major overkill to me. I'd be willing to share mutable memory between two (or perhaps 4 at most) unrelated CPUs. The factorial growth problem isn't bad at tiny numbers. I think an OS could adapt itself to setting somewhat flexible (but very small) limits, rather than jsut saying "no". It's just that I also object to the opposite extreme of "let's run all threads on all possible CPUs all at the same time!" model. It won't work.

And I program in ASM -- I really don't approve of language restrictions -- who would want to program apps for an OS, unless they could do it in a language they are familiar with?

Brendan · **Posted:** Sat Sep 27, 2008 1:10 am

Hi,

From my point of view, attempts to limit RFOs make things worse not better.

If one process can't use all CPUs then a programmer that wants to use all CPUs will just use many processes and read/write shared memory areas, which isn't really any different to having a single process with one thread per CPU. If you go the extra step and don't allow read/write shared memory areas either, then the programmer will use many processes with IPC instead, but IPC doesn't avoid RFOs (I guarantee you that somewhere deep inside the kernel there's at least one RFO per IPC transaction). IPC just pushes RFOs into the kernel and buries them under more overhead.

Think of any situation where the work done by multiple CPUs needs to be co-ordinated, and see if you can describe a way to acheive this co-ordination without RFOs somewhere. I can't think of a way that doesn't make things worse, and I doubt anyone else can either.

Bewing's solution is to solve scalability problems by not allowing scalability. After he realizes that people will work around pointless limitations by using something worse maybe he'll "solve" the scalability problems by replacing all multi-CPU systems with single-CPU systems that have no networking...

Cheers,

Brendan

Colonel Kernel · **Posted:** Sat Sep 27, 2008 1:45 am

Brendan wrote:

If you go the extra step and don't allow read/write shared memory areas either, then the programmer will use many processes with IPC instead, but IPC doesn't avoid RFOs (I guarantee you that somewhere deep inside the kernel there's at least one RFO per IPC transaction). IPC just pushes RFOs into the kernel and buries them under more overhead.

To look at it another way, it centralizes RFOs in the kernel so that IPC can be optimized as much as possible, rather than force every app developer to invent their own sub-optimal, unmaintainable, and probably broken shared-memory IPC scheme in user space.

Also, with clever language and compiler techniques IPC overhead can be virtually eliminated.

Brendan · **Posted:** Sat Sep 27, 2008 2:23 am

Hi,

Colonel Kernel wrote:

Brendan wrote:

If you go the extra step and don't allow read/write shared memory areas either, then the programmer will use many processes with IPC instead, but IPC doesn't avoid RFOs (I guarantee you that somewhere deep inside the kernel there's at least one RFO per IPC transaction). IPC just pushes RFOs into the kernel and buries them under more overhead.

To look at it another way, it centralizes RFOs in the kernel so that IPC can be optimized as much as possible, rather than force every app developer to invent their own sub-optimal, unmaintainable, and probably broken shared-memory IPC scheme in user space.

Ok, here's some example code:

Code:

    remainingWorkerThreads = neededThreads;
    for(threads = 1; threads < neededThreads; threads++) spawnThread(workerThread);

workerThread:
    doSomeWork();
    lock sub [remainingWorkerThreads],1                   ;RFO here!
    je .lastThread
    terminateThisThread();

.lastThread:
   combineResultsFromAllWorkerThreads();

Show me how this can be improved by "centralizing the RFO in the kernel"....

Cheers,

Brendan

Colonel Kernel · **Posted:** Sat Sep 27, 2008 11:39 am

Brendan wrote:

Show me how this can be improved by "centralizing the RFO in the kernel"....

That's easy. By putting it in the kernel, you don't have to entrust programmers who are far less talented and experienced than you to write that kind of code ever again. Therefore, it is improved.

A library would work too... As long as you're providing some kind of abstraction that doesn't force app developers to twiddle with counters in shared memory. I guarantee you that a significant portion of them will mess it up.

OSDev.org

Optimal multi-logical-processor scheduling

Who is online