OdinOS: I'd love some design feedback

bwat · **Joined:** Fri Jul 03, 2009 6:21 am **Posts:** 359

Brandan,

It looks like you have a need to be "right" in this question. It seems that you're emotionally attached to your statements, so instead of seeking the truth, you're trying to force an erroneous world-view onto those who have empirical evidence that runs contrary to your statements. I think your situation would improve if you opened your eyes to the difference between reality and your perception of it. You don't have to twist and turn, you can say "I was wrong" instead. Nobody would think any less of you. You are human, your beliefs are defeasable. At some point you're going to be forced to accept this.

p.s. You read a paper on combining OSE with Linux but OSE is a very old OS which can be used on its own and has been in many mobile phones and telecom infrastructure products. I've developed with it since 1998 and only combined it with Linux once. It doesn't need another OS, it only needs a boot loader.

Brendan · **Posted:** Fri Nov 22, 2013 7:13 am

Hi,

Kevin wrote:

You're starting again to change the requirements you made...

I was sure that was only an example (to illustrate the "VMs aren't ideal" point).

The main point remains - in order to get best possible performance where it matters; you have to ensure that all resources (CPUs, disk IO, networking, whatever) are being used for the most important thing at any point in time; which means you need enough information to determine what the most important thing is at any point in time and you need to use that information to dynamically adapt to changes as quickly as possible. If neither VM nor host has the information needed to do make the best possible decisions you can't get the best possible performance where it matters.

Of course this doesn't just apply to running 2 or more virtual machines on the same physical hardware - it can be applied to managing "resource contention" in any situation.

For a random/silly example; you could say that "free thermal headroom" is a resource that many device drivers compete for, and then use whatever information you have (CPU load, temperature sensors, etc) to manage that "free thermal headroom" resource; resulting in a system that dynamically shuffles device drivers between CPUs with spare thermal headroom, so that you don't end up with crippled performance because all your device drivers are running on a CPU that had to be throttled to avoid over-temperature.

Cheers,

Brendan

Brendan · **Posted:** Fri Nov 22, 2013 7:35 am

Hi,

bwat wrote:

It looks like you have a need to be "right" in this question. It seems that you're emotionally attached to your statements, so instead of seeking the truth, you're trying to force an erroneous world-view onto those who have empirical evidence that runs contrary to your statements.

What exactly might this "empirical evidence" be? None has been presented so far; and I suspect that if you actually had any you would've given some hint as to what it might be instead of ignoring the issues and resorting to pathetic emotional attacks in a shitty attempt to pretend that something you've said was right.

Cheers,

Brendan

Kevin · **Posted:** Fri Nov 22, 2013 7:50 am

Brendan wrote:

The main point remains - in order to get best possible performance where it matters; you have to ensure that all resources (CPUs, disk IO, networking, whatever) are being used for the most important thing at any point in time; which means you need enough information to determine what the most important thing is at any point in time and you need to use that information to dynamically adapt to changes as quickly as possible.

Right, having the information is the crucial part. Which is completely orthogonal to whether you run the code in a VM or not. If the OS doesn't know what the HTTP server is going to do with the current read request, it can't say if it's more important than the read request of the FTP server (whose purpose is unknown as well). If you take the normal case in practice today, this information is not available to the OS, whether you use a VM or not. You could in theory add interfaces to expose the purpose of the requests so that the OS can optimise the priorities, but you can do that in either case.

In your "free thermal headroom" example, the information doesn't come from the application, but from the physical hardware. This means that the information is practically available for OS and that it can use it to assign resources. But again, it can do that both for VMs and other applications.

Brendan · **Posted:** Fri Nov 22, 2013 8:21 am

Hi,

Kevin wrote:

Brendan wrote:

The main point remains - in order to get best possible performance where it matters; you have to ensure that all resources (CPUs, disk IO, networking, whatever) are being used for the most important thing at any point in time; which means you need enough information to determine what the most important thing is at any point in time and you need to use that information to dynamically adapt to changes as quickly as possible.

Right, having the information is the crucial part. Which is completely orthogonal to whether you run the code in a VM or not. If the OS doesn't know what the HTTP server is going to do with the current read request, it can't say if it's more important than the read request of the FTP server (whose purpose is unknown as well). If you take the normal case in practice today, this information is not available to the OS, whether you use a VM or not. You could in theory add interfaces to expose the purpose of the requests so that the OS can optimise the priorities, but you can do that in either case.

If no VM is involved (e.g. application running directly on the OS); most OSs just add a "priority" tag to the IO request (and let the corresponding device driver/s figure out which requests are more important at any given point in time). If a VM is involved, the VM and host OS receive IO requests in "prioritised by guest" order (so that still works); but the priority information is lost so the host can't determine if one guest's requests are more or less important than its own requests or requests from other guests.

To pass the priority information from guest to host you'd need to add support to "virtual hardware"; but virtual devices typically mimic real hardware so that can't happen.

Kevin wrote:

In your "free thermal headroom" example, the information doesn't come from the application, but from the physical hardware. This means that the information is practically available for OS and that it can use it to assign resources. But again, it can do that both for VMs and other applications.

Yes (mostly). It gets tricky when the information used includes pending load (e.g. how full a device driver's "request queue" is).

Cheers,

Brendan

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3192

mrstobbe wrote:

This is why the novel idea here is to stick the to one multi-purpose CPU (CPU0)... again, it's a risk and certainly might not pan out. Any experience with this? Have lessons learned you'd like to share? Best case is that everything is perfectly balanced (incredibly unlikely). Worst case is that CPU0 is a bottleneck (at which point, the idea goes out the window in it's current state).

At one time I planned to add hard real-time extensions. My idea was to reserve a single core for real-time tasks that would not use preemption, and that would receive no hardware IRQs. In my OS, I use IRQ balancing to even out load, so I regularly change IRQ assignments between cores to achieve even load. This works because many IRQs will schedule a thread on the core the IRQ executes on. Threads are mostly sticky, so if they start executing on one core, they would stay on that core, unless the load balancer moves them to the global thread queue where any core can pick them from. The load balancer works on the time scale of 100s of milliseconds, so moving threads have little effect on performance.

mrstobbe wrote:

I agree about the whole PML4 thing. I'm planning on only using the 4-step paging setup if it's necessary (hardware even has that much RAM and controllable as a kernel arg). The idea is to keep the high bits 0 so that any paging mode makes sense no matter what the compiler did. Basically, the smallest table chain possible is planned to be the default. I say planning, because I just started working on getting the paging setup a couple days ago, so... there's a lot of work to even begin to get dynamic about it (and I have a full time job

). Paging still sucks in terms of TLB, but at least most cores wouldn't be missing constantly.

Unfortunately, this is tied to the operating model. If you want to use x86-64, you must use the 4-level thing. You can get down to 2-levels by using 32-bit protected mode instead, but then you cannot use more than 4G linear memory. In order to operate without paging, you need to use 4G physical addresses directly, which is kind of awkward.

Brendan · **Posted:** Fri Nov 22, 2013 9:17 am

Hi,

mrstobbe wrote:

This is why the novel idea here is to stick the to one multi-purpose CPU (CPU0)... again, it's a risk and certainly might not pan out. Any experience with this? Have lessons learned you'd like to share? Best case is that everything is perfectly balanced (incredibly unlikely). Worst case is that CPU0 is a bottleneck (at which point, the idea goes out the window in it's current state).

For the "CPU0 is a bottleneck" worst case, the idea only needs minor changes (and doesn't need to go out the window).

For modern 80x86; for hyper-threading you'll find that logical CPUs in the same core share all caches (potentially including TLBs under certain conditions). You'll also find that often pairs of cores share some caches; and often all cores in a chip share the same last level cache (e.g. a 4-core CPU might have one pair of cores sharing an L2 data cache and another pair of cores sharing a different L2 data cache; and all cores might share the same L3 data cache). What this means is that by intelligently deciding which CPUs to use for the device drivers, you can minimise (or elliminate) any cache misses than you wouldn't have had if you only used one CPU (while spreading the device driver load across 2 or more CPUs).

CPU load may not be the only thing to balance though. You might find that the CPU load caused by device drivers is not the problem, but cache pressure is the problem (e.g. all device drivers combined using more data than a single CPU can cache). In this case you can split the load and find CPUs that don't share caches to effectively double (or triple, or..) the amount of cache that device drivers can use. For example, you might put all "file IO related" drivers on CPU0 and all "network IO related" device drivers on CPU7, and end up using twice as much cache without causing any cache locality problems.

The other thing you might need to balance is power/temperature (e.g. if a "device driver CPU" is getting hot, shift all device drivers to a different CPU). Worst case (if you don't do this) is that the CPU may need to be throttled to keep cool (e.g. dropped back to 12.5% of its normal performance).

mrstobbe wrote:

I agree about the whole PML4 thing. I'm planning on only using the 4-step paging setup if it's necessary (hardware even has that much RAM and controllable as a kernel arg). The idea is to keep the high bits 0 so that any paging mode makes sense no matter what the compiler did. Basically, the smallest table chain possible is planned to be the default. I say planning, because I just started working on getting the paging setup a couple days ago, so... there's a lot of work to even begin to get dynamic about it (and I have a full time job

). Paging still sucks in terms of TLB, but at least most cores wouldn't be missing constantly.

There are some things you can do to minimise TLB misses:

Support/use "large pages" (especially for kernel)
Keep things close together (e.g. don't have 20 pieces of data spread out all over the virtual address space, but put them all close to each other to minimise the number of higher level paging structures used). Note that modern CPUs do cache higher level paging structures to avoid doing a full "4 level lookup" (e.g. so they can only do "2 level lookup from cached PDPT entry").
Mark kernel pages as "global" (this means the kernel's TLB entries don't get invalidated when you change virtual address spaces)
Use INVLPG to invalidate specific TLB entries (don't reload CR3 and invalidate far too much if it can be avoided)
Consider making TLB invalidation coincide with virtual address space switches (by postponing TLB invalidation or doing virtual adress space switches sooner). For example, if a device driver only has 2 ms of time left to run but has allocated/freed a lot of RAM, just do the task switch early and let the (unavoidable) CR3 reload invalidate TLBs instead of doing it explicitly and having some extra TLB misses before the virtual address space switch.
Look into supporting the "address space IDs" feature in recent Intel CPUs. This allows you to switch virtual address spaces without causing TLB invalidations (TLB effectively holds translations for up to 16 virtual address spaces at the same time).
Don't forget that TLB misses can be satisfied using data from the CPU's L2/L3 data cache. This leads to some potentially interesting approaches (like using prefetch hints to get paging structure data into the data cache before the CPU needs it, so that a TLB miss costs a lot less). It also means that balancing your device driver's cache pressure (mentioned above) can reduce TLB miss costs by increasing the amount of cache drivers use.

Cheers,

Brendan

Kevin · **Posted:** Fri Nov 22, 2013 9:29 am

Brendan wrote:

If no VM is involved (e.g. application running directly on the OS); most OSs just add a "priority" tag to the IO request (and let the corresponding device driver/s figure out which requests are more important at any given point in time).

And it takes that priority out of thin air? Remember, applications don't tell the kernel which of their requests are more important and which are less important. (And if they did, would you really want to trust them? I mean, obvious my requests are always most important and it's only the other tasks that should use lower priorities for their requests.)

Quote:

If a VM is involved, the VM and host OS receive IO requests in "prioritised by guest" order (so that still works); but the priority information is lost so the host can't determine if one guest's requests are more or less important than its own requests or requests from other guests.

So when talking about single-tasking VMs, this is equivalent even without doing clever things in the VM. The application didn't tell the kernel about the priorities, and the VM doesn't tell the host about them. It needs to figure them out by itself, most likely by applying the same priority to all requests from the same application.

Quote:

To pass the priority information from guest to host you'd need to add support to "virtual hardware"; but virtual devices typically mimic real hardware so that can't happen.

I disagree, for two reasons:
a) If you can, you use paravirtual hardware, so you certainly can modify that.
b) Chapter "8.7 Command priority" in SAM-5 looks like SCSI can do what you want, so even mimicing real hardware allows you to get the desired result.

Brendan · **Posted:** Fri Nov 22, 2013 10:06 am

Hi,

Kevin wrote:

Brendan wrote:

If no VM is involved (e.g. application running directly on the OS); most OSs just add a "priority" tag to the IO request (and let the corresponding device driver/s figure out which requests are more important at any given point in time).

And it takes that priority out of thin air? Remember, applications don't tell the kernel which of their requests are more important and which are less important. (And if they did, would you really want to trust them? I mean, obvious my requests are always most important and it's only the other tasks that should use lower priorities for their requests.)

Applications can tell the kernel the priority of their requests (e.g. using POSIX Asynchronous IO). If applications don't do this the kernel can use thread priority as IO request priority. There's plenty of ways of dealing with "trust" (e.g. limit the range of priorities a process may use).

Kevin wrote:

Quote:

If a VM is involved, the VM and host OS receive IO requests in "prioritised by guest" order (so that still works); but the priority information is lost so the host can't determine if one guest's requests are more or less important than its own requests or requests from other guests.

So when talking about single-tasking VMs, this is equivalent even without doing clever things in the VM. The application didn't tell the kernel about the priorities, and the VM doesn't tell the host about them. It needs to figure them out by itself, most likely by applying the same priority to all requests from the same application.

If the application is bad, the guest OS is bad, the VM is bad, and the host OS is trying to do the right thing, there's not a lot the host OS can do other than using the same priority for everything the application and guest OS does and the performance of the application and guest OS will be "worse than ideal".

Kevin wrote:

Quote:

To pass the priority information from guest to host you'd need to add support to "virtual hardware"; but virtual devices typically mimic real hardware so that can't happen.

I disagree, for two reasons:
a) If you can, you use paravirtual hardware, so you certainly can modify that.
b) Chapter "8.7 Command priority" in SAM-5 looks like SCSI can do what you want, so even mimicing real hardware allows you to get the desired result.

If the VM adds support and the OS is modified to use that support; or if the hardware does have something "close enough" that could be used (local APIC TPR is another possibility); then it could work, but only if the guest OSs are the same or agree on range/s (e.g. if one OS thinks 127 is high priority while another thinks 65000 is high priority, then that's not going to end well).

Of course all of this is theoretical - I doubt many/any virtual machines support any of it.

Cheers,

Brendan

Kevin · **Posted:** Fri Nov 22, 2013 11:38 am

Brendan wrote:

Of course all of this is theoretical

Sounds like a great conclusion for this part of the discussion. The examples are theoretical, your advantages of directly running applications are theoretical and my solution to bring VMs to the same level are theoretical as well. Now that we understand this, we can talk about something else that actually matters.

mrstobbe · **Joined:** Fri Nov 08, 2013 7:40 pm **Posts:** 62

rdos wrote:

At one time I planned to add hard real-time extensions. My idea was to reserve a single core for real-time tasks that would not use preemption, and that would receive no hardware IRQs. In my OS, I use IRQ balancing to even out load, so I regularly change IRQ assignments between cores to achieve even load. This works because many IRQs will schedule a thread on the core the IRQ executes on. Threads are mostly sticky, so if they start executing on one core, they would stay on that core, unless the load balancer moves them to the global thread queue where any core can pick them from. The load balancer works on the time scale of 100s of milliseconds, so moving threads have little effect on performance.

Similar concept, some differences (one core reserved for tasks only), but still similar. I like the idea of IRQ balancing like that, additional overhead, but as you pointed out, rather minimal. Also, because the balanced thread is likely to be a small and consistent bit of code with very consistent behavior I would imagine it's impact of CPU state would be minimal as well. Worth exploring.

I'm starting to think that I can simplify things quite a bit by switching to a monolithic kernel design... keep the concept the same (minimal hardware support, CPU0's job is the same, etc), but it would help take some of the guesswork in who notifies who of what and why out of the equation. Probably a better starting place anyway so I can quickly start experimenting to see if the basic idea even works and learn some lessons from it. The downside is the lack of recovery potential if a driver misbehaves, but I can refactor to a microkernel as needed later.

rdos wrote:

Unfortunately, this is tied to the operating model. If you want to use x86-64, you must use the 4-level thing. You can get down to 2-levels by using 32-bit protected mode instead, but then you cannot use more than 4G linear memory. In order to operate without paging, you need to use 4G physical addresses directly, which is kind of awkward.

At the very least I can get 4KB/4-level, and 2MB/3-level, and possibly 1G/1-level (although that means that essentially all built in paging access features like NX and RW go out the window in terms of practicality, and it would only be useful on systems that have lots of RAM and who's purpose is something like memory caching or a db, and disk swap goes out the window but disk swap is a bit useless with this design, so that doesn't matter that much [run-on-sentence-from-hell]). I was thinking of playing with seeing if there was a way to use 32/36-bit addressing in long-mode (haven't tried it, but it was an idea I had the other day). I mean, when you enter long mode with 36-bit addressing, it uses that page setup until you tell it otherwise, so I was thinking there might be a way to take advantage of that. Processes that used this though would have to be mcmodel=small, which might be perfectly fine for less-memory hungry tasks. No idea if this is doable or reasonable even if it is, just a thought.

Thanks for making me think!

mrstobbe · **Joined:** Fri Nov 08, 2013 7:40 pm **Posts:** 62

Kevin wrote:

Okay, I guess now you know why it doesn't happen too often that Brendan and I agree...

I still think that this is the crucial point in the whole discussion: How do you deal with the multiple processes that you do get with your drivers and how do they play together with your application process?

If I understood it correctly from the latest few posts, you're going to run all drivers on CPU 0 and the application threads on CPU 1-n. Is this correct? If so, interfacing with a driver always means that you need to switch to a different CPU. Doesn't this hurt your latencies? If everything can indeed be done asynchronously, at least the throughput should be okay, but if your static HTTP server only delivers small pages instead of huge files, you're probably more interested in latency.

Again, that's still the fuzzy bit. The general idea is that CPU0 acts as a classic microkernel in all senses (including a standard scheduler for the drivers). A worker threads/the ring3's main thread would be able to talk to the drivers directly (maybe) or use the kernel as an intermediary (probably). It's asynchronous by default though, so they simply go on about their day doing anything else they can keep doing after the request is fired off. When a request has been fulfilled CPU0 then gives it back to the originating thread, and, if that thread is sleeping because there was nothing left to do in the mean time (or they finished everything else up), they get woken back up. It does introduce latency, undoubtedly, the question is how much. I'm gambling that the latency introduced by this design is far less than the latency introduced from scheduling 20-50 threads like you see in even a pretty minimal every-day server setup.

As of today however, I'm thinking I should switch to a monolithic design (same overall concept though), if nothing else to simplify things and get to a place where I can do practical experiments faster. I can refactor it into a microkernel later and see if that's reasonably effective if the monolithic version works out.

Kevin wrote:

Also, I think it requires that your drivers do nothing CPU-heavy. You won't be able to implement compressed or encrypted file systems, for example, without hurting others threads that just want to send some network packets at the same time when you force both drivers to run on the same CPU.

Wouldn't it make more sense to run the drivers in the CPU of their caller, and perhaps also distribute IRQ handlers across the CPUs?

Then, of course, you would have multiple processes on each CPU, and the limitation to one application process becomes rather arbitrary. So maybe another question is what advantages you get from the limitation.

That's why this is a novel idea. I don't know if it will work. Know any projects that have tried it? I'd love to learn their lessons so I don't have to the hard way. I just responded to rdos who mentioned they were thinking at one point of doing an IRQ balancing act by reassigning in low-frequency ticks. Something to explore.

As for your general question about the advantages, if all goes well a lot of the CPU bottle-neck we simply take for granted will basically be removed from the system. Presuming, of course, that I can engineer the I/O messaging to be significantly lower latency than that bottle-neck. You're right about the HTTP example, but keep in mind that large requests require very little CPU time per request, while tiny requests are almost purely CPU time per request. I'm trying to figure out a way to give as much of that time back to processing headers, resolving resources, and sanity checking at every step as humanly possible. I think I mentioned this earlier, but imagine an 8 core system that's trying to serve up tens-of-thousands or hundreds-of-thousands of concurrent requests per-second (like small JS files or whatever), that's most certainly a CPU bound system.

mrstobbe · **Joined:** Fri Nov 08, 2013 7:40 pm **Posts:** 62

Kevin wrote:

mrstobbe wrote:

Brendan wrote:

If one process starts a second process and waits until the child process terminates, then that's 2 processes (where only one is given CPU time, but both share memory, have file handles, etc) and not a single process. Of course if you're planning to have drivers running in their own virtual address spaces (as processes) it's not really single process anyway; and you're effectively doing "multiple processes and multi-tasking, with different obscure limits on what different types of processes can do".

Pure semantics... if one process starts another but can't exist (in terms of processor... can't see the light of day again) until the other one exits, it's still a mono-process system. Again, pure semantics.

Yes, but I think Brendan still has a valid point: Your drivers are processes that exist all the time and run in parallel with the single application. So you already have to have some kind of multitasking in order to run the drivers. (And if Brendan and I agree on something, there are chances it is right - because it doesn't happen too often.)

For this reason, a microkernel without multitasking is probably a contradiction in itself. The difference that you can make compared to the "normal" OS is that you don't schedule based on a timer, but only on events like IRQs or explicit calls into a driver function.

Super sorry Kevin, must have missed this post the other day, replying now...

It's semantics again... how would you define this design? "Micro-kernel/multi-tasking on one CPU, but the rest not"? "MkMtoocBtrn"? I should trademark that now

. Fact of the matter is the kernel is microkernel design, but it's mono-process (application or whatever you want to call it) in a system-wide sense. Pure semantics though. It's not set in stone what you call it. I think I'll call it microkernel-mononormalizing-polytasking from now on.

You're wrong about the difference though, with no context switching for all but one CPU, I'd say that's significantly different than the "normal" OS.

mrstobbe · **Joined:** Fri Nov 08, 2013 7:40 pm **Posts:** 62

Kevin wrote:

Also, I think it requires that your drivers do nothing CPU-heavy. You won't be able to implement compressed or encrypted file systems, for example, without hurting others threads that just want to send some network packets at the same time when you force both drivers to run on the same CPU.

Sorry, I didn't respond to this directly. Doing so now...

You're absolutely right... but compression is expensive while (mostly, depending, stream-or-not, pre-flight-compute-done-already-or-not, etc) encryption is pretty trivial. I digress though because it doesn't address your underlying point. Don't forget about buffering. I'm also operating under the assumption that CPU0 will be the most underutilized resource under full load given it's nature, so tacking on one or two CPU intensive tasks shouldn't be a concern (we'll see). Also keep in mind that I mentioned that I planned only minimal support for hardware. I also stated that I didn't care about supporting any kind of standards that facility quick development (posix or what-not). I should have also said (and will edit in a moment), that I don't care about supporting any standard formats (like file-systems or what-not) that don't fit into this OS design goals. If you're talking about highly computational tasks in general, name one that can't be part of the worker flow? Also keep in mind, that the use cases I envision are situations where "cold-start" isn't a chronic state: buffers/caches are primed, and commits can happen as resources allow or need preempts.

Brendan · **Posted:** Sat Nov 23, 2013 12:38 am

Hi,

mrstobbe wrote:

As of today however, I'm thinking I should switch to a monolithic design (same overall concept though), if nothing else to simplify things and get to a place where I can do practical experiments faster. I can refactor it into a microkernel later and see if that's reasonably effective if the monolithic version works out.

Micro-kernel sacrifices some performance for security/stability. Monolithic sacrifices some security/stability for performance.

For "single application" there's a third way that sacrifices more security/stability to get even more performance (and let's be honest, that one application is the only thing that can matter, so protecting the kernel and drivers from that application isn't really that important): run the application at CPL=0 (to minimise the cost privilege level switching); and implement the OS as a library that is linked into the application (so that the compiler can do optimisation, including having parts of the kernel in header files that are inlined, and possibly link-time optimisation).

This gives you that absolute maximum possible performance for that single application. Of course this ends up being equivalent to a "bootable application" developer kit.

[EDIT]: Strangely; a "bootable application" developer kit actually makes a lot of sense to me. I can imagine application developers (possibly HPC) using "network boot" to start a large number of computers from a central server (possibly with something like NFS for file IO), with fast application updates and zero setup/configuration/installation hassle.[/EDIT]

Cheers,

Brendan

OSDev.org

OdinOS: I'd love some design feedback

Who is online