Kevin wrote:
Okay, I guess now you know why it doesn't happen too often that Brendan and I agree...
I still think that this is the crucial point in the whole discussion: How do you deal with the multiple processes that you do get with your drivers and how do they play together with your application process?
If I understood it correctly from the latest few posts, you're going to run all drivers on CPU 0 and the application threads on CPU 1-n. Is this correct? If so, interfacing with a driver always means that you need to switch to a different CPU. Doesn't this hurt your latencies? If everything can indeed be done asynchronously, at least the throughput should be okay, but if your static HTTP server only delivers small pages instead of huge files, you're probably more interested in latency.
Again, that's still the fuzzy bit. The
general idea is that CPU0 acts as a classic microkernel in all senses (including a standard scheduler for the drivers). A worker threads/the ring3's main thread would be able to talk to the drivers directly (maybe) or use the kernel as an intermediary (probably). It's asynchronous by default though, so they simply go on about their day doing anything else they can keep doing after the request is fired off. When a request has been fulfilled CPU0 then gives it back to the originating thread, and, if that thread is sleeping because there was nothing left to do in the mean time (or they finished everything else up), they get woken back up. It does introduce latency, undoubtedly, the question is how much. I'm gambling that the latency introduced by this design is far less than the latency introduced from scheduling 20-50 threads like you see in even a pretty minimal every-day server setup.
As of today however, I'm thinking I should switch to a monolithic design (same overall concept though), if nothing else to simplify things and get to a place where I can do practical experiments faster. I can refactor it into a microkernel later and see if that's reasonably effective if the monolithic version works out.
Kevin wrote:
Also, I think it requires that your drivers do nothing CPU-heavy. You won't be able to implement compressed or encrypted file systems, for example, without hurting others threads that just want to send some network packets at the same time when you force both drivers to run on the same CPU.
Wouldn't it make more sense to run the drivers in the CPU of their caller, and perhaps also distribute IRQ handlers across the CPUs?
Then, of course, you would have multiple processes on each CPU, and the limitation to one application process becomes rather arbitrary. So maybe another question is what advantages you get from the limitation.
That's why this is a novel idea. I don't know if it will work. Know any projects that have tried it? I'd love to learn their lessons so I don't have to the hard way. I just responded to rdos who mentioned they were thinking at one point of doing an IRQ balancing act by reassigning in low-frequency ticks. Something to explore.
As for your general question about the advantages, if all goes well a lot of the CPU bottle-neck we simply take for granted will basically be removed from the system. Presuming, of course, that I can engineer the I/O messaging to be significantly lower latency than that bottle-neck. You're right about the HTTP example, but keep in mind that large requests require very little CPU time per request, while tiny requests are almost purely CPU time per request. I'm trying to figure out a way to give as much of that time back to processing headers, resolving resources, and sanity checking at every step as humanly possible. I think I mentioned this earlier, but imagine an 8 core system that's trying to serve up tens-of-thousands or hundreds-of-thousands of concurrent requests per-second (like small JS files or whatever), that's most certainly a CPU bound system.