Hi Octocontrabass,
Thank you for your feedback. Challenges like this help me reflect on the design and (hopefully) make it stronger.
I have found that the first and best use for caches is to prevent fetching/writing of frequently used blocks.
Consider two primary use cases of read and writes to RAM, with cacheing enabled.
1. Writing a tiny amount of data, such as an octet.
2. Reading a huge amount of data, such as scanning a disk buffer of database indexes, much larger than the on-chip cache.
If the RAM in question is not already cached, then the CPU must first fetch it. In scenario 1, a block of data is fetched, not just the variable in question, other cached ram is swapped out (requiring a write, if write back is not already enabled). So for one memory area, the existing cache, which may be better remaining cached, is swapped out.
In scenario 2, each part of the block must be read into the cache first in order to be read. The cache doesn't get dirty, so there is no write on swap-out. Yet whatever was in the cache first is swapped out to make room for this sequence of one-time reads. In all likelihood, what was in the cache to start with will have to be read back in.
In both of these scenarios, there is little benefit to caching. The memory must all be read anyway. Any gain, it seems to me, would be lost by the amount of cache changes (up to twice the overhead). (Your comment makes me think I am missing something here, so I will give it some thought, and perhaps you can help me see the flaw).
AFAIK the stack is by far the most frequent use of the same area of RAM, and I haven't found anything to indicate the CPU explicitly caches the stack. Because the stack is CPU specific, it doesn't even need to be written back. Only an individual CPU cares about the stack. (Perhaps a better hardware design would be to dedicate on-chip RAM to the stack; it really doesn't need to be in main memory at all). Even if Intel agreed to the change (unlikely) we'd still be 5 years away from this being in the mainstream. If I find a way to do it, I'd have the CPU keep the stack cached indefinitely, never being swapped out.
So, by caching only the stack (and a few other things) I believe I can minimize the amount of time the CPU has to wait for the MMU reads/writes. (And if I'm wrong, it's one flag change!)
Having all PCI DMA access take place in non-cached areas should mean the PCI bus and CPUs have to synchronize RAM to start and end a DMA transfer, less cycles there as well. With one CPU there the overhead is relatively small, but with multiple CPUs the overhead grows more geometrically than linearly. (Unless the PCI but knows what is already cached, this must be a very similar to spinlock performance losses, if not quite as steep a drop-off)
Because my OS wil not be allowed to write data to the stack, only return addresses. In addition the compiler will enforce low procedural depth and no recursion, the stacks will be relatively tiny (some OSes have stack of 1M, I expect far less than 64k, maybe even 16k).
Quote:
OSes have been doing this for the past 20 years paraphrased]
Since the early 90s, at least! But in response to OSes that were written with compilers that were not a good match for the job). C as it stands is great for synchronous code, but is a poor choice for asynchronous code: You either have to write programs that are specifically architected at a high-level to be cooperatively multi-tasked, or use preemption: standard practice, but computationally inefficient. I am proposing a simpler programming language/compiler that compiles to cooperatively multi-tasked code (and what is perhaps a new approach to synchronization that doesn't tie up the processor in wait spins).
Most barriers to progress are resistance to change, particularly 'we have always done it this way, why change?'. I think having done it a certain way for 30 years in what is supposed to be a pioneering field is a reason to look for alternatives
Perhaps I am being naive, or arrogant, or perhaps they are the same thing. Or perhaps this task is just too large for one developer to pull off. But I think it's worth throwing out the rulebook and starting over.
Thanks again, and I will definitely remember your feedback as I make progress.