OSDev's dream CPU

sandras · **Joined:** Thu Nov 03, 2011 9:30 am **Posts:** 146

Maybe I should make one bigger post instead of many short ones, but anyway. I also would like a computer with no moving parts. But on the other hand, I don't like the wearing of ssd's. So both, the moving parts and wearing removed would be cool.

Oh, and again, about simplicity. It would be cool to have simpler cpu's just so there are no bugs, so there is no microcode, which, I think slows down the computer. Am I right on that?

linguofreak · **Joined:** Wed Mar 09, 2011 3:55 am **Posts:** 509

I'm more of a segmentation / multiple address space fan than most people seem to be, although I think back-compatibility with the 80(2)86 crippled the 386's segmentation scheme.

Ideally, rather than a base and limit, a segment would designate a the top of a paging hierarchy (like access registers do on ESA/390 or z/Architecture).

I'd also want 8 or 16 segment registers, including separate user and kernel code and stack segment registers (so that instead of loading a new value into a single code or stack segment register on a mode switch, you'd just switch between already loaded segment registers).

Combuster · **Posted:** Thu May 03, 2012 11:55 am

Rudster816 wrote:

You also have to provide a way to update the TLB. It would be absolutely debilitating for the CPU to have to check every single TLB entry against every store operation to see if it needed to update something.

You'd have to do something similar for any store on SMP: if you don't own it, get it.

Which essentially means that you make caching TLB^C/D and consequently you need to check the TLB for coherency only on a cache miss - out of the hot path. Plus caches are silicon, so you can drop an address on a dedicated TLB bus and pretty much pull out all entries in one go.

The advantage is is that this scheme works combined with TLB tagging without the need for retrofitting the existing system codebase - i.e. not flushing entries on an address space switch but only disable them in case stuff gets switched back before the cache ran out of pages, and cache coherency will pull them out if another process pokes into memory = less pain on microkernels.

AndrewAPrice · **Posted:** Thu May 03, 2012 11:56 am

Rudster816 wrote:

quote="MessiahAndrw"]I'd like slower but may cores.
e.g.
3,000 1 Mhz cores than 1 3GHz core.

Would be many orders of magnitude slower than it's 1 3Ghz core counterpart. It would also never fit in a single chip, as frequency has nothing to do silicon usage.

http://en.wikipedia.org/wiki/Amdahl%27s_law[/quote]
With the technological advances we have, it'll be hard to justify limiting the speed to 1Mhz. But, I was just fantasizing about how great it would be to have nearly virtually unlimited cores. You wouldn't have to worry about time sharing a core.

"Multitasking" will be about how you allocate cores out (and deal with processes requesting extra cores) rather than context switching every 5-10 milliseconds.

NickJohnson · **Posted:** Thu May 03, 2012 1:44 pm

I don't know what everyone is complaining about; I kind of like the x86. It's ISA is definitely bloated, but in a comfy, bean-bag chair sort of way. More importantly, it's well-documented, fast, and easy/cheap to find hardware for.

Rudster816 · **Joined:** Thu Jun 17, 2010 2:36 am **Posts:** 141

Combuster wrote:

Rudster816 wrote:

You also have to provide a way to update the TLB. It would be absolutely debilitating for the CPU to have to check every single TLB entry against every store operation to see if it needed to update something.

You'd have to do something similar for any store on SMP: if you don't own it, get it.

Which essentially means that you make caching TLB^C/D and consequently you need to check the TLB for coherency only on a cache miss - out of the hot path. Plus caches are silicon, so you can drop an address on a dedicated TLB bus and pretty much pull out all entries in one go.

The advantage is is that this scheme works combined with TLB tagging without the need for retrofitting the existing system codebase - i.e. not flushing entries on an address space switch but only disable them in case stuff gets switched back before the cache ran out of pages, and cache coherency will pull them out if another process pokes into memory = less pain on microkernels.

I can't really make out entirely what you mean because your post is poorly written. Anyways...

You greatly overestimate the capabilities of hardware. You can't just drop an address on a bus and in a couple of clock cycles check the entire TLB for entries that collide. TLB design dictates that you don't need to store the physical address of the entry, just it's Virtual->Physical mapping. This means that you would have to store a physical address for each TLB entry, and a range for each paging structure entry. This would increase the size of the TLB significantly, meaning less entries and more capacity misses.

Caches also have a very limited amount of read\write ports. You can't just access the entire cache in parallel. There's a reason that caches have limited associativity, this is one, the other is the fact that you need to compare them all, which sounds simple but takes time and silicon.

Say we have a have a level 2 TLB cache that has 256 entries and a (quite massive) 8 read ports. The read latency for a multigigahertz CPU for such a cache is multiple clock cycles, but for the sake of argument, lets say it only takes. It would take another clock cycle to compare each entry as well. So 256 entries / 8 reads per cycle * 2 (comparing takes an additional cycle), thats 64 cycles for every single store operation at least. You also can't rely on any associatvity of the TLB because it would be associated with it's virtual->physical mapping, not the physical address it's stored at. But that's the tip of the iceberg as far as the cost. While you're checking all the TLB's (in a high end chip, probably at least 3, L1 Data\Instruction and L2 unified) for coherency, you can't access them (as all their read ports are being occupied, and they might be inconsistent, so adding more ports doesnt help). This means no Load\Store instructions can proceed, and you can't even fetch instructions. This means that every store instruction would take AT LEAST as many cycles as it takes to check all of the TLB's, even if L1 access is only 1 or 2 cycles and there's a hit.

You could speculate that there won't be a hit, which means load\fetches (but not additional stores) can proceed as usual, but you would need a repair mechanism in case you're wrong. Even if such repair mechanism was completely free (which it most definitely would not), you would still have to wait for the TLB check to finish before you could commit any instructions after the store (as your speculation may be wrong). This means that the instruction latency would still be the entire time it takes to check the entire TLB hierarchy, but with improved throughput.

Software would have to ensure that one core wouldn't make another's TLB inconsistent. Snooping other core's TLB's would be completely out of the question. In reality, I think implementing such automagic detection would literally cost 100's of cycles per store instruction, and major increases into the transistor\gate count. Since 99.99% of the time this mechanism would be in vain, its about 99999 times simpler and faster to just let software do it. The cost of an INVLPG like instruction might be high, but it's more than acceptable.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3192

My dream CPU is a proper implementation of segmentation in 64-bit mode that can handle multi-mode threads without virtualization. Intel managed to keep compability as they introduced their 386 processors, but AMD failed miserably when they did this for 64-bit. (To be honest, it was Intel that first messed up the 64 bit transition with their Itanium)

Other than that, my dream processor would have:
- 32 bit segment registers, and 32-bit GDT/LDT
- Would implement segment register loads in hardware, not in microcode
- Would have a switch to disable limit checking and protection tests in production release
- Would cache large parts of the GDT/LDT/IDT to improve speed of segment register loads and control transfers
- Would allow 16 and 32 bit code to use 64 bit registers and addressing modes
- Would allow TR selector be read directly with a new segment register override
- Would implement multicore safe hardware taskswitching (probably in two steps: save and load)

Brendan · **Posted:** Fri May 04, 2012 3:09 am

Hi,

I'd want something like this:

Cache coherent
CISC with variable length instructions
4 separate sets of flags, and support for predicates and for all instructions. For example, "if(flags1.z) add flags2,rax,rbx" would only do the addition if the zero flag in "flags1" is set and would set "flags2" depending on the result of the addition.
16 (64-bit) general purpose registers and 16 optional SIMD registers; where all registers (including general purpose registers) can handle integers or floating point (e.g. "fadd rax,rbx"). If the CPU supports SIMD, the width of the SIMD registers would depend on the CPU itself (e.g. some CPUs might only support 128-bit SIMD, while others might support 256-bit SIMD or 512-bit SIMD).
Only 2 privilege levels ("user" and "supervisor")
Doesn't support 32-bit or 16-bit code
Supports 64-bit virtual addresses (and "up to 64-bit" physical addresses)
Uses larger paging structures to reduce the number of layers used for paging, e.g.:
- 4 KiB pages (bits 0 to 11 are offset in page)
- 512 KiB page tables (bits 12 to 27 select PT entry)
- 512 KiB page directories (bits 28 to 43 select PD entry)
- 512 KiB page directory pointer table (bits 44 to 59 select PDPT entry)
- A set of 16 registers (like CR3) select a directory pointer table for each zone
Has a few MSRs that the OS can use for anything it wants (e.g. for things like "address of current task's data")
Doesn't support segmentation at all (not even GS)
No GDT at all (e.g. "SYSCALL" instruction, no software interrupts, no call gates)
No IDT in RAM. Use a group of 256 MSRs instead to avoid fetching from memory, where the interrupt handler must be aligned on a 16-byte boundary, and the lowest 4 bits of the MSR's value is used for DPL/attributes and the highest 60 bits are used for the address of the interrupt handler
No task register, no IST. MSR stores "supervisor RSP" which is used when user code is interrupted.
All devices are memory mapped (no IO ports, paging used instead of an IO permission bitmap)
No SMM. All possible sources of NMI are "opt in" and disabled by default. Hardware errors reported via. something like machine check (and never NMI).
A special "use cache as RAM" mode; where cache coherency is disabled and each CPU uses L3 cache (or L2 cache if there's no L3) for its own private/isolated RAM. This would include special instructions to (explicitly) transfer 4 KiB pages between the "L3/L2 RAM" and the normal/external RAM (e.g. so an OS that uses this mode could use normal/external RAM as swap space). Note: this is mostly intended for firmware to use before it has initialised RAM chips; but would be fun for an OS to mess about with, and could good for tiny systems and/or distributed systems and/or low cost "single-CPU" embedded systems (that have no external RAM at all)

Cheers,

Brendan

ACcurrent · **Posted:** Fri May 04, 2012 3:48 am

I like brendan's Arch except for the CISC. I would prefer risc. I was just pondering upon completely removing the decoder by using instructions like 0001, 0010, 0100 and 1000. Multiple instructions issued would mean MIMD or MISD. This may not be very space efficient but would remove the decode stage.

sandras · **Joined:** Thu Nov 03, 2011 9:30 am **Posts:** 146

As for CISC/RISC...

It's like microkernel/monolithic kernel/megalithic kernel problem. Or like BusyBox. Let me explain. You have a small program that does one thing. You can't even remove anything, cus then it wouldn't even do a thing. But you could add things. And here's the problem. When you start adding things, there's no end to it. You have to draw the limit somewhere.

As for CISC/RISC, I'd implement as many instructions in a processor, as possible, while they still take one cycle to execute.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3192

Brendans idea has basically already been implemented. Intel called it Itanium, and it was a big failure. The reason was because it couldn't run legacy software at reasonable speed. AMDs half-baked solution did offer a possibility to run legacy software at reasonable speed, so that's why it succeeded.

If you want x86 to become like ARM, do an ARM chip instead, and don't call it "x86"!

Brendan · **Posted:** Fri May 04, 2012 7:19 am

Hi,

rdos wrote:

Brendans idea has basically already been implemented. Intel called it Itanium, and it was a big failure. The reason was because it couldn't run legacy software at reasonable speed.

Itanium expected the compiler to do instruction scheduling, etc instead of the CPU. It "failed" because there wasn't a good enough compiler, and because it severely restricts the range of optimisations that could be done in future versions of the CPU (without causing performance problems for code tuned for older Itanium CPUs). For high-end servers, support for legacy software is virtually irrelevant (it's not like the desktop space where everyone wants to be able to run ancient Windows applications).

Of course "failed" depends on your perspective. Once upon a time there were a variety of CPUs competing in the high-end server market (80x86, Alpha, Sparc, PA-RISC, etc). Itanium helped kill everything else, leaving Intel's Itanium, Intel's 80x86 and AMD's 80x86 (which can barely be called competition at all now). It's a massive success for Intel - even if they discontinue Itanium they're still laughing all the way to the bank from (almost) monopolising an entire (very lucrative) market.

That's a huge success on it's own, but that's not all. The funny part is that other companies (e.g. HP) paid for most it - it cost Intel almost nothing to slaughter the competition.

Cheers,

Brendan

OSwhatever · **Joined:** Mon Jul 05, 2010 4:15 pm **Posts:** 595

Brendan wrote:

Hi,

Itanium expected the compiler to do instruction scheduling, etc instead of the CPU. It "failed" because there wasn't a good enough compiler, and because it severely restricts the range of optimisations that could be done in future versions of the CPU (without causing performance problems for code tuned for older Itanium CPUs).

Itanium has no monopoly on compiler instruction scheduling and there have been other successful implementation of it like Tilera and Xtensa. The problem in general with Itanium was that it was too complex to optimize for. I've read the Itanium architecture specifications for system developers and it really makes your head hurt when you read about the ISA, while the instruction scheduling with Xtensa or Tilera is pretty much straight forward and can be described on one page. Itanium is a good example of over-engineering now add that it wasn't really running any cooler or faster than an Opteron pretty much made it obsolete.

Compiler instruction scheduling is the way forward when you go massively multicore as OOE in HW increases complexity a lot, multiply by all the cores you have and you save suddenly a lot if you remove it.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3192

Brendan wrote:

Itanium expected the compiler to do instruction scheduling, etc instead of the CPU.

CPU instruction scheduling is a dead-end. The only way to achieve more performance is to write applications with many threads.

Based on my tests, I can only conclude that Moore's law is no longer valid. We had 3GHz CPUs several years ago, and they are just as fast as the standard CPUs of today. So, in reality, the only way to achieve better performance is to write parallell software, which your typical C compiler cannot help you with.

Brendan wrote:

It "failed" because there wasn't a good enough compiler, and because it severely restricts the range of optimisations that could be done in future versions of the CPU (without causing performance problems for code tuned for older Itanium CPUs).

Typical users cannot care less about compilers. They want their software to work, and to run the fastest possible. Practically nobody sold native Itanium software, so the CPU ended up executing legacy-software at a terribly low speed.

Brendan wrote:

For high-end servers, support for legacy software is virtually irrelevant (it's not like the desktop space where everyone wants to be able to run ancient Windows applications).

That is not the high-volume market. The high-volume market is desktop PCs and portable PCs.

Brendan wrote:

Of course "failed" depends on your perspective. Once upon a time there were a variety of CPUs competing in the high-end server market (80x86, Alpha, Sparc, PA-RISC, etc). Itanium helped kill everything else, leaving Intel's Itanium, Intel's 80x86 and AMD's 80x86 (which can barely be called competition at all now). It's a massive success for Intel - even if they discontinue Itanium they're still laughing all the way to the bank from (almost) monopolising an entire (very lucrative) market.

That's not the way I understand it. AFAIK, Intel launched (and patented) Itanium in order to get rid of the competition once and for all. This was a big failure since practically nobody bought Itanium. Then it was AMD that extended x86 to 64 bits, and thereby made sure that they stayed on the market.

Rudster816 · **Joined:** Thu Jun 17, 2010 2:36 am **Posts:** 141

Brendan wrote:

Hi,

I'd want something like this:

Cache coherent
CISC with variable length instructions
4 separate sets of flags, and support for predicates and for all instructions. For example, "if(flags1.z) add flags2,rax,rbx" would only do the addition if the zero flag in "flags1" is set and would set "flags2" depending on the result of the addition.
16 (64-bit) general purpose registers and 16 optional SIMD registers; where all registers (including general purpose registers) can handle integers or floating point (e.g. "fadd rax,rbx"). If the CPU supports SIMD, the width of the SIMD registers would depend on the CPU itself (e.g. some CPUs might only support 128-bit SIMD, while others might support 256-bit SIMD or 512-bit SIMD).
Only 2 privilege levels ("user" and "supervisor")
Doesn't support 32-bit or 16-bit code
Supports 64-bit virtual addresses (and "up to 64-bit" physical addresses)
Uses larger paging structures to reduce the number of layers used for paging, e.g.:
- 4 KiB pages (bits 0 to 11 are offset in page)
- 512 KiB page tables (bits 12 to 27 select PT entry)
- 512 KiB page directories (bits 28 to 43 select PD entry)
- 512 KiB page directory pointer table (bits 44 to 59 select PDPT entry)
- A set of 16 registers (like CR3) select a directory pointer table for each zone
Has a few MSRs that the OS can use for anything it wants (e.g. for things like "address of current task's data")
Doesn't support segmentation at all (not even GS)
No GDT at all (e.g. "SYSCALL" instruction, no software interrupts, no call gates)
No IDT in RAM. Use a group of 256 MSRs instead to avoid fetching from memory, where the interrupt handler must be aligned on a 16-byte boundary, and the lowest 4 bits of the MSR's value is used for DPL/attributes and the highest 60 bits are used for the address of the interrupt handler
No task register, no IST. MSR stores "supervisor RSP" which is used when user code is interrupted.
All devices are memory mapped (no IO ports, paging used instead of an IO permission bitmap)
No SMM. All possible sources of NMI are "opt in" and disabled by default. Hardware errors reported via. something like machine check (and never NMI).
A special "use cache as RAM" mode; where cache coherency is disabled and each CPU uses L3 cache (or L2 cache if there's no L3) for its own private/isolated RAM. This would include special instructions to (explicitly) transfer 4 KiB pages between the "L3/L2 RAM" and the normal/external RAM (e.g. so an OS that uses this mode could use normal/external RAM as swap space). Note: this is mostly intended for firmware to use before it has initialised RAM chips; but would be fun for an OS to mess about with, and could good for tiny systems and/or distributed systems and/or low cost "single-CPU" embedded systems (that have no external RAM at all)

Cheers,

Brendan

Kind of looks like you're stuck in an x86 mindset.

I don't know what your definition of CISC is, but I think any new architecture that isn't load\store would be poor. I don't think there is a microarch that doesn't turn a MUL RAX, [RBX] into two microcode ops anyways. It just makes instructions longer and a lot more complex. Not sure if this is what you had in mind though. I do agree though, the fetish of fixed length instructions is kind of silly. Right now I plan on support 2, 4, 6, 8, 10, and 12 byte instructions. Mostly out of my desire to support full 64 bit immediates, but I could use the extra bytes for anything in the future. Instructions always falling on 2 byte boundaries should make decoding variable length instructions a bit easier.

Predicates for every instruction have been shown to be unnecessary. The latest ARM architecture decided to drop them, mainly because they were not used on most instructions, and branch prediction has gotten much better (92%-96% nowadays). Certain instructions can still have them though, not sure which, but I think just CMOV and CADD would be adequate.

Just 16 registers for both GP\FP is quite anemic. I also see no purpose in using the same registers for FP instructions and GP instructions. I'm willing to bet a simple set of MOV gpreg, fpreg and MOV fpreg, gpreg would prove to be more than adequate. I can hardly think of any useful ALU operations that one would want to perform on FP values. In the rare case, just moving it to a GP reg and back to an FP reg would be fine.

Interesting idea for non page size paging structures. I think 512KB is too large though, as that means a minimum of 1.5MB to map an arbitrary place in virtual memory, which is extremely high. The idea of a full 64 bit virtual address space is nice, but is unnecessary. The same goes for 64 bit physical addresses. It would also decrease the number of entries in the TLB greatly. Say we have a 48 Bit Virtual\40 bit physical (orginal AMD64 implementation). With 4KB pages, that's at least 64 bits to store a virtual->physical mapping. In reality you need to store some additional things, so lets add 8 bits to make a TLB entry 72 bits, that would make a 256 entry TLB 18KB. Upping the virtual address to a full 64 bits (keeping the same physical address size) would require 16 additional bits and a 98 bit TLB entry. That makes that same 18KB TLB only able to hold 188 entires, or a full 256 entry TLB 24.5KB. A full 64 bit scheme would require 122 bits, which would make the TLB 151 entries or 30.5KB. I think the costs far outweigh the benefits, especially in the case of physical addresses, because you would also need a full 64 bit address for your memory controller. But even in the case of a 64 bit virtual address space, I don't see any benefit. What could you possibly do that would be worth the decreased TLB entries with a 16EB address space that you couldn't do with a 256TB one?

SMM was created out of necessity more than its elegance, so I don't think it will be missed.

Using cache as RAM is another interesting idea, but again IMO the benefits outweigh the costs. System initialization is typically done with only one CPU in an SMP system active, so disabling cache coherency just adds complexity for little reason. There would be a huge question of how address's are used too, one to which I doubt an answer exists. Since you would still need to be able to access outside hardware (in order to initialize the RAM), the CPU would have to differentiate an address that is RAM or an address that is memory mapped I\O. Even if it automatically knew (or you told it), what would it do if you tired to read\write from RAM? Most importantly, how would you map the cache? It just adds a huge mess of logic onto the chip that would be better used somewhere else. There are also many SoC's that integreate DRAM\SRAM onto the same chip as the CPU, which would be the same thing as using cache for RAM (in the case of integrated SRAM).

OSDev.org

OSDev's dream CPU

Who is online