The Mill: a new low-power, high-performance CPU design

Rusky · **Joined:** Wed Jan 06, 2010 7:07 pm **Posts:** 792

I'm not sure if this is the right place to post this, but...

http://ootbcomp.com/topic/introduction- ... g-model-2/

So, they're claiming a 10x power/performance gain over traditional, superscalar, out-of-order CPUs. From what I can see, they do bring some genuinely new innovations to general purpose computing. It's very VLIW-esque, but their CTO is both an experienced compiler-writer and CPU designer, so they've managed to avoid the issues of Itanium and friends.

Basically, it's a wide-issue (33 pipelines!), statically-scheduled (with some new tricks- deferred and pickup loads- to hide memory latency just as well as or better than an out-of-order), belt machine (no general registers, just reference "this many results back"), with only a 5-cycle pipeline depth. A call instruction, including passing all arguments (on the callee's belt, rather than in general registers) is a single cycle, and old stack cache lines are discarded on return (rather than evicted to DRAM). Interrupts work just like calls.

They have a new error model that uses metadata bits instead of exceptions/faults whenever possible, which greately simplifies speculative execution, and enables piplining and vectorizing of many, many more types of loops. For example, they can do a maximum-width vectorized strcpy in a single instruction that issues in a single cycle, by masking out the bits past the null terminator (there's an illustration and more in-depth explanation on the linked page).

The memory heirarchy has been reorganized to avoid TLB miss latency and even trapping to the OS to allocate physical pages (it's done with a hardware free list, populated by a background process). It's single-address-space, with protection available down to the byte granularity. It's immune to false aliasing, making static load scheduling even better.

They have a really good series of talks if you have time to watch the videos; if not, the linke page has some more in-depth explanation.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3192

I like it. Especially the call-part and segmenting memory. :-)

qw · **Joined:** Mon Jan 26, 2009 2:48 am **Posts:** 792

I like the belt.

embryo · **Posted:** Sat Mar 01, 2014 4:32 pm

Rusky wrote:

So, they're claiming a 10x power/performance gain over traditional, superscalar, out-of-order CPUs ... It's very VLIW-esque

The extended VLIW is a future of the processor industry. And the claim about 10x performance is just a baby step comparing to the potential of fully controlled processor with a very good compiler.

Rusky wrote:

so they've managed to avoid the issues of Itanium and friends.

And what the issues were with Itanium and friends ? Just not enugh VLIW ?

Rusky wrote:

it's a wide-issue (33 pipelines!)

What the problem with just 33 piplines ? Why not 1000 ? Just because a processor is not suitable for high level optimization. It can't do such simple things like optimized deep method inlining for example. It's just a stupid hardware. There must be a good compiler and managed language environment. A processor lacks it.

The future is with a total passthrough optimizing compilers in a managed environment.

Brendan · **Posted:** Sat Mar 01, 2014 11:36 pm

Hi,

Rusky wrote:

So, they're claiming a 10x power/performance gain over traditional, superscalar, out-of-order CPUs.

For what it's worth, I'm sceptical.

For things that GPUs can do 50 times faster than conventional CPUs, it might be 10 times faster than a conventional CPUs (or worse than a GPU). For everything else I'd expect similar performance to a conventional CPU or worse. For a simple example, consider binary search or iterating over a linked list - these cases end up being a combination of "pointer chasing" and branching; where long pipelines kill performance.

The other thing I'd worry about is clock speeds. From what I've seen they're claiming a lot more instructions per clock cycle but not saying how fast the clock is. A 100 MHz CPU that does 20 instructions per clock cycle is not faster than a 2 GHz CPU that does 1 instruction per clock cycle.

Finally, assuming that a compiler will be able to optimise the code sufficiently seems like a bad idea to me. It sounds great in theory, until later on when everyone realises the compiler can't do as much as you hoped and your nice Itanium or Mill CPU fails to get close to its theoretical maximum performance in practice.

Cheers,

Brendan

embryo · **Posted:** Sun Mar 02, 2014 6:36 am

Brendan wrote:

assuming that a compiler will be able to optimise the code sufficiently seems like a bad idea to me. It sounds great in theory, until later on when everyone realises the compiler can't do as much as you hoped and your nice Itanium or Mill CPU fails to get close to its theoretical maximum performance in practice.

But there is a reason for the difference between the theory and practice. The compiler loses required information when there are pointer manipulation in place, also it doesn't know if a global variable is accessed from outside of the code being optimized. But in a managed environment like Java there is no pointers and a lot of global variables are replaced with object fields and the objects often are just a local storage of their field information. This means the information, required by a compiler, can be found and used. As a consequence - the theory can close the gap from the practice.

Owen · **Posted:** Sun Mar 02, 2014 9:07 am

embryo wrote:

Brendan wrote:

assuming that a compiler will be able to optimise the code sufficiently seems like a bad idea to me. It sounds great in theory, until later on when everyone realises the compiler can't do as much as you hoped and your nice Itanium or Mill CPU fails to get close to its theoretical maximum performance in practice.

But there is a reason for the difference between the theory and practice. The compiler loses required information when there are pointer manipulation in place, also it doesn't know if a global variable is accessed from outside of the code being optimized. But in a managed environment like Java there is no pointers and a lot of global variables are replaced with object fields and the objects often are just a local storage of their field information. This means the information, required by a compiler, can be found and used. As a consequence - the theory can close the gap from the practice.

Accurate situational branch prediction. Parallelization appropriate for all harder, present and future. The processor has a wealth of information available to it that the compiler does not.

That said, the Mill is quite interesting because it explicitly isn't heavily compiler dependent. Discussions on comp.arch (which involve several industry professionals) are quite positive about it.

embryo · **Posted:** Sun Mar 02, 2014 1:02 pm

Owen wrote:

The processor has a wealth of information available to it that the compiler does not.

The compiler hasn't it because the processor manufacturer just hiding it. With the full information available to the compiler there is no way for a processor to win the competition.

Owen wrote:

the Mill is quite interesting because it explicitly isn't heavily compiler dependent

Why should the processor be compiler independent? Just because of very wide usage of such processors. It is the legacy problem.

But the Mill is interesting (for me) just because of it is being in row with the general direction of the 'good' processor architecture - the VLIW.

Owen · **Posted:** Sun Mar 02, 2014 2:19 pm

embryo wrote:

Owen wrote:

The processor has a wealth of information available to it that the compiler does not.

The compiler hasn't it because the processor manufacturer just hiding it. With the full information available to the compiler there is no way for a processor to win the competition.

The compiler has no knowledge of branch directions that the program takes, no knowledge of the memory bus constraints of various systems, etc, etc. How would it? Is it supposed to have a time machine? Is it supposed to know every detail of every future microarchitecture design?

The processor will win every time. You believe in the same delusions that sunk the Itanium. Everyone in the know who was involved in the design process knew of the disaster that was happening there then and fled.

embryo wrote:

Owen wrote:

the Mill is quite interesting because it explicitly isn't heavily compiler dependent

Why should the processor be compiler independent? Just because of very wide usage of such processors. It is the legacy problem.

But the Mill is interesting (for me) just because of it is being in row with the general direction of the 'good' processor architecture - the VLIW.

Itanium depended upon the compiler. It stunk (The compiler didn't have the information). It stinks even more now, because there are many Itaniums, all with different properties, and the compiler has no hope. Never mind that in times of increasingly constrained memory bandwidth one cannot afford to have 33% of your instruction fetch bandwidth wasted by NOPs as Itanium does. (And even if you completely discounted the NOPs the Itanium encoding still has pathetic density)

VLIW is even worse because every compiled binary becomes invalid (or highly inefficient) whenever you increase the issue width (Or else you have to add out-of-order execution support, in which case the whole VLIW thing was an enormous distraction because you're now in the same spot as a non VLIW processor, except your instruction stream is now stuffed with NOPs nobody wants.

VLIW is great, if and only if you are designing an embedded processor which is never expected to exhibit binary backwards compatibility. In those cases, you can get away with it (In fact its' often the obvious tradeoff where performance is needed). The AMD Radeons famously used VLIW architectures (VLIW4, VLIW5) up through and including the Radeon HD 6000 series. It is perhaps noteworthy then that the GCN architecture is not VLIW (or EPIC, or anything similar), but instead scalar.

Rusky · **Joined:** Wed Jan 06, 2010 7:07 pm **Posts:** 792

They're planning to start out at around 1GHz, but they also say there's nothing preventing them from moving up to the usual 3-4GHz range. (here's an interesting analysis) The new developments are what allows the Mill to move from "low power like a DSP but slow at general purpose" to "lower power like a DSP and good at general purpose."

For example, they tossed the "sufficiently-smart compiler" magic requirement. It's statically scheduled, with an exposed pipeline, so the compiler knows the latency of each operation and how many of each type it can issue in a cycle (also, NOPs are in the padding between ops in an instruction, so most of the time they're free, encoding-wise). Obviously this is dependent on a specific Mill, so they plan on distributing higher-level modules and using install-time compilation for the last step.

For loads, which are variable-latency, they specify in the instruction how many cycles ahead they want the load to finish. They also intercept potentially-aliasing stores, so loads can be hoisted as far as possible (this is an example of why the processor beats the compiler). This hides the same memory latencies an OoOE CPU does, but at far lower power and without the instruction window limitations (the compiler can see the whole program).

Metadata stored with in-CPU values (Not-a-Result, None, width, etc.) allows for compiler-specified speculative execution (greatly reducing the need for actual branch instructions), vectorizing even with control flow, a smaller instruction encoding, etc.

They also have a split instruction stream (half the ops go up and half go down), which means they can double the icache size without slowing down the critical path. This also effectively organizes code into basic blocks (you jump into a block with a single pointer which then goes both ways). The processor does run-ahead transfer prediction on these blocks, the results of which can be stored back in the load module, so something like binary search should be awesome here (combine cheap speculative execution with run-ahead prediction).

embryo · **Posted:** Sun Mar 02, 2014 6:15 pm

Owen wrote:

The compiler has no knowledge of branch directions that the program takes

Processor also doesn't know it until too late, it just executes instructions speculatively and throws away all the speculative work after the branch information is calculated.

Owen wrote:

no knowledge of the memory bus constraints of various systems

There are the PCI standard and processor manufacturer's electrical constraints. And nothing more. The processor knows just two things - what it's manufacturer writes in a specification and the same PCI standard. Why compiler can not know such information ?

But the compiler knows the program structure. It knows, for example, the branch section location and can infer limits for the both outcomes of branching, then it just caches both code fragments to help speed things up. And a processor has no need to speculate because there is the compiler which can feed the processor with the work the compiler knows exactly should be done. And it can have such knowledge because of the program structure available.

Owen wrote:

You believe in the same delusions that sunk the Itanium. Everyone in the know who was involved in the design process knew of the disaster that was happening there then and fled.

The Itanium has it's market share. May be bad design has significantly reduced the share, but it's not the architecture problem, it's just bad design. There was no efficient compiler to support the architecture, may be the processor internals had design problems. And there was no market for it. No market and no good compiler - the result is a disaster. But if there was a good compiler? And a market? It is the case for the Java server applications market, it is very big and a processor with a good compiler supported by Java Operating System can perform very well even having relatively small investments allocated.

Owen wrote:

Itanium depended upon the compiler. It stunk (The compiler didn't have the information)

The information or a bad design? It was the first thing of such class from Intel which had never done a real processor architecture change.

But with C code it is really hard to create a good compiler. Then there should be another target for compilation - the Java, for example. There are no pointers in Java. It prevents programmer from using many hacks that hide required information from the compiler.

Owen wrote:

Never mind that in times of increasingly constrained memory bandwidth one cannot afford to have 33% of your instruction fetch bandwidth wasted by NOPs as Itanium does.

It is not Itanium, who wastes the bandwidth, but the compiler. It is not efficient. Why the Itanium should be blamed?

Owen wrote:

VLIW is even worse because every compiled binary becomes invalid (or highly inefficient) whenever you increase the issue width

If the processor changes then there should be change in a software, yes. And C-like approach really makes the permanent recompilation a very important issue. But there is another approach, when the recompilation is limited to a few components only. It is the Java Operating System approach. And having such alternative I can ask - why we should throw away a really good technology like VLIW? It manages to have a market share even with inherent C compiler problems. What it can deliver if there are no such problems?

Owen wrote:

VLIW is great, if and only if you are designing an embedded processor which is never expected to exhibit binary backwards compatibility.

The mentioned above limited recompilation solves the backward compatibility problem. Then I can declare a win for the VLIW

embryo · **Posted:** Sun Mar 02, 2014 6:27 pm

Rusky wrote:

They also intercept potentially-aliasing stores, so loads can be hoisted as far as possible (this is an example of why the processor beats the compiler)

Why a compiler can not care about potentially-aliasing stores? It knows the program structure and every variable in it, why it can't detect aliases?

Owen · **Posted:** Sun Mar 02, 2014 7:00 pm

embryo wrote:

Owen wrote:

The compiler has no knowledge of branch directions that the program takes

Processor also doesn't know it until too late, it just executes instructions speculatively and throws away all the speculative work after the branch information is calculated.

The processor has branch predictors which tell it which branch is likely. Obviously, these are right more often than wrong. It speculatively executes the likely branch, and most of the time that is the right call.

Even if you do profile guided optimizations (i.e. the compiler looks at a previous run and encodes branch hints to the processor so it can choose the best direction based upon which was most likely in the profiling run), that can't adequately deal with cases where the situation on the user's computer is different (e.g. they're doing something different with the same code).

Accurate branch prediction is really important. Two decades ago a significant enough performance enhancement could be had by getting the branch predictor to predict the exit conditions for a loop correctly (note that a pure history based predictor will get this wrong every time). Its' gotten more exacting ever since

embryo wrote:

Owen wrote:

no knowledge of the memory bus constraints of various systems

There are the PCI standard and processor manufacturer's electrical constraints. And nothing more. The processor knows just two things - what it's manufacturer writes in a specification and the same PCI standard. Why compiler can not know such information ?

I don't know about your system, but in mine the RAM is not connected by PCI. Its' probably DDR3 - and in both of our machines likely different speeds of DDR3. You might have a lot of money and hence an Intel Sandy Bridge/Ivy Bridge E series processor with quad channel memory for incredible bandwidth.

My compiler can't know this. It can't know the latency of the memory. It can't know the current cache pressure. It can't know the current contention between the various devices vying for the bus - the processor core not exist in isolation, it exists on a chip probably with a number of other cores, plus an on chip or external GPU, plus alongside a bunch of hardware devices which are all competing for access to memory.

Its a dynamic situation. The exact nature of the memory bandwidth available changes from microsecond to microsecond. The latency of said memory access varies from microarchitecture to microarchitecture, from individual machine to individual machine.

The compiler can't know. So the processor works around it - out of order execution and all.

embryo wrote:

But the compiler knows the program structure. It knows, for example, the branch section location and can infer limits for the both outcomes of branching, then it just caches both code fragments to help speed things up. And a processor has no need to speculate because there is the compiler which can feed the processor with the work the compiler knows exactly should be done. And it can have such knowledge because of the program structure available.

Owen wrote:

You believe in the same delusions that sunk the Itanium. Everyone in the know who was involved in the design process knew of the disaster that was happening there then and fled.

The Itanium has it's market share. May be bad design has significantly reduced the share, but it's not the architecture problem, it's just bad design. There was no efficient compiler to support the architecture, may be the processor internals had design problems. And there was no market for it. No market and no good compiler - the result is a disaster. But if there was a good compiler? And a market? It is the case for the Java server applications market, it is very big and a processor with a good compiler supported by Java Operating System can perform very well even having relatively small investments allocated.

Owen wrote:

Itanium depended upon the compiler. It stunk (The compiler didn't have the information)

The information or a bad design? It was the first thing of such class from Intel which had never done a real processor architecture change.

Itanium was a joint project by Intel (x86, of course, but also i860, i960, StrongARM and XScale), HP (HPPA) and Compaq (who had purchased DEC, who had developed VAX and Alpha). The expertise was all there.

But the architecture depends upon the compiler to deal with the fact that it doesn't do anything out of order and the compiler needs to fill in exactly what instructions can be executed in parallel. The thing is, the compiler can't predict memory latencies, so it can't get the optimization perfect, and the processor doesn't do anything out of order so it can't paper over the fact that the compiler needs to be conservative at boundaries of what it can see.

For example, a function call to a virtual method (i.e. every non final method in Java, since that seems to be your language of choice). How can the compiler know what implementation of that method it is landing in? It can't, so it has to be conservative and assume that any object it has a handle to may be modified by said method unless it can prove otherwise (that is, the object was created in this method and never passed to something it can't see the code for). This creates lots of unnecessary memory loads and stores, and therefore the processor has to deal with this (Out of order execution saves the day)

embryo wrote:

But with C code it is really hard to create a good compiler. Then there should be another target for compilation - the Java, for example. There are no pointers in Java. It prevents programmer from using many hacks that hide required information from the compiler.

Any compiler developer will tell you that pointers aren't the real problem, C99 and above have enough constraints on their use that optimizing with them around is very possible.

embryo wrote:

Owen wrote:

Never mind that in times of increasingly constrained memory bandwidth one cannot afford to have 33% of your instruction fetch bandwidth wasted by NOPs as Itanium does.

It is not Itanium, who wastes the bandwidth, but the compiler. It is not efficient. Why the Itanium should be blamed?

Because its' not the compiler's fault. For most code (i.e. everything outside of the core loops of numeric code - and I'll admit Itanium shines there... but, most of the time, so does the GPU, or if it doesn't you might find a modern x86 cheaper and almost as fast) people can't get better with hand assembly.

The architecture is fundamentally flawed.

embryo wrote:

Owen wrote:

VLIW is even worse because every compiled binary becomes invalid (or highly inefficient) whenever you increase the issue width

If the processor changes then there should be change in a software, yes. And C-like approach really makes the permanent recompilation a very important issue. But there is another approach, when the recompilation is limited to a few components only. It is the Java Operating System approach. And having such alternative I can ask - why we should throw away a really good technology like VLIW? It manages to have a market share even with inherent C compiler problems. What it can deliver if there are no such problems?

Because VLIW still has issues of gross waste of memory bandwidth which only get worse with greater instruction word widths (and anyway VLIW is of no help at all for the majority of control-flow oriented code)

embryo wrote:

Owen wrote:

VLIW is great, if and only if you are designing an embedded processor which is never expected to exhibit binary backwards compatibility.

The mentioned above limited recompilation solves the backward compatibility problem. Then I can declare a win for the VLIW

If compilers didn't universally suck at a lot of problems. They really suck at vectorizing register intensive maths code, such as that found at the core of every video codec ever. When pushed, they really struggle at register allocation, causing unnecessary loads and or stores which just conflate matters.

These are problems which we can never fix completely, because optimal register allocation for arbitrary problems is only possible in NP time, and most people would like their compilation to complete this millennium. Vectorization is highly complex (though I don't think there is any formal analysis of its' complexity, I wouldn't expect it to be better than complexity class P and suspect it is probably NP).

Compilers must make approximations. Never underestimate the processing power of the human brain.

It may take a human 24 hours of thought and work to optimize an algorithm for a processor, but that only needs to be done once. Meanwhile, people tend to frown if a compiler spends 2 hours doing the same thing every compile, and I think they'd complain if it takes more than 0.1s for a JIT compiler...

embryo wrote:

Rusky wrote:

They also intercept potentially-aliasing stores, so loads can be hoisted as far as possible (this is an example of why the processor beats the compiler)

Why a compiler can not care about potentially-aliasing stores? It knows the program structure and every variable in it, why it can't detect aliases?

I addressed this above in my comments regarding virtual methods.

embryo · **Posted:** Mon Mar 03, 2014 6:06 am

Owen wrote:

The processor has branch predictors which tell it which branch is likely

This and following arguments are based on the same thing - the processor has runtime information. But how the runtime information makes the processor to work faster ? It is all about algorithms. The processor has microprograms with those algorithms and just runs some of them when it seems can be helpful. And now recall what is the compiler. It is also an entity with algorithms, and even much better algorithms. The compiler also has a lot of static information.

Now we can see the picture - just runtime information and some weak algorithms against a lot of static information and very powerful algorithms. And we should add to the picture one more thing - the compiler can inject the weak algorithms at compilation time and feed the processor with locally optimized actions depending on the current code section content. In simple words the compiler just replaces internal microprograms with the compiler driven. Also the compiler takes care of placing the injected algorithms in a right cache ensuring that the processor will have everything required.

In the end - what is a hardware? It's just a dumb silicon. And where it gets the power to work fast? It is provided with algorithms by some external entity. And if the entity will be a compiler - what will be the problem in such case?

Once more - the compiler has all information except runtime. The runtime behaviour of a processor is defined by externally provided algorithms. The compiler can provide the processor with ANY algorithm it decides is suitable for a particular task. Can you imagine a processor with all algorithms possible? I can imagine such compiler, but not processor. At least it takes a lot less time to update compiler instead of updating a processor design and redistributing new chips across the world.

The compiler wins without questions. Isn't it obvious?

Owen wrote:

My compiler can't know this. It can't know the latency of the memory. It can't know the current cache pressure. It can't know the current contention between the various devices vying for the bus.

It all is just a mix of specification information availability and runtime data processing. First is the same for compiler and for processor, the second is algorithm driven and in the processor case leads to the monstrous algorithm storage within a processor. Why we can't eliminate such storage and allow the compiler to provide a suitable algorithm?

Owen wrote:

The latency of said memory access varies from microarchitecture to microarchitecture

It's just another point to consider when we think about how fast will be a system with all the chip area allocated for logic units instead of a big (and useless in case of compiler provided algorithms) microprogram storage.

Owen wrote:

Itanium was a joint project by Intel (x86, of course, but also i860, i960, StrongARM and XScale), HP (HPPA) and Compaq (who had purchased DEC, who had developed VAX and Alpha). The expertise was all there.

Corporate internals can look worse if viewed from inside. May be it was a communication issues, may be bad management, may be the vision was too simplistic.

Owen wrote:

For example, a function call to a virtual method (i.e. every non final method in Java, since that seems to be your language of choice). How can the compiler know what implementation of that method it is landing in?

Actually in Java it is also non private, non static and non special methods (constructor and static initializer). But to the question - the processor must wait until the actual object reference (address) will be available and after it happens it can make decision about the actual method address. But what processor knows about method address? It only knows that the address should exist and nothing more. But the compiler always knows all the data structures responsible for address seeking and with unimaginable ease (comparing to the processor's faulty attempts) can order the system to cache data of a few successors of a base class with actual address information. And when the processor executes a jump to the actual method all addresses will be in it's register file. The only required additional operation before the jump is an index calculation to select a particular register. Again we see that compiler beats any processor with ease.

It should be mentioned, that there are cases of a root class (the Object) successors when the number of successors can be measured in thousands, but even in such situations the compiler just can look at a code a bit before the virtual call and to find there an actual object's type like in the example:

Code:

Object var=new String();
var.hashCode(); // hashCode is object's method

It means the compiler wins again.

Owen wrote:

Any compiler developer will tell you that pointers aren't the real problem

If we have a pointer to some structure - how can we be sure there is no access to the same address from other thread? Next we use a pointer arithmetic. What a structure the pointer addresses now? How to deal with it? What to cache? It is an inherent problem for unmanaged languages. And there are a lot of such problems. And all the likes of the problem are just eliminated in Java.

Owen wrote:

If compilers didn't universally suck at a lot of problems. They really suck at vectorizing register intensive maths code

Can you provide an example of a compiler fault here?

Owen wrote:

optimal register allocation for arbitrary problems is only possible in NP time

Why we should limit ourselves to an arbitrary case only? We have all required information to remove all the uncertainity. So - we just should use the information.

Owen wrote:

people tend to frown if a compiler spends 2 hours doing the same thing every compile

Yes, it is an issue. But for the Java Operating System there should be just a few full compiles per month - is it a problem? It is not a consumer toy and there is no children waiting the game to react quickly.

bwat · **Joined:** Fri Jul 03, 2009 6:21 am **Posts:** 359

embryo wrote:

This and following arguments are based on the same thing - the processor has runtime information. But how the runtime information makes the processor to work faster ? It is all about algorithms. The processor has microprograms with those algorithms and just runs some of them when it seems can be helpful. And now recall what is the compiler. It is also an entity with algorithms, and even much better algorithms. The compiler also has a lot of static information.

The negative answer to the Entschiedungsproblem says Owen is right, you're wrong. You've been wrong since 1936 on this one.

[Edit: German nouns start with capital letters].

OSDev.org

The Mill: a new low-power, high-performance CPU design

Who is online