OSDev.org

The Place to Start for Operating System Developers
It is currently Thu Apr 18, 2024 10:27 pm

All times are UTC - 6 hours




Post new topic Reply to topic  [ 144 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7 ... 10  Next
Author Message
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Wed Mar 05, 2014 12:44 pm 
Offline
Member
Member
User avatar

Joined: Wed Oct 18, 2006 3:45 am
Posts: 9301
Location: On the balcony, where I can actually keep 1½m distance
Quote:
You know what? Show us the code a compiler would output if it did branch prediction itself. Give us an example that's better than the CPU doing it.

Code:
(...)
for (int i = 0; i < 50; i++)
{
    (do stuff)
}
(...)

Code:
(...)
CLR r1
.loop:
CMP r1, 50-1
MOV r2, .endloop
(add n-2 parallel ops here)
MOVNBE r2, .loop
(add n-1 parallel ops here)

(do parts of stuff)

DJMP r2 ; fetching starts now
(do last few instructions of stuff to fill delay slot)
INC r1
.endloop:
; jump will have taken effect here.
(...)

_________________
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]


Top
 Profile  
 
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Wed Mar 05, 2014 1:20 pm 
Offline
Member
Member
User avatar

Joined: Wed Jan 06, 2010 7:07 pm
Posts: 792
And what if stuff modifies i?

You also have the problem of telling the CPU where and how much to prefetch. You could add a prefetch instruction so the CPU doesn't have to worry about whether you change r2 before the jump, which would even let you prefetch basic blocks rather than just guessing the size, but...

The Mill does that in hardware, running ahead through the chain of basic blocks in parallel with execution, to prefetch them all (the way it does this is pretty cool, you should check out the talk). It predicts which branch will be taken from each BB, and caches that, as well as storing the cache back with the executable so it can be used later. Do that from the compiler alone!

_________________
[www.abubalay.com]


Top
 Profile  
 
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Wed Mar 05, 2014 4:32 pm 
Rusky wrote:
What other algorithm are you possibly going to use for detecting aliasing stores?
Stores in general? It is better to analyse high level code and to find aliasing assignments. If it is impossible then additional information is required about the situation. How far potential aliases are one from another? What are the instructions in between?


Top
  
 
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Wed Mar 05, 2014 5:58 pm 
Offline
Member
Member
User avatar

Joined: Tue Dec 25, 2007 6:03 am
Posts: 734
Location: Perth, Western Australia
There are two very large issues with the compiler doing all the work.

1. The compiler has an inherently limited view of the system state. Unless you feed in the entire system (or at least a large chunk of it), there will be unpredictable data sources. The CPU on the other hand knows what these data sources are as soon as they enter cache (which, due to prefetch, may be many cycles before they're used).

2. How would the compiler encode the variable likelyhood of a branch being taken? It would have to indicate a rather complex set of rules for the prefetch code... almost as if it was re-encoding the instruction.

_________________
Kernel Development, It's the brain surgery of programming.
Acess2 OS (c) | Tifflin OS (rust) | mrustc - Rust compiler
Currently Working on: mrustc


Top
 Profile  
 
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Wed Mar 05, 2014 8:18 pm 
Offline
Member
Member
User avatar

Joined: Fri Jun 13, 2008 3:21 pm
Posts: 1700
Location: Cambridge, United Kingdom
Combuster wrote:
Quote:
You know what? Show us the code a compiler would output if it did branch prediction itself. Give us an example that's better than the CPU doing it.

Code:
(...)
for (int i = 0; i < 50; i++)
{
    (do stuff)
}
(...)

Code:
(...)
CLR r1
.loop:
CMP r1, 50-1
MOV r2, .endloop
(add n-2 parallel ops here)
MOVNBE r2, .loop
(add n-1 parallel ops here)

(do parts of stuff)

DJMP r2 ; fetching starts now
(do last few instructions of stuff to fill delay slot)
INC r1
.endloop:
; jump will have taken effect here.
(...)


BETTER. Trivial loop exit conditions are uninteresting and, well, trivial. Intel have been correctly predicting them for 15 years.

In fact your code is worse, because it wastes two registers and two instructions per cycle.

Please try again :-)


Top
 Profile  
 
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Wed Mar 05, 2014 11:50 pm 
Offline
Member
Member
User avatar

Joined: Wed Oct 18, 2006 3:45 am
Posts: 9301
Location: On the balcony, where I can actually keep 1½m distance
Owen wrote:
BETTER. Trivial loop exit conditions are uninteresting and, well, trivial. Intel have been correctly predicting them for 15 years.

In fact your code is worse, because it wastes two registers and two instructions per cycle.
And you think it's fair to simply strip silicon somewhere and not giving it back at an equal rate in another part of the machine?

Quote:
Now that we've got enough background, let's take a look at Intel's own description of the loop detector:

The Loop Detector analyzes branches to see if they have loop behavior. Loop behavior is defined as moving in one direction (taken or not-taken) a fixed number of times interspersed with a single movement in the opposite direction. When such a branch is detected, a set of counters are allocated in the predictor such that the behavior of the program can be predicted completely accurately for larger iteration counts than typically captured by global or local history predictors.
Which means that it only works if the loop always has the same number of iterations, and always fails the first time :wink:

_________________
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]


Top
 Profile  
 
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Thu Mar 06, 2014 12:03 am 
Offline
Member
Member
User avatar

Joined: Wed Jan 06, 2010 7:07 pm
Posts: 792
And yet, your example is still broken.

Not to mention the Mill prefetches entire functions at a time, even from cold code, so if it did mispredict loop exit (they don't specify a predictor algorithm because different models will probably use different algos) the penalty would be only 5 cycles (which can be helped with speculative execution).

How is your example better than that?

_________________
[www.abubalay.com]


Top
 Profile  
 
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Thu Mar 06, 2014 12:57 am 
Offline
Member
Member
User avatar

Joined: Wed Oct 18, 2006 3:45 am
Posts: 9301
Location: On the balcony, where I can actually keep 1½m distance
Because zero mispredictions are better than occasional mispredictions?

(Also, you just used a "no it isn't" as an argument from authority)

_________________
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]


Top
 Profile  
 
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Thu Mar 06, 2014 1:05 am 
Offline
Member
Member
User avatar

Joined: Wed Jan 06, 2010 7:07 pm
Posts: 792
Combuster wrote:
Because zero mispredictions are better than occasional mispredictions?

That depends on the cost of a misprediction vs the cost of your implementation.

_________________
[www.abubalay.com]


Top
 Profile  
 
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Thu Mar 06, 2014 1:08 am 
Offline
Member
Member
User avatar

Joined: Wed Oct 18, 2006 3:45 am
Posts: 9301
Location: On the balcony, where I can actually keep 1½m distance
So at least you agree that we're all comparing apples to oranges here with this sort of stuff. :wink:

_________________
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]


Top
 Profile  
 
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Thu Mar 06, 2014 1:28 am 
Offline
Member
Member
User avatar

Joined: Wed Jan 06, 2010 7:07 pm
Posts: 792
My point was that this quote was wrong:
Combuster wrote:
Bottom line: there's no such thing as the CPU being superior to the compiler.

_________________
[www.abubalay.com]


Top
 Profile  
 
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Thu Mar 06, 2014 1:31 am 
Offline
Member
Member
User avatar

Joined: Wed Oct 18, 2006 3:45 am
Posts: 9301
Location: On the balcony, where I can actually keep 1½m distance
Because you're trying to compare apples to oranges again? What is a CPU without compiler worth?

_________________
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]


Top
 Profile  
 
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Thu Mar 06, 2014 4:22 am 
Offline
Member
Member
User avatar

Joined: Sat Jan 15, 2005 12:00 am
Posts: 8561
Location: At his keyboard!
Hi,

Combuster wrote:
Because you're trying to compare apples to oranges again? What is a CPU without compiler worth?


A CPU without a compiler is worth about the same as a compiler without a CPU.

I think the main thing here is that compilers are good at doing the expensive static optimisations before run-time, while CPUs are good at doing the dynamic optimisations during run-time. These are mutually beneficial, not exclusive.


Cheers,

Brendan

_________________
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.


Top
 Profile  
 
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Thu Mar 06, 2014 5:48 am 
Offline
Member
Member
User avatar

Joined: Wed Aug 21, 2013 3:53 am
Posts: 449
Location: Asia, Singapore
Quote:
A CPU without a compiler is worth about the same as a compiler without a CPU.

I haven't read this thread carefully.......
What is meant by a compiler the above sentence?
A CPU with an assembler but no compiler has a greater worth than that of a compiler that targets a non-existing CPU (Is that what you mean by a compiler with no CPU?).
I might have misunderstood your point.
A valid point against this could be that an assembler is theoretically a compiler that translates assembly language into machine code.

_________________
"In a time of universal deceit - telling the truth is a revolutionary act." -- George Orwell
(R3X Runtime VM)(CHIP8 Interpreter OS)


Top
 Profile  
 
 Post subject: Re: The Mill: a new low-power, high-performance CPU design
PostPosted: Thu Mar 06, 2014 10:14 am 
Offline
Member
Member
User avatar

Joined: Fri Jun 13, 2008 3:21 pm
Posts: 1700
Location: Cambridge, United Kingdom
Combuster wrote:
Owen wrote:
BETTER. Trivial loop exit conditions are uninteresting and, well, trivial. Intel have been correctly predicting them for 15 years.

In fact your code is worse, because it wastes two registers and two instructions per cycle.
And you think it's fair to simply strip silicon somewhere and not giving it back at an equal rate in another part of the machine?

A loop predictor does not comprise anywhere near the amount of silicon to add an additional execution unit, or an additional retirement unit.

Combuster wrote:
Quote:
Now that we've got enough background, let's take a look at Intel's own description of the loop detector:

The Loop Detector analyzes branches to see if they have loop behavior. Loop behavior is defined as moving in one direction (taken or not-taken) a fixed number of times interspersed with a single movement in the opposite direction. When such a branch is detected, a set of counters are allocated in the predictor such that the behavior of the program can be predicted completely accurately for larger iteration counts than typically captured by global or local history predictors.
Which means that it only works if the loop always has the same number of iterations, and always fails the first time :wink:


"Because the Pentium M did it that way, all Intel processors do it that way"?

Nehalem and above implement Macro-op fusion, in which "dec reg, jnz backwards", "cmp reg, const, jnz backwards" pairs can be fused into one micro-op. Especially in the former case, tell me why it would be difficult for the processor to correctly predict that every time? It would surprise me entirely if Intel weren't predicting that correctly.

Now: Please tell me how you would beat branch-history predictors while using equal silicon area. Remember, if you use register or memory bandwidth to do so, the predictor wins, because those are things which are hard to expand (i.e. every increase in them consumes large amounts of silicon area) compared to the small dedicated resources predictors take.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 144 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7 ... 10  Next

All times are UTC - 6 hours


Who is online

Users browsing this forum: No registered users and 38 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group