Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Brendan wrote:Your example is so trivial that it's entirely idiotic and pointless
The point was - there are a number of algorithms with relatively low complexity (1+1 is an extreme) and such complexity has no connections with the limit of computability. More complex algorithms can be closer to the limit but the result still can be computed. There is no point in referencing a theoretic conclusion about very general case. Because the cases we discuss are not very general.
void foo(void) {
int c;
int x = 0;
while( (c = getc()) != EOF) {
switch(c) {
case 'a':
x /= 3;
break;
case '4':
x += 4;
break;
case 'q':
exit(0);
default:
x++;
break;
}
if(x%7 == 0) {
printf("Bork!\n");
}
}
}
Questions:
How often will the "if(x%6 == 0)" branch be predicted correctly by a traditional "ahead of time" compiler?
How often will the "if(x%6 == 0)" branch be predicted correctly by a modern CPU's branch predictors?
How often will the "if(x%6 == 0)" branch be predicted correctly by a modern "just in time" compiler?
For a "just in time" compiler, would the overhead of attempting to correctly predict the "if(x%6 == 0)" branch cost far more than a misprediction would have?
What about the "switch/case" branches?
Even while this case suits simple branch prediction algorithm of the processor, the compiler can cache code of three branches and then order the processor to execute all three branches simultaneously. This allows to forget about prediction at all. Meanwhile the processor will count branch usage and after small number of loops the branches 'a', '4' and 'q' will never be predicted (it is supposed the flat distribution of input probabilities is in play). Then we can see that the compiler wins again.
Rusky wrote:Only the CPU knows exactly which addresses are being loaded at a given moment.
The CPU is provided with some hardwired algorithm. Just one simple algorithm. The compiler can provide any number of algorithms, including the simplest. What is more flexible solution?
Rusky wrote:There's absolutely no reason to make this an all-or-nothing contest- the best outcome is often when the CPU and compiler cooperate
It is more simple if we think about one managing entity (the compiler) instead of two (with processor). Both entities can contain some algorithms, but only one can contain many algorithms, including complex ones. Why we should be bothered with another entity? What for is this extra complexity?
What other algorithm are you possibly going to use for detecting aliasing stores? And for that matter, how would you express the one I described in the compiler? It's a hardware thing.
You know what? Show us the code a compiler would output if it did branch prediction itself. Give us an example that's better than the CPU doing it.
(...)
CLR r1
.loop:
CMP r1, 50-1
MOV r2, .endloop
(add n-2 parallel ops here)
MOVNBE r2, .loop
(add n-1 parallel ops here)
(do parts of stuff)
DJMP r2 ; fetching starts now
(do last few instructions of stuff to fill delay slot)
INC r1
.endloop:
; jump will have taken effect here.
(...)
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
You also have the problem of telling the CPU where and how much to prefetch. You could add a prefetch instruction so the CPU doesn't have to worry about whether you change r2 before the jump, which would even let you prefetch basic blocks rather than just guessing the size, but...
The Mill does that in hardware, running ahead through the chain of basic blocks in parallel with execution, to prefetch them all (the way it does this is pretty cool, you should check out the talk). It predicts which branch will be taken from each BB, and caches that, as well as storing the cache back with the executable so it can be used later. Do that from the compiler alone!
Rusky wrote:What other algorithm are you possibly going to use for detecting aliasing stores?
Stores in general? It is better to analyse high level code and to find aliasing assignments. If it is impossible then additional information is required about the situation. How far potential aliases are one from another? What are the instructions in between?
There are two very large issues with the compiler doing all the work.
1. The compiler has an inherently limited view of the system state. Unless you feed in the entire system (or at least a large chunk of it), there will be unpredictable data sources. The CPU on the other hand knows what these data sources are as soon as they enter cache (which, due to prefetch, may be many cycles before they're used).
2. How would the compiler encode the variable likelyhood of a branch being taken? It would have to indicate a rather complex set of rules for the prefetch code... almost as if it was re-encoding the instruction.
(...)
CLR r1
.loop:
CMP r1, 50-1
MOV r2, .endloop
(add n-2 parallel ops here)
MOVNBE r2, .loop
(add n-1 parallel ops here)
(do parts of stuff)
DJMP r2 ; fetching starts now
(do last few instructions of stuff to fill delay slot)
INC r1
.endloop:
; jump will have taken effect here.
(...)
BETTER. Trivial loop exit conditions are uninteresting and, well, trivial. Intel have been correctly predicting them for 15 years.
In fact your code is worse, because it wastes two registers and two instructions per cycle.
Owen wrote:BETTER. Trivial loop exit conditions are uninteresting and, well, trivial. Intel have been correctly predicting them for 15 years.
In fact your code is worse, because it wastes two registers and two instructions per cycle.
And you think it's fair to simply strip silicon somewhere and not giving it back at an equal rate in another part of the machine?
Now that we've got enough background, let's take a look at Intel's own description of the loop detector:
The Loop Detector analyzes branches to see if they have loop behavior. Loop behavior is defined as moving in one direction (taken or not-taken) a fixed number of times interspersed with a single movement in the opposite direction. When such a branch is detected, a set of counters are allocated in the predictor such that the behavior of the program can be predicted completely accurately for larger iteration counts than typically captured by global or local history predictors.
Which means that it only works if the loop always has the same number of iterations, and always fails the first time
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
Not to mention the Mill prefetches entire functions at a time, even from cold code, so if it did mispredict loop exit (they don't specify a predictor algorithm because different models will probably use different algos) the penalty would be only 5 cycles (which can be helped with speculative execution).
Because zero mispredictions are better than occasional mispredictions?
(Also, you just used a "no it isn't" as an argument from authority)
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
So at least you agree that we're all comparing apples to oranges here with this sort of stuff.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
Because you're trying to compare apples to oranges again? What is a CPU without compiler worth?
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]