Brendan wrote:
For example, rather than having an 8-core CPU at 3 GHz consuming 85 watts you could have a 32-core CPU running at 1.5 Ghz consuming the same 85 watts and be able to do up to twice as much processing.
Now, some people take this to extremes. They suggest cramming a massive number of very slow wimpy cores (e.g. without things like out-of-order execution, branch prediction, tiny caches, etc) onto a chip. For example, maybe 200 cores running at 500 Mhz. This is part of what Linus is saying is silly, and for this part I think Linus is right. Things like out-of-order execution help, even for "many core", and those wimpy cores would be stupid.
A "wimpy core" is just as stupid as it's user (a programmer). And "out-of-order core" is no more smart than it's programmer. If a programmer can grasp around an algorithm, then there can be some performance. But if algorithm implementation details are hidden under a great deal of caches, out-of-order things and other intervening stuff, then the performance will be just an accident.
So, if we have a simple system of wimpy cores instead of a complex hardware, the chances are that we can get better performance.
Brendan wrote:
Between the CPU and things like RAM and IO devices there's a whole pile of stuff (caches, memory controllers, etc). As the number of cores increases all of that stuff becomes a problem - twice as many CPUs fighting for the same interconnect bandwidth do not go twice as fast.
Then we should think about a way of efficient feeding of our cores. If hardware supports us in this quest (with caches and simple and predictable operations, for example), then our cores will never get stalled. But if we have some murky hardware, that lives it's own life, then of course, we just unable to provide it with data and commands just because we don't know when (and often - how) the data or commands should be delivered.
Brendan wrote:
The other (much larger?) problem is software scalability.
It's our way of thinking that scales badly. We are used to interact with a computer in a perverted way, e.g. typing letters instead of talking or even just thinking. And to get rid of the perversion we need much more processing power. But Linus tells us that he needs no more processing power. Then we still are typing and "mousing" around instead of doing very simple things.
Brendan wrote:
The "very fine grained actor model" is too fine grained - the overhead of communication between actors is not free and becomes a significant problem. The ideal granularity is something between those extremes.
Basically; I think the ideal situation is where "actors" are about the size of threads.
But next question is - how big a thread should be? And the answer is that there just should be an optimum of a processing/communication ratio. It's very old rule, in fact. Processing has limits defined by the communication and communication has limits defined by the processing. This mutually interdependent parts of a same problem were identified very long ago. And a solution is an optimum, but the optimum is very hardware dependent. If we have just some hints about communication bottlenecks between a processor and a memory, then how can we find an optimum? And even the processor itself today has a lot of hidden mechanics to prevent us from proving that some solution is really optimal. We need some understandable hardware. And we need a way to manage the hardware at the lowest possible level instead of trying to hope for some "good" out-of-order behavior. If hardware is advertised as a ready "to make it's best" for us then we just have no way to get an optimum, but only "it's best" instead of our clear understanding that it's really best or it's really much worse than expected.
Brendan wrote:
He's saying that for most normal desktop software more cores won't help much, which is a little like saying "excluding all the cases where parallelism is beneficial, parallelism isn't beneficial"
Yes, it's really the only point of the message.
Brendan wrote:
Also note that there are people (including me) that want CPUs (with wide SIMD - e.g. AVX) to replace GPUs. For example, rather than having a modern "core i7" chip containing 8 CPU cores and 40 "GPU execution units" I'd rather have a chip containing 48 CPU cores (with no GPU at all) and use those CPU cores for both graphics and normal processing - partly because it'd be more able to adapt to the current load (e.g. rather than having GPU doing nothing while CPUs are struggling with load and/or CPUs doing nothing while GPU is struggling with load); but also partly because supporting GPUs is a massive pain in the neck due to lack of common standards between Nvidia, ATI/AMD and Intel and poor and/or non-existent documentation; and also partly because I think there are alternative graphics pipelines that need to be explored but don't suit the "textured polygon pipeline" model GPUs force upon us.
I hope you agree that this is just a yet another call for the hardware simplicity and manageability. Else the optimum will always slip away from us.