Do you agree with linus regarding parallel computing?

dansmahajan · **Joined:** Mon Jan 07, 2013 10:38 am **Posts:** 62

http://highscalability.com/blog/2014/12 ... bunch.html

So what do you think ?

Combuster · **Posted:** Mon Jan 19, 2015 12:20 pm

"4 cores is enough" sounds just as future-proof as "640KB is enough". #-o

Actually we had 4 cores in household equipment for two decades now. Did that prove to be sufficient?

Brendan · **Posted:** Mon Jan 19, 2015 1:28 pm

Hi,

Combuster wrote:

"4 cores is enough" sounds just as future-proof as "640KB is enough". #-o

Actually we had 4 cores in household equipment for two decades now. Did that prove to be sufficient?

Yes and no.

For about 75% of desktop software, an ancient 166 Mhz Pentium is all you really need and 4-core is complete overkill.

For about 10% of desktop software, 4-core is adequate.

For about 15% of desktop software, having more cores would've given noticeable improvements.

An additional 30% of desktop software (yes, 30% of a theoretical 130%) simply doesn't exist because existing machines didn't have enough performance to handle it and we'll never really know how much stuff we never saw.

However; that initial 75% of desktop software (where you only really need an ancient 166 MHz Pentium) was a lie; because for half of it 4-cores actually wasn't enough so it got shifted to the GPU (and the ancient 166 Mhz Pentum is only good enough for the tiny little bit that wasn't shifted to the GPU).

Basically, 4-core CPUs were sufficient for about 47.5% out of 130% of desktop software (or in other words, about 36.5% of 100% of desktop software).

Cheers,

Brendan

SpyderTL · **Joined:** Sun Sep 19, 2010 10:05 pm **Posts:** 1074

Combuster wrote:

"4 cores is enough" sounds just as future-proof as "640KB is enough". #-o

I had this exact same thought.

You could give Microsoft, Apple, or the Linux developers a 128-core system with 1 TB of RAM, and within a year, they would find a way to make it take 50 seconds to boot to the login screen.

More hardware just makes software developers lazier.

You do have to admit that a quad-core 4ghz machine is better than a 4.77 MHz machine from 1985. Is it 4 thousand times better? Not really. But it is 10 times better.

seuti · **Joined:** Tue Aug 19, 2014 1:20 pm **Posts:** 74

SpyderTL wrote:

Combuster wrote:

"4 cores is enough" sounds just as future-proof as "640KB is enough". #-o

I had this exact same thought.

You could give Microsoft, Apple, or the Linux developers a 128-core system with 1 TB of RAM, and within a year, they would find a way to make it take 50 seconds to boot to the login screen.

More hardware just makes software developers lazier.

You do have to admit that a quad-core 4ghz machine is better than a 4.77 MHz machine from 1985. Is it 4 thousand times better? Not really. But it is 10 times better.

Whatever performance gains hardware manufacturers make software developers take away.

iansjack · **Posted:** Mon Jan 19, 2015 3:57 pm

Quote:

More hardware just makes software developers lazier.

Perhaps that is the point that Linus is making?

Rusky · **Joined:** Wed Jan 06, 2010 7:07 pm **Posts:** 792

I agree with Linus in the sense that once you scale beyond 4 (strawman for maybe 2-16?) cores it's often the sort of problem you can offload to the GPU. The number of types of workloads people have figured out how to do that with is increasing, and the types of cores you get in CPUs don't make sense in that quantity.

For example, game engines can gain a lot of performance using things like Entity component systems that think about things in terms of large data transforms rather than collections of interacting entities. Trying to parallelize the interacting-entities version leads to a lot of wasteful locking and inter-thread communication that is rarely worth it. On the other hand, parallelizing data transforms scales way beyond the number of cores in a CPU and doesn't need that kind of OoOE/superscalar power in any individual core anyway.

There are and will probably continue to be problems that can take advantage of large numbers of powerful cores, but they're pretty specialized. That might change, but there's not a lot of sign of that in the near future.

Brendan · **Posted:** Tue Jan 20, 2015 12:58 am

Hi,

dansmahajan wrote:

http://highscalability.com/blog/2014/12/31/linus-the-whole-parallel-computing-is-the-future-is-a-bunch.html

So what do you think ?

First, let's put this into context.

For single-thread performance, a lot of people (including me) think Intel can't make cores go much faster. In the last 10 years clock speeds have gone from ~3.5 GHz Pentium 4 all the way up to ~3.5 Ghz Haswell; and while Intel have found ways of optimising the CPU's "instructions per cycle" they've been getting diminishing returns for a long while now - Haswell is only about 3% faster than Sandy Bridge, which was only about 7% faster than Nehalem, which was about 10% faster than Core. Their next micro-arch (Broadwell) might only be 1.5% faster than Haswell.

For modern CPUs the main limiting factor is the thermal envelope; and the relationship between clock frequency and power consumption/heat is not linear. For example, rather than having an 8-core CPU at 3 GHz consuming 85 watts you could have a 32-core CPU running at 1.5 Ghz consuming the same 85 watts and be able to do up to twice as much processing.

Now, some people take this to extremes. They suggest cramming a massive number of very slow wimpy cores (e.g. without things like out-of-order execution, branch prediction, tiny caches, etc) onto a chip. For example, maybe 200 cores running at 500 Mhz. This is part of what Linus is saying is silly, and for this part I think Linus is right. Things like out-of-order execution help, even for "many core", and those wimpy cores would be stupid.

There are 2 other things that come into this: hardware scalability and software scalability. Between the CPU and things like RAM and IO devices there's a whole pile of stuff (caches, memory controllers, etc). As the number of cores increases all of that stuff becomes a problem - twice as many CPUs fighting for the same interconnect bandwidth do not go twice as fast. There are ways to mitigate this problem (e.g. NUMA); but the problem doesn't go away entirely, and a complex and fast "uncore" also consumes power (and contributes to the thermal envelope). That's part of why those wimpy cores aren't such a good idea (and it's better to have fewer complex cores and less hardware scalability problems in the "uncore").

The other (much larger?) problem is software scalability. For most programming the languages/environments were mostly designed for sequential/procedural code; and to make it work for multi-threading you have to use things like locks/mutexes, critical sections, etc. History has shown that this is excessively difficult for programmers (and that work-arounds, like transactional memory, don't help).

The funny part is that this is a problem mostly caused by the languages/environments. If we switch to a different programming model (e.g. the actor model) most of the problems disappear. Basically, the "software scalability problem" mostly represents the difficulty of getting C/C++ programmers to shift to something that isn't crap.

However; while there are programming languages designed around the actor model that do give extremely good scalability and fault tolerance and have done so for 40+ years (e.g. Erlang); these languages are equally bad. There really isn't a good alternative for people to switch to (yet).

The main problem with these languages is what I'd call "granularity". For the actor model, you have "actors" communicating via. asynchronous messaging; but how big is an "actor"?

You could think of the Internet as a massive system that consists of millions of computers communicating with asynchronous messages (e.g. UDP and TCP packets) - it's relatively obvious that scalability and fault tolerance have been achieved. In this case, you could think of things like web browsers, FTP servers, etc as "huge actors". Basically, it's sort of like "very coarse grained actor model".

For languages like Erlang it's the opposite - an actor is a very small thing (e.g. an actor may be similar in size to an object in OOP). Let's call this "very fine grained actor model".

Now let's talk about "ideal granularity". The "very coarse grained actor model" is not fine-grained enough to take advantage of multiple cores (e.g. imagine a single-threaded HTTP server talking to a single-threaded web browser). The "very fine grained actor model" is too fine grained - the overhead of communication between actors is not free and becomes a significant problem. The ideal granularity is something between those extremes.

Basically; I think the ideal situation is where "actors" are about the size of threads. Essentially; I think the solution is threads communicating via. asynchronous messaging (and not the actor model as it's typically described, and not languages like Erlang - in fact, a simple language sort of like C would be almost ideal if asynchronous messaging replaced retarded nonsense like linking).

Also note that Linus is partially "begging the question". He's saying that for most normal desktop software more cores won't help much, which is a little like saying "excluding all the cases where parallelism is beneficial, parallelism isn't beneficial". There is a lot of things desktop software can do and does do that benefits from more cores (especially if we fix the "software scalability/software developer problem"); and it would be very foolish to look at software designed for 4-core and assume it would be designed the same if hardware was 100-core.

Also note that there are people (including me) that want CPUs (with wide SIMD - e.g. AVX) to replace GPUs. For example, rather than having a modern "core i7" chip containing 8 CPU cores and 40 "GPU execution units" I'd rather have a chip containing 48 CPU cores (with no GPU at all) and use those CPU cores for both graphics and normal processing - partly because it'd be more able to adapt to the current load (e.g. rather than having GPU doing nothing while CPUs are struggling with load and/or CPUs doing nothing while GPU is struggling with load); but also partly because supporting GPUs is a massive pain in the neck due to lack of common standards between Nvidia, ATI/AMD and Intel and poor and/or non-existent documentation; and also partly because I think there are alternative graphics pipelines that need to be explored but don't suit the "textured polygon pipeline" model GPUs force upon us.

Cheers,

Brendan

embryo · **Posted:** Tue Jan 20, 2015 3:41 am

dansmahajan wrote:

http://highscalability.com/blog/2014/12/31/linus-the-whole-parallel-computing-is-the-future-is-a-bunch.html

So what do you think ?

I think that Linus has expressed his quick thoughts without any serious analysis. A spontaneous expression of some feelings from past experience mostly works as an advertising hype or as a trigger of a massive flooding process. And here we have an example of flooding.

But of course, if a controversial question is thrown out too often (or "too wide" as in this case), then it can produce some positive outcomes in a form of some common understanding of a problem, and even may be some usable solutions.

embryo · **Posted:** Tue Jan 20, 2015 4:30 am

Brendan wrote:

For example, rather than having an 8-core CPU at 3 GHz consuming 85 watts you could have a 32-core CPU running at 1.5 Ghz consuming the same 85 watts and be able to do up to twice as much processing.

Now, some people take this to extremes. They suggest cramming a massive number of very slow wimpy cores (e.g. without things like out-of-order execution, branch prediction, tiny caches, etc) onto a chip. For example, maybe 200 cores running at 500 Mhz. This is part of what Linus is saying is silly, and for this part I think Linus is right. Things like out-of-order execution help, even for "many core", and those wimpy cores would be stupid.

A "wimpy core" is just as stupid as it's user (a programmer). And "out-of-order core" is no more smart than it's programmer. If a programmer can grasp around an algorithm, then there can be some performance. But if algorithm implementation details are hidden under a great deal of caches, out-of-order things and other intervening stuff, then the performance will be just an accident.

So, if we have a simple system of wimpy cores instead of a complex hardware, the chances are that we can get better performance.

Brendan wrote:

Between the CPU and things like RAM and IO devices there's a whole pile of stuff (caches, memory controllers, etc). As the number of cores increases all of that stuff becomes a problem - twice as many CPUs fighting for the same interconnect bandwidth do not go twice as fast.

Then we should think about a way of efficient feeding of our cores. If hardware supports us in this quest (with caches and simple and predictable operations, for example), then our cores will never get stalled. But if we have some murky hardware, that lives it's own life, then of course, we just unable to provide it with data and commands just because we don't know when (and often - how) the data or commands should be delivered.

Brendan wrote:

The other (much larger?) problem is software scalability.

It's our way of thinking that scales badly. We are used to interact with a computer in a perverted way, e.g. typing letters instead of talking or even just thinking. And to get rid of the perversion we need much more processing power. But Linus tells us that he needs no more processing power. Then we still are typing and "mousing" around instead of doing very simple things.

Brendan wrote:

The "very fine grained actor model" is too fine grained - the overhead of communication between actors is not free and becomes a significant problem. The ideal granularity is something between those extremes.

Basically; I think the ideal situation is where "actors" are about the size of threads.

But next question is - how big a thread should be? And the answer is that there just should be an optimum of a processing/communication ratio. It's very old rule, in fact. Processing has limits defined by the communication and communication has limits defined by the processing. This mutually interdependent parts of a same problem were identified very long ago. And a solution is an optimum, but the optimum is very hardware dependent. If we have just some hints about communication bottlenecks between a processor and a memory, then how can we find an optimum? And even the processor itself today has a lot of hidden mechanics to prevent us from proving that some solution is really optimal. We need some understandable hardware. And we need a way to manage the hardware at the lowest possible level instead of trying to hope for some "good" out-of-order behavior. If hardware is advertised as a ready "to make it's best" for us then we just have no way to get an optimum, but only "it's best" instead of our clear understanding that it's really best or it's really much worse than expected.

Brendan wrote:

He's saying that for most normal desktop software more cores won't help much, which is a little like saying "excluding all the cases where parallelism is beneficial, parallelism isn't beneficial"

Yes, it's really the only point of the message.

Brendan wrote:

Also note that there are people (including me) that want CPUs (with wide SIMD - e.g. AVX) to replace GPUs. For example, rather than having a modern "core i7" chip containing 8 CPU cores and 40 "GPU execution units" I'd rather have a chip containing 48 CPU cores (with no GPU at all) and use those CPU cores for both graphics and normal processing - partly because it'd be more able to adapt to the current load (e.g. rather than having GPU doing nothing while CPUs are struggling with load and/or CPUs doing nothing while GPU is struggling with load); but also partly because supporting GPUs is a massive pain in the neck due to lack of common standards between Nvidia, ATI/AMD and Intel and poor and/or non-existent documentation; and also partly because I think there are alternative graphics pipelines that need to be explored but don't suit the "textured polygon pipeline" model GPUs force upon us.

I hope you agree that this is just a yet another call for the hardware simplicity and manageability. Else the optimum will always slip away from us.

SpyderTL · **Joined:** Sun Sep 19, 2010 10:05 pm **Posts:** 1074

I would say that the only reason we really need 4-cores is to "separate" Windows (or "the OS") from our applications.

Most applications run just fine on a single CPU. However, when the OS decides that it needs to do some "house work", like indexing some files, or refreshing some cached network addresses, it will "steal" resources from your application to do its work.

If it weren't for this "behavior", we probably wouldn't even have multi-core CPUs...

OSwhatever · **Joined:** Mon Jul 05, 2010 4:15 pm **Posts:** 595

No, I don't agree with Linus. I think there is still a lot of work to be done in order to crank out more performance and especially "parallel" performance. Linus is stuck in his C-land of thinking. What we need are new tools that can describe a parallel task.

Take this:

int vector_with_numbers[lotsof];
int sum = 0;

for(int i = 0; i < lotsof; i++)
{
sum += vector_with_numbers[lotsof]
}

As you can see, this is something that is described as something that you should do in sequence. However, we don't need to sum this up in sequence and there are ways to do this in parallel. What we are missing is how to describe this task and how the compiler should recognize dependencies so that this can be done more efficiently.

Even if this example can't be parallelized as much as completely independent tasks like graphics, you can significantly reduce this but you also have to pay attention IO bottlenecks and this is how it becomes complicated.

I don't know any language that can do this (but there are so many that I have probably overlooked it).

AlexShpilkin · **Joined:** Thu Aug 29, 2013 4:10 pm **Posts:** 15

OSwhatever wrote:

I don't know any language that can do this (but there are so many that I have probably overlooked it).

You can have this, today, not from the Sufficiently Smart Compiler, but from smart programming and expressive libraries.[*]

What is needed here is to stop thinking with actions and start thinking with values, and acknowledge that concurrency (hard and ad hoc) ≠ parallelism (still hard, but not that ad hoc). The “loop” then becomes a list fold (aka reduce) using an associative operation, (+); it is the associativity of addition that yields the parallelism.

Two real-world implementations of that would be Google’s MapReduce and Haskell’s Control.Parallel.

[*] At least not until implementations of Agda, Idris et al. start exploiting strong normalization (irrelevance and guaranteed termination of any evaluation order) to choose evaluation strategies transparently, or (even better, and) allow the programmer to prove her low-level optimizations. Herds of unslain PhDs here, but not for me

AlexShpilkin · **Joined:** Thu Aug 29, 2013 4:10 pm **Posts:** 15

Brendan wrote:

Basically; I think the ideal situation is where "actors" are about the size of threads. Essentially; I think the solution is threads communicating via. asynchronous messaging (and not the actor model as it's typically described, and not language like Erlang - in fact, a simple language sort of like C would be almost ideal if asynchronous messaging replaced retarded nonsense like linking).

Looks like the all-“new” and fashionable Hoare’s CSP model to me. So Rust or Go; they also solve the problem of GCing shared mutable data by not sharing. Also, Haskell Chans have been there for years, but probably fail your simplicity test; there, no mutation. (Alef anyone?) To be more specific, is Plan 9’s Design of a Concurrent Window System “it”?

Brendan wrote:

For example, rather than having a modern "core i7" chip containing 8 CPU cores and 40 "GPU execution units" I'd rather have a chip containing 48 CPU cores (with no GPU at all) and use those CPU cores for both graphics and normal processing.

Did you mean: 512 cores? Seriously, 40 cores are a joke, even taking into account the weird SIMD/MIMD hybrid execution model.

The number of times “throughput” is mentioned during the Trip through the Graphics Pipeline 2011 and the absurd efficiency of DSPs in comparison with PC CPUs leads me to think that this is Not Going to Happen. I’d, however, love to be proven wrong by people experienced with, e.g., Cell on PS3 or Epiphany-16 on the Parallella.

In particular,

Brendan wrote:

the "textured polygon pipeline" model GPUs force upon us

is there for a reason: keeping pipes saturated (throughput!) is easier if you know exactly how much data and when will flow through them. Also, as you must already know, a texture sampler (or AES or SHA round, for that matter) in specialized silicon is much smaller than a GP processing unit that is equally fast at it. Sad; unsolved. Sorry RISC, we all loved you.

Edit: A previous version mistakenly referred to an earlier similarly-named paper A Concurrent Window System about a different window system in Newsqueak, by the same author.

AlexShpilkin · **Joined:** Thu Aug 29, 2013 4:10 pm **Posts:** 15

Brendan wrote:

[Some people] suggest cramming a massive number of very slow wimpy cores (e.g. without things like out-of-order execution, branch prediction, tiny caches, etc) onto a chip. For example, maybe 200 cores running at 500 Mhz.

They even make absurdly expensive evaluation boards: I’m surprised to see no mention of Charles Moore’s GA144. (If it doesn’t work, there should be some evidence for that, esotericity notwithstanding.)

They also have a point, there: throwing more silicon at a single core doesn’t scale (anymore), but throwing it at more cores does, sort of (better), if you connect them in a 2D network; three examples of differing complexity/size are already mentioned: Cell, Epiphany, GA144.

OSDev.org

Do you agree with linus regarding parallel computing?

Who is online