Do you agree with linus regarding parallel computing?

gerryg400 · **Posted:** Tue Jan 20, 2015 7:12 pm

alexShpilkin wrote:

To be more specific, is Plan 9’s A Concurrent Window System “it”?

I had a look at that doc. It appears to be describing a modern day compositing window manager. Am I reading it right ?

AlexShpilkin · **Joined:** Thu Aug 29, 2013 4:10 pm **Posts:** 15

gerryg400 wrote:

alexShpilkin wrote:

To be more specific, is Plan 9’s A Concurrent Window System “it”?

I had a look at that doc. It appears to be describing a modern day compositing window manager. Am I reading it right ?

I’m sorry, I gave the wrong reference. See the updated post for the right one. (Similar titles and off-top-of-the-head citations FTW

) So, my original point was not about what was described but rather about how it was implemented with intra-service messaging. It looks like Pike’s concurrent designs in 1989 and 1999 were quite similar, but I can’t seem to find the relevant details in the earlier (wrong) paper.

Since you’re still interested in it, here’s a quick off-topic reply. It’s from 1989, although “modern” is subjective. It does not perform any composition, but it does store whole-window buffers, unlike non-compositing X WMs. Finally, it is very different from X (and similar to Wayland) in that the WM is the WS, and un-modern the other way round in channelling all drawing commands through itself. (No video memory/command buffer management problems in 1989.)

Rusky · **Joined:** Wed Jan 06, 2010 7:07 pm **Posts:** 792

OSwhatever wrote:

Code:

int vector_with_numbers[lotsof];
int sum = 0;

for(int i = 0; i < lotsof; i++)
{
    sum += vector_with_numbers[i]
}

...
I don't know any language that can do this (but there are so many that I have probably overlooked it).

Most widely used compilers today can vectorize that loop automatically, and something like OpenMP can multithread it automatically. In C.

Brendan wrote:

Now, some people take this to extremes. They suggest cramming a massive number of very slow wimpy cores (e.g. without things like out-of-order execution, branch prediction, tiny caches, etc) onto a chip. For example, maybe 200 cores running at 500 Mhz. This is part of what Linus is saying is silly, and for this part I think Linus is right. Things like out-of-order execution help, even for "many core", and those wimpy cores would be stupid.

...

Also note that there are people (including me) that want CPUs (with wide SIMD - e.g. AVX) to replace GPUs. For example, rather than having a modern "core i7" chip containing 8 CPU cores and 40 "GPU execution units" I'd rather have a chip containing 48 CPU cores (with no GPU at all) and use those CPU cores for both graphics and normal processing - partly because it'd be more able to adapt to the current load (e.g. rather than having GPU doing nothing while CPUs are struggling with load and/or CPUs doing nothing while GPU is struggling with load); but also partly because supporting GPUs is a massive pain in the neck due to lack of common standards between Nvidia, ATI/AMD and Intel and poor and/or non-existent documentation; and also partly because I think there are alternative graphics pipelines that need to be explored but don't suit the "textured polygon pipeline" model GPUs force upon us.

I agree that the "many wimpy cores" idea doesn't work well when you just try to port existing software, or even hypothetical highly actor model-based software. But this is the precise idea behind GPUs, which actually take advantage of the problem space for some pretty amazing results in many areas, including alternate graphics pipelines like ray tracing, and areas outside graphics both in game engines and in other fields.

Replacing GPUs with large numbers of wide CPU cores is an awful idea that throws away a lot of performance and power usage. GPUs go way beyond simple massive SIMD engines. GPUs and DSPs are much more flexible than SIMD regarding memory access, as they can do scatter/gather without a large penalty. They are also much more flexible than SIMD regarding control flow- instead of using superscalar OoOE, they can automatically switch between threads (much like hyperthreading, but with a workload and scale that actually makes sense). For more details, this article I linked before does a good job.

Further, contrary to your classification of this problem space as "the 'textured polygon pipeline' model GPUs force upon us," there are actually a lot of programming techniques that can take advantage of the slightly-more-general hardware in recent GPUs. The whole field of Data Oriented Design (not to be confused with anything data-driven) is about organizing programs in terms of actual data and operations on it, rather than objects or actors. This leads to a surprising amount of ways to reduce problems to mapping and reducing large arrays without a lot of divergent control flow. Simpler programs that use the hardware better.

alexShpilkin wrote:

The number of times “throughput” is mentioned during the Trip through the Graphics Pipeline 2011 and the absurd efficiency of DSPs in comparison with PC CPUs leads me to think that this is Not Going to Happen. I’d, however, love to be proven wrong by people experienced with, e.g., Cell on PS3 or Epiphany-16 on the Parallella.

From what I've seen of the PS3 (no experience, sadly), the above mentioned data-oriented design techniques work rather well on those sorts of multi-core processors, possibly better than GPUs for some workloads. The key is to completely avoid any kind of synchronization- no locks, no CSP/actor model, just large data transformations that use caches/memory bandwidth efficiently. The actor model scales better than locking for smaller numbers of cores, but it's awful when you reach that scale.

Brendan · **Posted:** Wed Jan 21, 2015 2:50 am

Hi,

embryo wrote:

Brendan wrote:

Now, some people take this to extremes. They suggest cramming a massive number of very slow wimpy cores (e.g. without things like out-of-order execution, branch prediction, tiny caches, etc) onto a chip. For example, maybe 200 cores running at 500 Mhz. This is part of what Linus is saying is silly, and for this part I think Linus is right. Things like out-of-order execution help, even for "many core", and those wimpy cores would be stupid.

A "wimpy core" is just as stupid as it's user (a programmer). And "out-of-order core" is no more smart than it's programmer. If a programmer can grasp around an algorithm, then there can be some performance. But if algorithm implementation details are hidden under a great deal of caches, out-of-order things and other intervening stuff, then the performance will be just an accident.

The problem is that very little can actually be "known" by the programmer or compiler beforehand. For example, for memory accesses (reads and writes) if you have caches then you don't really know if you're going to get a cache hit or cache miss, and even if you don't have caches(!) you don't know if your CPU will have to wait for its turn to use RAM because something else (another CPU or a device using bus mastering/DMA) happened to be using the RAM at the time. Another example is the communication between CPUs necessary to maintain synchronisation for parallel workloads (at both levels - e.g. things like "pthread_join()" at the higher levels, and things like cache coherency at the lower levels).

Because it's impossible to predict unpredictable things, the CPU either stalls and can't do useful work while it waits (wasting a lot of performance), or you find a way to avoid wasting a lot of performance (out of order execution).

When you add (both software and hardware) scalability issues onto that, for wimpy cores you just end up with "twice as many CPUs that spend half their time stalled".

Note: The other way to avoid wasting a lot of performance is SMT (e.g. hyper-threading). This helps, but makes everything even less deterministic, and also tends to magnify the scalability issues.

Rusky wrote:

Brendan wrote:

Also note that there are people (including me) that want CPUs (with wide SIMD - e.g. AVX) to replace GPUs. For example, rather than having a modern "core i7" chip containing 8 CPU cores and 40 "GPU execution units" I'd rather have a chip containing 48 CPU cores (with no GPU at all) and use those CPU cores for both graphics and normal processing - partly because it'd be more able to adapt to the current load (e.g. rather than having GPU doing nothing while CPUs are struggling with load and/or CPUs doing nothing while GPU is struggling with load); but also partly because supporting GPUs is a massive pain in the neck due to lack of common standards between Nvidia, ATI/AMD and Intel and poor and/or non-existent documentation; and also partly because I think there are alternative graphics pipelines that need to be explored but don't suit the "textured polygon pipeline" model GPUs force upon us.

I agree that the "many wimpy cores" idea doesn't work well when you just try to port existing software, or even hypothetical highly actor model-based software. But this is the precise idea behind GPUs, which actually take advantage of the problem space for some pretty amazing results in many areas, including alternate graphics pipelines like ray tracing, and areas outside graphics both in game engines and in other fields.

Replacing GPUs with large numbers of wide CPU cores is an awful idea that throws away a lot of performance and power usage. GPUs go way beyond simple massive SIMD engines. GPUs and DSPs are much more flexible than SIMD regarding memory access, as they can do scatter/gather without a large penalty.

GPUs are gradually becoming more "CPU like", and CPUs are gradually becoming more "GPU like". For example, AVX2 does support gather, and AVX-512 has support for scatter and gather.

Rusky wrote:

They are also much more flexible than SIMD regarding control flow- instead of using superscalar OoOE, they can automatically switch between threads (much like hyperthreading, but with a workload and scale that actually makes sense). For more details, this article I linked before does a good job.

Is this a joke?

Let's talk about AVX-512. It's SIMD, and everyone knows it's SIMD. However, what if we're marketting assholes?

If we're marketting assholes, we might pretend that each (512-byte) register is actually four separate (128-byte) registers. We might pretend that the four "separate" registers are in different cores (and pretend that a 48 core chip is actually a 192 "core" chip too). We might pretend that the "Opmask_registers" enable/disable different cores. We might even make up a new name for it and call it SIMT, and try to convince suckers that it's not SIMD at all!

Rusky wrote:

Further, contrary to your classification of this problem space as "the 'textured polygon pipeline' model GPUs force upon us," there are actually a lot of programming techniques that can take advantage of the slightly-more-general hardware in recent GPUs. The whole field of Data Oriented Design (not to be confused with anything data-driven) is about organizing programs in terms of actual data and operations on it, rather than objects or actors. This leads to a surprising amount of ways to reduce problems to mapping and reducing large arrays without a lot of divergent control flow. Simpler programs that use the hardware better.

Yes. Originally (back in the "fixed function pipeline" era) it was purely a textured polygon pipeline. As GPUs have grown more "CPU like" they've become more flexible, and now you can probably use GPGPU techniques instead of "traditional GPU" (shaders plugged into a textured polygon pipeline) if you really want to.

Note that I'm not saying special purpose accelerators won't remain in the video card (e.g. H.264 decompression). Mostly what I'm suggesting is "CPU/GPU convergence"; starting with bringing the GPU onto the same chip as the CPU and getting rid of descrete video RAM (already happened), and adding the strengths of GPUs into the CPU (already happening). Also, don't forget that there are already software renderers claiming to be as fast as integrated graphics hardware and already people looking at shifting parts of the graphics pipeline from GPU back to the CPU.

Rusky wrote:

The actor model scales better than locking for smaller numbers of cores, but it's awful when you reach that scale.

Sure - the entire Internet is a hoax and there's really only one fast server in Japan, because "actor model principles" only scale to a small number of cores and having millions of computers world-wide all sending messages/packets to each other would never work...

Cheers,

Brendan

embryo · **Posted:** Wed Jan 21, 2015 7:33 am

alexShpilkin wrote:

Brendan wrote:

[Some people] suggest cramming a massive number of very slow wimpy cores (e.g. without things like out-of-order execution, branch prediction, tiny caches, etc) onto a chip. For example, maybe 200 cores running at 500 Mhz.

They even make absurdly expensive evaluation boards: I’m surprised to see no mention of Charles Moore’s GA144.

Why 20$ per chip is too expensive? For an Intel processor with price at 400$ level we can buy 20*144=2880 simple cores. Approximately 2 trillions 18-bit integer calculations per second comparing to 256 billions of 16-bit numbers on 4 AVX cores at 4 Ghz. 8 time speed increase with 12.8 watts vs ~100 watts power consumption (8 times again). Of course, it's just a peak values, but they are impressive.

embryo · **Posted:** Wed Jan 21, 2015 7:56 am

Brendan wrote:

The problem is that very little can actually be "known" by the programmer or compiler beforehand. For example, for memory accesses (reads and writes) if you have caches then you don't really know if you're going to get a cache hit or cache miss, and even if you don't have caches(!) you don't know if your CPU will have to wait for its turn to use RAM because something else (another CPU or a device using bus mastering/DMA) happened to be using the RAM at the time.

But why there are no such simple things as stable cached memory ranges or dedicated memory bus? I mean here that we can avoid the "unknown" with some specialized hardware. Just like the GPU's memory allows it not to worry about "its turn to use RAM". So, there are solutions available, but for some reason we can't use them.

Brendan wrote:

Because it's impossible to predict unpredictable things, the CPU either stalls and can't do useful work while it waits (wasting a lot of performance), or you find a way to avoid wasting a lot of performance (out of order execution).

When you add (both software and hardware) scalability issues onto that, for wimpy cores you just end up with "twice as many CPUs that spend half their time stalled".

So we need a "predictor". Technically it is possible to change the contemporary hardware and get a simple prediction algorithm. What prevents us from doing so?

Brendan wrote:

If we're marketting assholes, we might pretend that each (512-byte) register is actually four separate (128-byte) registers. We might pretend that the four "separate" registers are in different cores (and pretend that a 48 core chip is actually a 192 "core" chip too). We might pretend that the "Opmask_registers" enable/disable different cores. We might even make up a new name for it and call it SIMT, and try to convince suckers that it's not SIMD at all!

But 2/4/8 cores of modern processors are on the same silicon base. Then should we count on those "cores" or just talk about one processor? And 40 GPUs are on the same silicon.

It is better to count a processing and electrical power instead. If we have 100 000 adders on board and they consume 1 watt at 1 Ghz, then we have 100 trillions of additions per second for 1 watt. Of course, it's the numbers for a particular task and they do not show a performance for every possible algorithm, but they show, that having many simplified units can deliver us a very good performance for some tasks (orders of magnitude better, than modern processors can deliver). And the problem is how to combine the power of the simplified units to perform our typical tasks. My bet is with smart compilers, but not with the traditional processor architectures.

Brendan · **Posted:** Wed Jan 21, 2015 9:19 am

Hi,

embryo wrote:

Brendan wrote:

The problem is that very little can actually be "known" by the programmer or compiler beforehand. For example, for memory accesses (reads and writes) if you have caches then you don't really know if you're going to get a cache hit or cache miss, and even if you don't have caches(!) you don't know if your CPU will have to wait for its turn to use RAM because something else (another CPU or a device using bus mastering/DMA) happened to be using the RAM at the time.

But why there are no such simple things as stable cached memory ranges or dedicated memory bus? I mean here that we can avoid the "unknown" with some specialized hardware.

Because it's less efficient. For example, if everything is fast enough to handle the CPU's peak requirements, then the bandwidth (and power consumed to maintain it) is wasted whenever the CPU isn't using max. bandwidth.

embryo wrote:

Just like the GPU's memory allows it not to worry about "its turn to use RAM". So, there are solutions available, but for some reason we can't use them.

GPUs stall just as much as CPUs do; except that (at least for discrete cards) they tend to use faster (more expensive and more power hungry) RAM so you spend a little less time stalled; and the work GPUs are asked to do tends to be easier (e.g. more "embarrassingly parallel" and less "unpredictable/branchy").

embryo wrote:

Brendan wrote:

Because it's impossible to predict unpredictable things, the CPU either stalls and can't do useful work while it waits (wasting a lot of performance), or you find a way to avoid wasting a lot of performance (out of order execution).

When you add (both software and hardware) scalability issues onto that, for wimpy cores you just end up with "twice as many CPUs that spend half their time stalled".

So we need a "predictor". Technically it is possible to change the contemporary hardware and get a simple prediction algorithm. What prevents us from doing so?

Intel already has the world's leading branch prediction algorithms built into their CPUs. For predicting "unpredictable" memory accesses, I don't know how you think a predictor would work (other than speculatively executing code to find the memory accesses as soon as possible, which is something you'd need out-of-order execution for).

embryo wrote:

It is better to count a processing and electrical power instead. If we have 100 000 adders on board and they consume 1 watt at 1 Ghz, then we have 100 trillions of additions per second for 1 watt. Of course, it's the numbers for a particular task and they do not show a performance for every possible algorithm, but they show, that having many simplified units can deliver us a very good performance for some tasks (orders of magnitude better, than modern processors can deliver). And the problem is how to combine the power of the simplified units to perform our typical tasks.

That doesn't work either. You end up with (e.g.) "theoretical maximum FLOPS" that no software ever comes close to achieving in practice.

embryo wrote:

My bet is with smart compilers, but not with the traditional processor architectures.

You mean, Itanium?

Cheers,

Brendan

Rusky · **Joined:** Wed Jan 06, 2010 7:07 pm **Posts:** 792

Brendan wrote:

GPUs are gradually becoming more "CPU like", and CPUs are gradually becoming more "GPU like". For example, AVX2 does support gather, and AVX-512 has support for scatter and gather.
...
Is this a joke?

You need to take another look at what SIMT actually means before you declare the problem solved by AVX2. The two make fundamentally different tradeoffs with how they access memory, how pipelines work, and how hardware parallelism works. It's more than just marketing.

Brendan wrote:

Yes. Originally (back in the "fixed function pipeline" era) it was purely a textured polygon pipeline. As GPUs have grown more "CPU like" they've become more flexible, and now you can probably use GPGPU techniques instead of "traditional GPU" (shaders plugged into a textured polygon pipeline) if you really want to.

I'm saying the "GPGPU" model can often be much more efficient than the actor model, it's more than just cramming things into a glorified fixed-function pipeline. This is also how people get the best performance out of the PS3's Cell SPUs, for example.

Brendan wrote:

Rusky wrote:

The actor model scales better than locking for smaller numbers of cores, but it's awful when you reach that scale.

Sure - the entire Internet is a hoax and there's really only one fast server in Japan, because "actor model principles" only scale to a small number of cores and having millions of computers world-wide all sending messages/packets to each other would never work...

Okay, I misspoke. I wasn't specific enough. For large actors (browsers, servers) in large quantities (the internet) where you don't have a lot of overlap in what they're working on (entirely different websites), the actor model scales well. For your "thread-sized" actors working on the same problem (trying to get an N-times speedup for N actors, for example), you cannot scale nearly as high.

On the other hand, when you can hand out work to GPU or SPU-style processors, you can actually get an N-times speedup until you run out of data to split up. And the premise of data-oriented design is that you can do that for a lot more problems when you formulate the problem as data transforms instead of thinking about threads/actors/objects.

Brendan · **Posted:** Thu Jan 22, 2015 5:47 am

Hi,

Rusky wrote:

Brendan wrote:

GPUs are gradually becoming more "CPU like", and CPUs are gradually becoming more "GPU like". For example, AVX2 does support gather, and AVX-512 has support for scatter and gather.
...
Is this a joke?

You need to take another look at what SIMT actually means before you declare the problem solved by AVX2. The two make fundamentally different tradeoffs with how they access memory, how pipelines work, and how hardware parallelism works. It's more than just marketing.

I'm not saying AVX is identical to an Nvidia GPU. I'm only saying that Nvidia's "SIMT" is functionally identical to SIMD. Yes, there are very different trade-offs (different goals, different memory hierarchies, different complexity, different instructions sets, different numbers of cores/registers/execution units, different hardware threading, different register widths, etc); but none of that changes the fact that Nvidia's "SIMT" is functionally identical to SIMD.

Rusky wrote:

Brendan wrote:

Rusky wrote:

The actor model scales better than locking for smaller numbers of cores, but it's awful when you reach that scale.

Sure - the entire Internet is a hoax and there's really only one fast server in Japan, because "actor model principles" only scale to a small number of cores and having millions of computers world-wide all sending messages/packets to each other would never work...

Okay, I misspoke. I wasn't specific enough. For large actors (browsers, servers) in large quantities (the internet) where you don't have a lot of overlap in what they're working on (entirely different websites), the actor model scales well. For your "thread-sized" actors working on the same problem (trying to get an N-times speedup for N actors, for example), you cannot scale nearly as high.

Still wrong. For example, for something like (e.g.) SETI@home where everything is working on exactly the same problem, it still scales incredibly well. In general, all "embarrassingly parallel" problems (which includes graphics) are embarrassingly easy regardless of whether you use locks or messages or whatever (until you realise that some approaches fail to scale beyond the limits of shared memory).

Rusky wrote:

On the other hand, when you can hand out work to GPU or SPU-style processors, you can actually get an N-times speedup until you run out of data to split up.

GPU struggles just to handle "CPU+GPU" working on the same problem in the same computer. In contrast, it's relatively trivial to use the actor model to scale graphics processing across a pool of many computers, both with and without shared memory, and even with heterogeneous processors (got mixture of 80x86, ARM and Sparc CPUs; and a mixture of Intel, AMD and Nvidia GPUs; all on the same LAN? No problem - let's use all of them at once!).

You might be making the mistake of assuming that the actor model has higher communication costs than shared memory; but that would also be wrong because you can use shared memory for communication anyway. This is not limited to (e.g.) things like lockless FIFO queues; but does include (e.g.) multiple actors (who all think they've received their own private copy of a message) reading from the same physical pages. In fact "sending a message" is mostly just a transfer of ownership - in a shared memory environment, it doesn't mean that the data actually has to be shifted anywhere, or that ownership can't be transferred to multiple receivers (still without actually shifting or copying the data anywhere).

Rusky wrote:

And the premise of data-oriented design is that you can do that for a lot more problems when you formulate the problem as data transforms instead of thinking about threads/actors/objects.

If you formulate the problem as data/messages being processed by data transforms/actors, instead of thinking about threads/functions/objects; then...

Surely you can see that data-oriented design has some significant similarities to the actor model.

Cheers,

Brendan

embryo · **Posted:** Thu Jan 22, 2015 6:57 am

Brendan wrote:

embryo wrote:

But why there are no such simple things as stable cached memory ranges or dedicated memory bus? I mean here that we can avoid the "unknown" with some specialized hardware.

Because it's less efficient. For example, if everything is fast enough to handle the CPU's peak requirements, then the bandwidth (and power consumed to maintain it) is wasted whenever the CPU isn't using max. bandwidth.

It is a wrong way of thinking to expect that everything is fast enough. I have tried to point to another way - a standard bottleneck identification routine. If the bottleneck is the bandwidth, then the solution is obvious - just extend the bandwidth. If power is a limiting factor, then just make a memory bus that decreases it's clock rate or even stops it's clocks when there is no need for memory access. The mentioned before GA144 has ability to stop any power consumption beyond the minimal transistor leak in an idle state, then why a memory bus can't do the same?

And the waste of bandwidth when a memory access is not needed is a bit insane, because then we are wasting a lot of things around - computers when we are not using them, cars when we are not going somewhere, houses when we are at work, sun light when we are unable to wrap the whole sun and get all of it's emission. So, such a "waste" should be considered normal, at least until we solve some very hard problems like - just why we need to consume the all sun's emission?

Brendan wrote:

GPUs stall just as much as CPUs do

But do they increase the graphics processing speed? The goal is achieved. It is still not absolutely efficient, but the speed increase is here and definitely is very usable. So, if we can get such speed increase with the help of poly-core enabled software, but when having some cores in idle state, then it's a really good achievement and it pays a lot for the price of having some cores stalled. Then again I should point to the initial goal - we need a speed increase for new applications, like more natural human-computer interfaces and everything above it up to an artificial intelligence, for example. And it is just offline user related services, but if we consider the need for speed of online services, then the speed increase becomes visible in form of billions of $.

Brendan wrote:

Intel already has the world's leading branch prediction algorithms built into their CPUs.

This world leading algorithm is targeted for a particular combination of hardware and software. If we use an analogy then it's like tailoring a suit for only one person when producing it for everybody on earth.

Brendan wrote:

For predicting "unpredictable" memory accesses, I don't know how you think a predictor would work

If we use some data in our program then it is obvious that a predictor can just look at such fact and use such information for arrangement of the corresponding memory access. That's exactly what a smart compiler should do. If a compiler misses some information, then it's our bottleneck and we should remove it using standard bottleneck avoiding technics. For example - let a programmer annotate some critical code section, or let's invent a new language construct, or let's make a more smart compiler.

Brendan wrote:

You end up with (e.g.) "theoretical maximum FLOPS" that no software ever comes close to achieving in practice.

At least I see it working for graphics applications. It also works for many other algorithms, that currently are executed on CPUs. And CPU designers invent MMX, SSE, AVX to help to extend this bottleneck. The final point of such evolution will be something like AVX48 with 1024*16 (512*32/256*64/128*128/...) bit of the same operations, that we have in GPUs. But what is changed on the silicon? Just nothing, except some units moved from one side of the silicon (GPU) to another (CPU). It means we have some volume for computing units and can move those units where we wish. But why we should wait for Intel's designers to move those units? We can propose a design with reconfigurable set of basic units and move those units as we wish and when we wish. And yes, hardware optimized movement is more efficient, but if we consider the speed of such movement by Intel's designers (decades), then the waste of some efficiency is a really good payment for a good flexibility.

Brendan wrote:

You mean, Itanium?

I don't know the marketing name for the proposed solution, but it seems viable to call it like "reconfigurable computing units". And because the command word for such set of units can be very long, then it is possible to use existing abbreviation like VLIW. And yes, the Itanium uses VLIW.

Rusky · **Joined:** Wed Jan 06, 2010 7:07 pm **Posts:** 792

Brendan wrote:

I'm not saying AVX is identical to an Nvidia GPU. I'm only saying that Nvidia's "SIMT" is functionally identical to SIMD. Yes, there are very different trade-offs (different goals, different memory hierarchies, different complexity, different instructions sets, different numbers of cores/registers/execution units, different hardware threading, different register widths, etc); but none of that changes the fact that Nvidia's "SIMT" is functionally identical to SIMD.

SIMD and SIMT both work by a single instruction fetch/decode sending its results to multiple execution units, but the similarity ends there:

SIMD uses a (usually small, but that's not important) architecturally-fixed number of vector elements that a controlling loop has to cycle through, handling boundary conditions. SIMT, on the other hand, has a separate register set per "core" (or if you prefer, a separate SIMD register set per SM, although these individual "cores" are more powerful than just SIMD lanes as I'll explain in the next bullet point), so adding more cores or SIMD lanes doesn't change the program-visible architecture. You specify things in terms of single vector elements rather than baking the specifics of a particular SIMD implementation into your program.
SIMD does sometimes have scatter/gather for parallel memory access, but SIMT handles its latencies much better. Instead of superscalar/OoOE, which work well for low-latency low-thread-count workloads (with SMT occasionally thrown in afterward to "fill in the gaps"), SIMT has enough threads (and separate register sets/"cores"/glorified SIMD lanes) that it just switches threads when a single thread stalls. Note that this works at the level of individual SIMD vector elements, not just the larger SMs. This technique makes individual cores much cheaper without making them "wimpy", even further allowing things like vastly larger, higher-latency register files without impacting throughput.
SIMD does have select instructions and sometimes predication, but SIMT supports control flow divergence much better. The biggest enabler here is the above-mentioned thread switching, so that threads not executing predicated instructions can hand off their execution units to threads that are. Of course this trick helps the most when your workload has lots of very similar threads, so it doesn't work for e.g. the actor model, but it'll destroy a bunch of SIMD cores for doing massive data transformations.
SIMD synchronization is the same as scalar synchronization- you get locks, atomics, or abstractions on top like message passing. SIMT workloads instead tend to need a way to synchronize different stages of processing, doing something like "wait for all threads to reach this point." This is easy to implement in SIMT (NVIDIA's __syncthreads()), but very expensive with multiple SIMD CPU cores.

Overall, the flexibility of being able to switch your CPU cores between "graphics"-like workloads and "actor"-like workloads based on demand just hurts your performance. Heterogeneous computing is more efficient. Your earlier mention of problems like GPU-waiting-on-CPU and vice versa is better solved by fixing your program architecture and drivers. Your complaint about proprietary GPU architectures is better solved by making open GPU architectures, like Intel is moving toward with its integrated graphics.

Brendan wrote:

Rusky wrote:

Okay, I misspoke. I wasn't specific enough. For large actors (browsers, servers) in large quantities (the internet) where you don't have a lot of overlap in what they're working on (entirely different websites), the actor model scales well. For your "thread-sized" actors working on the same problem (trying to get an N-times speedup for N actors, for example), you cannot scale nearly as high.

Still wrong. For example, for something like (e.g.) SETI@home where everything is working on exactly the same problem, it still scales incredibly well. In general, all "embarrassingly parallel" problems (which includes graphics) are embarrassingly easy regardless of whether you use locks or messages or whatever (until you realise that some approaches fail to scale beyond the limits of shared memory).

SETI@home isn't using the actor model for its actual processing, so much as it's using it to orchestrate the computers' separate workloads over the network. Just like other map/reduce frameworks, it's using the same basic model as a GPU for the actual processing, although its distributed nature makes for more complicated coordination, which is something the actor model excels at.

For example, at a smaller scale, a typical way to use the PS3 is to split up a problem like SETI@home (which does run on the PS3) into chunks that run in isolation, and then hand them out to the SPUs that all have their own local memory- and nobody calls that the actor model, though it is a perfect example of what I said the actor model is good at (large actors without much overlap in what they're working on).

On the other hand, if you tried to do something like SETI@home by having several clients work on the same chunk of data and send messages back and forth to coordinate, the actor model can't help you at all. They did it the right way for that scale by relegating it to coordination between actors working on separate chunks.

Brendan wrote:

Rusky wrote:

On the other hand, when you can hand out work to GPU or SPU-style processors, you can actually get an N-times speedup until you run out of data to split up.

GPU struggles just to handle "CPU+GPU" working on the same problem in the same computer. In contrast, it's relatively trivial to use the actor model to scale graphics processing across a pool of many computers, both with and without shared memory, and even with heterogeneous processors (got mixture of 80x86, ARM and Sparc CPUs; and a mixture of Intel, AMD and Nvidia GPUs; all on the same LAN? No problem - let's use all of them at once!).

You might be making the mistake of assuming that the actor model has higher communication costs than shared memory...

As you can see, the actor model and data-oriented design are pretty much orthogonal. The actor model works well to enforce some structure on the infrequent communications needed by the data-oriented approach, if you're doing things over a network or something, which that data-oriented approach excels at because it makes it easier to split up the data, and to split up the transforms. But actors can't help you scale past a few threads if you don't think in terms of data. Trying to do too much communication between actors kills it, not because it's any less efficient at communicating (I don't think it is per se), but because communication itself is what keeps you from scaling.

Brendan wrote:

Rusky wrote:

And the premise of data-oriented design is that you can do that for a lot more problems when you formulate the problem as data transforms instead of thinking about threads/actors/objects.

If you formulate the problem as data/messages being processed by data transforms/actors, instead of thinking about threads/functions/objects; then...

Surely you can see that data-oriented design has some significant similarities to the actor model.

No, they're not that similar. The actor model doesn't help you at all in terms of how to split up the data and operations, it just gives you the tools to organize things once you've done so. I'm all for the actor model as a way to organize communications once you know you need them, but as a way to organize your program it's just as bad as OOP- thinking in terms of agents doing things (i.e. threads doing work, actors sending and receiving messages, or objects implementing features) makes it too easy to ignore the actual workload, the actual data, and the actual transforms.

Brendan · **Posted:** Thu Jan 22, 2015 12:36 pm

Hi,

Rusky wrote:

Brendan wrote:

I'm not saying AVX is identical to an Nvidia GPU. I'm only saying that Nvidia's "SIMT" is functionally identical to SIMD. Yes, there are very different trade-offs (different goals, different memory hierarchies, different complexity, different instructions sets, different numbers of cores/registers/execution units, different hardware threading, different register widths, etc); but none of that changes the fact that Nvidia's "SIMT" is functionally identical to SIMD.

SIMD and SIMT both work by a single instruction fetch/decode sending its results to multiple execution units, but the similarity ends there:

SIMD uses a (usually small, but that's not important) architecturally-fixed number of vector elements that a controlling loop has to cycle through, handling boundary conditions. SIMT, on the other hand, has a separate register set per "core" (or if you prefer, a separate SIMD register set per SM, although these individual "cores" are more powerful than just SIMD lanes as I'll explain in the next bullet point), so adding more cores or SIMD lanes doesn't change the program-visible architecture. You specify things in terms of single vector elements rather than baking the specifics of a particular SIMD implementation into your program.

That's just a combination of syntactical sugar in the compiler (e.g. allowing you to do "vector = vector1 + vector2") and something Intel also does (e.g. "architecturally defined as 256 bytes but possibly split into smaller parts depending on which CPU and how many execution units Intel actually felt like having"). It's not a difference between SIMD and "SIMT".

Rusky wrote:

SIMD does sometimes have scatter/gather for parallel memory access, but SIMT handles its latencies much better. Instead of superscalar/OoOE, which work well for low-latency low-thread-count workloads (with SMT occasionally thrown in afterward to "fill in the gaps"), SIMT has enough threads (and separate register sets/"cores"/glorified SIMD lanes) that it just switches threads when a single thread stalls. Note that this works at the level of individual SIMD vector elements, not just the larger SMs. This technique makes individual cores much cheaper without making them "wimpy", even further allowing things like vastly larger, higher-latency register files without impacting throughput.

All of that is implementation details that have nothing to do with the equivalence of SIMD vs. "SIMT" whatsoever.

Rusky wrote:

SIMD does have select instructions and sometimes predication, but SIMT supports control flow divergence much better. The biggest enabler here is the above-mentioned thread switching, so that threads not executing predicated instructions can hand off their execution units to threads that are. Of course this trick helps the most when your workload has lots of very similar threads, so it doesn't work for e.g. the actor model, but it'll destroy a bunch of SIMD cores for doing massive data transformations.

You have a single instruction, some "cores" do the instruction and other "cores" are masked. Their "control flow divergence" is just plain old SIMD with masking and the main difference is terminology/marketing. There's no reason you can't have SIMD such that execution units that would've done work that has been masked do something else instead (and I'd be very surprised if Intel's AVX-512 doesn't do that, given that they've added special opmask registers to make it easy).

Note: I don't know why you think the actor model prevents multiple actors doing similar work, but it doesn't.

Rusky wrote:

SIMD synchronization is the same as scalar synchronization- you get locks, atomics, or abstractions on top like message passing. SIMT workloads instead tend to need a way to synchronize different stages of processing, doing something like "wait for all threads to reach this point." This is easy to implement in SIMT (NVIDIA's __syncthreads()), but very expensive with multiple SIMD CPU cores.

I'm saying Nvidia's "SIMT cores" are really just a single (wide) SIMD core; so it doesn't surprised me that their "__syncthreads()" is cheap (but does surprise me that their "__syncthreads()" is needed in the first place).

Please note that (using Nvidia's obfuscated marketing terminology) Nvidia's "__syncthreads()" does not synchronise all threads in the grid and only synchronises all threads in the block. To synchronise all threads in the grid it's expensive.

Please note that (using Intel's terminology) Nvidia's "__syncthreads()" does not synchronise all cores in the chip and only synchronises all execution units in the core. To synchronise all cores in the chip it's expensive.

Rusky wrote:

Overall, the flexibility of being able to switch your CPU cores between "graphics"-like workloads and "actor"-like workloads based on demand just hurts your performance. Heterogeneous computing is more efficient. Your earlier mention of problems like GPU-waiting-on-CPU and vice versa is better solved by fixing your program architecture and drivers. Your complaint about proprietary GPU architectures is better solved by making open GPU architectures, like Intel is moving toward with its integrated graphics.

When we're trying to compile a large C++ project, the CPUs are going float out and the GPUs are doing nothing; we should just fix the program architecture and drivers? When we're playing a very graphical 3D game in 1920*1600 at max. detail settings and the GPU is going flat out and the CPUs are doing almost nothing; we should just fix the program architecture and drivers?

What exactly does "fix the program architecture and drivers" involve? Should I bring some rope and a shovel?

Rusky wrote:

Brendan wrote:

Rusky wrote:

Okay, I misspoke. I wasn't specific enough. For large actors (browsers, servers) in large quantities (the internet) where you don't have a lot of overlap in what they're working on (entirely different websites), the actor model scales well. For your "thread-sized" actors working on the same problem (trying to get an N-times speedup for N actors, for example), you cannot scale nearly as high.

Still wrong. For example, for something like (e.g.) SETI@home where everything is working on exactly the same problem, it still scales incredibly well. In general, all "embarrassingly parallel" problems (which includes graphics) are embarrassingly easy regardless of whether you use locks or messages or whatever (until you realise that some approaches fail to scale beyond the limits of shared memory).

SETI@home isn't using the actor model for its actual processing, so much as it's using it to orchestrate the computers' separate workloads over the network. Just like other map/reduce frameworks, it's using the same basic model as a GPU for the actual processing, although its distributed nature makes for more complicated coordination, which is something the actor model excels at.

Locks and memory barriers aren't used for actual processing, so much as they're used to orchestrate CPUs' separate workloads over the bus. Just like the actor model, locks and memory barriers use things designed for processing for the actual processing.

Rusky wrote:

Brendan wrote:

Rusky wrote:

On the other hand, when you can hand out work to GPU or SPU-style processors, you can actually get an N-times speedup until you run out of data to split up.

GPU struggles just to handle "CPU+GPU" working on the same problem in the same computer. In contrast, it's relatively trivial to use the actor model to scale graphics processing across a pool of many computers, both with and without shared memory, and even with heterogeneous processors (got mixture of 80x86, ARM and Sparc CPUs; and a mixture of Intel, AMD and Nvidia GPUs; all on the same LAN? No problem - let's use all of them at once!).

You might be making the mistake of assuming that the actor model has higher communication costs than shared memory...

As you can see, the actor model and data-oriented design are pretty much orthogonal. The actor model works well to enforce some structure on the infrequent communications needed by the data-oriented approach, if you're doing things over a network or something, which that data-oriented approach excels at because it makes it easier to split up the data, and to split up the transforms. But actors can't help you scale past a few threads if you don't think in terms of data. Trying to do too much communication between actors kills it, not because it's any less efficient at communicating (I don't think it is per se), but because communication itself is what keeps you from scaling.

Exactly.

At this point I'd like to make a slight correction regarding Erlang. The problem with Erlang isn't so much that it uses a "fine grained actor model", but more that there's no easy way for programmers to tell the compiler where the communications bottlenecks are; and no way for the compiler to know where message passing should be implemented as direct function calls, or implemented with something like a lockless FIFO queue in shared memory, or implemented on top of TCP/IP, or...

With this in mind; my "thread sized actors" is mostly just cheating - allowing programmers to use function calls instead of message passing where more expensive communication isn't desirable; and allowing programmers to use threads in different processes (instead of threads in the same process) if communication costs don't matter so much.

Rusky wrote:

Brendan wrote:

Rusky wrote:

And the premise of data-oriented design is that you can do that for a lot more problems when you formulate the problem as data transforms instead of thinking about threads/actors/objects.

If you formulate the problem as data/messages being processed by data transforms/actors, instead of thinking about threads/functions/objects; then...

Surely you can see that data-oriented design has some significant similarities to the actor model.

No, they're not that similar. The actor model doesn't help you at all in terms of how to split up the data and operations, it just gives you the tools to organize things once you've done so. I'm all for the actor model as a way to organize communications once you know you need them, but as a way to organize your program it's just as bad as OOP- thinking in terms of agents doing things (i.e. threads doing work, actors sending and receiving messages, or objects implementing features) makes it too easy to ignore the actual workload, the actual data, and the actual transforms.

I think you're right - they're mostly orthogonal, and it's just that I tend to think about things like cache lines and data layouts when designing actors and message formats.

Cheers,

Brendan

Rusky · **Joined:** Wed Jan 06, 2010 7:07 pm **Posts:** 792

Glad we're clear on the actor model.

Brendan wrote:

All of that is implementation details that have nothing to do with the equivalence of SIMD vs. "SIMT" whatsoever.

Of course it's all implementation details. Of course Nvidia's SIMT cores (they call them SMs) are very similar to wide SIMD cores. The point is that the SIMT implementation is more efficient both in performance and power usage for several important workloads. SIMD vs SIMT is all about tradeoffs in implementation details, just like SMT vs multiple cores or out-of-order vs in-order. You can't just hand-wave implementation details away and say they're equivalent, or we'd all still be using single-cycle, in-order CPUs.

Of course you can compile SIMT-looking code to both architectures, but the actual binary representation of the instructions in SIMT is vector size-agnostic, whereas SIMD is not- this is an advantage in decoding/dispatch and power usage. SSE/AVX aren't going to be able to take advantage of the kind of massive scaled-up SMT-style execution unit sharing SIMT does without either ripping out superscalar/OoOE and making the core bad at scalar workloads (in which case you've just made a narrower SIMT GPU core) or leaving it in and being far less power-efficient (seems AVX might be doing this, but in this case you ought to use an SIMT GPU core instead).

Brendan wrote:

When we're trying to compile a large C++ project, the CPUs are going float out and the GPUs are doing nothing; we should just fix the program architecture and drivers? When we're playing a very graphical 3D game in 1920*1600 at max. detail settings and the GPU is going flat out and the CPUs are doing almost nothing; we should just fix the program architecture and drivers?

When we're trying to compile a large C++ project, the general-purpose registers, branch prediction tables, caches, and instruction scheduler are going flat out and the SIMD registers, memory bus capacity, and large numbers of execution units are doing nothing but dragging down context switch and power usage efficiency.

When we're playing a very graphical 3D game in 1920*1600 at max. detail settings, the SIMD registers, full memory bandwidth, and execution units are going flat out and the branch predictor, cache hierarchy, and instruction scheduler are doing nothing but taking up space and power usage.

Specialized hardware gets left disabled in either case. The branch and pointer-heavy code relies on superscalar/OoOE with branch prediction and cache hierarchy, and the data-crunching code relies on large numbers of registers that don't need to be context-switched, high memory bandwidth (graphics ram vs normal ram), and lots of execution units. Why not optimize the system so both workloads perform better? It doesn't cost anything in terms of performance or power to shut off the GPU when you're compiling C++ (for example, laptops already do it)- you can't use the things the GPU is optimized for for that workload anyway.

Edit: I should add that former Intel P6 processor architect Andy Glew also uses SIMT as a distinct design from SIMD, explicitly disagreeing with the idea that it is merely a marketing term saying that (from that link) "The SIMT-ness is a microarchitectre technique, that realizes efficiencies when threads are aligned and coherent." He also gave a presentation on it, talking about how "SIMT is not just vectors with mask and good scatter/gather."

Owen · **Posted:** Thu Jan 22, 2015 3:45 pm

Brendan wrote:

First, let's put this into context.

For single-thread performance, a lot of people (including me) think Intel can't make cores go much faster. In the last 10 years clock speeds have gone from ~3.5 GHz Pentium 4 all the way up to ~3.5 Ghz Haswell; and while Intel have found ways of optimising the CPU's "instructions per cycle" they've been getting diminishing returns for a long while now - Haswell is only about 3% faster than Sandy Bridge, which was only about 7% faster than Nehalem, which was about 10% faster than Core. Their next micro-arch (Broadwell) might only be 1.5% faster than Haswell.

Your numbers don't add up. Relative performance, per your numbers (taking Core 2 as 1):

Core 2: 1.0x
Nehalem: 1.1x
Sandy Bridge: 1.17x
Haswell: 1.21x

Except that, the real numbers are:

2006 Core 2 (Q9650): 1.0x (3810)
2008 Nehalem (860): 1.63x (6230, +63%)
2011 Sandy Bridge (i7 2600): 2.09x (7970, +28%)
2013 Haswell (i7 4790): 2.5x (9540, +20%, also lower TDP)

Years listed are year of microarchitecture introduction. Scores are from 3DMark; other sources give broadly similar numbers. Processor selected was highest listed variant of given family on 3DMark's results page, excluding K variations to minimize influence of overclocking

Not exactly the doom and gloom you stated

Brendan · **Posted:** Thu Jan 22, 2015 5:27 pm

Hi,

Owen wrote:

Brendan wrote:

First, let's put this into context.

For single-thread performance, a lot of people (including me) think Intel can't make cores go much faster. In the last 10 years clock speeds have gone from ~3.5 GHz Pentium 4 all the way up to ~3.5 Ghz Haswell; and while Intel have found ways of optimising the CPU's "instructions per cycle" they've been getting diminishing returns for a long while now - Haswell is only about 3% faster than Sandy Bridge, which was only about 7% faster than Nehalem, which was about 10% faster than Core. Their next micro-arch (Broadwell) might only be 1.5% faster than Haswell.

Your numbers don't add up. Relative performance, per your numbers (taking Core 2 as 1):

Core 2: 1.0x
Nehalem: 1.1x
Sandy Bridge: 1.17x
Haswell: 1.21x

Except that, the real numbers are:

2006 Core 2 (Q9650): 1.0x (3810)
2008 Nehalem (860): 1.63x (6230, +63%)
2011 Sandy Bridge (i7 2600): 2.09x (7970, +28%)
2013 Haswell (i7 4790): 2.5x (9540, +20%, also lower TDP)

Years listed are year of microarchitecture introduction. Scores are from 3DMark; other sources give broadly similar numbers. Processor selected was highest listed variant of given family on 3DMark's results page, excluding K variations to minimize influence of overclocking

Not exactly the doom and gloom you stated

Except, it's highly misleading. All these CPUs use different clock speeds, have different cache sizes and speeds, different RAM speeds, etc.

Here's your results; with the original score divided by the clock speed:

2006 Core 2 (Q9650): 3810/3 = 1270
2008 Nehalem (860): 6230/2.8 = 2225, difference from previous = 75%
2011 Sandy Bridge (i7 2600): 7970/3.4 = 2344, difference from previous = 5%
2013 Haswell (i7 4790): 9540/3.5 = 2725, difference from previous = 16%

Note that the biggest change/improvement Intel made for Nehalem was bringing the memory controller onto the same chip as the CPU, which caused major improvement in RAM access times. I very much doubt that the jump in benchmark score for Nehalem is a coincidence, and that implies their benchmark is heavily effected by memory bandwidth and/or latency. Nehalem and Sandy Bridge both used DDR3-1066/1333, and the (normalised) performance difference between them is minor. Haswell uses DDR3-1333/1600, and with a 20% increase in RAM speed (over Sandy Bridge) we get a 16% increase in (normalised) benchmark score. What a surprise!

How much of the improvements are actual improvements to the cores themselves and not just changes to the memory hierarchy and RAM? I don't know for sure, but I'm guessing I was right all along and the cores themselves haven't improved much over the last ~10 years.

Cheers,

Brendan

OSDev.org

Do you agree with linus regarding parallel computing?

Who is online