Brendan wrote:
I'm not saying AVX is identical to an Nvidia GPU. I'm only saying that Nvidia's "SIMT" is functionally identical to SIMD. Yes, there are very different trade-offs (different goals, different memory hierarchies, different complexity, different instructions sets, different numbers of cores/registers/execution units, different hardware threading, different register widths, etc); but none of that changes the fact that Nvidia's "SIMT" is functionally identical to SIMD.
SIMD and SIMT both work by a single instruction fetch/decode sending its results to multiple execution units, but the similarity ends there:
- SIMD uses a (usually small, but that's not important) architecturally-fixed number of vector elements that a controlling loop has to cycle through, handling boundary conditions. SIMT, on the other hand, has a separate register set per "core" (or if you prefer, a separate SIMD register set per SM, although these individual "cores" are more powerful than just SIMD lanes as I'll explain in the next bullet point), so adding more cores or SIMD lanes doesn't change the program-visible architecture. You specify things in terms of single vector elements rather than baking the specifics of a particular SIMD implementation into your program.
- SIMD does sometimes have scatter/gather for parallel memory access, but SIMT handles its latencies much better. Instead of superscalar/OoOE, which work well for low-latency low-thread-count workloads (with SMT occasionally thrown in afterward to "fill in the gaps"), SIMT has enough threads (and separate register sets/"cores"/glorified SIMD lanes) that it just switches threads when a single thread stalls. Note that this works at the level of individual SIMD vector elements, not just the larger SMs. This technique makes individual cores much cheaper without making them "wimpy", even further allowing things like vastly larger, higher-latency register files without impacting throughput.
- SIMD does have select instructions and sometimes predication, but SIMT supports control flow divergence much better. The biggest enabler here is the above-mentioned thread switching, so that threads not executing predicated instructions can hand off their execution units to threads that are. Of course this trick helps the most when your workload has lots of very similar threads, so it doesn't work for e.g. the actor model, but it'll destroy a bunch of SIMD cores for doing massive data transformations.
- SIMD synchronization is the same as scalar synchronization- you get locks, atomics, or abstractions on top like message passing. SIMT workloads instead tend to need a way to synchronize different stages of processing, doing something like "wait for all threads to reach this point." This is easy to implement in SIMT (NVIDIA's __syncthreads()), but very expensive with multiple SIMD CPU cores.
Overall, the flexibility of being able to switch your CPU cores between "graphics"-like workloads and "actor"-like workloads based on demand just hurts your performance. Heterogeneous computing is more efficient. Your earlier mention of problems like GPU-waiting-on-CPU and vice versa is better solved by fixing your program architecture and drivers. Your complaint about proprietary GPU architectures is better solved by making open GPU architectures, like Intel is moving toward with its integrated graphics.
Brendan wrote:
Rusky wrote:
Okay, I misspoke. I wasn't specific enough. For large actors (browsers, servers) in large quantities (the internet) where you don't have a lot of overlap in what they're working on (entirely different websites), the actor model scales well. For your "thread-sized" actors working on the same problem (trying to get an N-times speedup for N actors, for example), you cannot scale nearly as high.
Still wrong. For example, for something like (e.g.)
SETI@home where everything is working on exactly the same problem, it still scales incredibly well. In general, all "embarrassingly parallel" problems (which includes graphics) are embarrassingly easy regardless of whether you use locks or messages or whatever (until you realise that some approaches fail to scale beyond the limits of shared memory).
SETI@home isn't using the actor model for its actual processing, so much as it's using it to orchestrate the computers'
separate workloads over the network. Just like other map/reduce frameworks, it's using the same basic model as a GPU for the actual processing, although its distributed nature makes for more complicated coordination, which is something the actor model excels at.
For example, at a smaller scale, a typical way to use the PS3 is to split up a problem like SETI@home (which does run on the PS3) into chunks that run in isolation, and then hand them out to the SPUs that all have their own local memory- and nobody calls that the actor model, though it is a perfect example of what I said the actor model is good at (large actors without much overlap in what they're working on).
On the other hand, if you tried to do something like SETI@home by having several clients work on the same chunk of data and send messages back and forth to coordinate, the actor model can't help you at all. They did it the right way for that scale by relegating it to coordination between actors working on separate chunks.
Brendan wrote:
Rusky wrote:
On the other hand, when you can hand out work to GPU or SPU-style processors, you can actually get an N-times speedup until you run out of data to split up.
GPU struggles just to handle "CPU+GPU" working on the same problem in the same computer. In contrast, it's relatively trivial to use the actor model to scale graphics processing across a pool of many computers, both with and without shared memory, and even with heterogeneous processors (got mixture of 80x86, ARM and Sparc CPUs; and a mixture of Intel, AMD and Nvidia GPUs; all on the same LAN? No problem - let's use all of them at once!).
You might be making the mistake of assuming that the actor model has higher communication costs than shared memory...
As you can see, the actor model and data-oriented design are pretty much orthogonal. The actor model works well to enforce some structure on the infrequent communications needed by the data-oriented approach, if you're doing things over a network or something, which that data-oriented approach excels at because it makes it easier to split up the data, and to split up the transforms. But actors can't help you scale past a few threads if you don't think in terms of data. Trying to do too much communication between actors kills it, not because it's any less efficient at communicating (I don't think it is per se), but because communication itself is what keeps you from scaling.
Brendan wrote:
Rusky wrote:
And the premise of data-oriented design is that you can do that for a lot more problems when you formulate the problem as data transforms instead of thinking about threads/actors/objects.
If you formulate the problem as data/messages being processed by data transforms/actors, instead of thinking about threads/functions/objects; then...
Surely you can see that data-oriented design has some significant similarities to the actor model.
No, they're not that similar. The actor model doesn't help you at all in terms of
how to split up the data and operations, it just gives you the tools to organize things once you've done so. I'm all for the actor model as a way to organize communications once you know you need them, but as a way to organize your program it's just as bad as OOP- thinking in terms of agents doing things (i.e. threads doing work, actors sending and receiving messages, or objects implementing features) makes it too easy to ignore the actual workload, the actual data, and the actual transforms.