Brendan wrote:
Hi
Hi Brendan, and thanks for your good explanation. But I still miss some things.
First I should be more specific about the actual goal. The goal is a long calculation. The long here means that data array, which is a source of numbers for the processor to calculate, has much greater length, than any cache can fit. Such arrays can be frequently found in any area, that is related to data processing (engineering, databases, data analysis and so on). And very often such data is used just once, like is the case when we have to calculate a control sum, so even if the data fits in a cache of some level, actually it's caching is just useless waste of resources.
But I have explored some data processing and found that somehow actual clock cycle count per one iteration is lesser than anything close to worst case assessment. The difference was close to an order of magnitude. That's why I have found some lack of knowledge.
Brendan wrote:
The worst case I can think of is a TLB miss (where everything misses) in long mode (where PML4, PDPT, PD and PT all have to be fetched from RAM), followed by a cache miss for the actual data; where all RAM involved is in a different NUMA domain (e.g. CPU has to ask another CPU for it via. a quick-path or hyper-transport link). In this case if you estimate 150 cycles per "fetch from RAM" you're looking at about 750 cycles.
I suppose it is a seldom case, which can be met after some inactivity time and additional memory requests from other applications, when system swaps the data to the disk. But in most probable case the data is first read from disk and at the time of reading the TLB is updated as needed. So for long calculations it is not a big issue.
Brendan wrote:
For pro-active approaches the single biggest is caches themselves.
Yes, but in case of long calculations the caches have limited impact on the performance. It is a good case when we are able to select some data, that is used really often, and make our program in such a way that it maximizes selected data usage in one place, without loss of cached state. But with long arrays it is not the good case. Also it is not the case if we try to calculate control sum for some buffer that was used long ago to be sure that the data is not in a cache.
Brendan wrote:
Then there's prefetching. This comes in 2 flavours - software prefetching and hardware prefetching; where both reduce the number of cache misses too.
The actual mechanics of the prefetch is still unclear for me. Is it the mentioned earlier 64 byte load upon any read from memory? Or there are different mechanisms to prefetch some data in a cache?
Brendan wrote:
For retro-active approaches, there's out-of-order execution, speculative execution and hyper-threading. The basic idea of all of these is hide the time it takes for data to arrive by finding other work for the CPU to do while it waits.
Yes, it is very helpful technics, but if we just need to add or multiply some long arrays? What will be the outcome in case of SSE or AVX instructions? Is it the simpler and also not slower way to use just general purpose registers instead of SSE or AVX? Or SSE/AVX instructions somehow can tell the processor it should start some efficient pipeline of data feeding?
Brendan wrote:
Now; if you combine all of the pro-active and retro-active approaches will significantly reduce the overhead (e.g. depending on the specific code and it's access patterns; instead of every access costing 750 cycles for every access it might be more like 20 cycles of "non-hidden overhead" per access on average).
It seems from this phrase that there is no way to read the required amount of data from memory to saturate processor's need for it. Even if we know that all we need is the maximum performance of a memory-processor channel (bus).
Brendan wrote:
For increasing cache line size; if you do that you'd increase the cost of transferring data at every level (e.g. from RAM to L3, from L3 to L2, etc).
Well, after it I see just one way - we should be able to utilize memory-processor channel throughput as much as possible in case of not cached data. But is it possible to somehow tell the memory controller to start read some long data array and to provide it to the processor "just in time"?
Brendan wrote:
Of course "modern programming practices" (e.g. things like OOP, abstractions, modern memory management, etc) goes a long way to preventing important optimisations and mostly just destroy all hope.
Unfortunately, it's just old dogma, but not the reality. Any programmer, that knows about optimization techincs, can separate program's bottlenecks and then rewrite the program in a way that prevents it from having such a narrow place. But when programmer needs to create a model of some situation, it is much better to have a tool, that allows object oriented modeling. And of course, if some programmer just doesn't know about optimization, then there will be a lot of very slow code. And it is irrelevant if he uses C or Java or something else.