Memory access performance

embryo · **Posted:** Sat Dec 20, 2014 5:53 am

I should admit that I missed the internal mechanics of the memory access. So I now have a situation when I don't know how to improve performance. The problem is as follows:

When processor executes some operation on the data from memory it first should read it. The read is very time consuming if measured in clock cycles. If there is no cache hit it is about 100 clock cycles to get required data into registers. But somehow processors manage to crunch a lot of data in a very few cycles. The first my idea was about cache loading procedure. If we read just one byte the memory controller in fact reads whole cache line, which is 64 bytes long. So now we have 64 bytes read in about 100 clock cycles. But it is too small amount because any modern processor needs much more data to have it's load at some acceptable level. When we use SSE operations a processor can add 16 bytes every clock cycle, but memory access mechanism provides just 64 bytes per 100 clock cycles. So we have just 4 SSE operations per 100 clock cycles. And now we have AVX operations when 32 bytes are required per one operation and processor load now is just 2 operations per 100 clock cycles.

So here is another idea - the cache line should be much longer. But again we have something very ugly - the memory bus should be able to feed the processor with the speed of light, 6400 bytes per 100 clock cycles. For modern processors it is 320 Gb per second at 5 Ghz clock rate. Is it possible? And is it possible to increase the cache line up to 6400 bytes? And is it a small waste of resources when controller reads 6400 bytes if we need just one byte? May be I just miss the real picture of memory access?

So actually the question is about memory access mechanics - how it works in modern computers?

Brendan · **Posted:** Sat Dec 20, 2014 7:34 am

Hi,

embryo wrote:

So actually the question is about memory access mechanics - how it works in modern computers?

Let's start be characterising the actual worst case. The worst case I can think of is a TLB miss (where everything misses) in long mode (where PML4, PDPT, PD and PT all have to be fetched from RAM), followed by a cache miss for the actual data; where all RAM involved is in a different NUMA domain (e.g. CPU has to ask another CPU for it via. a quick-path or hyper-transport link). In this case if you estimate 150 cycles per "fetch from RAM" you're looking at about 750 cycles.

Now let's consider both pro-active approaches to minimise the overhead, and retro-active approaches to hide the overhead.

For pro-active approaches the single biggest is caches themselves. This is a carefully designed set of caches, ranging from "larger and slower" (e.g. L3) all the way up to "smaller and faster" (e.g. L1). This includes TLBs (e.g. on a TLB miss where CPU needs to fetch the page table entry, it might get the page table entry from L3 cache and avoid fetching from RAM). It turns out that for a lot of code the "working set" (the data that's actually used often) is small compared to cache sizes, and a lot of accesses are cache hits.

Then there's prefetching. This comes in 2 flavours - software prefetching and hardware prefetching; where both reduce the number of cache misses too.

For retro-active approaches, there's out-of-order execution, speculative execution and hyper-threading. The basic idea of all of these is hide the time it takes for data to arrive by finding other work for the CPU to do while it waits.

Now; if you combine all of the pro-active and retro-active approaches will significantly reduce the overhead (e.g. depending on the specific code and it's access patterns; instead of every access costing 750 cycles for every access it might be more like 20 cycles of "non-hidden overhead" per access on average).

It's not perfect. That's why (for best performance) software has to be designed carefully, to maximise cache locality (and minimise cache pollution). There are a lot of things software can do to improve performance (and not just better software prefetching). For example; you might split larger structures into "hot fields" (things that are accessed frequently) and "cold fields" (things that aren't accessed so often); or you might replace a linked list (where iterating over the list means a potential cache miss that can't be prefetched for every single iteration) with an array (where iterating over the "list" becomes sequential/easily prefetched accesses). Of course "modern programming practices" (e.g. things like OOP, abstractions, modern memory management, etc) goes a long way to preventing important optimisations and mostly just destroy all hope.

For increasing cache line size; if you do that you'd increase the cost of transferring data at every level (e.g. from RAM to L3, from L3 to L2, etc). For example, if you read 1 byte, then Instead of costing 150 cycles to fetch 64 bytes from RAM it might cost 300 cycles to fetch 256 bytes from RAM. It would also increase the chance of something called "false sharing", where data that was not changed has to be invalidated because something else (that happened to be in the same cache line) was modified.

Cheers,

Brendan

embryo · **Posted:** Sat Dec 20, 2014 1:59 pm

Brendan wrote:

Hi

Hi Brendan, and thanks for your good explanation. But I still miss some things.

First I should be more specific about the actual goal. The goal is a long calculation. The long here means that data array, which is a source of numbers for the processor to calculate, has much greater length, than any cache can fit. Such arrays can be frequently found in any area, that is related to data processing (engineering, databases, data analysis and so on). And very often such data is used just once, like is the case when we have to calculate a control sum, so even if the data fits in a cache of some level, actually it's caching is just useless waste of resources.

But I have explored some data processing and found that somehow actual clock cycle count per one iteration is lesser than anything close to worst case assessment. The difference was close to an order of magnitude. That's why I have found some lack of knowledge.

Brendan wrote:

The worst case I can think of is a TLB miss (where everything misses) in long mode (where PML4, PDPT, PD and PT all have to be fetched from RAM), followed by a cache miss for the actual data; where all RAM involved is in a different NUMA domain (e.g. CPU has to ask another CPU for it via. a quick-path or hyper-transport link). In this case if you estimate 150 cycles per "fetch from RAM" you're looking at about 750 cycles.

I suppose it is a seldom case, which can be met after some inactivity time and additional memory requests from other applications, when system swaps the data to the disk. But in most probable case the data is first read from disk and at the time of reading the TLB is updated as needed. So for long calculations it is not a big issue.

Brendan wrote:

For pro-active approaches the single biggest is caches themselves.

Yes, but in case of long calculations the caches have limited impact on the performance. It is a good case when we are able to select some data, that is used really often, and make our program in such a way that it maximizes selected data usage in one place, without loss of cached state. But with long arrays it is not the good case. Also it is not the case if we try to calculate control sum for some buffer that was used long ago to be sure that the data is not in a cache.

Brendan wrote:

Then there's prefetching. This comes in 2 flavours - software prefetching and hardware prefetching; where both reduce the number of cache misses too.

The actual mechanics of the prefetch is still unclear for me. Is it the mentioned earlier 64 byte load upon any read from memory? Or there are different mechanisms to prefetch some data in a cache?

Brendan wrote:

For retro-active approaches, there's out-of-order execution, speculative execution and hyper-threading. The basic idea of all of these is hide the time it takes for data to arrive by finding other work for the CPU to do while it waits.

Yes, it is very helpful technics, but if we just need to add or multiply some long arrays? What will be the outcome in case of SSE or AVX instructions? Is it the simpler and also not slower way to use just general purpose registers instead of SSE or AVX? Or SSE/AVX instructions somehow can tell the processor it should start some efficient pipeline of data feeding?

Brendan wrote:

Now; if you combine all of the pro-active and retro-active approaches will significantly reduce the overhead (e.g. depending on the specific code and it's access patterns; instead of every access costing 750 cycles for every access it might be more like 20 cycles of "non-hidden overhead" per access on average).

It seems from this phrase that there is no way to read the required amount of data from memory to saturate processor's need for it. Even if we know that all we need is the maximum performance of a memory-processor channel (bus).

Brendan wrote:

For increasing cache line size; if you do that you'd increase the cost of transferring data at every level (e.g. from RAM to L3, from L3 to L2, etc).

Well, after it I see just one way - we should be able to utilize memory-processor channel throughput as much as possible in case of not cached data. But is it possible to somehow tell the memory controller to start read some long data array and to provide it to the processor "just in time"?

Brendan wrote:

Of course "modern programming practices" (e.g. things like OOP, abstractions, modern memory management, etc) goes a long way to preventing important optimisations and mostly just destroy all hope.

Unfortunately, it's just old dogma, but not the reality. Any programmer, that knows about optimization techincs, can separate program's bottlenecks and then rewrite the program in a way that prevents it from having such a narrow place. But when programmer needs to create a model of some situation, it is much better to have a tool, that allows object oriented modeling. And of course, if some programmer just doesn't know about optimization, then there will be a lot of very slow code. And it is irrelevant if he uses C or Java or something else.

Brendan · **Posted:** Sat Dec 20, 2014 4:22 pm

Hi,

embryo wrote:

First I should be more specific about the actual goal. The goal is a long calculation. The long here means that data array, which is a source of numbers for the processor to calculate, has much greater length, than any cache can fit. Such arrays can be frequently found in any area, that is related to data processing (engineering, databases, data analysis and so on). And very often such data is used just once, like is the case when we have to calculate a control sum, so even if the data fits in a cache of some level, actually it's caching is just useless waste of resources.

But I have explored some data processing and found that somehow actual clock cycle count per one iteration is lesser than anything close to worst case assessment. The difference was close to an order of magnitude. That's why I have found some lack of knowledge.

Are cache lines being used in sequential order? If yes; then the hardware prefetecher would've noticed and (for most of them) would be prefetching the next cache line while you're using the current one, so that by the time you need the next cache line it's already either on its way or already in cache.

Of course this also depends a little on what you're actually doing - e.g. if the calculation is expensive and works on an array of bytes (e.g. 64 iterations per cache line) then you can end up CPU bound (and not limited by RAM latency or RAM bandwidth much at all).

embryo wrote:

Brendan wrote:

For pro-active approaches the single biggest is caches themselves.

Yes, but in case of long calculations the caches have limited impact on the performance. It is a good case when we are able to select some data, that is used really often, and make our program in such a way that it maximizes selected data usage in one place, without loss of cached state. But with long arrays it is not the good case. Also it is not the case if we try to calculate control sum for some buffer that was used long ago to be sure that the data is not in a cache.

Even for long arrays, it's possible that the CPU is accessing extremely easily cached things (e.g. code and stack). Also note that modern CPUs have special TLB/caches for "higher level paging structures" (e.g. a CPU might cache the PDPT's contents to avoid the cost of fetching PML4 and PDPT for a full TLB miss).

embryo wrote:

Brendan wrote:

Then there's prefetching. This comes in 2 flavours - software prefetching and hardware prefetching; where both reduce the number of cache misses too.

The actual mechanics of the prefetch is still unclear for me. Is it the mentioned earlier 64 byte load upon any read from memory? Or there are different mechanisms to prefetch some data in a cache?

It depends a lot on which CPU. For most, CPU's hardware prefetcher monitors accesses and is able to handle several independent "sequential streams" and begin prefetching them. However; it takes a few cache misses for the hardware prefetcher to notice and start prefetching; and because it works on the caches (which cache the physical address space) the hardware prefetcher doesn't know about paging (or the virtual address space), so it stops prefetching when it reaches a page boundary (instead of prefetching the next cache line in physical memory, where the next cache line in virtual memory is likely to be completely different).

What this mostly means is that in practice when you're sequentially accessing cache lines you'll get a few cache misses until the hardware prefetcher notices, then it'll prefetch then next 62 cache lines (3968 bytes) up to the page boundary, then it'll take another few cache misses before the hardware prefetcher notices the new page and begins prefetching again.

Also note that (e.g. if your array doesn't end on a page boundary) the hardware prefetcher will "over-fetch" - fetching a cache line past the end of your array that you won't need. For large arrays this is mostly irrelevant. For tiny arrays it can be a significant problem. For a worst case example, imagine if you're accessing a lot of 128 byte arrays - you get 2 cache misses before the hardware prefetcher starts, but then when the hardware prefetcher does start it's "over-fetching" data you don't want (and wasting RAM bandwidth that you need for the next 128 byte array).

In addition to that; more recent CPUs have "TLB prefetchers" - e.g. if the CPU notices that you're sequentially accessing virtual memory pages then it'll prefetch the TLBs to avoid a future TLB miss.

Finally, some CPUs do "adjacent cache line prefetch", where if you get a cache miss for one cache line it'll prefetch 2 cache lines from RAM instead of one. I'm not too sure how this interacts with the hardware's "sequential streams" prefetching (e.g. if an advacent cache line pefetch causes the ""sequential streams detector" to take longer to notice the sequential accesses).

embryo wrote:

Brendan wrote:

For retro-active approaches, there's out-of-order execution, speculative execution and hyper-threading. The basic idea of all of these is hide the time it takes for data to arrive by finding other work for the CPU to do while it waits.

Yes, it is very helpful technics, but if we just need to add or multiply some long arrays? What will be the outcome in case of SSE or AVX instructions? Is it the simpler and also not slower way to use just general purpose registers instead of SSE or AVX? Or SSE/AVX instructions somehow can tell the processor it should start some efficient pipeline of data feeding?

In general; if you're limited by RAM bandwidth then it won't matter too much what you do (but for hyper-threading it's probably better to use SSE/AVX if you can, so that there's more CPU resources left for the other logical CPU in the core to use). If you are limited by CPU then anything you can do to reduce the time CPU spends processing (including using SSE/AVX) will help.

However; for some bizarre and ill-advised reason, Intel like to pretend that general purpose cache management instructions (e.g. prefetch, clflush, non-temporal stores) that have nothing to do with SIMD are part of the SSE/AVX extensions. If used properly, these instructions can help, because you don't need 2 cache misses per page (to get the hardware prefetcher started) and you can avoid "over-fetch".

Note that software prefetching isn't necessarily easy - the "prefetch scheduling distance" (or how far in advance to prefetch) is hard to calculate (and depends on RAM latency and how much time is spent processing between fetches); although it's actually more forgiving than it seems (prefetching "too early" typically works fine, as long as nothing else can cause that data to get pushed out of the cache before it's needed). Also; depending on which CPU, the software prefetch instructions don't do anything if there's a TLB miss, which means that you may need to "preload" the first cache line of a page (e.g. do a dummy access to force the CPU to fetch the TLB entry and the cache line) before prefetching the remainder of the cache lines in that page.

Cheers,

Brendan

embryo · **Posted:** Sun Dec 21, 2014 7:31 am

Thanks for clarification!

I see that I need to experiment with sequential access and prefetch instruction. Other methods like non-temporal stores are important in case when the data almost completely fits into existing caches. But when it is guaranteed that the size of array is bigger than any cache then my hope is with prefetcher, that is able to detect sequential access and preload whole page instead of just one cache line.

On the issue of virtual memory access I can see that if a software requests memory for an array then (most probably) it will get one chunk of a physical memory without any virtualization (even if the chunk can be placed far from other program's data in physical memory). So I hope virtual memory issues would not affect the performance.

To prevent first 2 cache misses per page in case of sequential memory access I hope the prefetch instruction would be of some help. But even if it will not then I miss just 1/32 of memory access bandwidth.

However, some boundary situations when calculations are performed in chunks that are relatively small, can take performance hit due to a multitude of cache fetching effects. In this case it is important to look at every effect separately and to implement a program in a way that escapes most critical performance damage. And in case of chunks it is possible to get TLB miss or any other virtualization effect.

Brendan wrote:

However; for some bizarre and ill-advised reason, Intel like to pretend that general purpose cache management instructions (e.g. prefetch, clflush, non-temporal stores) that have nothing to do with SIMD are part of the SSE/AVX extensions.

But what problem it creates? Every modern x86 processor has SSE, so why should we be bothered with the name of an instruction group?

Brendan wrote:

Also; depending on which CPU, the software prefetch instructions don't do anything if there's a TLB miss, which means that you may need to "preload" the first cache line of a page (e.g. do a dummy access to force the CPU to fetch the TLB entry and the cache line) before prefetching the remainder of the cache lines in that page.

Is the TLB miss detectable in any way? Or I need the dummy access every time when the miss is possible?

Owen · **Posted:** Sun Dec 21, 2014 7:38 am

You seem to be confusing latency with time.

Yes, if you have a cache miss it can take ~100 cycles for the data to come back from RAM; but that doesn't mean it takes the RAM controller 100 cycles per read.

The whole memory interface on modern processors is pipelined and often even out of order. As long as the processor has enough information to keep decoding instructions, it'll continue doing so until it's capacity to do so fills up. So, while it's quite possible that the first memory access will stall, it's also possible that the next 4 loop iterations will be decoded and waiting inside the processor, each with its' set of memory accesses pushed out to the memory system; and when the data starts coming back, the processor will already be ready to process it.

If doing prefetches, you'll want to do so quite a distance ahead (potentially 50+ loop iterations); this is very processor microarchitecture specific. Also bear in mind that prefetches take up some of your instruction decode and issue bandwidth - they may sometimes make things worse (especially if the hardware prefetcher has predicted your accesses and the limitation is your instruction issue/retire rate)

embryo · **Posted:** Mon Dec 22, 2014 3:12 am

Owen wrote:

You seem to be confusing latency with time.

Yes, if you have a cache miss it can take ~100 cycles for the data to come back from RAM; but that doesn't mean it takes the RAM controller 100 cycles per read.

Ok, may be there is memory pipeline, but when processor needs some new data, it will be available not earlier than after 100 cycles. But if the pipeline is scheduled according to programmer's expectations, then it can deliver one read up to every cycle. The problem here is the match between programmer's expectations and pipeline scheduling. I still wonder why hardware developers just refuse to create some means of a direct schedule manipulation by a programmer. All those predictions are very interesting, but when we have just a bit of non triviality in our algorithm, the undocumented predictions only make things worse. Even if they are documented somewhere (I'll appreciate a link here), it is done in some kind of aesopian language with many notes about "processor will decide somehow and such a very strict decision can be changed right in the next processor model". If they want all those predictions, ok, let the predictions be with us, but why not to add some explicit means of a memory read/write management? It's just so simple that I have a feeling of something bad within the silicon industry.

Owen wrote:

If doing prefetches, you'll want to do so quite a distance ahead (potentially 50+ loop iterations); this is very processor microarchitecture specific. Also bear in mind that prefetches take up some of your instruction decode and issue bandwidth - they may sometimes make things worse (especially if the hardware prefetcher has predicted your accesses and the limitation is your instruction issue/retire rate)

Yes, the optimization is an area for many trials and errors. And I see it as a just another point about the need for explicit hardware management accessible for a programmer. Or will then Intel lose it's competitive advantage if everything will be transparent and clear?

Combuster · **Posted:** Mon Dec 22, 2014 3:26 am

Quote:

It's just so simple that I have a feeling of something bad within the silicon industry.

If it is that simple, show us a design and we'll point out all the cases you have missed and which problems you just introduced.

SpyderTL · **Joined:** Sun Sep 19, 2010 10:05 pm **Posts:** 1074

Quote:

Ok, may be there is memory pipeline, but when processor needs some new data, it will be available not earlier than after 100 cycles. But if the pipeline is scheduled according to programmer's expectations, then it can deliver one read up to every cycle. The problem here is the match between programmer's expectations and pipeline scheduling. I still wonder why hardware developers just refuse to create some means of a direct schedule manipulation by a programmer. All those predictions are very interesting, but when we have just a bit of non triviality in our algorithm, the undocumented predictions only make things worse. Even if they are documented somewhere (I'll appreciate a link here), it is done in some kind of aesopian language with many notes about "processor will decide somehow and such a very strict decision can be changed right in the next processor model". If they want all those predictions, ok, let the predictions be with us, but why not to add some explicit means of a memory read/write management? It's just so simple that I have a feeling of something bad within the silicon industry.

Unfortunately, these types of optimizations have little effect in the real world, when you consider that you will typically have hundreds of processes and thousands of threads waiting on CPU time, memory bus time, PCI bus time, USB bus time, storage device read/write time and network send/receive time.

The hardware/software cache control improvements you are talking about would be beneficial if you were designing a real-time, dedicated, single purpose machine. But I doubt they would even be noticeable on a modem desktop environment.

Also, keep in mind that the best person to make decisions about memory access optimizations would be the application developer (not the CPU designer or the OS developer). They are the ones who know, ahead of time, how the application will use system memory.

But, unfortunately, application developers rarely focus on performance at the memory bus level. There are usually much bigger gains to be found at the network and storage device level.

And, honestly, how many developers would actually use this functionality at the CPU level, considering that most modern development is done in .NET, Java, JavaScript, or some other dynamic scripting language?

On the other hand, as an OS developer, you really have no way of knowing what your system will be used for, which makes it pretty difficult to "fine tune" your memory access code to improve performance at the user level. The best you can do is tweak your kernel performance, and maybe choose a particular type of application to focus your performance improvements on, like databases, or network servers, or 3D graphics rendering.

CPU designers and hardware manufacturers do this all the time, which is why you don't see too many Intel Xeon laptops, or ARM based web servers.

First, make it work, then make it fast. That's my motto.

Brendan · **Posted:** Mon Dec 22, 2014 9:53 am

Hi,

embryo wrote:

I see that I need to experiment with sequential access and prefetch instruction. Other methods like non-temporal stores are important in case when the data almost completely fits into existing caches. But when it is guaranteed that the size of array is bigger than any cache then my hope is with prefetcher, that is able to detect sequential access and preload whole page instead of just one cache line.

When the arrays are larger than the cache; what happens is that after you're finished accessing the array/s the caches end up full of "most recently used, but unlikely to be used again soon" data and all of the "less recently used, but more likely to be used again" data gets pushed out of the cache; and the performance of the code that's executed after you're finished accessing the array/s is effected. Things like non-temporal stores and/or the CLFLUSH instruction are important to prevent the cache from getting filled with "most recently used, but unlikely to be used again soon" data, and important the prevent the performance of the code that's executed after you're finished accessing the array/s from being effected.

For an example of all this, imagine you've got an array of 4096000 8-bit integers, and you're doing this:

Code:

    for(i = 0; i < 4096000; i++) {
        doSomething(array[i]);
    }

The first step would be to ensure the array is aligned on a cache line boundary and work on cache lines, like this:

Code:

    for(i = 0; i < 4096000; i += 64) {
        for(j = i; j < i + 64; j++) {
            doSomething(array[j]);
        }
        cacheLineFlush(&array[i]);
    }

The next step would be to ensure the array is aligned on a page boundary and work on pages, like this:

Code:

    for(i = 0; i < 4096000; i += 4096) {
        for(j = i; j < i + 4096; j += 64) {
            for(k = j; k < j + 64; k++) {
                doSomething(array[k]);
            }
            cacheLineFlush(&array[j]);
        }
    }

Next, preload 1 page in advance:

Code:

    preload(&array[0]);

    for(i = 0; i < 4096000; i += 4096) {
        preload(&array[i+4096]);
        for(j = i; j < i + 4096; j += 64) {
            for(k = j; k < j + 64; k++) {
                doSomething(array[k]);
            }
            cacheLineFlush(&array[j]);
        }
    }

This will preload a page past the end of the array, so fix that:

Code:

    preload(&array[0]);

    // Do all pages except last (while preloading next page)
    for(i = 0; i < 4096000 - 4096; i += 4096) {
        preload(&array[i+4096]);
        for(j = i; j < i + 4096; j += 64) {
            for(k = j; k < j + 64; k++) {
                doSomething(array[k]);
            }
            cacheLineFlush(&array[j]);
        }
    }

    // Do last page (without preloading next page)
    for(j = i; j < i + 4096; j += 64) {
         for(k = j; k < j + 64; k++) {
            doSomething(array[k]);
        }
        cacheLineFlush(&array[j]);
    }

Next, let's assume we want to prefetch 2 cache lines in advance:

Code:

    preload(&array[0]);
    // prefetch(&array[0]);
    prefetch(&array[64]);

    // Do all pages except last (while preloading next page)
    for(i = 0; i < 4096000 - 4096; i += 4096) {
        preload(&array[i+4096]);
        for(j = i; j < i + 4096; j += 64) {
            prefetch(&array[k+128]);
            for(k = j; k < j + 64; k++) {
                doSomething(array[k]);
            }
            cacheLineFlush(&array[j]);
        }
    }

    // Do last page (without preloading next page)
    for(j = i; j < i + 4096; j += 64) {
         for(k = j; k < j + 64; k++) {
            prefetch(&array[k+128]);
            doSomething(array[k]);
        }
        cacheLineFlush(&array[j]);
    }

This will prefetch 2 cache lines past the end of the array, so fix that:

Code:

    preload(&array[0]);
    // prefetch(&array[0]);
    prefetch(&array[64]);

    // Do all pages except last (while preloading next page)
    for(i = 0; i < 4096000 - 4096; i += 4096) {
        preload(&array[i+4096]);
        for(j = i; j < i + 4096; j += 64) {
            prefetch(&array[k+128]);
            for(k = j; k < j + 64; k++) {
                doSomething(array[k]);
            }
            cacheLineFlush(&array[j]);
        }
    }

    // Do last page (without preloading next page)
    for(j = i; j < i + 4096 - 128; j += 64) {
         for(k = j; k < j + 64; k++) {
            prefetch(&array[k+128]);
            doSomething(array[k]);
        }
        cacheLineFlush(&array[j]);
    }
    for(k = j; k < j + 64; k++) {
        doSomething(array[k]);
    }
    cacheLineFlush(&array[j]);
    j += 64;
    for(k = j; k < j + 64; k++) {
        doSomething(array[k]);
    }
    cacheLineFlush(&array[j]);

Of course this is just an example (not practical code optimised for any specific CPU).

In theory, you'd hope that the compiler can do all this for you. In practice, it's more likely that the compiler won't optimise properly and will then screw things up when you do it yourself. [-o<

embryo wrote:

Brendan wrote:

However; for some bizarre and ill-advised reason, Intel like to pretend that general purpose cache management instructions (e.g. prefetch, clflush, non-temporal stores) that have nothing to do with SIMD are part of the SSE/AVX extensions.

But what problem it creates? Every modern x86 processor has SSE, so why should we be bothered with the name of an instruction group?

Imagine in 10 years time, where all code is using the prefetch instructions and AVX, and because nobody is using SSE anymore Intel wants to remove it. If they do "SSE supported = no" (in the CPUID feature flags) then software will assume prefetch instructions aren't supported, so Intel can't do that without causing huge problems and therefore can't remove SSE.

Of course there are also some CPUs (early Athlon) that do support the prefetch instructions but don't support SSE.

embryo wrote:

Brendan wrote:

Also; depending on which CPU, the software prefetch instructions don't do anything if there's a TLB miss, which means that you may need to "preload" the first cache line of a page (e.g. do a dummy access to force the CPU to fetch the TLB entry and the cache line) before prefetching the remainder of the cache lines in that page.

Is the TLB miss detectable in any way? Or I need the dummy access every time when the miss is possible?

There's no way to detect if a TLB miss will happen before it happens. Of course if it was possible to do (e.g.) "if( TLBnotPresent(address) ) preload(address);" then the branch would probably cost more than the preload.

Cheers,

Brendan

embryo · **Posted:** Tue Dec 23, 2014 6:30 am

Combuster wrote:

If it is that simple, show us a design and we'll point out all the cases you have missed and which problems you just introduced.

Ok.

I prefer to have an assembly where the following code is possible:

Code:

long;      enter long loop mode
      ;      may be this command can be omitted and it's job will be done by a first range instruction
range eax, ebx;      this range will be used by following loop
range [ecx], [ecx+0xaaaa];      this range also will be used by following loop
...;      some loop instructions
   ;      (more than one instruction, but there should be some limit for the total number)
start;      here processor and memory controller do their job of decoding loop instructions
       ;      starting prefetching process, executing some other code while data or code are still not ready

Here we need one buffer for loop instructions and another (may be just a few registers) for expected memory ranges. On task switch system need to push the buffers on a stack just like it now pushes content of 16*256+16*128+16*64+[other registers] bits of all registers. It means we can have something like one kilobyte for mentioned buffers without serious performance degradation.

Another possible solution is about exposing memory controller as an another processor with it's personal set of commands. It can work like this:

Code:

x86;      directive, that tells an assembler translator about kind of code it should expect
...;      some x86 instructions
signal eax;      notify memory controller about the need for it's actions
               ;      that are described in action table in a row with index=eax
               ;      (note - instruction can be replaced with port access),
               ;      also this instruction tells to the processor to turn off memory access optimization
mwait;     some kind of synchronization between processor and memory controller
...;      some other x86 instructions, that expect quick memory access
done;      a signal to the processor that tells about end of intensive memory access section
       ;      and turns on any hardware memory access optimizations
memory 123;      directive, that tells an assembler translator about memory controller interrupt #123 actions
cache L1, [eax], [eax+0xbbbbb];      memory controller instruction that fetches memory range [eax]-[eax+0xbbbbb] into L1 cache
x86;      x86 code again

Here a task switch can break the harmony. So we need some means that help to prevent it. Such means can be of different nature, but at first sight it seems that just some OS defined signal about postponing task switch for one timer interrupt delay will work. But of course the critical action should not take more than one timer interval in total to be completed. It is absolutely possible if we split our processing into small parts and yield run time to other threads after one part is completed (kind of cooperative multitasking).

And, of course, hardware operations should be more transparent to help us create effective programs, that are not spoiled by some unpredictable hardware behavior.

embryo · **Posted:** Tue Dec 23, 2014 7:02 am

SpyderTL wrote:

Unfortunately, these types of optimizations have little effect in the real world, when you consider that you will typically have hundreds of processes and thousands of threads waiting on CPU time, memory bus time, PCI bus time, USB bus time, storage device read/write time and network send/receive time.

First - it's all about concurrency. But hiding it from a programmer with the help of some dumb predictors leads us to performance degradation. From another side - the complexity of many competing processes is really high for a programmer. But here is the second - we can fight complexity after splitting it into small parts, exactly as it works now. USB bus time is hidden from a programmer by USB driver, also it is the same for PCI, storage devices and network. And before a driver/OS developer meets storage device he should win over CPU and memory controller time sharing. So our task is simplified a bit, now we need just to understand, how to deal with many processing units. But the area of multiprocessing is explored at relatively good level. It's not trivial, but the complexity is acceptable and human being can live with it.

SpyderTL wrote:

The hardware/software cache control improvements you are talking about would be beneficial if you were designing a real-time, dedicated, single purpose machine.

But if I design an OS, the mentioned improvements can help me to make an OS that can leverage much more hardware power than existing OSes are able. And as a result the cost of clusters, clouds, data farms can be decreased significantly. And the cost is a very powerful driver in our world.

SpyderTL wrote:

Also, keep in mind that the best person to make decisions about memory access optimizations would be the application developer (not the CPU designer or the OS developer). They are the ones who know, ahead of time, how the application will use system memory.

There is another person, it is a compiler developer. If he is able to create really smart compiler, then it is possible for the compiler to inject it's memory management code without help from an application programmer. If within a loop we see an array access, then it is relatively simple to determine the memory range, that will be needed by a program. So, the compiler developer should pay attention to such things and improve a "smartness" level of a compiler.

SpyderTL wrote:

There are usually much bigger gains to be found at the network and storage device level.

After an application developer has identified program's bottlenecks, he is able to remove them, but only if he has right tools. For network access it is enough to use some standard tooling, available for almost every language. But for bottlenecks in CPU or memory related areas we still have no right tools. I suppose we still need them.

SpyderTL wrote:

And, honestly, how many developers would actually use this functionality at the CPU level, considering that most modern development is done in .NET, Java, JavaScript, or some other dynamic scripting language?

It will be OS and compiler developers. But some application programmers also can use optimization techniques, so - why should we discount them?

SpyderTL wrote:

On the other hand, as an OS developer, you really have no way of knowing what your system will be used for, which makes it pretty difficult to "fine tune" your memory access code to improve performance at the user level.

It depends on an OS type. If my OS uses only managed code, then I can make much more for the application performance increase than an OS with unmanaged code.

SpyderTL wrote:

First, make it work, then make it fast. That's my motto.

And when your intent of "making fast" hits the wall of memory or CPU optimization limits, then it is time to think about new hardware.

embryo · **Posted:** Tue Dec 23, 2014 7:26 am

Brendan, your patience and drive for quality even in forum messages are exciting

Brendan wrote:

Code:

    preload(&array[0]);
    // prefetch(&array[0]);
    prefetch(&array[64]);

    // Do all pages except last (while preloading next page)
    for(i = 0; i < 4096000 - 4096; i += 4096) {
        preload(&array[i+4096]);
        for(j = i; j < i + 4096; j += 64) {
            prefetch(&array[k+128]);
            for(k = j; k < j + 64; k++) {
                doSomething(array[k]);
            }
            cacheLineFlush(&array[j]);
        }
    }

    // Do last page (without preloading next page)
    for(j = i; j < i + 4096 - 128; j += 64) {
         for(k = j; k < j + 64; k++) {
            prefetch(&array[k+128]);
            doSomething(array[k]);
        }
        cacheLineFlush(&array[j]);
    }
    for(k = j; k < j + 64; k++) {
        doSomething(array[k]);
    }
    cacheLineFlush(&array[j]);
    j += 64;
    for(k = j; k < j + 64; k++) {
        doSomething(array[k]);
    }
    cacheLineFlush(&array[j]);

Of course this is just an example (not practical code optimised for any specific CPU).

In theory, you'd hope that the compiler can do all this for you. In practice, it's more likely that the compiler won't optimise properly and will then screw things up when you do it yourself. [-o<

Your code is explanatory and clear. And for complete tutorial on code optimization it is the internals of doSomething, cacheLineFlush, prefetch and preload that we are still missing

But it's just a joke

But another question here is how to align an array on a page boundary? I see only one way - to request an array which is 4096 bytes bigger than required and next to determine the correct offset into it.

Brendan wrote:

Imagine in 10 years time, where all code is using the prefetch instructions and AVX, and because nobody is using SSE anymore Intel wants to remove it. If they do "SSE supported = no" (in the CPUID feature flags) then software will assume prefetch instructions aren't supported, so Intel can't do that without causing huge problems and therefore can't remove SSE.

It's a destiny of almost all instructions, independently of the name they have. Most probably Intel will never remove SSE because there will be a lot of programs that use it. It's like 16-bit mode of Intel's processors - it is used just at boot time, but it's usage is very important. I wonder why Intel had decided to remove 8-bit mode from it's 8086/8088 processors?

Brendan wrote:

Of course there are also some CPUs (early Athlon) that do support the prefetch instructions but don't support SSE.

It's another way Intel can take.

Brendan wrote:

"if( TLBnotPresent(address) ) preload(address);" then the branch would probably cost more than the preload.

So if an OS just doesn't use virtual memory then there is no performance hit in any case. But passivated programs will require more time to be activated again. However, it is really frequent situation when OSes swap memory fragments to disk and such swap often ends in a hang-like situation. So even OSes that use virtual memory are not very exciting when memory limit was hit.

Brendan · **Posted:** Tue Dec 23, 2014 10:23 am

Hi,

embryo wrote:

But another question here is how to align an array on a page boundary? I see only one way - to request an array which is 4096 bytes bigger than required and next to determine the correct offset into it.

That depends on your environment. Maybe "malloc()" will give you something that's page aligned if the allocation is large enough, maybe you have a "mallocAligned()" function that lets you ask for whichever alignment you need, maybe there's a "mmap()" function that can only do page aligned allocations, maybe you put it in your ".bss" and tell the linker to align it, and maybe none of those options are possible and you do have to allocate 4096 bytes extra and align it yourself.

Alternatively, maybe you write your code to handle misaligned arrays; with an initial loop to handle from the start of the array up to the first page boundary, then a middle loop for the majority of the array (that is page aligned), then another loop for the (partially used) last page of the array.

embryo wrote:

Brendan wrote:

Imagine in 10 years time, where all code is using the prefetch instructions and AVX, and because nobody is using SSE anymore Intel wants to remove it. If they do "SSE supported = no" (in the CPUID feature flags) then software will assume prefetch instructions aren't supported, so Intel can't do that without causing huge problems and therefore can't remove SSE.

It's a destiny of almost all instructions, independently of the name they have. Most probably Intel will never remove SSE because there will be a lot of programs that use it. It's like 16-bit mode of Intel's processors - it is used just at boot time, but it's usage is very important. I wonder why Intel had decided to remove 8-bit mode from it's 8086/8088 processors?

Ideally, the length of an instruction's opcode should depend on how often its used - the most frequently used instructions should have the shortest opcodes possible. One of the problems Intel currently/still have is that there's too few unused shorter opcodes; and new instructions have to use longer/less efficient opcodes. That's why SSE and AVX instructions begin with a series of one or more prefixes (and it's also why AMD deprecated certain instructions in 64-bit code - pusha/popa, inc/dec, etc). The end result of this is less efficient instruction fetch/caches, and more complex decoders.

Currently, FPU, MMX and 3DNow are all relatively removable (superseded by SSE and AVX). Once we complete the transition to UEFI real mode will become removable (and possibly, later, protected mode could be removed too), but this won't really help with opcodes.

embryo wrote:

Brendan wrote:

Of course there are also some CPUs (early Athlon) that do support the prefetch instructions but don't support SSE.

It's another way Intel can take.

I think (for early Athlon) AMD added Intel's prefetch instructions as part of "Extended 3DNow!". Intel could maybe remove SSE and support "Extended 3DNow!" instead (so that software designed for Extended 3DNow! would still be able to use the prefetches); but that doesn't sound like a great option to me..

Essentially it's the same problem - one flag being used for 2 different things (whether it's one "SSE plus prefetch" flag, or one "Extended 3DNow! plus prefetch" flag).

The other thing I didn't mention is that sometimes CPUs have bugs, and it might make sense to disable a certain feature. Recently we've seen this happen with Intel's transactional extensions ("Oops, we screwed up, here's a micro-code update that disables TSX"). For this specific case, I still don't know if the problem effects both hardware lock elision and restricted transactional memory, or if it only effects one and not the other (and both had to be disabled because there's only one flag).

Also note that it does make sense for an OS to create it's own set of CPU feature flags, where all software uses the OS's "CPU feature flags" and nothing uses CPUID. That way the OS has the ability to fix certain bugs (e.g. CPUs that say something isn't supported when it is, CPUs that say something is supported when it's not, and/or CPUs that say something is supported when its buggy), and the OS also has the ability to use different flags for different things (e.g. you can have separate flags for prefetch instructions, SSE and E3DNow!; separate flags for "syscall in 32-bit code" and "syscall in 64-bit code"; etc).

embryo wrote:

Brendan wrote:

"if( TLBnotPresent(address) ) preload(address);" then the branch would probably cost more than the preload.

So if an OS just doesn't use virtual memory then there is no performance hit in any case. But passivated programs will require more time to be activated again. However, it is really frequent situation when OSes swap memory fragments to disk and such swap often ends in a hang-like situation. So even OSes that use virtual memory are not very exciting when memory limit was hit.

If an OS doesn't use paging, then it won't suffer from TLB misses at all; but will suffer from other problems (related to efficient memory management) and the hardware prefetch still won't cross page boundaries.

If pages (or segments) aren't in memory (e.g. they're on disk, either in swap space or due to memory mapped files) then the cost of the exception followed by the cost of disk IO is going to be many times higher than the cost of any TLB and/or cache misses and TLB/cache misses becomes irrelevant. Of course this shouldn't be considered likely, and code should assume that the data is in RAM.

Cheers,

Brendan

embryo · **Posted:** Wed Dec 24, 2014 7:29 am

Brendan wrote:

Alternatively, maybe you write your code to handle misaligned arrays; with an initial loop to handle from the start of the array up to the first page boundary, then a middle loop for the majority of the array (that is page aligned), then another loop for the (partially used) last page of the array.

It works for big arrays, but when predictor catches the direction of a sequential memory access, wouldn't it preload next page when memory reads hit somewhere close to a predictor's threshold? Then only initial loop seems as viable.

But if there is an opportunity to work with data in chunks (like usage of cached data in more than one operation), then predictor will be unable to preload the data because memory read pattern tends to exhibit locality instead of sequentiality. Here we need to use prefetch instruction. And as I understand it should be executed at least twice to initiate predictor's preload action, right? More generally - what kind of relationship exists between prefetch instruction and predictor's ability to preload whole page? Can such relationship be described in form of a clear algorithm? If there is such algorithm then we (almost) have reliable mean of preloading required page into caches. But we still miss the cache level where the page will be preloaded.

Brendan wrote:

Ideally, the length of an instruction's opcode should depend on how often its used - the most frequently used instructions should have the shortest opcodes possible.

But if instruction read bandwidth is measured in gygabytes per second then it is possible to have just a bit more complex decoder and forget about instruction length. Or there are situation when processor stalls because of empty instruction pipeline? As I see it an instruction cache is independent and instruction read should have more priority than data read, so it is hardly possible to stall a processor because of high data load.

Brendan wrote:

Once we complete the transition to UEFI real mode will become removable

I suppose such transition will take us so far as 2030-s or even later. What has became removable in last 15-20 years? Or in 35 years?

Brendan wrote:

Essentially it's the same problem - one flag being used for 2 different things (whether it's one "SSE plus prefetch" flag, or one "Extended 3DNow! plus prefetch" flag).

Then they will invent another flag which will tell us about what is removed.

Brendan wrote:

Also note that it does make sense for an OS to create it's own set of CPU feature flags, where all software uses the OS's "CPU feature flags" and nothing uses CPUID.

Well, you are one step closer to a manageable code.

Brendan wrote:

If an OS doesn't use paging, then it won't suffer from TLB misses at all; but will suffer from other problems (related to efficient memory management)

What kind of problems of memory management (without virtual memory) you can think of in case of managed code?

OSDev.org

Memory access performance

Who is online