Why "order-of-order write" isn't introduced by intel ?

miaowei · **Joined:** Wed Dec 18, 2013 9:10 am **Posts:** 84

Hi, friends.
I am reading intel document Chapter 8.2 .
It seems that x86's memory order is quite 'strong':
* Reads are not reordered with other read.
* Write are not reordered with other writes.

And from here:
https://en.wikipedia.org/wiki/Memory_ordering#In_symmetric_multiprocessing_.28SMP.29_microprocessor_systems
I see all all other popular cpu architectures support a more flexible memory order.

Why intel doesn't implement the relaxed memory order since he can gain more benefit from out-of-order execution ?

BTW, could anyone tell me something about 'x86 oostore'? I didn't find find much information on google. Is it implemented in our daily x86 CPU ? Or you can show me a link, thanks!

simeonz · **Joined:** Fri Aug 19, 2016 10:28 pm **Posts:** 360

Intel CPUs don't implement true load-load and store-store ordering (see here). What Intel has is closer to processor consistency, meaning that each core sees the stores of other cores in the same order, but there is no total store or load ordering.

In their case, serializing reads is a matter of processing cache invalidations in order. Similarly, serializing writes requires only buffering before the L1 cache (store buffer.) Note that these are predominantly local strategies, not global interactions. I believe there are fewer stalls possible here - store buffer overflows, store forces processing the invalidation queue, etc. But a total load or store ordering is harder to attain.

The benefits depend on the software design. Double-checked locking, as well as eventual consistency without barriers are possible with this memory model. In cases where you use filters/software caches to optimize performance, occasional false-negative from stale core data has smaller performance impact then memory barriers all over the place. Corruption from stale read due to reordered reads is intolerable however. Essentially, the aim is to simply avoid memory barriers, which incur performance penalty to all operations in the vicinity (after the fact), even if there are no updates to the address of interest. This is not possible in all cases, but is usually likely when the contention is low.

In my opinion, the primary performance drain is cache line coherency (meaning no lost updates), especially due to false-sharing. And it is essentially necessary, whether it is implemented optimally or not. I would like to find documents detailing the impact of dependent loads ordering on software performance, which seems to me a difficult constraint to implement without overhead similar to a memory barrier, depending on how sophisticated the invalidation queue interface is.

Googling various forums indicated that x86 oostore was used in WinChip. It appears to be some kind of experimental obsolete architecture.

StephanvanSchaik · **Posted:** Thu Oct 27, 2016 12:19 pm

Hi,

Out-of-order execution is still possible on x86, because modern x86 processors make use of a Memory Ordering Buffer (MOB) alongside the Re-Order Buffer (ROB). Similar to the ROB, which keeps track of the program order for instructions to allow them to be executed out of order, the MOB keeps track of the program order for both load and store instructions. Whenever a store is issued, an entry will be allocated for it to store the address and the data, and this entry will only be released whenever the instruction has retired (i.e. completed) and when the data has been written to the L1d cache. Similarly, whenever a load is issued, an entry will be allocated for it to store the address, and this entry will only be released whenever the instruction has retired and when the data is available. However, because loads may alias with earlier stores, i.e. they end up sharing the same address, the load has to wait for the store to complete in order to preserve the dependency. Modern x86 processors optimise this case by forwarding the data from the store to the load once the store instruction has been retired, whereupon the load entry can be released from the buffer and the data can be written to the register file. This additional book keeping thus enables x86 processors to execute instructions out-of-order, while still preserving the order of the memory accesses. This essentially boils down to the fact that switching to a weaker memory model does not offer the benefit of out-of-order execution, as x86 processors can already do this.

On the one hand, the benefit of having a strong memory ordering is that it is easier to work with for programmers. On processors with stronger memory ordering, memory accesses will happen in the right order for most of the cases. In the cases, they don't, you have to use memory fences or barriers to guarantee the order. On processors with weaker memory ordering, this is worse, as there is almost no guarantee on memory order. On the other hand, stronger memory ordering involves much more overhead in keeping the caches coherent. In the end this means that weak memory ordering tends to be more scalable, but that the programmer has to pay a price for it. This is exactly one of the trade offs to consider when considering either a strong or a weak memory model. Furthermore, there is also another trade off, which is the book keeping involved to get strong memory ordering guarantees, while still being able to execute instructions out-of-order. The complexity behind this book keeping and the cost of it in both silicon and power is exactly the reason why architectures like ARMv7 and IBM POWER chose to implement weak memory ordering. This is also the reason why Centaur Technology decided to implement x86 oostore in IDT's WinChip, as it was a relatively simple x86 processors that was not capable of doing the same kind of ordering tricks that modern x86 processors use nowadays.

One important aspect to realise though is that even though modern x86 processors are well-optimised for out-of-order execution, they are not so good at scaling up cores due to the cache coherency overhead involved in guaranteeing a strong memory order. This means that at some point Intel may decide to implement weak memory ordering, at least, if they are unable to devise some clever tricks for scaling up with the current model as-is.

Yours sincerely,
S.J.R. van Schaik.

Brendan · **Posted:** Thu Oct 27, 2016 6:58 pm

Hi,

StephanvanSchaik wrote:

One important aspect to realise though is that even though modern x86 processors are well-optimised for out-of-order execution, they are not so good at scaling up cores due to the cache coherency overhead involved in guaranteeing a strong memory order. This means that at some point Intel may decide to implement weak memory ordering, at least, if they are unable to devise some clever tricks for scaling up with the current model as-is.

I'm not so sure about that.

With strong memory ordering hardware ensures coherency and that adds some overhead; but this is spread throughout the cache hierarchy (and doesn't always mean traffic broadcast across all interconnects) and for larger systems they just add a kind of directory at each NUMA node to reduce traffic between NUMA nodes; and because hardware can do most of it in parallel with everything else and because there's various techniques (buffers, hyper-threading) to mitigate the overhead, the overhead doesn't effect performance as much as you might think (even for huge systems).

With weak memory ordering software has to ensure coherency and that adds more overhead and scales worse; partly because software can't do any of the tricks hardware can, and partly because it's too hard for compilers to figure out (e.g. they can't even figure out pointer aliasing properly for the single-thread case) and humans are far too error prone. This mostly ends up being "unnecessary barriers everywhere just in case", more traffic, and less to mitigate that.

Mostly; I'd say that strong memory ordering improves performance and improves scalability; but also increases hardware costs.

I'd also assume that strong memory ordering (or the hardware needed to provide strong memory ordering) is more flexible (allowing hardware designers to change or extend the implementation without breaking software). For an extreme example of this flexibility, consider hardware transactional memory support.

Cheers,

Brendan

miaowei · **Joined:** Wed Dec 18, 2013 9:10 am **Posts:** 84

Thank you, Brendan and StephanvanSchaik.
I will read your posts carefully.
But it may be several days later before I can ask you a further question. I know too little.

StephanvanSchaik · **Posted:** Thu Oct 27, 2016 8:58 pm

Brendan wrote:

Hi,

StephanvanSchaik wrote:

One important aspect to realise though is that even though modern x86 processors are well-optimised for out-of-order execution, they are not so good at scaling up cores due to the cache coherency overhead involved in guaranteeing a strong memory order. This means that at some point Intel may decide to implement weak memory ordering, at least, if they are unable to devise some clever tricks for scaling up with the current model as-is.

I'm not so sure about that.

With strong memory ordering hardware ensures coherency and that adds some overhead; but this is spread throughout the cache hierarchy (and doesn't always mean traffic broadcast across all interconnects) and for larger systems they just add a kind of directory at each NUMA node to reduce traffic between NUMA nodes; and because hardware can do most of it in parallel with everything else and because there's various techniques (buffers, hyper-threading) to mitigate the overhead, the overhead doesn't effect performance as much as you might think (even for huge systems).

With weak memory ordering software has to ensure coherency and that adds more overhead and scales worse; partly because software can't do any of the tricks hardware can, and partly because it's too hard for compilers to figure out (e.g. they can't even figure out pointer aliasing properly for the single-thread case) and humans are far too error prone. This mostly ends up being "unnecessary barriers everywhere just in case", more traffic, and less to mitigate that.

Mostly; I'd say that strong memory ordering improves performance and improves scalability; but also increases hardware costs.

I'd also assume that strong memory ordering (or the hardware needed to provide strong memory ordering) is more flexible (allowing hardware designers to change or extend the implementation without breaking software). For an extreme example of this flexibility, consider hardware transactional memory support.

Cheers,

Brendan

Hi Brendan,

Thanks for your insightful reply. Now that you have mentioned it, I agree with you as I am not so sure either and as your additions make a lot of sense in the case of Intel and they do really explain well why Intel chose to follow that path instead. Since I have seen the claim of weak memory ordering being more scalable than strong memory ordering come up a few times, it might come from a time where nobody had proven otherwise, but that with Intel's implementation as it is today it is definitely possible to get more out of strong memory ordering, if you can come up with a good implementation and if you can afford the hardware costs involved.

Yours faithfully,
S.J.R. van Schaik.

Ycep · **Joined:** Mon Dec 28, 2015 11:11 am **Posts:** 401

It would take speed. Through there is where semaphores come in.

OSDev.org

Why "order-of-order write" isn't introduced by intel ?

Who is online