Intel CPUs don't implement true load-load and store-store ordering (see
here). What Intel has is closer to
processor consistency, meaning that each core sees the stores of other cores in the same order, but there is no total store or load ordering.
In their case, serializing reads is a matter of processing cache invalidations in order. Similarly, serializing writes requires only buffering before the L1 cache (store buffer.) Note that these are predominantly local strategies, not global interactions. I believe there are fewer stalls possible here - store buffer overflows, store forces processing the invalidation queue, etc. But a total load or store ordering is harder to attain.
The benefits depend on the software design. Double-checked locking, as well as
eventual consistency without barriers are possible with this memory model. In cases where you use filters/software caches to optimize performance, occasional false-negative from stale core data has smaller performance impact then memory barriers all over the place. Corruption from stale read due to reordered reads is intolerable however. Essentially, the aim is to simply avoid memory barriers, which incur performance penalty to all operations in the vicinity (after the fact), even if there are no updates to the address of interest. This is not possible in all cases, but is usually likely when the contention is low.
In my opinion, the primary performance drain is cache line coherency (meaning no lost updates), especially due to false-sharing. And it is essentially necessary, whether it is implemented optimally or not. I would like to find documents detailing the impact of dependent loads ordering on software performance, which seems to me a difficult constraint to implement without overhead similar to a memory barrier, depending on how sophisticated the invalidation queue interface is.
Googling various forums indicated that x86 oostore was used in
WinChip. It appears to be some kind of experimental obsolete architecture.