Disable MTRRs

johnsa · **Joined:** Mon Oct 15, 2007 3:04 pm **Posts:** 296

Originally you said "1500 fps with WC" and now you're saying "180 fps with WC".

The 5fps vs 1500fps was just the copy portion (rep movsq).
So using W/C It's pushing about 12Gb/sec which is totally plausible.

I was surprised by the overall difference too, with W/C enabled it runs pretty much the same as it does under Windows using GDI bitmap. Without W/C the area is marked as uncached for the LFB and it's only pushing out about 40mb a sec (5fps or so).. the performance fps wise doesn't really vary whether it renders the pixel data or not because it's already so slow.

To me it looks like the performance with W/C is as it should be, perhaps there is something else wrong on my side that makes it "extra" slow in the non-W/C mode..

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Brendan wrote:

Korona wrote:

You'll have to use the PAT (or the corresponding PT flags if the PAT is not supported) anyway at some point because you won't have enough MTRRs for each device that wants to perform DMA (PCI DMA to cached pages is not allowed; PCIe does allow it by issuing cache snoops which kill performance especially on NUMA systems).

That's very much wrong. There are no problems with PCI devices doing DMA to normal RAM configured as write-back, write-through, etc. There would be a potential problem with DMA to RAM configured as WC caused by "buffer on the side that has nothing to do with normal caches", but even in that case you can probably work around it with fences to ensure the data is out of that "buffer on the side" before you begin the DMA.

As far as I understand something like the following can happen if you DMA to WB or WT memory:
Assume you want to perform a DMA write to some physical page p.
- Some CPU c1 (e.g. speculatively, because the page is mapped as WB somewhere) fetches some cache lines from p.
- The device writes to p (and does not participate in the cache coherency protocol).
- Another CPU c2 fetches the same cache line from p (and marks it as shared).
Now two CPUs have the same line in their caches (both of them marked as shared) but with different contents. If this is detected it may result in a MCE (?).

Is my understand wrong in this regard? If it is correct I don't see how using fences can ensure coherency in this situation (in a race free manner). Sure, I can clflush/wbinvd after the DMA completes (on all CPUs) but that might race with speculative fetches.

Brendan wrote:

For code that is even less well optimised (e.g. writes all pixels to display memory regardless of whether they changed or not, and doesn't use SSE or AVX or non-temporal stores) I'd still expect that using WC (in the MTRR) might make it no more than 10 times faster (because "pointless writes that should've been avoided" would be effecting "with WC" and "without WC" the same).

WC does not only affect WC buffering but also memory ordering. Memory ordering should have a great influence even on optimized code.

Brendan · **Posted:** Wed Jan 04, 2017 11:58 am

Hi,

Korona wrote:

Brendan wrote:

Korona wrote:

You'll have to use the PAT (or the corresponding PT flags if the PAT is not supported) anyway at some point because you won't have enough MTRRs for each device that wants to perform DMA (PCI DMA to cached pages is not allowed; PCIe does allow it by issuing cache snoops which kill performance especially on NUMA systems).

That's very much wrong. There are no problems with PCI devices doing DMA to normal RAM configured as write-back, write-through, etc. There would be a potential problem with DMA to RAM configured as WC caused by "buffer on the side that has nothing to do with normal caches", but even in that case you can probably work around it with fences to ensure the data is out of that "buffer on the side" before you begin the DMA.

As far as I understand something like the following can happen if you DMA to WB or WT memory:
Assume you want to perform a DMA write to some physical page p.
- Some CPU c1 (e.g. speculatively, because the page is mapped as WB somewhere) fetches some cache lines from p.
- The device writes to p (and does not participate in the cache coherency protocol).

The device does participate in the cache coherency protocol; or possibly more correctly, read and write requests that originated from devices are received by something (e.g. memory controller) and that something is responsible for ensuring coherency (e.g. by forcing "modified, write-back" to be written back for reads and writes, and also invalidation on writes).

Korona wrote:

Brendan wrote:

For code that is even less well optimised (e.g. writes all pixels to display memory regardless of whether they changed or not, and doesn't use SSE or AVX or non-temporal stores) I'd still expect that using WC (in the MTRR) might make it no more than 10 times faster (because "pointless writes that should've been avoided" would be effecting "with WC" and "without WC" the same).

WC does not only affect WC buffering but also memory ordering. Memory ordering should have a great influence even on optimized code.

I'd find it hard to believe that ordering is responsible for "about a hundred times slower than expected".

I'm assuming that CPU, RAM and VRAM are all "faster than PCI bus/link" and PCI bus/link is the bottleneck. For that assumption ordering of writes across the PCI bus/link either makes no difference or strict sequential order is better; and for both cases (uncached and WC) I'd expect sequential order almost always (close enough to "100% always" for any difference to be insignificant).

I'm thinking more along the lines of "takes slightly more than a time slice rather than slightly less, so it gets hit with a task switch that causes CPU to run some other task for many ms".

Cheers,

Brendan

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Hi,

Brendan wrote:

The device does participate in the cache coherency protocol; or possibly more correctly, read and write requests that originated from devices are received by something (e.g. memory controller) and that something is responsible for ensuring coherency (e.g. by forcing "modified, write-back" to be written back for reads and writes, and also invalidation on writes).

That makes a lot of sense. Is this documented somewhere? I always thought PCI (non PCIe) transactions didn't participate in cache coherency but I don't remember where I got that info from. PCIe has a "no snoop" attribute that explicitly controls this behavior.

Brendan wrote:

I'm assuming that CPU, RAM and VRAM are all "faster than PCI bus/link" and PCI bus/link is the bottleneck. For that assumption ordering of writes across the PCI bus/link either makes no difference or strict sequential order is better; and for both cases (uncached and WC) I'd expect sequential order almost always (close enough to "100% always" for any difference to be insignificant).

That is right. What I meant by "ordering" is that WC allows the CPU to post multiple writes to main memory (or in this case to the PCI bus) without waiting for a single write to complete (see SDM section 11.3). With UC it has to wait until main memory signals completion after every single write (it has to respect store-store ordering and there is no cache that takes care of it like in the WB case).

In theory non-temporal writes should also allow to bypass store-store ordering but I do not know if non-temporal stores to UC memory actually work.

Brendan · **Posted:** Wed Jan 04, 2017 2:30 pm

Hi,

Korona wrote:

Brendan wrote:

The device does participate in the cache coherency protocol; or possibly more correctly, read and write requests that originated from devices are received by something (e.g. memory controller) and that something is responsible for ensuring coherency (e.g. by forcing "modified, write-back" to be written back for reads and writes, and also invalidation on writes).

That makes a lot of sense. Is this documented somewhere?

I can't think of anywhere that it's documented; but 80x86 has always been cache coherent and far too much would break if it wasn't.

Korona wrote:

I always thought PCI (non PCIe) transactions didn't participate in cache coherency but I don't remember where I got that info from. PCIe has a "no snoop" attribute that explicitly controls this behavior.

For PCI (from host bridge to end point device) there's no need to worry about cache coherency because it's handled elsewhere. For PCIe, the "no snoop" seems to only be used for isochronous transfers that have strict timing (latency) requirements, and isn't used for normal reads/writes for bus mastering or DMA, and I'd be tempted to assume it exists to prevent any unexpected additional latency caused by (e.g.) caches writing back data from caches to RAM in response to a snooped write.

Korona wrote:

Brendan wrote:

I'm assuming that CPU, RAM and VRAM are all "faster than PCI bus/link" and PCI bus/link is the bottleneck. For that assumption ordering of writes across the PCI bus/link either makes no difference or strict sequential order is better; and for both cases (uncached and WC) I'd expect sequential order almost always (close enough to "100% always" for any difference to be insignificant).

That is right. What I meant by "ordering" is that WC allows the CPU to post multiple writes to main memory (or in this case to the PCI bus) without waiting for a single write to complete (see SDM section 11.3). With UC it has to wait until main memory signals completion after every single write (it has to respect store-store ordering and there is no cache that takes care of it like in the WB case).

You can send "write requests" in program order without waiting for any kind of acknowledgement to come back - PCI doesn't reorder in transit requests and if anything goes wrong you get asynchronous notification (e.g. machine check exception, NMI) at some point after the write was considered completed.

Essentially, "waiting for write to complete" means "waiting for memory controller or northbridge or PCI host bridge to say the write request has been forwarded to PCI" and doesn't mean "waiting for an acknowledgement to come back from a device all the way on the other side of a PCI bus/link".

Note that (if you have access to PCI Express Base Specification) there's a list of memory transaction types (in "2.1.1.1. Memory Transactions"), which include things like "memory write request", "completion without data", "completion with data", etc. All of the "completions" are described as either being used for reads (where CPU is waiting for the data to be fetched) or for writes to IO ports or PCI configuration space; and there are no completions for writes to memory mapped IO (even for the "status other than successfully completion" case). This also helps to explain why memory mapped IO writes are faster than IO port writes (and PCI configuration space writes); and why writes to video memory are faster than reads from video memory. Basically; writes to memory mapped IO are mostly "fire and forget" (once they reach PCI).

Korona wrote:

In theory non-temporal writes should also allow to bypass store-store ordering but I do not know if non-temporal stores to UC memory actually work.

The CPU has a set of write-combining buffers. Writes using the WC type (as determined by MTRRs or PAT) get shoved into the write combining buffers (and bypass normal caches); and non-temporal stores get shoved into the write combining buffers (and bypass normal caches). If non-temporal stores didn't work for UC, then the WC type (as determined by MTRRs or PAT) wouldn't work for UC either.

Cheers,

Brendan

johnsa · **Joined:** Mon Oct 15, 2007 3:04 pm **Posts:** 296

Oddly, in my case I have everything disabled at the moment, interrupts, scheduler, timers.. so this is running as a single task with no switching at all. So the performance variance I get is purely down to "something" changing between WC and UC that is so significant. I guess it would be hard to show you without taking a video capture of it perhaps with and without to see the difference (I can do that if you'd be interested?)

Brendan · **Posted:** Wed Jan 04, 2017 4:03 pm

Hi,

johnsa wrote:

Oddly, in my case I have everything disabled at the moment, interrupts, scheduler, timers.. so this is running as a single task with no switching at all. So the performance variance I get is purely down to "something" changing between WC and UC that is so significant. I guess it would be hard to show you without taking a video capture of it perhaps with and without to see the difference (I can do that if you'd be interested?)

I believe you are getting 5 fps for uncached (which is what a video would show); but I don't see how the difference can be purely because of UC vs. WC (and a video won't help for that).

What might help is knowing if the writes are aligned or not; and a description of the hardware involved (CPU, video card, bus types/speeds, memory type/speed, chipset).

For one example, maybe you're using a CPU with built in eDRAM and integrated graphics; and when you use WC the writes go to eDRAM and don't got to display memory at all, and when you use UC the writes aren't cached by eDRAM and do go to display memory.

Cheers,

Brendan

johnsa · **Joined:** Mon Oct 15, 2007 3:04 pm **Posts:** 296

The back-buffer is aligned on a page boundary, each pixel is drawn in sequence so no surprises there. the LFB is also page aligned and it's just a rep movsq between them.

CPU is an Intel i7 Broadwell.. in a dell laptop.
Graphics is an integrated Intel HD 5500.
DDR3 1600mhz (I think offhand)
Chipset is wildcat point-lp

I had considered the integrated eDRAM too as something which under WC may be considerably faster than would normally be the case if writing to a discrete gfx card over PCIe.

Brendan · **Posted:** Thu Jan 05, 2017 1:27 am

Hi,

johnsa wrote:

The back-buffer is aligned on a page boundary, each pixel is drawn in sequence so no surprises there. the LFB is also page aligned and it's just a rep movsq between them.

CPU is an Intel i7 Broadwell.. in a dell laptop.
Graphics is an integrated Intel HD 5500.
DDR3 1600mhz (I think offhand)
Chipset is wildcat point-lp

I had considered the integrated eDRAM too as something which under WC may be considerably faster than would normally be the case if writing to a discrete gfx card over PCIe.

Hrm - it's not eDRAM.

I've mostly reached the point where it's unexplainable (due to lack of documentation). I've been assuming "PCI of some sort" (mostly PCI-Express); but the CPU, memory controller and video are all in the same chip and in that case PCI doesn't apply - Intel are free to do whatever they like and have no reason to comply with PCI specs internally (beyond creating a plausible illusion). Unfortunately the communication between a core, the memory controller and the GPU aren't documented by Intel, mostly because only Intel's chip designers would need that information.

One thing I've noticed is that (with 1500 frames per second and 1920*1080*4 bytes per frame), for the WC case, you'd be reading about 11.6 GiB of data per second and writing 11.6 GiB of data per second (so crudely 23.2 GiB/s total); and DDR3 1600 has a peak bandwidth of 15 GiB/s (16 GB/s) per channel for a total of 30 GiB/s for dual channel. Essentially, for the WC case (allowing a little for misc. overheads, like video reading to refresh monitor) I think the bottleneck for WC is RAM bandwidth alone.

Of course that doesn't help much to explain why UC is so slow in comparison; and doesn't mean that you can't make WC significantly faster (e.g. by only moving data that didn't change) or that by making WC significantly faster it won't also make UC significantly faster, and won't also reduce the performance difference between WC and UC.

It also doesn't mean you'd get similar performance differences on other computers. For a simple example, for discrete (rather than integrated) video it would have to comply with PCI and your WC performance will be worse (PCI bandwidth limit rather than RAM bandwidth limit), and likely to be much closer to the UC performance.

Cheers,

Brendan

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Brendan wrote:

I can't think of anywhere that it's documented; but 80x86 has always been cache coherent and far too much would break if it wasn't.

I see. Okay, I guess this means that I can update my DMA code to allow weaker caching behavior. Thank you for taking the time to answer to my comments

. This really cleared up some misconceptions I had before.

johnsa · **Joined:** Mon Oct 15, 2007 3:04 pm **Posts:** 296

My other machine should be up and running shortly for testing, it's using an Nvidia board, so will be interesting to see the comparison of UC vs. WC on that machine as the PCIe bus will have to come into play there.
I will let you know my findings/results.

OSDev.org

Disable MTRRs

Who is online