Hi,
Korona wrote:
Brendan wrote:
Also note that the CPU has to assume that the same page might be mapped multiple times with different PAT values (in different virtual address spaces or in the same virtual address space), and therefore the data could (e.g.) be cached even when the PAT says "uncached". The result of this assumption is that (for most cases) the CPU has to do extra work when the PAT is used to modify the "base cache-ability" described by MTRRs. Basically, it's better to use MTRRs if you can (so that CPU doesn't have to do extra work), and only use PAT if you can't use MTRRs.
Are you sure about that (specifically the performance hit when using the PAT)?
Yes. Take a look at the notes for "Table 11-7. Effective Page-Level Memory Types for Pentium III and More Recent Processor Families", where it says:
1. The UC attribute comes from the MTRRs and the processors are not required to snoop their caches since the data could never have been cached. This attribute is preferred for performance reasons.
2. The UC attribute came from the page-table or page-directory entry and processors are required to check their caches because the
data may be cached due to page aliasing, which is not recommended.
Korona wrote:
Intel explicitly states that mapping the same page with different memory types is undefined behavior:
Section 11.12.4: Programming the PAT wrote:
The PAT allows any memory type to be specified in the page tables, and therefore it is possible to have a single physical page mapped to two or more different linear addresses, each with different memory types. Intel does not support this practice because it may lead to undefined operations that can result in a system failure. In particular, a WC page must never be aliased to a cacheable page because WC writes may not check the processor caches.
The manual also states that you must issue a wbinvd (i.e. do the same thing you would do if you reprogrammed the PAT) before you use a WC mapping of a previously cached page unless the processor supports self-snooping. I guess that scenarios different from cached -> WC might work by chance (i.e. because no support in the cache coherency protocol is required for them) but are not architecturally supported either.
That section is slightly badly worded. For this they are only talking about WC and not talking about other caching types. This is because WC is not like any of the other caching types and is not really a true caching type - WC is more accurately described as "uncached caching type as far as normal caches are concerned, but where CPU combines writes in a buffer on the side that has nothing to do with normal caches". It's this "buffer on the side that has nothing to do with normal caches" that causes potential problems.
Korona wrote:
As the PAT cannot be disabled on processors that support it I always use it when it is available (e.g. on x86_64).
PAT can't be disabled in the same way that segmentation (in protected mode) can't be disabled - in both cases you can achieve "effectively disabled" by configuring it to do nothing (e.g. "base = 0, limit = 4 GiB" for segments). For PAT, "configured to do nothing/effectively disabled" is the default setting (where it behaves identically to older CPUs that didn't have PAT and only had the PCD and PWT flags).
Korona wrote:
AFAIR this is the same strategy Linux uses. As you said, reprogramming MTRRs is hard in general and might be impossible to do correctly for unknown chipsets.
Normally "same strategy Linux uses" means that it's bad; however for this case I don't think that applies (if the PAT exists there's no real reason not to use it). Note that Linux does support fully reconfiguring MTRRs during boot (for when firmware doesn't configure MTRRs well) and will use MTRRs (and not PAT) for things like device's memory mapped IO areas if it can.
Korona wrote:
You'll have to use the PAT (or the corresponding PT flags if the PAT is not supported) anyway at some point because you won't have enough MTRRs for each device that wants to perform DMA (PCI DMA to cached pages is not allowed; PCIe does allow it by issuing cache snoops which kill performance especially on NUMA systems).
That's very much wrong. There are no problems with PCI devices doing DMA to normal RAM configured as write-back, write-through, etc. There would be a potential problem with DMA to RAM configured as WC caused by "buffer on the side that has nothing to do with normal caches", but even in that case you can probably work around it with fences to ensure the data is out of that "buffer on the side" before you begin the DMA.
Korona wrote:
Brendan wrote:
That doesn't add up. How much time do you spend generating the data to blit, and how much time (e.g. in nanoseconds or something) do you spend doing the blitting itself for each case? How are you blitting (e.g. a simple "for each row of pixels { if row of pixels changed, copy row of pixels from buffer to display memory" loop)? Are the writes aligned on a 8-byte boundary?
What performance should be expected then? WC allows the CPU to buffer writes which should not improve the performance of rep movsq. However it also allows the CPU to issue delayed writes. With UC each iteration of rep movsq has to wait until the previous iteration hit main memory. With WC the CPU can issue many iterations of rep movsq concurrently and does not have to wait for main memory.
When unknown software running on an unknown number of unknown CPUs (with unknown cache sizes, speed, etc) is spending an unknown amount of time to generate pixel data in unknown ways in a buffer of unknown size in unknown RAM, and then doing unknown things (with unknown alignment, etc) to blit that data across an unknown bus to an unknown video controller; I would be "extremely shocked" if it didn't take exactly 1234.5678 nanoseconds.
For code that is well optimised (e.g. avoids writing pixels that didn't change to display memory, uses SSE or AVX, uses non-temporal stores, etc) I would expect that using WC (in the MTRR) makes no difference at all.
For code that is less well optimised (e.g. avoids writing pixels that didn't change to display memory, but doesn't use SSE or AVX or non-temporal stores) I would expect that using WC (in the MTRR) might make it no more than 10 times faster (and that the time spent to blit the data is negligible compared to the time spent generating that pixel data in the first place).
For code that is even less well optimised (e.g. writes all pixels to display memory regardless of whether they changed or not, and doesn't use SSE or AVX or non-temporal stores) I'd still expect that using WC (in the MTRR) might make it no more than 10 times faster (because "pointless writes that should've been avoided" would be effecting "with WC" and "without WC" the same).
If you assume that johnsa's code is spending 0.33333 ms to generate the pixel data, 0.33333 ms to blit all pixel data with a single "rep movsq" when WC is being used, and 199.666666 ms to blit all pixel data with a single "rep movsq" when WC is being used; then that would imply that WC makes it 600 times faster. That's so far beyond the expected performance differences that something else must be causing misleading results.
Cheers,
Brendan