Disable MTRRs

johnsa · **Joined:** Mon Oct 15, 2007 3:04 pm **Posts:** 296

Hi,

So I've recently changed my memory management and paging system and while doing so wanted to see how much difference WC memory would make to the linear frame buffer. (It was huge, even more than I expected) push 1920x1080x32 from 5fps up to 1500fps!).

I was doing that purely as a quick test so to accomplish it I quickly changed 3gb-4gb large page to be write combining just to check the performance, obviously it's a terrible idea as there are many other things in that range that need to be uncached and a bit of space which can be WB.. but I digress..

I thought never mind, I'll map in the LFB range using one of the variable MTRRs, (I happen to have 10 on my test machine) and one free for use as planned.
The problem I ran into is that the MTRR setup from boot happens to cover the range where the LFB is as uncached, I don't have enough free ranges to split that and still add the write combining range.
Another issue is that when you want to start using large pages (2mb, 1gb) there are some issues with pages that span multiple mtrr ranges leading to undefined behaviour (which you really want to avoid).

So given that my minimum requirement is support for PAT and I have that setup, should I (or could I) just completely disable the MTRRs altogether and just transfer their memory types/ranges into the paging table setup?
I was also curious if switching off MTRRs might improve memory access performance as in theory it should be less for the cpu to check accesses against.
I'm running in longmode exclusively, so I wouldn't expect any legacy bios or firmware calls to happen that might require the MTRRs to still be present, I'm not sure about ACPI/SMI.. but I would think if the page tables are setup to map
the same ranges the same way as the MTRRs it should be seamless ?

It just seems cleaner to have one system responsible for memory typing that two different setups which have a bit of an impedence mismatch when it comes to larger pages , and also MTRRs being pretty out-dated in comparison to PAT.

So the question is :
1) Can I disable MTTRs completely (is it safe to do in longmode / with regards to acpi/smi/smm etc)? - Assuming that I remap pages into paging tables that mirror the mtrr settings?

Brendan · **Posted:** Mon Jan 02, 2017 7:10 pm

Hi,

johnsa wrote:

So the question is :
1) Can I disable MTTRs completely (is it safe to do in longmode / with regards to acpi/smi/smm etc)? - Assuming that I remap pages into paging tables that mirror the mtrr settings?

It's "safe" as long as all CPUs have the same MTRR configuration (and as long as you follow the sequence that Intel describes to change MTRRs on all CPUs at the same time).

However, with MTRRs disabled everything ends up as "uncached", and PAT can't change that to anything except "write combining" (and can't change it to "write-back"); so the performance of everything will be destroyed.

Also note that the CPU has to assume that the same page might be mapped multiple times with different PAT values (in different virtual address spaces or in the same virtual address space), and therefore the data could (e.g.) be cached even when the PAT says "uncached". The result of this assumption is that (for most cases) the CPU has to do extra work when the PAT is used to modify the "base cache-ability" described by MTRRs. Basically, it's better to use MTRRs if you can (so that CPU doesn't have to do extra work), and only use PAT if you can't use MTRRs.

Also, for old CPUs, some don't support PAT (but do support MTRRs), and some ancient CPUs (Cyrix) had "something like MTRRs" (I think Cyrix they called then "address range registers" or something) before Intel did (and Intel mostly stole the idea in implemented it in an incompatible way). For newer CPUs, AMD extended MTRRs in some way (I can't remember - something to do with IOMMU and/or virtualisation I think). This means that you (potentially) have about 4 different cases to worry about (nothing like MTRRs, "address range registers", MTRRs, and "extended MTRRs"); and could/should provide an abstraction (e.g. some kind of "change base cache-ability for physical region" function in your physical memory manager) to hide the differences.

Finally; there are 2 different use cases to worry about:

Cache-ability being changed because of memory mapped device (because a device driver asked for it)
Cache-ability being changed for performance reasons (because a normal user-space process asked for it - e.g. possibly to avoid cache pollution in rare cases where "least recently used" doesn't make sense).

For the first case (memory mapped devices); you should use MTRRs or equivalent whenever possible, and then use PAT (if present) as a fall-back in cases where you can't use MTRRs or equivalent (e.g. because you ran out of variable MTRRs).

For the second case (normal "write-back" RAM being changed for performance reasons by user-space) you only want to use PAT (and don't want to use MTRRs or equivalent) and if PAT can't be used (e.g. not supported) then do nothing (consider it a "performance hint" that can be ignored).

johnsa wrote:

So I've recently changed my memory management and paging system and while doing so wanted to see how much difference WC memory would make to the linear frame buffer. (It was huge, even more than I expected) push 1920x1080x32 from 5fps up to 1500fps!).

If your code is extremely bad (e.g. not well optimised and does lots of tiny writes, possibly including writing the same pixel/s multiple times), then WC can make a huge difference.

Cheers,

Brendan

johnsa · **Joined:** Mon Oct 15, 2007 3:04 pm **Posts:** 296

Hi,

Thanks for the info.

I was basically just writing to a normal cached buffer for the screen then transferring it to the LFB with a rep movsq. Without WC it was about 5fps, with about 1500fps.. so no individual pixel writes etc, I was surprised by the increase myself, it runs about the same speed as the same code (software rendering) does under windows 7 64bit, in fact it's about 20% faster.. but I take that with a pinch of salt without a full stack of drivers running firing off interrupts and 50 odd tasks switching.

The problem I have with the mtrr's is that the configuration as it stands leaves me with only a single free variable mtrr and the range I need for the LFB to be marked as write combining is already in a range marked uncached. If I split the range, uncached -> write combine -> uncached again I'd need two free mtrrs to map it which I don't have and then mtrr's only allow you to have overlapping areas when the types are writeback and uncached, even if I could the uncached would take priority over write combining where they overlap so that doesn't seem to be an option.

I didn't realise that the PAT entries couldn't convert accesses to write back if the mtrr's were disabled, that is bad news in deed! As now.. Even if I wanted to mark the LFB as write combining with the PAT, that PAT entries will overlap an MTRR saying it's uncached.. which I assume will take precedence..

The only thing I can think of doing is setting the default mtrr type to uncached (which it should be anyway) and then removing all ranges from the mtrr's that are uncached, as that would then be the default for any memory not covered by an mtrr.
Then the MTRR's would only serve to map areas as Write Back or Write Combine .. thus freeing up a bunch of mtrr's .. the other way would be to make the default cached and then only map areas that are uncached or write-back (which is opposite to the intel manual suggestion of the default being uncached).. but either way my mtrr's contain a mix of type 6 and 0 .. which seems silly as one of those should be the default and we should be able to remove the others.

Why this has to be such a mess and a pain grrr

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5143

johnsa wrote:

Even if I wanted to mark the LFB as write combining with the PAT, that PAT entries will overlap an MTRR saying it's uncached.. which I assume will take precedence..

Nope! According to Intel, if the MTRR says memory should be UC and the PAT says memory should be WC, the result is WC. (I believe the same applies to AMD, but I haven't checked AMD's manuals to confirm.)

With that said, you still shouldn't map a large page across multiple MTRRs. That tends to make bad things happen.

johnsa · **Joined:** Mon Oct 15, 2007 3:04 pm **Posts:** 296

Do you know off-hand where in the manual it mentions that ? I can't find any mention of the a PAT W/C entry taking precedence over a MTRR UC type ?

Thanks!

I see in the AMD Manual section 7.8.5 it says a PAT entry of WC and - for the MTRR type (which I assume to mean any) results in WC .. I would imagine Intel must be the same for consistency.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5143

johnsa wrote:

Do you know off-hand where in the manual it mentions that ?

Volume 3, chapter 11, section 11.5.2.2.

Brendan · **Posted:** Tue Jan 03, 2017 10:42 pm

Hi,

johnsa wrote:

I was basically just writing to a normal cached buffer for the screen then transferring it to the LFB with a rep movsq. Without WC it was about 5fps, with about 1500fps..

That doesn't add up. How much time do you spend generating the data to blit, and how much time (e.g. in nanoseconds or something) do you spend doing the blitting itself for each case? How are you blitting (e.g. a simple "for each row of pixels { if row of pixels changed, copy row of pixels from buffer to display memory" loop)? Are the writes aligned on a 8-byte boundary?

johnsa wrote:

I didn't realise that the PAT entries couldn't convert accesses to write back if the mtrr's were disabled, that is bad news in deed! As now.. Even if I wanted to mark the LFB as write combining with the PAT, that PAT entries will overlap an MTRR saying it's uncached.. which I assume will take precedence..

In general, PAT can only make caching worse (e.g. "write-back" can be changed to "write-through" or "uncached"; "write-through" can be changed to "uncached", etc). Write-combining is a special case, and (if CPU supports both PAT and WC) you can use PAT to convert "uncached" to "write-combining".

johnsa wrote:

The only thing I can think of doing is setting the default mtrr type to uncached (which it should be anyway) and then removing all ranges from the mtrr's that are uncached, as that would then be the default for any memory not covered by an mtrr. Then the MTRR's would only serve to map areas as Write Back or Write Combine .. thus freeing up a bunch of mtrr's .. the other way would be to make the default cached and then only map areas that are uncached or write-back (which is opposite to the intel manual suggestion of the default being uncached).. but either way my mtrr's contain a mix of type 6 and 0 .. which seems silly as one of those should be the default and we should be able to remove the others.

If firmware's default is currently "write-back", then changing the default to "uncached" means that all RAM will become "uncached" (and PAT can't change that to "write-back" so the performance of everything involving RAM would be severely crippled). To fix that you'd have to create new MTRRs to describe RAM as "write-back". Maybe this means that you can delete 2 existing MTRRs for "uncached" and have to create 3 new entries for "write-back", and then you'd have no MTRRs left over for the "write-combining" anyway.

Note that some firmware simply isn't very good at creating the MTRRs; and maybe you can save some MTRRs by reconfiguring all of them. This is a relatively risking proposition - it's hard to write code that generates "ideal MTRRs" that works for all possible cases (with different memory maps, different CPUs with different numbers of MTRRs and different "overlapping" rules), and its impossible to test all possible cases; which means there's a good chance of "works on some computers" (which is the same as "fails on other computers but you don't know that").

johnsa wrote:

Why this has to be such a mess and a pain grrr

Improving performance or capabilities always increases complexity, so (unless you're willing to accept "bad but simple") everything is always complicated.

Cheers,

Brendan

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Brendan wrote:

Also note that the CPU has to assume that the same page might be mapped multiple times with different PAT values (in different virtual address spaces or in the same virtual address space), and therefore the data could (e.g.) be cached even when the PAT says "uncached". The result of this assumption is that (for most cases) the CPU has to do extra work when the PAT is used to modify the "base cache-ability" described by MTRRs. Basically, it's better to use MTRRs if you can (so that CPU doesn't have to do extra work), and only use PAT if you can't use MTRRs.

Are you sure about that (specifically the performance hit when using the PAT)? Intel explicitly states that mapping the same page with different memory types is undefined behavior:

Section 11.12.4: Programming the PAT wrote:

The PAT allows any memory type to be specified in the page tables, and therefore it is possible to have a single physical page mapped to two or more different linear addresses, each with different memory types. Intel does not support this practice because it may lead to undefined operations that can result in a system failure. In particular, a WC page must never be aliased to a cacheable page because WC writes may not check the processor caches.

The manual also states that you must issue a wbinvd (i.e. do the same thing you would do if you reprogrammed the PAT) before you use a WC mapping of a previously cached page unless the processor supports self-snooping. I guess that scenarios different from cached -> WC might work by chance (i.e. because no support in the cache coherency protocol is required for them) but are not architecturally supported either.

As the PAT cannot be disabled on processors that support it I always use it when it is available (e.g. on x86_64). AFAIR this is the same strategy Linux uses. As you said, reprogramming MTRRs is hard in general and might be impossible to do correctly for unknown chipsets.

You'll have to use the PAT (or the corresponding PT flags if the PAT is not supported) anyway at some point because you won't have enough MTRRs for each device that wants to perform DMA (PCI DMA to cached pages is not allowed; PCIe does allow it by issuing cache snoops which kill performance especially on NUMA systems).

Brendan wrote:

That doesn't add up. How much time do you spend generating the data to blit, and how much time (e.g. in nanoseconds or something) do you spend doing the blitting itself for each case? How are you blitting (e.g. a simple "for each row of pixels { if row of pixels changed, copy row of pixels from buffer to display memory" loop)? Are the writes aligned on a 8-byte boundary?

What performance should be expected then? WC allows the CPU to buffer writes which should not improve the performance of rep movsq. However it also allows the CPU to issue delayed writes. With UC each iteration of rep movsq has to wait until the previous iteration hit main memory. With WC the CPU can issue many iterations of rep movsq concurrently and does not have to wait for main memory.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5143

Korona wrote:

Are you sure about that (specifically the performance hit when using the PAT)?

That's also described in Volume 3, chapter 11, section 11.5.2.2. Check the footnotes for table 11-7.

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Octocontrabass wrote:

Korona wrote:

Are you sure about that (specifically the performance hit when using the PAT)?

That's also described in Volume 3, chapter 11, section 11.5.2.2. Check the footnotes for table 11-7.

Ah I see, thank you. That means you pay a performance penalty if you use the PAT to force pages to be UC that are not UC in the MTRRs. Performance never suffers if you mark pages as WC in the PAT compared to marking them WC in the MTRRs.

The part I cited from the SDM about cached -> WC makes a lot more sense now and transitions cached <-> UC are indeed architecturally supported.

That is nice because you generally want to set DMA regions to WC and only have UC for memory mapped device registers. So you can forget about the MTRRs (and use the PAT without performance penalties) if you assume that your firmware gets at least the memory mapped registers right. Only if your firmware messes up the MTRRs for memory mapped registers you need to reprogram MTRRs to get full performance. It does not even matter if your firmware gets the MTRRs for the frame buffer right.

johnsa · **Joined:** Mon Oct 15, 2007 3:04 pm **Posts:** 296

That doesn't add up. How much time do you spend generating the data to blit, and how much time (e.g. in nanoseconds or something) do you spend doing the blitting itself for each case? How are you blitting (e.g. a simple "for each row of pixels { if row of pixels changed, copy row of pixels from buffer to display memory" loop)? Are the writes aligned on a 8-byte boundary?

The rendering code itself is a checkerboard raytrace, it determines the intersection against the board for each pixel (1920x1080). It's single-core using AVX, it takes about 5ms for the actual render code.
Under Windows using a GDI bitmap and 16byte aligned buffer (in both cases) it achieves about 140fps. Without the write-combining it was getting about 5fps for me, with write-combining on the LFB I get 180fps (render + copy to lfb with rep movsq).

Brendan · **Posted:** Wed Jan 04, 2017 7:32 am

Hi,

Korona wrote:

Brendan wrote:

Also note that the CPU has to assume that the same page might be mapped multiple times with different PAT values (in different virtual address spaces or in the same virtual address space), and therefore the data could (e.g.) be cached even when the PAT says "uncached". The result of this assumption is that (for most cases) the CPU has to do extra work when the PAT is used to modify the "base cache-ability" described by MTRRs. Basically, it's better to use MTRRs if you can (so that CPU doesn't have to do extra work), and only use PAT if you can't use MTRRs.

Are you sure about that (specifically the performance hit when using the PAT)?

Yes. Take a look at the notes for "Table 11-7. Effective Page-Level Memory Types for Pentium III and More Recent Processor Families", where it says:

1. The UC attribute comes from the MTRRs and the processors are not required to snoop their caches since the data could never have been cached. This attribute is preferred for performance reasons.
2. The UC attribute came from the page-table or page-directory entry and processors are required to check their caches because the
data may be cached due to page aliasing, which is not recommended.

Korona wrote:

Intel explicitly states that mapping the same page with different memory types is undefined behavior:

Section 11.12.4: Programming the PAT wrote:

The PAT allows any memory type to be specified in the page tables, and therefore it is possible to have a single physical page mapped to two or more different linear addresses, each with different memory types. Intel does not support this practice because it may lead to undefined operations that can result in a system failure. In particular, a WC page must never be aliased to a cacheable page because WC writes may not check the processor caches.

The manual also states that you must issue a wbinvd (i.e. do the same thing you would do if you reprogrammed the PAT) before you use a WC mapping of a previously cached page unless the processor supports self-snooping. I guess that scenarios different from cached -> WC might work by chance (i.e. because no support in the cache coherency protocol is required for them) but are not architecturally supported either.

That section is slightly badly worded. For this they are only talking about WC and not talking about other caching types. This is because WC is not like any of the other caching types and is not really a true caching type - WC is more accurately described as "uncached caching type as far as normal caches are concerned, but where CPU combines writes in a buffer on the side that has nothing to do with normal caches". It's this "buffer on the side that has nothing to do with normal caches" that causes potential problems.

Korona wrote:

As the PAT cannot be disabled on processors that support it I always use it when it is available (e.g. on x86_64).

PAT can't be disabled in the same way that segmentation (in protected mode) can't be disabled - in both cases you can achieve "effectively disabled" by configuring it to do nothing (e.g. "base = 0, limit = 4 GiB" for segments). For PAT, "configured to do nothing/effectively disabled" is the default setting (where it behaves identically to older CPUs that didn't have PAT and only had the PCD and PWT flags).

Korona wrote:

AFAIR this is the same strategy Linux uses. As you said, reprogramming MTRRs is hard in general and might be impossible to do correctly for unknown chipsets.

Normally "same strategy Linux uses" means that it's bad; however for this case I don't think that applies (if the PAT exists there's no real reason not to use it). Note that Linux does support fully reconfiguring MTRRs during boot (for when firmware doesn't configure MTRRs well) and will use MTRRs (and not PAT) for things like device's memory mapped IO areas if it can.

Korona wrote:

You'll have to use the PAT (or the corresponding PT flags if the PAT is not supported) anyway at some point because you won't have enough MTRRs for each device that wants to perform DMA (PCI DMA to cached pages is not allowed; PCIe does allow it by issuing cache snoops which kill performance especially on NUMA systems).

That's very much wrong. There are no problems with PCI devices doing DMA to normal RAM configured as write-back, write-through, etc. There would be a potential problem with DMA to RAM configured as WC caused by "buffer on the side that has nothing to do with normal caches", but even in that case you can probably work around it with fences to ensure the data is out of that "buffer on the side" before you begin the DMA.

Korona wrote:

Brendan wrote:

That doesn't add up. How much time do you spend generating the data to blit, and how much time (e.g. in nanoseconds or something) do you spend doing the blitting itself for each case? How are you blitting (e.g. a simple "for each row of pixels { if row of pixels changed, copy row of pixels from buffer to display memory" loop)? Are the writes aligned on a 8-byte boundary?

What performance should be expected then? WC allows the CPU to buffer writes which should not improve the performance of rep movsq. However it also allows the CPU to issue delayed writes. With UC each iteration of rep movsq has to wait until the previous iteration hit main memory. With WC the CPU can issue many iterations of rep movsq concurrently and does not have to wait for main memory.

When unknown software running on an unknown number of unknown CPUs (with unknown cache sizes, speed, etc) is spending an unknown amount of time to generate pixel data in unknown ways in a buffer of unknown size in unknown RAM, and then doing unknown things (with unknown alignment, etc) to blit that data across an unknown bus to an unknown video controller; I would be "extremely shocked" if it didn't take exactly 1234.5678 nanoseconds. :roll:

For code that is well optimised (e.g. avoids writing pixels that didn't change to display memory, uses SSE or AVX, uses non-temporal stores, etc) I would expect that using WC (in the MTRR) makes no difference at all.

For code that is less well optimised (e.g. avoids writing pixels that didn't change to display memory, but doesn't use SSE or AVX or non-temporal stores) I would expect that using WC (in the MTRR) might make it no more than 10 times faster (and that the time spent to blit the data is negligible compared to the time spent generating that pixel data in the first place).

For code that is even less well optimised (e.g. writes all pixels to display memory regardless of whether they changed or not, and doesn't use SSE or AVX or non-temporal stores) I'd still expect that using WC (in the MTRR) might make it no more than 10 times faster (because "pointless writes that should've been avoided" would be effecting "with WC" and "without WC" the same).

If you assume that johnsa's code is spending 0.33333 ms to generate the pixel data, 0.33333 ms to blit all pixel data with a single "rep movsq" when WC is being used, and 199.666666 ms to blit all pixel data with a single "rep movsq" when WC is being used; then that would imply that WC makes it 600 times faster. That's so far beyond the expected performance differences that something else must be causing misleading results.

Cheers,

Brendan

johnsa · **Joined:** Mon Oct 15, 2007 3:04 pm **Posts:** 296

So it looks like my approach is as follows based on the above information:

The firmware has correctly mapped all my device mmio ranges, I'm not convinced it's optimal but hey. I can use the PAT to create the Write combining area due to the special provision that MTRR=UC can be forced to WC by PAT without penalty
(So I will use that for LFB as well as any DMA accesses/buffers that require WC). So in theory, apart from transferring the BSP MTRR settings to the other cores on trampoline I shouldn't have to look at them again.

Brendan · **Posted:** Wed Jan 04, 2017 8:01 am

Hi,

johnsa wrote:

Quote:

johnsa wrote:

I was basically just writing to a normal cached buffer for the screen then transferring it to the LFB with a rep movsq. Without WC it was about 5fps, with about 1500fps..

That doesn't add up. How much time do you spend generating the data to blit, and how much time (e.g. in nanoseconds or something) do you spend doing the blitting itself for each case? How are you blitting (e.g. a simple "for each row of pixels { if row of pixels changed, copy row of pixels from buffer to display memory" loop)? Are the writes aligned on a 8-byte boundary?

The rendering code itself is a checkerboard raytrace, it determines the intersection against the board for each pixel (1920x1080). It's single-core using AVX, it takes about 5ms for the actual render code.
Under Windows using a GDI bitmap and 16byte aligned buffer (in both cases) it achieves about 140fps. Without the write-combining it was getting about 5fps for me, with write-combining on the LFB I get 180fps (render + copy to lfb with rep movsq).

Originally you said "1500 fps with WC" and now you're saying "180 fps with WC".

At 180 fps it's one frame every 5.555 ms, and with 5 fps it's one frame every 200 ms. That means with WC it's 5 ms to generate the data and 0.555 ms to blit; and without WC it's the same 5 ms to generate the data and 199.4444 ms to blit. That implies it's 0.555 vs 199.4444 ms - around 400 times faster with WC.

Now think of those writes as packets across a bus, where each packet has a header (saying that it's a write, which address is being written, the number of bytes being written, etc) and the data itself. The amount of data is the same in both cases - the difference is the amount of bandwidth consumed by "per packet overhead" (those headers, etc). Essentially it becomes "total_traffic = packets * per_packet_overhead + total_bytes". If WC has no packet overhead then it'd be "total_traffic = 1920*1080*4", and without WC it would have to be 400 times worse and therefore it'd have to be "400 * 1920*1080*4 = packets * per_packet_overhead + 1920*1080*4". That means "packets * per_packet_overhead = 399 * 1920*1080*4". We know that you're writing 8 bytes at a time, and that we're looking at "1920*1080*4 / 8 = 1036800" packets. Therefore we can estimate "per_packet_overhead = 399 * 1920*1080*4 / 1036800 = 3192".

Essentially; for what you're saying to be believable, you have to assume that sending 8 bytes across the PCI bus costs the equivalent of at least (because we ignored all packet overhead for WC when we probably shouldn't have) 3192 bytes of overhead.

That is simply not believable.

Believable might be more like 16 bytes of per packet overhead (e.g. a 1-byte packet type field, an 8 byte "address of write" field, a 16-bit size field, and 5 extra bytes for no particular reason). That would work out to a maximum performance difference of "0 + 1920*1080*4" with WC (and no packet overhead) vs. "16*1036800 + 1920*1080*4"; or "8294400 vs 24883200"; or WC being (no more than) 3 times faster.

In that case, assuming 5 ms to generate the pixels data, it would be the same 5.555 ms (180 fps) for WC, and "5+3*0.555 = 6.555 ms" (150 fps) without WC.

Cheers,

Brendan

Brendan · **Posted:** Wed Jan 04, 2017 8:11 am

Hi,

johnsa wrote:

So it looks like my approach is as follows based on the above information:

The firmware has correctly mapped all my device mmio ranges, I'm not convinced it's optimal but hey.

Firmware mostly only sets up MTRRs for RAM and "firmware special areas", and doesn't/shouldn't set up any MTRRs for any MMIO devices.

johnsa wrote:

I can use the PAT to create the Write combining area due to the special provision that MTRR=UC can be forced to WC by PAT without penalty

Yes.

johnsa wrote:

(So I will use that for LFB as well as any DMA accesses/buffers that require WC).

Don't use WC for any DMA - it's bizarre and complicates things far too much (because WC involves those "special side-buffers that normal caches don't know about"). For DMA to normal RAM (the only case that's actually likely to occur) it's simpler and usually faster to leave the RAM as "write-back".

johnsa wrote:

So in theory, apart from transferring the BSP MTRR settings to the other cores on trampoline I shouldn't have to look at them again.

I wouldn't want to assume that AP CPUs have caches fully disabled properly; which means that (for "defensive programming" against potentially buggy firmware, etc) I'd start AP CPUs before touching MTRRs, so that I can update all CPU's MTRRs at the same time.

Cheers,

Brendan

OSDev.org

Disable MTRRs

Who is online