[solved] paging and caching questions (x86)

oscoder · **Posted:** Tue Jan 09, 2018 4:38 am

I've been reading the wiki and the Intel manuals, and I've still got a few questions. I'm assuming x86 processors in 32 bit protected mode with normal paging (not PAE). Sorry for the long list - I promise I tried to figure these out first! Nw if you don't answer them all, just one or two would be really helpful.

1. The INVLPG instruction. I know that when I invalidate an address, it flushes any relevant page table entries cached in the TLB. But what about about page *directory* entries? Can I invalidate an entire page table, instead of just individual entries/pages?

2. Caching flags in a page table entry. I (roughly) understand what the write-through flag does. What I don't get is what the use cases are. When would this be used? Is there anything I *should* be using it for? If I don't have this flag set, when does data get written to physical memory?

3. Shared memory and caching on multi-processor systems. When writing to a page mapped into the address space of two processors/cores, do I need to do anything to make sure both processors "see" it? (eg enabling the write-through bit) By shared memory I don't mean global memory - eg I mean memory available to two or more processes, but not to *every* process

4. Paging structures and caching. When changing a paging structure mapped into virtual memory, should I do anything to make sure it's written to physical memory too? Or is calling the INVLPG instruction good enough?

Thanks in advance for your help,
OScoder

EDIT: numbered the questions properly, removed couple questions that I found answers to via forum search

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

IIRC INVLPG needs to be executed once per page when you invalidate a whole page table. The Intel SDM has an entire section that describes when INVLPG has to be executed that explicitly states that. Look up that section and confirm that what you're doing is correct.

The write-through flag has little uses. Mark usual RAM as write-back (this is the fully-cached, maximal performance mode), mark graphics RAM and other side-effect free buffers backed by devices on the system bus as write-combining and mark MMIO registers as uncached. Usually it suffices to rely on the MTRR configuration of the firmware; for additional modes (e.g. write-combining) take the PAT into account. Write-through might make sense for buffers backed by devices on the system bus that are only ever read by the device and never written. This will yield the same write-performance as write-combining but better read-performance. However, it will horribly fail if the device decides to write to that memory without snooping. If you read from your video framebuffer (but don't use the graphics card to render to it or use some other device to DMA-write to it), you might want to mark the buffer as write-through. EDIT: You'd have to test if common graphics hardware supports this though as it will require the hardware to correctly respond to cacheline-sized memory reads over PCIe.

Yes, INVLPG is processor-local. Some cross-processor invalidation mechanism is required to guarantee correctness before unmapped memory can really be freed (and used again) by the kernel. Established operating systems use IPIs to force TLB flushes on other processors.

The cache is transparent to the software when you deal with paging structures. You can manipulate the paging structures (and other structures like the GDT and IDT) in virtual memory and do not have to worry about coherence (as long as you get the INVLPGs right after you unmap a page, map a read-write page as read-only or stuff like that).

oscoder · **Posted:** Fri Jan 12, 2018 6:31 am

Thanks for clearing some of that up!

So I don't need to mark shared memory (that could be used by multiple processors at the same time) as write combining or anything like that? For example - I don't want to unlock my physical memory map until I *know* other cpus will see any changes made.

Also, for graphics memory would "write through" be acceptable, if I don't want to mess about with PAT yet?

~ · **Joined:** Tue Mar 06, 2007 11:17 am **Posts:** 1225

I remember that write-combine and using MTRRs can make graphics operations very fast even without acceleration, with plain standard VGA or VESA modes.

There's code that does that in the 32-bit version of MenuetOS 0.8x, it's very easy to understand and it's fully FASM Assembly (easy to translate to NASM).

Brendan · **Posted:** Fri Jan 12, 2018 5:46 pm

Hi,

oscoder wrote:

2. Caching flags in a page table entry. I (roughly) understand what the write-through flag does. What I don't get is what the use cases are. When would this be used? Is there anything I *should* be using it for? If I don't have this flag set, when does data get written to physical memory?

There's 2 uses. The first case is for memory mapped devices - if you don't use MTTRs to control "cacheability" (e.g. because you ran out of variable range MTRRs) then you can use the "slightly less good" paging flags instead.

The second case is cache management for normal software. Caches rely on the assumption that recently used data is more likely to be used again soon; but sometimes that assumption is false, and in those cases caches become less efficient. For an example; imagine your application is logging data, where data that's added to the log is not likely to be used again soon. If that "not likely to be used again" data is cached then "more likely to be used again" data has to be evicted from the cache to make room for it; and the performance of your application suffers because "more likely to be used again" data was evicted from the cache. For old CPUs (that don't support CLFLUSH or non-temporal moves) the application could ask kernel to make the log's pages "uncached" to avoid this problem. Of course for newer CPUs it's easier to use CLFLUSH and/or non-temporal moves (and/or prefetching) for cache management instead.

oscoder wrote:

3. Shared memory and caching on multi-processor systems. When writing to a page mapped into the address space of two processors/cores, do I need to do anything to make sure both processors "see" it? (eg enabling the write-through bit) By shared memory I don't mean global memory - eg I mean memory available to two or more processes, but not to *every* process

For normal caches (excluding TLBs), it's all cache coherent and you don't need to do anything. However you may need to be aware of store-forwarding problems. Store forwarding is where a value being stored is forwarded directly to a later load; which can cause problems in extremely rare cases (typically involving using memory alone for synchronisation) if another CPU modifies the value after it was stored but before it was loaded. For a pathological case, consider something like this code (which might be waiting until another CPU finishes doing some work and modifies "foo" to say the work was finished):

Code:

.wait:
    mov [foo],eax           ;This store
    cmp eax,[foo]           ;..may be forwarded to here, causing it to work like "cmp eax,eax" without reading from memory
    je .wait                  ;..and turning this into an infinite loop

oscoder wrote:

4. Paging structures and caching. When changing a paging structure mapped into virtual memory, should I do anything to make sure it's written to physical memory too? Or is calling the INVLPG instruction good enough?

You have this backwards. You modify physical memory, then use INVLPG to tell the CPU it needs to update the TLB entry from physical memory. The INVLPG instruction only effects one CPU, and when there's multiple CPUs that could have the old translation in their TLBs you need to do INVLPG on all CPUs. This is called "multi-CPU TLB shootdown" and typically involves sending an "inter-processor interrupt"/IPI to the other CPUs (where the IPI handler does the INVLPG). Because this is expensive there's multiple tricks to avoid it in various cases; starting with "lazy TLB invalidation".

Cheers,

Brendan

oscoder · **Posted:** Sat Jan 13, 2018 12:04 pm

Thanks Brendan! That's made it all a lot clearer for me.

I'm also working on a lockless IPC mechanism via shared memory (a circular buffer), so I'll have to think about whether store-forwarding problems will affect it at all. (probably not since each process involved never reads the memory it writes and vice versa - I'll look back over my notes...)

Brendan wrote:

Hi,

oscoder wrote:

2. Caching flags in a page table entry. I (roughly) understand what the write-through flag does. What I don't get is what the use cases are. When would this be used? Is there anything I *should* be using it for? If I don't have this flag set, when does data get written to physical memory?

There's 2 uses. The first case is for memory mapped devices - if you don't use MTTRs to control "cacheability" (e.g. because you ran out of variable range MTRRs) then you can use the "slightly less good" paging flags instead.

The second case is cache management for normal software. Caches rely on the assumption that recently used data is more likely to be used again soon; but sometimes that assumption is false, and in those cases caches become less efficient. For an example; imagine your application is logging data, where data that's added to the log is not likely to be used again soon. If that "not likely to be used again" data is cached then "more likely to be used again" data has to be evicted from the cache to make room for it; and the performance of your application suffers because "more likely to be used again" data was evicted from the cache. For old CPUs (that don't support CLFLUSH or non-temporal moves) the application could ask kernel to make the log's pages "uncached" to avoid this problem. Of course for newer CPUs it's easier to use CLFLUSH and/or non-temporal moves (and/or prefetching) for cache management instead.

oscoder wrote:

3. Shared memory and caching on multi-processor systems. When writing to a page mapped into the address space of two processors/cores, do I need to do anything to make sure both processors "see" it? (eg enabling the write-through bit) By shared memory I don't mean global memory - eg I mean memory available to two or more processes, but not to *every* process

For normal caches (excluding TLBs), it's all cache coherent and you don't need to do anything. However you may need to be aware of store-forwarding problems. Store forwarding is where a value being stored is forwarded directly to a later load; which can cause problems in extremely rare cases (typically involving using memory alone for synchronisation) if another CPU modifies the value after it was stored but before it was loaded. For a pathological case, consider something like this code (which might be waiting until another CPU finishes doing some work and modifies "foo" to say the work was finished):

Code:

.wait:
    mov [foo],eax           ;This store
    cmp eax,[foo]           ;..may be forwarded to here, causing it to work like "cmp eax,eax" without reading from memory
    je .wait                  ;..and turning this into an infinite loop

oscoder wrote:

4. Paging structures and caching. When changing a paging structure mapped into virtual memory, should I do anything to make sure it's written to physical memory too? Or is calling the INVLPG instruction good enough?

You have this backwards. You modify physical memory, then use INVLPG to tell the CPU it needs to update the TLB entry from physical memory. The INVLPG instruction only effects one CPU, and when there's multiple CPUs that could have the old translation in their TLBs you need to do INVLPG on all CPUs. This is called "multi-CPU TLB shootdown" and typically involves sending an "inter-processor interrupt"/IPI to the other CPUs (where the IPI handler does the INVLPG). Because this is expensive there's multiple tricks to avoid it in various cases; starting with "lazy TLB invalidation".

Cheers,

Brendan

OSDev.org

[solved] paging and caching questions (x86)

Who is online