Software page-table walk in virtual address space

josecm · **Joined:** Sun Apr 30, 2017 10:14 am **Posts:** 18

Hello, I'm struggling with the question on how to perform a software page table walk (for example, for purposes of creating a new mapping) when the operating system itself is running in a virtual address space. My question arises from the fact that the page table entries contain the physical address space of the next level table. So, how do we know the virtual address for the next level page so we can access it, How do people usually solve this problem?

Thank you in advance

bellezzasolo · **Joined:** Sun Feb 20, 2011 2:01 pm **Posts:** 110

There's not guaranteed to be a virtual address for a page table. In theory, you have to map it in.

In practice, this is usually handled by pointing the last entry (or an entry of your choosing) of the page directory or PML4 to itself. Then every page table physical address becomes mapped as a page in the top 1/1024 or 1/512 of the address space. If you look at the top slot of that space, you find the page directories, and so on. The top level paging structure is then at FFFFF000-FFFFFFFF or (FFFF)FFFFFFFFF000 to (FFFF)FFFFFFFFFFFF

The downside of this approach becomes apparent with PAE, where the PDPT has 4 entries. You'd dedicate 1/4 of the address space. For PAE, I'd instead map each page directory into the top page directory entries manually, and just map the PDPT as an ordinary page.

This is also the case on ARM with the short paging descriptor format. These aren't virtuallly identical between levels, and I don't think recursive mapping is possible. The long descriptor format, on the other hand, is trivial (and I believe so is AArch64)

josecm · **Joined:** Sun Apr 30, 2017 10:14 am **Posts:** 18

Thank you for your answer @bellezzasolo. I think what you are referring to is recursive mappings right? Is this actually implemented in practice in OSes such as Windows or Linux?

But actually, I'd like another solution. I'm actually working with the RISC-V architecture which does not support recursive mappings because each PTE must be either a pointer to table or pointer to a page.

Is there any other method you know about to tackle this issue?

bellezzasolo · **Joined:** Sun Feb 20, 2011 2:01 pm **Posts:** 110

As I understand it, Linux on x86 just uses its directly mapped physical memory at 3GB (which is a poor approach).

If you can't use recursive page tables, then you need to be able to map a particular page table on demand, which isn't too different to mapping a physical address. Of course, if you have no page table mapped, or only full page tables, you end up unable to map any more.

I'd be contemplating keeping a cache of page tables virtually mapped, and when you're running out allocate and map some more.

josecm · **Joined:** Sun Apr 30, 2017 10:14 am **Posts:** 18

I didn't get what you meant by "full-page tables", but I have at the least the first level/root page-table always mapped at a known address.

But I also didn't fully get your "cache of page tables" idea. If I'm not mapping the page tables on demand as I traverse the multiple levels, and use this cache, wouldn't I also need some other kind of auxiliary mapping or tagging to tell me what are the physical addresses of the page-tables I actually have mapped right?

bellezzasolo · **Joined:** Sun Feb 20, 2011 2:01 pm **Posts:** 110

josecm wrote:

I didn't get what you meant by "full-page tables", but I have at the least the first level/root page-table always mapped at a known address.

But I also didn't fully get your "cache of page tables" idea. If I'm not mapping the page tables on demand as I traverse the multiple levels, and use this cache, wouldn't I also need some other kind of auxiliary mapping or tagging to tell me what are the physical addresses of the page-tables I actually have mapped right?

By full page tables, I just mean page tables with no free entries (so you can't use them to map a new page table). Storing the root is a good start, but you need to do more. If you write an address into the root, you still don't get access to it, you need to have at least one bottom level page table that you can write it into.

If every page table that is in use is mapped, you should be able to implement virtual->physical address conversion through what is effectively a page table walk.

josecm · **Joined:** Sun Apr 30, 2017 10:14 am **Posts:** 18

I now see what you mean by a cache of page tables now. I think I'll use your idea then.

bellezzasolo wrote:

If every page table that is in use is mapped, you should be able to implement virtual->physical address conversion through what is effectively a page table walk.

I see this, but with the overhead of the cache look-up at each page-table level right?

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

bellezzasolo wrote:

As I understand it, Linux on x86 just uses its directly mapped physical memory at 3GB (which is a poor approach).

I have seen this claim (i.e., the claim that mapping all physical memory is a bad strategy) so many times on this forum (particularly by Brendan when he was not banned yet) but I have never seen an alternative design. Specifically I have not seen an alternative design that takes into account: (i) asynchronous TLB shootdown on SMP systems and (ii) that caching modes of pages can change. I have tried multiple times to implement something which avoids the full physical mapping (without massive drops in performance) but I have not been able to do it.

Designs where physical memory is not directly mapped quickly turn into a mess. If you map only a part of physical RAM and you change this mapping, you have to perform TLB shootdown. This means allocating some data structure for an inter-CPU message (hint: allocating heap memory in the physical memory allocation path without deadlocks is complicated), sending an IPI to all processors and waiting for them to process the message. Thus, you either need to eagerly perform shootdown when remapping, or you need to track pages which have previously been part of the physical mapping but which are not shot down yet. This needs to be done both in the VM mapping path and in the VM unmapping path. (At some point I had a design where all kernel page tables were directly mapped but user memory was still not directly mapped; even that turned out to be a nightmare.) Similar problems arise when you want to change caching modes, with the difference that consequences are even worse: for example, if you change the caching mode of a page that was previously used as a page table to WC without performing proper shootdown, you might get an MCE.

Before I recognize the claim that a full physical mapping is bad, I want to see an alternative approach, that exactly explains (i) how it accesses physical memory, and (ii) how it performs TLB shootdown after unmapping and/or changing the caching mode of pages. I do not see how this can be done without massively sacrificing performance. And I want the alternative explained without hand waving as this is a problem where details matter a lot. I would very much like to be wrong here, as it would allow 32-bit kernels to use much more physical RAM. But I don't think I am.

josecm · **Joined:** Sun Apr 30, 2017 10:14 am **Posts:** 18

Korona wrote:

Designs where physical memory is not directly mapped quickly turn into a mess. If you map only a part of physical RAM and you change this mapping, you have to perform TLB shootdown.

What if you could keep this region of virtual space you use to remap (the page tables caches we talked about above), private to a core? You wouldn't need TLB shootdowns, right? As bellezzasolo said (and of course this depends on the kind of system you are designing), if you don't have that many page-tables, and the region is large enough, you wouldn't even need that much remapping and accessing physical memory would only have the overhead of the cache look-up.

Also, having in mind the spectre vulnerability, keeping all physical memory mapped at all times might not be good from a security perspective.

bellezzasolo · **Joined:** Sun Feb 20, 2011 2:01 pm **Posts:** 110

Korona wrote:

bellezzasolo wrote:

As I understand it, Linux on x86 just uses its directly mapped physical memory at 3GB (which is a poor approach).

I have seen this claim (i.e., the claim that mapping all physical memory is a bad strategy) so many times on this forum (particularly by Brendan when he was not banned yet) but I have never seen an alternative design. Specifically I have not seen an alternative design that takes into account: (i) asynchronous TLB shootdown on SMP systems and (ii) that caching modes of pages can change. I have tried multiple times to implement something which avoids the full physical mapping (without massive drops in performance) but I have not been able to do it.

Designs where physical memory is not directly mapped quickly turn into a mess. If you map only a part of physical RAM and you change this mapping, you have to perform TLB shootdown. This means allocating some data structure for an inter-CPU message (hint: allocating heap memory in the physical memory allocation path without deadlocks is complicated), sending an IPI to all processors and waiting for them to process the message. Thus, you either need to eagerly perform shootdown when remapping, or you need to track pages which have previously been part of the physical mapping but which are not shot down yet. This needs to be done both in the VM mapping path and in the VM unmapping path. (At some point I had a design where all kernel page tables were directly mapped but user memory was still not directly mapped; even that turned out to be a nightmare.) Similar problems arise when you want to change caching modes, with the difference that consequences are even worse: for example, if you change the caching mode of a page that was previously used as a page table to WC without performing proper shootdown, you might get an MCE.

Before I recognize the claim that a full physical mapping is bad, I want to see an alternative approach, that exactly explains (i) how it accesses physical memory, and (ii) how it performs TLB shootdown after unmapping and/or changing the caching mode of pages. I do not see how this can be done without massively sacrificing performance. And I want the alternative explained without hand waving as this is a problem where details matter a lot. I would very much like to be wrong here, as it would allow 32-bit kernels to use much more physical RAM. But I don't think I am.

In respect to TLB shootdowns, there are three scenarios to consider:
Two+ cores all in different virtual address spaces. Assuming a higher half kernel, just check the high bit set/clear, if clear, no IPI is necessary. Otherwise, the IPI is sent. Assuming the kernel code for IPI handling isn't being paged out, that should resolve itself fairly easily.
Two+ cores running in the same address space. In this scenario, again just a TLB shootdown.
Shared memory between different address spaces: There's some form of reference counting going on already here. You need to consult kernel data structures to find out to which virtual addresses it's mapped, and which CPUs are using the corresponding address spaces. A bit more interesting.

Regarding allocating IPI data structures? Keeping a non-paged cache of shootdown IPI structures, and waiting for one IPI to finish before the next starts should do the trick. You need room for n*(n-1)*2 IPI structures, where n is the number of CPU threads. (In theory, each CPU sending an IPI to every other CPU). Making a pool work is fairly trivial.

Of course, a physical page can only be replaced on the free list when all TLB shootdowns have finished. I'd be looking at using lock cmpxchg to access a shared variable in the data structure, and when the counter reaches 0 either an event is signalled or the IPI handler itself marks the page as free.

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

josecm wrote:

Korona wrote:

Designs where physical memory is not directly mapped quickly turn into a mess. If you map only a part of physical RAM and you change this mapping, you have to perform TLB shootdown.

What if you could keep this region of virtual space you use to remap (the page tables caches we talked about above), private to a core? You wouldn't need TLB shootdowns, right?

Correct. But if you want to keep a per-CPU VM region, you either have to copy 2048 bytes at each CR3 switch (to move the lower half to a different per-CPU higher half), or you need to duplicate all VM spaces for each CPU. While the cost of the first approach could be reduced by some caching, I find neither approaches very compelling. Note that just not accessing the physical window of a different CPU is not enough: CPUs are allowed to speculatively prefetch any TLB entry, regardless of whether pages are accessed or not. Also keep in mind that when the caching mode is changed, the OS must be sure that the physical page is not in any TLB of any CPU.

Note that for Spectre defense, it's not enough to get rid of the full physical mapping. You'd have to guarantee that no sensitive information is mapped in the higher half at all while user space runs. That seems to be much more difficult. At most, getting rid of the full physical mapping gives you some defense-in-depth, but it doesn't allow you to get rid of KPTI.

@bellezzasolo I'm not asking about how TLB shootdown works in general (in fact, managarm handles it quite well), but how it would interact with a physical memory window that needs to be remapped. There are multiple questions that arise in this context: for example, is it possible to work with a fixed number of shootdown messages? If yes, what do you do if you run out of those? Can you block in all contexts where you unmap until a shootdown finishes? Keep in mind that this now includes all contexts where you need to access a physical page, as you do not have all physical pages mapped! Can you guarantee that this never leads to priority inversion? If not, can you handle priority inheritance on TLB shootdown? Is your chosen number of shootdown structs enough to prevents deadlocks (e.g. because the CPU that needs to finish the shootdown needs to issue another shootdown itself? So far, I have only ever seen hand-wavy answers to these questions but no concrete design that explains how exactly it would work in reality.

For your suggestion about always waiting for each IPIs to finish: I don't think that's practical in a real-world system. If you do that, a task that maps and unmaps memory in a tight loop can starve the whole system. It also does not allow you to do lazy TLB invalidation, which is quite a nice performance boost.

Thus, my current position is that the introduced complexity make it just not worth to be able to utilize slightly more memory on 32-bit systems. And that's the only real advantage that getting rid of the full physical mapping would have.

OSDev.org

Software page-table walk in virtual address space

Who is online