64 bit recursive paging

nexos · **Joined:** Tue Feb 18, 2020 3:29 pm **Posts:** 1071

Hello,
I am currently trying to understand 64 bit paging. I understand the paging itself, but am confused about 64 bit recursive paging. It is easy in 32 bit, because there are two levels only. How does it work in 64 bit?
Thanks,
nexos

kzinti · **Joined:** Mon Feb 02, 2015 7:11 pm **Posts:** 898

It works it a very similar way. The main difference is that you can't use the last entry (index 511) because this is where you want to load your kernel (mcmodel = kernel). So what I do instead is use the entry at index 510.

It makes the maths a bit less intuitive, so here is what it looks like in memory:

Code:

    0xFFFFFF00 00000000 - 0xFFFFFF7F FFFFFFFF   Page Mapping Level 1 (Page Tables)
    0xFFFFFF7F 80000000 - 0xFFFFFF7F BFFFFFFF   Page Mapping Level 2 (Page Directories)
    0xFFFFFF7F BFC00000 - 0xFFFFFF7F BFDFFFFF   Page Mapping Level 3 (PDPTs / Page-Directory-Pointer Tables)
    0xFFFFFF7F BFDFE000 - 0xFFFFFF7F BFDFEFFF   Page Mapping Level 4 (PML4)

    0xFFFFFF80 00000000 - 0xFFFFFFFF 7FFFFFFF   Free (510 GB)
    0xFFFFFFFF 80000000 - 0xFFFFFFFF FFFFFFFF   Kernel (2 GB)

Page Mapping: 4 levels, 9 bits each:
    PML4: 0xFFFFFF7F BFDFE000 to 0xFFFFFF7F BFDFEFFF - 0x200 entries (9 bits), shift = (48 - 9) = 39
    PML3: 0xFFFFFF7F BFC00000 to 0xFFFFFF7F BFDFFFFF - 0x40000 entries (18 bits), shift = (48 - 18) = 30
    PML2: 0xFFFFFF7F 80000000 to 0xFFFFFF7F BFFFFFFF - 0x8000000 entries (27 bits), shift = (48 - 27) = 21
    PML1: 0xFFFFFF00 00000000 to 0xFFFFFF7F FFFFFFFF - 0x1000000000 entries (36 bits), shift = (48 - 36) = 12

Here is how I initialize the recursive page mapping from the bootloader: https://github.com/kiznit/rainbow-os/bl ... x86_64.hpp

And here is how I update it from the kernel kernel: https://github.com/kiznit/rainbow-os/bl ... etable.cpp

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1604

nexos wrote:

Hello,
I am currently trying to understand 64 bit paging. I understand the paging itself, but am confused about 64 bit recursive paging. It is easy in 32 bit, because there are two levels only. How does it work in 64 bit?
Thanks,
nexos

There is little need for recursive paging when you can just map all physical memory. You are guaranteed to have at least 4 times as much virtual memory as you have physical memory, so it is fine to devote a quarter of your virtual address space to physical memory. This makes it trivial to convert from physical to virtual addresses, and back is also trivial if the address is in that quarter. I map all physical memory linearly starting at 0xffff800000000000 using the largest available page size (typically 1GB pages), and I am done with it.

nexos · **Joined:** Tue Feb 18, 2020 3:29 pm **Posts:** 1071

Thank you all for your help! I decided how I am going to do it.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5145

nullplan wrote:

You are guaranteed to have at least 4 times as much virtual memory as you have physical memory,

Servers with NVRAM are rapidly approaching that limit. It may be wise to consider alternatives.

nexos · **Joined:** Tue Feb 18, 2020 3:29 pm **Posts:** 1071

My alternative was to map paging tables that I was using to temporary addresses. It is a long function, but works. It can be found here. The 32 bit code still uses recursive paging instead.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1604

Octocontrabass wrote:

Servers with NVRAM are rapidly approaching that limit. It may be wise to consider alternatives.

I don't know what NVRAM could be, other than non-volatile RAM. And that is usually very small, not worth mentioning these days. Googling for "server nvram" brought no results, either. A cursory search on newegg for current server hardware gave 1.5 TB as the absolute maximum at this time (for only $3500). A further search on Alternate found 3 mainboards that allow a maximum RAM capacity of 7.5 TB (in 24 RAM slots) for around 1000€ each (bear in mind, the latter search was just for the mainboard, the former was for complete systems).

Anyway, with 4-level paging you have 48 bits of address space, or 256 TB. A quarter or that is 64 TB. That is still more than three doublings away from the maximum memory capacity I found (so roughly 5 years if RAM size keeps up with Moore's law). If indeed servers are approaching that limit (I don't even have that much storage in total, let alone RAM in one computer), Intel has already released 5-level paging, which enables a 56 bit address space, or 64 PB. A quarter of that is 16 PB, and even those crazy servers are going to be a long way off from that.

Also, just because some computer may exist that may, eventually, supersede my memory management scheme, on the system I actually have in front of me, those whacked-out specs I cannot even dream of. If this scheme does indeed shackle me to a memory limit of 64 TB, I do think I shall survive. Having a solution is better than having no solution. And managing pages with a recursive mapping is bad for a multitude of reasons, not the least of which being you can only manipulate the current address space, so if you need to make space and page out unused data, you first have to evict your current address space to manipulate the other one, the one that hasn't been used in a while. The whole point of paging out is that you get to evict the stuff you didn't need in a while.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5145

nullplan wrote:

I don't know what NVRAM could be, other than non-volatile RAM.

This stuff.

nullplan wrote:

A cursory search on newegg for current server hardware gave 1.5 TB as the absolute maximum at this time (for only $3500). A further search on Alternate found 3 mainboards that allow a maximum RAM capacity of 7.5 TB (in 24 RAM slots) for around 1000€ each (bear in mind, the latter search was just for the mainboard, the former was for complete systems).

Here's one that can do 36TB. (If you have to ask about the price, you can't afford it.)

nullplan wrote:

And managing pages with a recursive mapping is bad for a multitude of reasons, not the least of which being you can only manipulate the current address space, so if you need to make space and page out unused data, you first have to evict your current address space to manipulate the other one, the one that hasn't been used in a while.

Agreed. But there are other ways besides recursive paging and mapping all physical memory.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1604

Octocontrabass wrote:

Here's one that can do 36TB. (If you have to ask about the price, you can't afford it.)

No wonder I didn't find it. I was searching for hardware you can actually buy, and this one is out of reach for us mere mortals (or can you find $2 mill. between the sofa cushions?) Anyway, 36TB is still below the limit of 64TB, and Intel's Ice Lake is in production as of late last year (that's the microarchitecture with the 5-level paging). So by the time 64 TB are actually reached, 5-level paging will be part of these completely crazy systems.

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

You can definitely buy servers with more than 1.5 TiB RAM but these mainboards probably don't show up on Newegg (look at server configuration tools of enterprise vendors instead). A system with 1.5 TiB will "only" cost you ~$30k, not millions.

nexos · **Joined:** Tue Feb 18, 2020 3:29 pm **Posts:** 1071

On Wikipedia, it says Ice Lake and 5 level paging was released last year.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5145

Ice Lake client CPUs, which do not support 5-level paging, were released last year. Ice Lake server CPUs have not been released yet.

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Octocontrabass wrote:

But there are other ways besides recursive paging and mapping all physical memory.

I don't think that there is any paging scheme that can compete with a full physical mapping on x86_64 (including recursive paging, the latter is just bad and problematic for many reasons; don't do it!). This claim is often repeated on these forums but people never actually measure its impact. It seems to be one of these "Oh look, I can outsmart mainstream OSes" claims that falls apart when you take a closer look.

The first reason is complexity. If you don't map all physical memory, you need to perform frequent TLB shootdown (whenver you unmap a per-demand mapped page). To perform lazy TLB shootdown, you need some chunk of memory to store the current shootdown progress. Note that this happens in your deallocation routine, so you don't want to allocate memory (at least not from the same heap, otherwise, you can get nasty free-allocate-free recursions). If all physical memory was mapped, you would just grab a page of physical memory and use that to store the state. Now that is not possible anymore so you have to switch to a more complex scheme. For example, you could store the shootdown state in some static locations and block if these locations are all in use. Other schemes (e.g., the one traditionally employed by Linux on 32-bit CPUs) is to have "low" phyiscal memory that is mapped at all times and "high" physical memory that is mapped on-demand. Note that Linux is removing this scheme from their kernel (since today you either have a lot of memory and a 64-bit CPU or little memory and a 32-bit CPU) due to its complexity. Anecdotally, If Linux removes something for complexity reasons, you don't want to have it in your kernel.

The second reason is performance. Ideally you want to only map a page into your current CPU but that's not possible on x86_64 (in contrast to other archs that have per-CPU higher halfs). One solution would be to dynamically modify the PML4 of each process to swap out the higher half to a per-CPU higher half. But now you end up with even more TLB invalidations¹, a larger CPU migration cost and more overall memory consumption since you will probably have more than one PML4 per process (because you want to cache these PML4s). In any case (per-CPU PML4 or not), you need global synchronization among all CPUs when you map and unmap a page, e.g., because you must ensure that you don't map a page as uncached on one CPU and as writeback-cached on another (that can trigger an MCE when the memory controller detects it). You can consider a scheme that uses quiescent states to wait until page attribute changes are done (similar to what Linux' RCU does for garbage collection) but that introduces even more complexity. Note that there is also an increase in memory consumption: to map all physical memory, you need around 8 bytes per 1 GiB of memory. Any reasonable scheme that tracks which pages are mapped (or unmapped but not invalidated on all CPUs yet) and which aren't will need considerably more memory than that.

I would gladly be proven wrong here but I will not buy the idea that on-demand mapping of physical pages can be fast without seeing some benchmarks first. Note that the Linux guys did perform the benchmarks and they noticed a 25% performance difference:

Quote:

If you have 8GB of RAM or more, the biggest advantage [of 64-bit mode vs. 32-bit mode + high memory] _by_far_ for the kernel is that you don't spend 25% of your system time playing with k[un]map() and the TLB flushing that goes along with it.

¹ You will even have to invalidate TLBs from your CPU migration routine. Depending on your kernel's design, this might be triggered from IRQ contexts - you don't want to block for TLB shootdown in IRQ contexts, so you need to allocate memory for lazy invalidation...

kzinti · **Joined:** Mon Feb 02, 2015 7:11 pm **Posts:** 898

This seems to make sense to me for kernel space. Basically you map all physical memory into kernel VMA space and it makes a lot of things much simpler.

But how is that helping in user space (if at all)? Surely you have to map the pages so that the user programs can access them and that means TLB shootdowns and all the complexity is back?

What am I missing here?

bzt · **Joined:** Thu Oct 13, 2016 4:55 pm **Posts:** 1584

Octocontrabass wrote:

nullplan wrote:

You are guaranteed to have at least 4 times as much virtual memory as you have physical memory,

Servers with NVRAM are rapidly approaching that limit. It may be wise to consider alternatives.

Nope, they won't.

Virtual memory is 56 bits wide, always (architectural limit). Physical bus is currently 36 bits wide (64G RAM) on most CPUs, and nobody have ever implemented more than 48 bits (256T RAM) and probably won't in the foreseeable future. This means even with the hypotetical top edge CPU there's a factor of 256 times more virtual memory than physical RAM. And just for the records, mainline OSes can't handle more than 512G RAM at the moment (you'll need special kernel configuration for that).

Cheers,
bzt

OSDev.org

64 bit recursive paging

Who is online