Octocontrabass wrote:
But there are other ways besides recursive paging and mapping all physical memory.
I don't think that there is any paging scheme that can compete with a full physical mapping on x86_64 (including recursive paging, the latter is just bad and problematic for many reasons; don't do it!). This claim is often repeated on these forums but people never actually measure its impact. It seems to be one of these "Oh look, I can outsmart mainstream OSes" claims that falls apart when you take a closer look.
The first reason is complexity. If you don't map all physical memory, you need to perform frequent TLB shootdown (whenver you unmap a per-demand mapped page). To perform lazy TLB shootdown, you need some chunk of memory to store the current shootdown progress. Note that this happens in your deallocation routine, so you don't want to allocate memory (at least not from the same heap, otherwise, you can get nasty free-allocate-free recursions). If all physical memory was mapped, you would just grab a page of physical memory and use that to store the state. Now that is not possible anymore so you have to switch to a more complex scheme. For example, you could store the shootdown state in some static locations and block if these locations are all in use. Other schemes (e.g., the one traditionally employed by Linux on 32-bit CPUs) is to have "low" phyiscal memory that is mapped at all times and "high" physical memory that is mapped on-demand. Note that Linux is removing this scheme from their kernel (since today you either have a lot of memory and a 64-bit CPU or little memory and a 32-bit CPU) due to its complexity. Anecdotally, If Linux removes something
for complexity reasons, you don't want to have it in your kernel.
The second reason is performance. Ideally you want to only map a page into your current CPU but that's not possible on x86_64 (in contrast to other archs that have per-CPU higher halfs). One solution would be to dynamically modify the PML4 of each process to swap out the higher half to a per-CPU higher half. But now you end up with even more TLB invalidations¹, a larger CPU migration cost and more overall memory consumption since you will probably have more than one PML4 per process (because you want to cache these PML4s). In any case (per-CPU PML4 or not), you need global synchronization among all CPUs when you map and unmap a page, e.g., because you must ensure that you don't map a page as uncached on one CPU and as writeback-cached on another (that can trigger an MCE when the memory controller detects it). You can consider a scheme that uses quiescent states to wait until page attribute changes are done (similar to what Linux' RCU does for garbage collection) but that introduces even more complexity. Note that there is also an increase in memory consumption: to map all physical memory, you need around 8 bytes per 1 GiB of memory.
Any reasonable scheme that tracks which pages are mapped (or unmapped but not invalidated on all CPUs yet) and which aren't will need
considerably more memory than that.
I would gladly be proven wrong here but I will not buy the idea that on-demand mapping of physical pages can be fast without seeing some benchmarks first. Note that the Linux guys
did perform the benchmarks and they noticed a 25% performance difference:
Quote:
If you have 8GB of RAM or more, the biggest advantage [of 64-bit mode vs. 32-bit mode + high memory] _by_far_ for the kernel is that you don't spend 25% of your system time playing with k[un]map() and the TLB flushing that goes along with it.
¹ You will even have to invalidate TLBs from your CPU migration routine. Depending on your kernel's design, this might be triggered from IRQ contexts - you don't want to block for TLB shootdown in IRQ contexts, so you need to allocate memory for lazy invalidation...