Hi,
AlexShpilkin wrote:
Brendan wrote:
There's almost never any sane reason to allocate a physical page and not map it somewhere; so you have to map it (and do the TLB stuff) regardless of whether you use the first approach or the second approach.
Spot on as usual, thanks! Put another way: if you need to touch a free list node, you’re in the process of allocating (or freeing) the page it resides in, thus it will need to be (or is) mapped anyway. Missed that!
This doesn’t hold if you have a more complicated structure (buddy allocator?), but presumably you only need this to allocate consecutive physical pages, which is weird. (And AFAIU page colouring ≈ multiple free lists[?] so it’s not “complicated” in this sense.)
For a buddy allocator I doubt it can work. However, a buddy allocator is more suited to heap (e.g. "malloc()") rather than virtual address space management anyway.
For virtual address space management, most of the time it's 1 page only and it's allocated within the page fault handler. For example; caller might ask for 100 MiB, but you just map the same page full of zeros everywhere in that 100 MiB as "read only", and don't actually allocate a physical page until/unless they write to it and trigger a page fault by writing to a page. Anything involving "copy on write shared memory" is mostly the same (allocate one page in the page fault handler when there's a write); which includes memory mapped files (where file data is loaded into VFS cache and shared with processes that mapped the file). Swap space (e.g. allocating a physical page to load data from swap space) is always one page at a time too.
Sometimes you'll need a page that has a 32-bit physical address (e.g. for a "32-bit only" PCI device); and for that reason I have different free page stacks for pages with 32-bit physical addresses and for pages with larger physical addresses.
For some devices you need physically contiguous RAM (and sometimes with more bizarre restrictions, like "must be below 16 MiB and can't cross a 64 KiB boundary"), and free page stacks aren't good for that. For these cases; I normally use a bitmap for the first ~16 MiB of RAM instead.
For page colouring and for NUMA optimisations; I just have many free page stacks. For example; with 2 NUMA domains and 256 page colours I might have 2*256=512 free page stacks; and do something like "best_free_page_stack = (requested_NUMA_domain << NUMA_shift) | requested_page_colour;". Many free page stacks (with one spinlock per stack) also reduces the chance of multiple CPUs wanting to acquire the same spinlock at the same time (or, more free page stacks means less lock contention).
I also tend to do a trick to improve performance of higher priority threads. Essentially, have a "dirty page stack" (for pages that still contain stale data from before it was freed last) and a "clean page stack" (for pages that have been filled with zeros). When a high priority thread frees a page you put it on the dirty page stack, and when a high priority thread allocates a page you try to allocate from the clean page stack; and when a low priority thread frees a page you fill it with zeros and put it on the clean page stack, and when a low priority thread allocates a page you allocate from the dirty page stack and fill it with zeros. This means that most of the "page cleaning" is done with low priority threads (where performance/overhead doesn't matter as much); and most of the time you can allocate/free pages for high priority threads faster (without the overhead of filling pages with zeros).
Note that with all of this combined you end up with a kind of "search order" when allocating normal pages. E.g. you might start by trying to allocate a clean page above 4 GiB, and if there's none try for a dirty page above 4 GiB, and if there's none try for a clean page with a 32-bit address; and so on (until you're trying for wrong page colour and/or wrong NUMA domain as a last resort).
For large pages; I simply don't bother and only use large pages for memory mapped IO (e.g. video display memory). This is partly due to the nature of my OS (many smaller processes communicating rather than large "single process" applications) which makes large pages less useful (more wasteful). You could have additional free page stacks for large pages; but the large pages tend to get split up into smaller pages while the OS is running until there are none (so you end up with 4 KiB pages and nothing else anyway); and to avoid that you need some way to recombine many free 4 KiB pages back into a free large page, which tends to be messy and requires a much more complicated/slower physical memory manager (and the extra physical memory manager overhead can easily negate any performance advantages of using large pages).
Cheers,
Brendan