Article on physical memory management

FlashBurn · Post by **FlashBurn** » Tue Aug 16, 2011 8:40 am

Embedded systems and 64GB RAM?

OSwhatever · Post by **OSwhatever** » Tue Aug 16, 2011 8:53 am

FlashBurn wrote:Embedded systems and 64GB RAM?

You will see it sooner than you think

egos · Post by **egos** » Tue Aug 16, 2011 9:30 am

OSwhatever wrote:You will see it sooner than you think

Not earlier than it will work so quickly

FlashBurn · Post by **FlashBurn** » Tue Aug 16, 2011 9:31 am

Yeah, but then the cpu will be fast enough to init a 64GB stack in less than a second

Brendan · Post by **Brendan** » Tue Aug 16, 2011 3:37 pm

Hi,

turdus wrote:
gerryg400 wrote:A stack uses only 1 pointer. That's it.
ROTFL. PLease show me a stack implementation that uses *ONLY* a pointer. No further allocation allowed.

I've written plenty of them.

You need to understand that there's 3 different (but related) quantities:

the amount of virtual address space used
the amount of "used" RAM used (RAM that can't be allocated)
the amount of "free" RAM used (RAM that can be allocated)

For the amount of virtual address space used, a free page stack needs a minimum of one pointer (the physical address of the page on the top of the stack), plus maybe 1 bit for a spinlock (which can be any of the 12 unused bits of the physical address). The amount of "used" RAM is the same (4 or 8 bytes). The amount of "free" RAM used depends on how many pages are actually free, but it never matters how much "free" RAM is used.

There are variations on the free page stack idea though. One variation is to have a linked list of "index" pages, where each "index page" contains the physical addresses of up to 1023 (for 32-bit addresses) or 511 (for 64-bit addresses) more pages. In this case the amount of virtual address space used is 4 KiB, but the last index page can be allocated when it's empty and therefore doesn't count as "used" RAM (so "used" RAM is effectively zero bytes). The amount of "free" RAM used is the same as before.

For performance, there's 6 separate things to consider:

allocating 1 page
freeing 1 page
allocating multiple (potentially non-contiguous) pages
freeing multiple (potentially non-contiguous) pages
allocating multiple contiguous (potentially with other restrictions - e.g. "must not cross 64 KiB boundary") pages
freeing multiple contiguous pages
initialisation/setup time

For most OSs, virtual memory is virtual. For example, when a process allocates 123 MiB of RAM the OS doesn't actually allocate anything. Instead it maps the same "page full of zeros" as read-only for the entire 123 MiB area. When the process attempts to write to that area you get a page fault and only then does the kernel allocate a page. Similar tricks are used for memory mapped files (pages marked as "not present" and fetched if/when accessed), swap space, etc. This means that it's very rare to want to allocate more than 1 page at a time; and also means that it's extremely unlikely that a process' pages will ever be physically contiguous.

This also means that for allocating 1 page, freeing 1 page, and freeing multiple non-contiguous pages; performance is very important. The other things don't happen often enough to matter.

For free page stacks, allocating 1 page and freeing 1 page is extremely fast. Freeing multiple (potentially non-contiguous) pages is also relatively fast, because you can write all the "next" links into each page before adding them all to the stack itself (which is great for lock contention too). It gets more complex when you're using multiple free page stacks (e.g. one stack for pages with 32-bit addresses and another for pages with 64-bit physical addresses), but it's essentially the same (just done once per stack type). The same applies to initialisation/setup time (it's the equivalent of freeing multiple potentially non-contiguous pages).

Anyway, lets have a look at various methods with all the characteristics..

Free page stacks

virtual address space used: 4 or 8 bytes
"used" RAM used: 4 or 8 bytes
"free" RAM used: depends on amount of free RAM
allocating 1 page: extremely fast
freeing 1 page: extremely fast
allocating multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page allocations)
freeing multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page deallocations)
allocating multiple contiguous pages with restrictions: extremely slow
freeing multiple contiguous pages: very fast
initialisation/setup time: average or relatively slow (depending on how and when it's done)

Linked list of index pages

virtual address space used: 4 KiB
"used" RAM used: none
"free" RAM used: depends on amount of free RAM
allocating 1 page: extremely fast
freeing 1 page: extremely fast
allocating multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page allocations)
freeing multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page deallocations)
allocating multiple contiguous pages with restrictions: extremely slow
freeing multiple contiguous pages: very fast
initialisation/setup time: average or relatively slow (depending on how and when it's done)

Free page bitmap

virtual address space used: "total RAM / 4 KiB / 8" bytes (32 KiB per GiB of RAM)
"used" RAM used: "total RAM / 4 KiB / 8" bytes (32 KiB per GiB of RAM)
"free" RAM used: none
allocating 1 page: relatively slow (worst case is searching the entire bitmap)
freeing 1 page: extremely fast (can be done with one "bts" instruction)
allocating multiple (potentially non-contiguous) pages: relatively slow (worst case is searching the entire bitmap)
freeing multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page deallocations)
allocating multiple contiguous pages with restrictions: relatively slow (worst case is searching the entire bitmap)
freeing multiple contiguous pages: very fast
initialisation/setup time: average

Turdus' "linked list of extents" method, with limited number of extents

virtual address space used: Don't know - maybe 4 KiB (for one page full of "next + start + length" entries)?
"used" RAM used: Don't know - maybe 4 KiB (for one page full of "next + start + length" entries)?
"free" RAM used: none
allocating 1 page: extremely fast
freeing 1 page: may not be possible if there's no space left for a new extent ("infinitely slow"?)
allocating multiple (potentially non-contiguous) pages: average (worse case is lots of 1 page extents)
freeing multiple (potentially non-contiguous) pages: "infinitely slow" (same as freeing 1 page)
allocating multiple contiguous pages with restrictions: "infinitely slow" (if you need to allocate the middle pages of an extent to suit restrictions like "must not cross 64 KiB boundary" then the original extent may need to be split into 2 extents, and it may not be possible to create the second extent).
freeing multiple contiguous pages: "infinitely slow" (same as freeing 1 page)
initialisation/setup time: extremely fast

Turdus' "linked list of extents" method, with unlimited number of extents

virtual address space used: have to be able to handle worst case (every second page free or "total pages / 2 * sizeof(entry)") so maybe 1 MiB per GiB
"used" RAM used: none (assuming "allocation on demand" used for storing extents)
"free" RAM used: with "allocate on demand" depends on number of extents (worst case is maybe 1 MiB per GiB)
allocating 1 page: extremely fast
freeing 1 page: very slow. Worst case is searching the entire linked list and not finding an extent to add the page to
allocating multiple (potentially non-contiguous) pages: average (worse case is lots of 1 page extents)
freeing multiple (potentially non-contiguous) pages: extremely slow (worst case is lots of 1 page deallocations)
allocating multiple contiguous pages with restrictions: slow. Worst case is searching (almost) the entire linked list
freeing multiple contiguous pages: very slow (worst case is searching the entire linked list and not finding an extent to add the page to)
initialisation/setup time: extremely fast

For the most important things (allocating 1 page, freeing 1 page, and freeing multiple non-contiguous pages) the page stacks method wins. It also wins for virtual address space used and "used" RAM used. For the less important things (allocating multiple contiguous pages with restrictions and freeing multiple contiguous pages) the bitmap method wins. The only case where the "linked list of extents" method wins is for initialisation/setup time (but the difference is unlikely to matter - despite the faster initialisation/setup time, boot time will probably be slower anyway because of the extra overhead of allocations/deallocations that occur during boot).

On top of everything above, there's the ability to support 3 more (potentially important) features: NUMA optimisations, page colouring, and large page sizes.

For NUMA optimisation, for page stacks and "linked list of index pages" you just have one stack or list per NUMA domain and it's both easy and fast. For "one global bitmap" it makes allocations slower; and for "one bitmap per NUMA domain" it doesn't effect allocation (but virtual space usage and "used" RAM usage increases). For "linked list of extents" you could have one linked list of extents per NUMA domain, which doesn't effect performance (but virtual space usage increases).

For page colouring; for page stacks and "linked list of index pages" you just have one stack or list per page colour and it's both easy and fast. For bitmaps it's also easy and fast to do (e.g. check each nth bit rather than each bit). For "global linked list of extents" it'd be a disaster (you'd be splitting and joining extents all the time). One "linked list of extents" per page colour would work better (but that will also make it very hard to allocate contiguous pages).

Then there's the ability to support large page sizes (2 MiB or 4 MiB pages, and 1 GiB pages). The key characteristic here is being able to coalesce multiple smaller pages into a larger page (e.g. multiple free 4 KiB pages converted into one free 4 MiB page, and multiple free 4 MiB pages converted into one free 1 GiB page), because without that you quickly end up with all larger pages broken up with only 4 KiB pages left (which makes large page support pointless after the OS has been running a little while).

It's always been my opinion that for most OSs (where virtual memory is virtual), regardless of how you do memory management, supporting large page sizes is a waste of time because you're still almost always allocating 4 KiB pages; and the disadvantages (e.g. the extra overhead of coalescing free pages, allocated pages or both) outweighs any advantages (e.g. less TLB misses). It's because of this that most OSs only use larger pages for the kernel's RAM and memory mapped devices where you get some of the "less TLB misses" advantage without any disadvantages.

However, if I had to support large page sizes; then page stacks and "linked list of index pages" fail completely. Bitmaps alone could work (but would be slow and could easily be improved on). For "linked list of extents", it'd be easy to add support for large page sizes without making the performance any worse.

I'd probably use a bitmap (or a bitmap per NUMA domain) to track 4 KiB pages, with an array of "number of free 4 KiB pages" entries for each 2 MiB page (to make it easier to find 2 MiB pages, and find 2 MiB pages that are only missing a few 4 KiB pages). Of course this would mean searching the array and then part of the bitmap when allocating 4 KiB pages, and searching the array when allocating 2 MiB pages; and would also cost 1 bit per 4 KiB page plus 2 bytes per 2 MiB page (or a total of 33 KiB per GiB, for both virtual address space and "used" RAM usage). Of course part of the reason for this is that it'd be easy to make it work for NUMA optimisations and page colouring too. I haven't thought about it much though - there might be a better way.

Cheers,

Brendan

gerryg400 · Post by **gerryg400** » Tue Aug 16, 2011 4:04 pm

If you really want an allocator/deallocator that approaches O(1) for both operations and will support both page sizes you can use a stack for the 2MB pages and a slab allocator on top of that to keep track of the 4kB pages. You can keep the slab info in one of the pages of the slab (and thus waste 0.2% of each big page thats broken up) or in one of the other memory structures the way (IIRC) the Linux SLUB allocator does it. Unless your prepared to run some sort of defragger though you'll always have the possibility of running out of big pages.

I've not found any reason to allocate big pages yet, other than for static stuff at boot time so I've not really explored this.

Cognition · Post by **Cognition** » Tue Aug 16, 2011 5:10 pm

Well in theory the whole point of using large pages would be to avoid hammering the PMM with allocation requests when it can be determined ahead of time that a larger area of memory will be needed. Instead of firing off 512 or 1024 page faults you simply get one and fill the request. Given the multimedia nature of today's computing environment I don't think it's hard envision a great number of scenarios where an application will predictably and quickly need in excess of 2-4mb of memory.

That said, any frame allocator which implements multiple page sizes, and the ability to split and coalesce them will be much slower then a single size lifo stack allocator. Even a well designed 2^N system with a predictable worst case scenario like the one combuster described is much much more complicated and time consuming then an allocator that just pushes and pops. So much so that you will end up with slower overall allocation speed if you're just allocating 4k pages most of the time anyways.

Ultimately it'll come down to how applications actually request memory. If you have a scenario where small applications won't frequently request small blocks of memory and large applications will frequently request larger spans of memory then a mixed size allocator would probably be best. If on the other hand every application is just nibbling away at 4k pages there's not a really compelling reason to avoid a single size stack allocator.

Combuster · Post by **Combuster** » Tue Aug 16, 2011 6:38 pm

However, I still prefer to use doubly linked lists for my physical memory. De-allocation is faster than O(1). I found that I tended to allocate pages 1 (or a few) at a time, but de-allocate in big chunks when processes are killed or a file is closed. If all the pages in a process are linked together, you can return the entire list with 4 pointer assignments.

I can't deduce your implementation based on that description, but one implementation I can see of such a tactic is one of lazy-reclaiming one: In each page of free ram you have the following: a number of pointers that are potentially present, followed by an array of pointers of that size, which might contain null pointers. The first entry in the array (if it exists) points to a structure of similar format, the others point to free pages potentially containing garbage. You can initialize the system by setting each page with an array of size one and a pointer to the next page, i.e. double the amount of work as you would for a stack:

Code: Select all

 start
   |
   v
+------+      +------+      +------+
|   4  |  +-->|   2  |  +-->|   3  |
+------+  |   +------+  |   +------+
| next |--+   | next |--+   | null |--X
+------+      +------+      +------+
|  p1  |      |  p4  |      | null |
+------+      +------+      +------+
|  p2  |                    |  p5  |
+------+                    +------+
|  p2  |
+------+

The trick to do is when allocating, pop items of the first page (skip nulls if necessary), and adjust n by 1+#nulls. If the next pointer is reached, return the page and set the start pointer to the following page.
Freeing one page just involves setting n to 1, next to start, and start to the freed page.
The trick comes with block deallocations of virtual memory. If you have a page table with several pages pointing from it, you can just reinterpret that page as a free page stack. The only thing you need to do is make sure that the n and next fields have the proper value. If you do not do other bookkeeping, n will have to be 1023 (32-bit entries) or 511 (64-bit entries), and next should point to the existing stack. Any pointers you have to overwrite this way can be turned into a second list with a known size.

In the likely case, such page tables are filled continuously where code/data/bss is, whereas the heap may have some fragmentation. If you do not have other bookkeeping finding the used pages within that table might cost as much as searching that pagetable completely and freeing the pages one at a time. Instead that process is now performed by the page request function which might in the worst case strip all null values where necessary, but in more likely cases there exists several allocated pages within a pagetable and less operations have to be performed in all. In many cases, you know even know exactly the n needed to cover all pages used. You might even want to add a signature bit if the page is known to be full to make batch allocations even faster.

Let's add to Brendan's list for completeness sake. If you have a better description for your algorithm you may want to share it

Pagetables as index tables

virtual address space used: 4 or 8 bytes
"used" RAM used: 4 or 8 bytes
"free" RAM used: depends on amount of free RAM
allocating 1 page: amortized extremely fast (read 4k of ram in the worst case)
freeing 1 page: extremely fast
allocating multiple (potentially non-contiguous) pages: average (amortized lots of 1-page allocations worst case; the real worst case is when there are lots of fragmented pages)
freeing multiple (potentially non-contiguous) pages: average (lots of 1-page deallocations)
allocating multiple contiguous pages with restrictions: extremely slow
freeing multiple contiguous pages: fast
initialisation/setup time: average or relatively slow (depending on how and when it's done)

Binary aligned extent trees (as posted earlier)

virtual address space used: 4 or 8 bytes
"used" RAM used: 4 or 8 bytes
"free" RAM used: depends on amount of free RAM
allocating 1 page: average (worst case requires halving the entire address space until a 4k node is returned)
freeing 1 page: average (worst case requires merging the entire address space)
allocating multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page extents)
freeing multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page extents with no positional relation)
allocating multiple contiguous pages with restrictions: fast (worst case when unaligned allocations are allowed, the actual requested size is very small, and fragmentation is extremely high: the algorithm scales inversely proportional to the requested size)
freeing multiple contiguous pages: extremely fast
initialisation/setup time: extremely fast

Obviously the extent trees are designed for large allocations with better worst-case performance in other areas than Turdus' method (since O(log n log n) for an alloc-dealloc pair is faster than O(1)+O(n) in their respective worst cases). Pagetable indexes are no better here than their stack counterparts.

NUMA can be covered by keeping a structure per domain for the pagetable indexes, which makes the optimal case less achievable because directory-wide deallocations requires knowledge if the NUMA criterion was satisfied. The alternative is to keep two lists per domain, one with guaranteed members and one with likely members. If the guaranteed list is exhausted, the likely list is checked in turn, and then the foreign lists. Any out-of place items can be dispatched to their correct guaranteed-lists, nulls are moved to the end, and also uniform lists can be dispatched to the correct guaranteed-list, which still maintains the same time performed per actual deallocation, but with potentially much higher outliers for individual requests. Coloured pages are even nastier to optimize and go away from the best-case scenarios even further to the point where simply using a stack is both faster and more practical.

Where NUMA domains are single large ranges, the extent tree needs no modification at all, as operations can simply add the desired criterion in the descent, otherwise the tree can be separated per domain. Page colouring can be easily added in a node's data as a bitfield (where for example bit n set means this node or its subnodes contain pages with colour n)

Actually my current implementation is a referencecounting tree having the exact dimensions as the x86 PAE system, with entries that either are a 32-bit reference count, or a pointer to the next table with the amount of unused pages in the bottom bits. The top 8 bits of the pointer (which is architecturally never part of the address) is used as a tag to indicate ram, and if the data is a pointer/counter pair or a usage reference count. It makes deallocations trivial, small allocations decent as the tree is only recursed if there's free ram available, and allows automatic reclaiming of larger pages, but all operations are O(1) when userland wishes to apply a different allocation strategy and just goes supplying the page locations to map and have the kernel do little more than refcounting and sanity checks.

turdus · Post by **turdus** » Wed Aug 17, 2011 7:57 am

Combuster wrote:Ok, It has been shown again that religion is more important than fact, due to the fact it's blatantly obvious that turdus thinks his implementation is flawless and the world disagrees.

Nothing like that. You show only flaw's that's been think of. For example it will definitely not crash when the buffer is full, as many of you thought. I was only against these.

turdus · Post by **turdus** » Wed Aug 17, 2011 8:04 am

Brendan wrote:Hi,
turdus wrote:
gerryg400 wrote:A stack uses only 1 pointer. That's it.
ROTFL. PLease show me a stack implementation that uses *ONLY* a pointer. No further allocation allowed.
I've written plenty of them.

Dear Brendan,

I think you misunderstand. I was referring to implementing a stack that uses *ONLY* a pointer as data, in other words, please show me a stack implementation that uses only 4 bytes of memory to keep track of pages. I'm pretty sure you agree that a pointer that points to nowhere is quite useless

Cheers

Owen · Post by **Owen** » Wed Aug 17, 2011 12:30 pm

But it does only use one pointer's worth of memory. All the other data is stored on the free pages. Now, as we have ascertained that these pages are free (After all, if they weren't free, they wouldn't be on the free page stack), any memory used on these free pages does not count.

Combuster · Post by **Combuster** » Wed Aug 17, 2011 3:02 pm

turdus wrote:
Combuster wrote:Ok, It has been shown again that religion is more important than fact
Nothing like that.
(...)
Dear Brendan, I think you misunderstand.

I rest my case.

gerryg400 · Post by **gerryg400** » Wed Aug 17, 2011 9:45 pm

You can probably reduce the amount of physcal memory used by a bitmap to almost zero too.

If a page in the bitmap contains all 1's (i.e every page that it repesents is in use) the physical page behind the bitmap can be removed for use by the OS and replaced by a COW page conainting all 1's. When a page is freed in a bitmap page that is COW, the freed page can be used as the physical memory behind the page.

It requires that the backing page for each page in the bitmap come from the group of pages that that section of the bitmap represents. It might be doable but I've never tried it.

FlashBurn · Post by **FlashBurn** » Thu Aug 18, 2011 2:26 am

The question should be, why is it a problem that a structure for memory management needs like 128kb (e.g. the bitmap) for managing 4gb of ram?

@turdus

You haven´t answered the "flaws" I found in your algo.

turdus · Post by **turdus** » Thu Aug 18, 2011 9:01 am

Owen wrote:any memory used on these free pages does not count.

As I wrote before, I rather use that area as cache, so it does count for me.

OSDev.org

Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management