
So, instead of trying to invent a better stack, I went ahead to invent (or re-invent, I'm not sure

Characteristics:
- Looking up for a free page: O(log_32(n))
- Allocating: O(log_32(n))
- Freeing: O(log_32(n))
- Checking if page is allocated or not: O(1) - a feature that really helps to catch bugs.
- Memory spent: O(n)
I've posted basic idea (when it was in it's infancy) here, it somewhat resembles Cottontail (mentioned before by Legendmythe) or Brendan's idea about array of "number of free 4 KiB pages" entries for each 2 MiB page, in this thread.
Here's a simplified illustration for 4-bit indexing groups: (in reality I'm using 32-bit groups, but it wouldn't be practical to draw that)
To make it clear: 1 means free page, 0 - allocated. Bit set in a upper level means that there's at least one bit set in corresponding group at lower level.
Basic algorithms
- Look up for a free page: start at a top, find first nonzero bit at that level (preferably using BSF instruction or GCC's __builtin_ctz()). Descend to next level, find a nonzero bit there, and so on until you hit the bottom.
- Allocating: clear a bit in lower level, then if a whole group becomes 0, update a bit in upper level, and so on until the top (or stop).
- Freeing: set a bit in lower level, if whole group was 0 before, update a bit in upper level and so on...
- Checking if page is allocated: just check the bit in lower level.
Support for 4 MiB pages also could be added - it just needs additional, parallel upper levels (slightly different, checking for 0xFFFFFFFF, not 0). 2 MiB pages would be trickier, as it involves 16-bit groups somewhere between 4KiB and 2 MiB, but are doable anyway.
Currently I'm using 4KiB-only version, and it works great
