Bucket Allocator

BenLunt · Post by **BenLunt** » Sun May 01, 2022 9:00 pm

Hi guys,

It's been a while. I haven't visited here near as much as I use to. However, I have been a little ill the past little while and seem to spend more time sitting here at my desk than out doing things. I hope it isn't old age, right? :-)

Anyway, recently I have decided to do a complete re-write of my project. I am dropping anything considered legacy. No more Legacy BIOS, FDC, PIC, PIT, etc. Strictly 64-bit, UEFI, HPET, APIC, APCI, etc., and of course USB :-).

Anyway, I was looking through lists and lists of malloc() replacements to see what I could come up with to write one for my own. Some I found where very simple, though the minimum allocation was a 4096 byte block. Others were more complex, allowing smaller allocations, being more conservative--i.e.: less fragmentation. However, the more complex it was, the more difficult it was to understand what was actually happening under the hood.

Therefore, I decided to write a simple malloc() drop-in replacement, with documentation, for the newbie. Something that was much easier to understand. (note: that malloc() function does not take the standard C set of parameters. However, this is for your kernel's internal memory, so...)

It uses mmap() to allocate virtual memory, but does not require mmap() in case the reader doesn't have that much implemented yet. It can simply use a contiguous set of physical blocks.

Anyway, it was written over a few hours this weekend, and it seems to work as expected, though I have not tested it thoroughly.

It has two source files, malloc.cpp and malloc.h, which are (mostly) drop-in source files. (what source is always drop-in?) It also has bucket.cpp as the caller to show what initialization needs to be done.

The source and documentation is at https://github.com/fysnet/FYSOS/tree/master/bucket. It includes a seven page PDF trying to explain exactly what is happening under the hood.

I would love to hear comments, good and bad, improvements, etc., though remember that the purpose is simplicity for the newbie starting out.

One thing I might change is to have the spinlock function only lock the current Bucket (which in turn locks any children Pebbles) being modified, rather than lock all function. This way another task/thread can access any other Bucket at the same time as this task/thread, as long as it is another Bucket.

A few advantages to this code's simplicity is the mmap() function is completely optional if only one Bucket is ever allocated, as well as the spinlock function if no multitasking is used.

Thank you to everyone here, for always welcoming my comments. I have been a long time visitor and very much enjoyed reading this forum.

Thank you,
Ben
- https://www.fysnet.net/osdesign_book_series.htm

Voldemort · Post by **Voldemort** » Mon May 02, 2022 6:28 pm

Good Job!. It well documented and easy to read. I will take a deeper look when time permits and get back with my suggestions.

replace 4096 with global constant perhaps.

davmac314 · Post by **davmac314** » Tue May 03, 2022 2:35 am

Hi Ben,

That looks pretty decent given the purpose. I think the one thing I'd consider doing differently is to not have the distinction between physical and virtual memory, insist that all buckets are physically contiguous (as well as virtually contiguous), and have the user (caller) do the virtual -> physical address conversion if necessary. That's what I am doing in my (tinkering, early-stage) kernel and IIRC it is what Linux does for kmalloc. But maybe there are advantages to doing it the way you have done, I'm just "thinking aloud".

Regarding your comment at the bottom in the readme:

This code may be similar to another malloc I once saw. As I was writing it, things seemed to be a little familiar. Kind of a deja-vu type thing. If you know of a malloc that is similar, even using the name 'bucket', please let me know. I can't seem to find the one in question anymore. It would be interesting to see what I remembered.

I definitely know the term "bucket" from an allocator. I suspect maybe it is the Doug Lea malloc. However, for the "bucket" concept that I am familiar with, a bucket is a free list of "pebbles" of a particular size, rather than a region from which allocations are made (I know that as "arena"). So there is always a fixed number of buckets, each corresponding to a different sized allocation: starting at perhaps 16 bytes, then 24 bytes, 32 bytes, and so on, maybe jumping by larger increments for higher index ranges. The bucket just points to the first free block (pebble) of the specified size, which is part of a linked list. The size of a free block is duplicated at the start and end of the block so you can merge "backwards" if freeing an allocated block that has a prior free block. I have implemented this myself but the code is not public yet.

Voldemort · Post by **Voldemort** » Tue May 03, 2022 11:23 am

Linux Kernel Development by Rober Love is excellent small book, the memory management chapter supplemented with Bens explanation should help everyone.

nullplan · Post by **nullplan** » Tue May 03, 2022 12:37 pm

davmac314 wrote:I definitely know the term "bucket" from an allocator. I suspect maybe it is the Doug Lea malloc. However, for the "bucket" concept that I am familiar with, a bucket is a free list of "pebbles" of a particular size, rather than a region from which allocations are made (I know that as "arena"). So there is always a fixed number of buckets, each corresponding to a different sized allocation: starting at perhaps 16 bytes, then 24 bytes, 32 bytes, and so on, maybe jumping by larger increments for higher index ranges. The bucket just points to the first free block (pebble) of the specified size, which is part of a linked list. The size of a free block is duplicated at the start and end of the block so you can merge "backwards" if freeing an allocated block that has a prior free block. I have implemented this myself but the code is not public yet.

This is pretty much how the old musl allocator worked (now called oldmalloc). Every free chunk is part of a doubly linked list, all sorted into buckets of varying sizes. Each chunk must be a multiple of 4 machine words in size (and has an overhead of 2 machine words), and the first 32 buckets just contain fixed size chunks from 1 to 32 times that size. The next 32 buckets contain a logarithmic scale for minimum size. When an allocation request comes in, it is rounded up to the next minimum bucket size, and then all buckets of that index and higher are checked to see if they contain anything. If not, a chunk of the next higher multiple of the page size is allocated, and then split.

The algorithm featured a sophisticated fine-grained locking mechanism, so you would only take a lock on the bucket you were operating on. Meaning that different threads allocating objects of wildly different size could work at the same time. But that didn't really work out for two reasons: One, modern C++ applications tend to allocate lots of similarly sized objects, so most threads only work on one of a few buckets, and two, as part of the allocation at one point, it is possible that one thread consumes all already allocated free memory. Only temporarily; the memory chunk will then be split and the remainder freed again, but in that time, any other thread would see that the heap is empty and expand the heap, leading to unlimited heap growth. So before abandoning that algorithm, Rich published a version with a global malloc lock to get rid of the problem.

BenLunt · Post by **BenLunt** » Tue May 03, 2022 1:28 pm

Hi guys, Thanks for your comments.

davmac314 wrote: I think the one thing I'd consider doing differently is to not have the distinction between physical and virtual memory, insist that all buckets are physically contiguous (as well as virtually contiguous), and have the user (caller) do the virtual -> physical address conversion if necessary.

If I understand you correctly, the heap would allocate contiguous pages of physical memory, all identity mapped to virtual memory addresses, and visa-versa. If this is what you mean, then there is no need for paging, correct? Or did I understand that incorrectly?

Though not added in the code, in the specification (which is what my much-advanced heap code does), I add the Physical distinction, the ability to allocate physical contiguous memory because this is used by the kernel. The kernel needs to be able to allocate physical memory for hardware drivers. In a user privileged heap, the ability is disabled by the kernel, with a simple mask of the flags, not allowing physical allocation.

My memory structure has three layers. The physical layer, the virtual layer, and the heap layer. All allocation goes to the heap. If it can satisfy the request, it does so and returns, never calling the underlining layer. If it cannot, it calls the virtual layer to extend the heap, which in turn calls the physical layer.

My kernel supports 32 Gig of RAM. In fact, the way it is coded, my kernel thinks there is 32 gig of RAM at all times, period, even if there is considerably less. The virtual layer supports this 32 Gig independent of how much is actually physically installed. The kernel's heap, as well as a user heap, can request memory as it pleases. The kernel has the 32-gig limit, while a user heap has a much smaller limit, though it can be lifted if needed. The physical layer is where all the work is done, paging out to the media, etc.

Anyway, back to what I originally had in mind with this post. The code I link to is a simple heap allocator, usable even if there is no underlining layers. A beginner may simply point the mmap() call to a string of physical pages and return. As long as these pages are guaranteed to be free and present, *and* there is never a call to add a bucket to the heap, this code works just fine. A beginner can have a heap allocation malloc() drop-in with no other memory layers available. Then when they write the virtual layer, simply implement the mmap() functionality and this heap continues to work as is, no modifications.

The reason was that most of the code I looked at, when deciding what kind of heap layer I wanted, was all either extremely simple, but memory hungry; or more complex, but very difficult to understand what was happening. In fact, I did see Doug Lea's code. Without sounding like I am disrespecting it in any way, there were so many macros used, a beginner would be completely lost.

The code I provided is for the beginner. It will allocate and free memory, very similar to a libc call. Granted, the alignment part is not included. This is left for the beginner to implement. When the beginner has enough of a project to allocate physical memory for hardware drivers, I believe they should have the knowledge to advance this code, or even write their own. For now, in my humble opinion, it allows a beginner to understand and use a memory chunk less than the page size, eliminating fragmentation, but still allocating large chunks as well.

I do appreciate the comments. As I said, I started many years ago and would not be anywhere near what I have today if it wasn't for forums like this one. As a nostalgic note, how many of us remember BBSs. :-)

Thanks,
Ben

davmac314 · Post by **davmac314** » Wed May 04, 2022 6:44 am

BenLunt wrote:If I understand you correctly, the heap would allocate contiguous pages of physical memory, all identity mapped to virtual memory addresses, and visa-versa. If this is what you mean, then there is no need for paging, correct? Or did I understand that incorrectly?

Not exactly - I meant that the heap would allocate contiguous pages of physical memory at some virtual address which may be different from the physical address. If the caller wants the physical address then they must convert from the virtual address to the physical (by whatever means is provided elsewhere, not part of the allocator). Or putting it another way: all allocations are for virtual addresses, but the virtual memory allocated by any single allocation is guaranteed to correspond to a contiguous block of physical memory. Allocations for special ranges (<16mb for example) probably could be handled at a separate level (the page-based physical memory allocator).

So for instance the system I'm working on is 64-bit. All of physical memory is mapped contiguously at 0xFFFF 8000 0000 0000 (i.e. at the beginning of the "high half", assuming 4-level page tables). There's a physical memory allocator which can be used to allocate a physical range, and "mapping" it is a trivial operation: just add the 0xFFFF8...000 offset. The heap allocator allocates arenas via the physical memory allocator, "maps" them, and returns addresses from within. The caller (of malloc for example) can take the returned address, subtract 0xFFFF8...000, and they have the physical address (if they need it).

In fact, I did see Doug Lea's code. Without sounding like I am disrespecting it in any way, there were so many macros used, a beginner would be completely lost.

I remember having a similar experience looking at Doug Lea's code for the first time!

I think it's great you're writing an allocator that's easy to adapt and easy to understand. That will hopefully be useful for people who are starting out making their own kernels!

Ethin · Post by **Ethin** » Wed May 04, 2022 1:37 pm

So I was actually thinking about a way of implementing a memory allocator that could work with Ada's constructs. I'm not really sure how feasible it is, but here's my idea, in the form of an algorithmic process:

Initialize a counter of pages and a position indicator to 0. Both are 64-bit. Both are of type Natural.
Identify the page size. If 1G pages are supported and available, then each "allocation bucket" is 1G. If 2M pages are supported, then each bucket will be 2M. Otherwise, each bucket is 4K.
Set the hard limit on buckets to 1024, such that the maximum amount of memory available to the kernel is either 1T, 2G, or 4M. (Maybe allow this to be adjustable or dynamic?)
When either Allocate is called or when an allocator is invoked:
1. If the requested allocation size would cross a page boundary, skip to the next page. If there are no more pages available, abort the allocation.
2. Save the current position indicator.
3. If a new page was not allocated, increase the position indicator by the requested allocation size, and return the address of the (old) position indicator. The current position indicator becomes the next allocation.
4. If a new page was allocated, set the position indicator to zero, increase it by the requested allocation size, and return the address of the start of the page.
When Deallocate is called, either implicitly or explicitly:
1. If the deallocated size results in the entire page being free, unmap it and restore the old position indicator.
2. Otherwise, subtract the position indicator by the size of the item being deallocated, zeroing all memory along the way.

I'm curious what you guys think of this. It seems, on the surface, to be a relatively simple design, but it probably has flaws somewhere (e.g. pages that haven't been used for a while should be deallocated). I've never written a memory allocator, ever, so this is my first (mental) attempt at designing one that's easy for me (and hopefully anyone else) to understand. I know that there are people who are a lot better at this than I am, though, so maybe there's a much better design I could pursue and understand. Hell, maybe this already exists and there's a name for it and I'm not aware of it. Thoughts?
Edit: I feel like this might be an interesting way of me learning and mastering paging. Using a 1T hard maximum for example would create addresses from 0h-ffffffffffh, assuming I did start at absolute address zero (though I've read that starting from the null address is a bad idea). I would obviously need to design a page table where physical memory is mapped in the upper half -- this would be a horrible idea on an identity-mapped system and could easily cause weird HW problems when program memory overwrote physical device bits. Page alignment wouldn't be overly difficult to guarantee, and though I've looked at the intel docs on paging a few times, I'm still a bit stumped (I'm not sure what's making this so difficult for me to understand -- I feel like the designers of paging made it deliberately convoluted just because engineers).

nullplan · Post by **nullplan** » Wed May 04, 2022 3:15 pm

Well, that just simply doesn't work. What if the thing being deallocated isn't the most recent item allocated? You are making a watermark allocator, and those are known for the fact that they cannot deallocate anything except the last item on the heap. Simple and effective, but runs consumes unbounded memory if the allocation patterns don't pan out.

Ethin · Post by **Ethin** » Wed May 04, 2022 4:33 pm

Eh, it was a good try, I suppose. I'll probably implement this one as a starting point -- thanks, by the way, Ben!

Octocontrabass · Post by **Octocontrabass** » Wed May 04, 2022 4:41 pm

Ethin wrote:I feel like this might be an interesting way of me learning and mastering paging.

Heap allocation and page allocation are two very different tasks. While it's true that most heap allocators care about paging, that's only an optimization: applications usually want things like cache/TLB locality, or the ability to release unused memory back to the OS. If your heap allocator isn't going to provide those features, it doesn't need to care about paging.

Ethin wrote:I've read that starting from the null address is a bad idea

Most C compilers use the address 0 to represent a null pointer (since in C, the integer 0 converted to a pointer is a null pointer). Compilers optimize on the assumption that a null pointer will always be invalid, so programs can behave oddly if you allow memory at address 0.

Ethin wrote:I would obviously need to design a page table where physical memory is mapped in the upper half

Physical memory doesn't have to be mapped when you're not using it. Having permanent access to all physical memory is a distinctly Linux thing - Windows doesn't work that way.

Ethin · Post by **Ethin** » Wed May 04, 2022 5:41 pm

Octocontrabass wrote:
Ethin wrote:I feel like this might be an interesting way of me learning and mastering paging.
Heap allocation and page allocation are two very different tasks. While it's true that most heap allocators care about paging, that's only an optimization: applications usually want things like cache/TLB locality, or the ability to release unused memory back to the OS. If your heap allocator isn't going to provide those features, it doesn't need to care about paging.

Ethin wrote:I've read that starting from the null address is a bad idea
Most C compilers use the address 0 to represent a null pointer (since in C, the integer 0 converted to a pointer is a null pointer). Compilers optimize on the assumption that a null pointer will always be invalid, so programs can behave oddly if you allow memory at address 0.

Ethin wrote:I would obviously need to design a page table where physical memory is mapped in the upper half
Physical memory doesn't have to be mapped when you're not using it. Having permanent access to all physical memory is a distinctly Linux thing - Windows doesn't work that way.

Thanks for those explanations. Yeah, I'll probably end up designing my OS the way windows works -- by only mapping those physical memory regions that I actually need. But I first need to master paging. Time to go re-read wiki and docs. Then I'll check out the bucket allocator some more.

neon · Post by **neon** » Wed May 04, 2022 11:50 pm

Hi,

If interested, we use a cache-based slab allocator. It allocates from the system pool if needed to expand a cache. The process works like this: get a free kernel page (pool), get a free page frame number (pmm), map it (paging) and delink it (pmm). I originally went with the slab allocator both because my target is a microkernel and because I always go the more complicated route.

A series of 2^k size general caches exist. So for general allocations, MmAlloc finds the cache that best fits and allocates from it. This would be the equivalent to malloc.

For an introduction though, I can see your code being very helpful. I think it is important to see different designs and how others work things out - everyone has their own creative approach to memory management.

OSDev.org

Bucket Allocator

Bucket Allocator

Re: Bucket Allocator

Re: Bucket Allocator

Re: Bucket Allocator

Re: Bucket Allocator

Re: Bucket Allocator

Re: Bucket Allocator

Re: Bucket Allocator

Re: Bucket Allocator

Re: Bucket Allocator

Re: Bucket Allocator

Re: Bucket Allocator

Re: Bucket Allocator