Question about tracking memory regions in processes...

senaus · **Posted:** Tue May 01, 2007 7:49 am

I'm redesigning the address space support in my kernel, and I'm wondering which is the best/most popular option for tracking memory regions.

Discuss!

Brendan · **Posted:** Tue May 01, 2007 12:24 pm

Hi,

senaus wrote:

I'm redesigning the address space support in my kernel, and I'm wondering which is the best/most popular option for tracking memory regions.

Discuss!

I tend to use the flags in page table entries as much as possible. For "present" pages there's at least 3 available bits that are combined to give one of the following page types:

For "not present" pages there's at least 31 bits that are available. These are combined into a 31 bits number, where:

All other values mean "not present page that was sent to swap", and contains the "block number" in swap space. For 32-bit paging this gives me a maximum (combined) swap space size of "(2^31 - 3) * 4 KiB", or almost 8192 GiB of swap space. For PAE and long mode, page table entries are 64-bit and the maximum amount of swap space is insanely huge (it's almost 34359738368 TiB).

For memory mapped files the page table flags only indicate if a page is part of a memory mapped file or not. There will be seperate structures for working out which file is mapped where (probably using simple linked lists).

For swap space the only thing missing is a way to work out which pages should be sent to swap if the OS runs out of RAM (e.g. an LRU page eviction algorithm). I need to think about this more, but I want a global page eviction algorithm rather than per process page eviction algorithm (i.e. when a page needs to be sent to swap, it could be any page in any process not just a page in the currently running process). This means some sort of structure in kernel space for managing when pages where last used.

On top of that I have regions that behave differently - one region for "process space", one for "thread space", and one for "kernel space". The boundaries between these regions are all at fixed addresses, except for the boundary between process space and thread space which is determined by a "per process" variable that's initialised from the executable file's header. There's also flags for each process that effect the behaviour of process space, and flags for each thread that effect the behaviour of thread space (mostly, if allocation on demand is enabled as default or not).

Process space itself is split into seperate areas - one for code and read only data, one for read/write data and the remainder for unititialised data. The size of these areas are determined from the executable file's header too. Thread space is always treated as uninitialised data. For normal pages these areas determine page protection flags (read only, read/write, executeable).

That's it.

There are a few deficiencies with this system - it can't handle shared memory areas, and doesn't allow for DLLs or shared libraries. This is fine for my OS where shared memory is deliberately avoided (as processes may be running on different computers), and where DLLs or shared libraries should be implemented as either static libraries and/or services that run as seperate processes.

It also takes a little juggling in the linear memory manager to make it work. For example, if a process says "make this area allocation on demand" when the default is "allocation on demand disabled", then the memory manager needs to pre-allocate page tables for that area so it can remember. It also means that if a process says "make this area locked" then the linear memory manager has to pre-allocate pages because there is no "not present locked RAM" page type.

Of course for large areas the same thing can be done with flags in page directory entries (and page directory pointer table entries, etc). This means (for e.g. and assuming 32-bit paging) if a process wants to change 1 GB of space to "allocation on demand" then the linear memory manager can just set flags in the page directory for most of it, and only pre-allocate page tables if the area doesn't start or end on a 4 MB boundary.

Cheers,

Brendan

Combuster · **Posted:** Tue May 01, 2007 1:30 pm

I'm building an exokernel. Hence my memory tracking mechanism consists of the paging structures (1024 or 512-way tree), and a mirror tree in the kernel that keeps reference counts (also 512 way as if it were a alternate format PAE structure) This allows userland applications to manage physical and virtual memory with mostly O(1), lock-free and (under conditions) realtime calls to the kernel.

I'm currently trying to find a way to layer a page allocator on top of that without breaking the properties of the other calls. I've got some concepts, but i'm trying to mentally work out a proof of correctness so to make sure it will actually work

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

Brendan wrote:

All other values mean "not present page that was sent to swap", and contains the "block number" in swap space. For 32-bit paging this gives me a maximum (combined) swap space size of "(2^31 - 3) * 4 KiB", or almost 8192 GiB of swap space. For PAE and long mode, page table entries are 64-bit and the maximum amount of swap space is insanely huge (it's almost 34359738368 TiB).

I don't think swap space size is the biggest issue with the above scheme. I think the biggest issue is that if you need to bring a page in from swap for read, but it's never modified, you end up writing them back into swap anyway if you don't keep track of the swap block when a page is present. In a sense every page then is dirty (assuming you map pages on demand.. if not, then any non-zero page is always dirty).

If every page is dirty, there will never be any pages to free "fast" from processes, because removing a dirty page from a process involves writing it to disk. On the other hand, if the swap-space allocation is remembered even for in-memory pages, you can clean pages when you don't have anything else to do, such that hopefully you always have some clean pages that you can remove from processes essentially instantly.

Once you have a way to remember swap blocks for in-memory pages, you can use the same method for doing memory mapped files. If several processes map the same file (and you allow this) you'll (in a sane implementation) end up having shared memory as well, which allows shared libraries as a special case (though allowing CoW as well can simplify them).

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

Oh, and my own kernel doesn't do that kind of stuff yet, but I've got a few alternatives in mind. Most likely I'll do region management with either (resizeable) arrays or binary trees (that's easy to abstract though, so experimentation with different schemes is possible), to allow binary search to sparse address spaces. Each region mapped into an address space will then map into another region in some file, anonymous swap space being handled by special swapfs which isn't visible mounted in the normal directory tree. Memory mapped devices can likewise be handled by generalized files. Copy-on-write will need special support in the region structures, but that's pretty much it.

It'll take a while though, before I get into writing that, since VFSv3 is under development, I want some networking first, and I'm planning to write a new "fair" scheduler. Until then my process memory management is simple: there's a "break" which you can move around, and system will serve page faults by mapping zeroed pages below break, never removing any page below break.

Colonel Kernel · **Posted:** Tue May 01, 2007 3:33 pm

mystran wrote:

I don't think swap space size is the biggest issue with the above scheme. I think the biggest issue is that if you need to bring a page in from swap for read, but it's never modified, you end up writing them back into swap anyway if you don't keep track of the swap block when a page is present. In a sense every page then is dirty (assuming you map pages on demand.. if not, then any non-zero page is always dirty).

This would be a pretty big problem... In my MM design (yet to be implemented for lack of time :roll:

), I store the block number in the PTE as well, but copy it to the corresponding entry in the page frame DB (my physical memory manager's main data structure) when bringing a page in from swap. BCOS uses a linked list of pages for its PMM I believe... Brendan, how would you handle this case that mystran pointed out?

mystran wrote:

Once you have a way to remember swap blocks for in-memory pages, you can use the same method for doing memory mapped files. If several processes map the same file (and you allow this) you'll (in a sane implementation) end up having shared memory as well, which allows shared libraries as a special case (though allowing CoW as well can simplify them).

I think the lack of shared memory and shared library support in BCOS is a feature, not a bug (Brendan, correct me if I'm wrong). Like Singularity, it is designed as a "sealed process architecture" (although with traditional hardware protection, unlike Singularity).

The paper on the sealed process architecture is here:

ftp://ftp.research.microsoft.com/pub/tr/TR-2006-51.pdf

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

Yeah I didn't mean it'd be bug to not allow shared memory. I just pointed out that if you allow memory mapping a single file into several processes at the same time, you end up with either shared memory, or a huge synchronization issue.

I know the Singularity stuff.

Anyway, I remember having read that Windows NT style for the in-memory swap address problem is to use a separate "shadow page directory" with the swap addresses. No idea about the details though, so I could have understood it totally wrong.

Colonel Kernel · **Posted:** Tue May 01, 2007 4:28 pm

mystran wrote:

Anyway, I remember having read that Windows NT style for the in-memory swap address problem is to use a separate "shadow page directory" with the swap addresses. No idea about the details though, so I could have understood it totally wrong.

I wouldn't really call it a "shadow page directory"... actually, the PFDB in my kernel is based on NT's. You might be thinking of "prototype PTEs", which are like "shadow PTEs" I guess. They're used to help keep PTEs that point to shared pages consistent. I never really understood the details... I'm not planning to implement shared memory either.

Aali · **Joined:** Sat Apr 14, 2007 12:13 pm **Posts:** 58

i just keep a static array of 4 or more memory regions (program data, heap, stack and IPC area (really just the other half of the heap but i keep it separate because it makes things easier))

BUT, i keep my own page tables separately from the "real" page tables
this makes things alot easier, since i can store arbitrary data with any MM object (too much overhead is not good of course, but 12 bytes per page is acceptable in my opinion)

these structures keep track of shared memory, CoW and all that stuff
keeping them in sync might seem troublesome but most of that is invisible outside the MM

Crazed123 · **Joined:** Thu Oct 21, 2004 11:00 pm **Posts:** 248

I use an array of arrays. The higher-level arrays contain structures denoting the starting virtual address, access rights and containing the array of pages mapped at that virtual address with those permissions.

Brendan · **Posted:** Tue May 01, 2007 9:55 pm

Hi,

mystran wrote:

Brendan wrote:

All other values mean "not present page that was sent to swap", and contains the "block number" in swap space. For 32-bit paging this gives me a maximum (combined) swap space size of "(2^31 - 3) * 4 KiB", or almost 8192 GiB of swap space. For PAE and long mode, page table entries are 64-bit and the maximum amount of swap space is insanely huge (it's almost 34359738368 TiB).

I don't think swap space size is the biggest issue with the above scheme. I think the biggest issue is that if you need to bring a page in from swap for read, but it's never modified, you end up writing them back into swap anyway if you don't keep track of the swap block when a page is present. In a sense every page then is dirty (assuming you map pages on demand.. if not, then any non-zero page is always dirty).

Inside the kernel there's "kernel modules". One of these keeps track of page usage (a "page usage manager?"). It tells the linear memory manager which pages to evict from RAM (and which pages are "high usage" and should be corrected if a process is migrated from one NUMA node to another, but that's another story).

Because the page usage manager is responsible for deciding which pages get sent to swap, it needs to know about anything that can make one page a better choice than another. This means that if the physical memory manager brings a page in from swap for read then it needs to tell the page usage manager what it did (which process/thread, and which swap block was read at what linear address), and the linear memory manager can forget about that page after that. In this case the page usage monitor would remember the page is still in the swap, and if the page is modified (or if the system starts running out of swap space) it'd free the "swap block" and forget about the page.

Basically, for each "in use swap block", either the physical memory manager keeps track of it (if the data isn't in RAM) or the page usage manager keeps track of it (if the data is in RAM).

To be honest I haven't spent much time thinking about the interface between the linear memory manager and the page usage manager. The nice thing about my linear memory management is that I don't actually need to think about the page usage manager until after I've got basic linear memory management working - everything except swap space and memory mapped files can be done with page table flags and nothing else (and swap space and memory mapped files can be added later).

mystran wrote:

Once you have a way to remember swap blocks for in-memory pages, you can use the same method for doing memory mapped files. If several processes map the same file (and you allow this) you'll (in a sane implementation) end up having shared memory as well, which allows shared libraries as a special case (though allowing CoW as well can simplify them).

A file can be opened as read-only any number of times. If a file is opened as read/write then (conceptually) a virtual copy of the file is created that is atomically written to the file system as a new version of the old file when the file is closed. The same file can be opened as read/write any number of times to create any number of new versions of that file. For legacy file systems (file systems that don't support versioning, like FAT) the OS refuses to allow the file to be opened as read/write if it's already opened as read-only or read/write, and refuses to allow it to be open as read-only if it's already opened as read/write.

For memory mapped files, creating the memory mapping is the same as opening the file, and unmapping the file is the same as closing the file. It doesn't make any difference which process or thread opens or closes what (for e.g. a single thread could open the same file as read/write 6 times and still end up creating 6 new versions of the old file). Lastly, file handles can't be transfered - if a thread opens a file and gets a file handle, then other threads in the same process can't use that file handle.

Cheers,

Brendan

senaus · **Posted:** Wed May 02, 2007 7:09 pm

Thanks everyone, some very interesting ideas there. Now it's time for me to explain my design... here goes!

Basically, I use three structures alongside the page tables - MemoryRegions, BackingStores and Pages. I'll explain these first.

A MemoryRegion defines an arbitrary sized range of virtual addresses in a process's address space. I am currently trying to work out a way of storing these structures efficiently.

A Page structure represents a physical page frame. These structures are stored in a large array, allocated upon initialization, where each element represents its corresponding page frame.

A BackingStore is an object (managed by my object manager) which represents the physical backing storage behind MemoryRegions - a MemoryRegion can be a 'window' into a BackingStore (this is optional, a MemoryRegion with no BackingStore will be treated as heap). They also contain a 'Page Cache Table' - a hash table of pointers to Page structures representing the parts of the BackingStore currently resident in memory.

The page tables are the link entities between MemoryRegions and Pages; when a page frame is committed to a MemoryRegion, a page table entry is filled which links the two.

All memory allocation is done on-demand, and can be satisfied from the Page Cache. When a page fault comes in for a MemoryRegion with a BackingStore (could be a memory mapped file), the kernel does one of four things:

dirty cache

Pages in the dirty cache are locked and written to BackingStore lazily. Once written, they are moved back to the clean cache.

The design of this system prohibits a BackingStore to be mapped persistent RW once only; this can be done whilst there are non-persistent mappings on the BackingStore - however, no more mappings can be made on a BackingStore once it is mapped as persistent RW.

I hope that all makes sense

It is late here, and I'll probably be suitably embarrassed once I re-read this in the morning, but what the hell. I'll be glad to hear some constructive criticism on this!

Cheers,
Senaus

Crazed123 · **Joined:** Thu Oct 21, 2004 11:00 pm **Posts:** 248

Since I'm writing a microkernel, I've decided to add L4-style map/unmap/grant virtual memory management. Anyone know an efficient implementation of an L4-style mapping database?

Avarok · **Joined:** Thu Aug 30, 2007 9:09 pm **Posts:** 102

I'm still marking down the specifications for mine.

I've been planning to modify reiser4 to achieve Hans' file system semantics, and implement a multi-entry-point, multi-threading executable file format on top of it with a privileges security framework.

My line of thought was to allow a "program" to have multiple entry points, possibly for different threads, and therefore multiple code and stack segments. To accomplish this, "programs" need to provide information (generated at compile time) about the initial size of each of these arrays, as well as their maximum size. This information allows "the OS" to allocate pages to the "program" marked with the appropriate access bits.

The initial size of the code and data segments tends to be accurate, while the stack and dynamic arrays tend to expand.

Quote:

For this reason the dynamic arrays are placed in the middle of the (64-bit) address space and the stack at the top 0xFFFF_FFFF_FFFF_FFFF_FFFF_FFFF_FFFF_FFFF. This makes it impossible to achieve a stack overflow condition with rationally chosen program maximums (which are tested on allocation of real resources, and exist to detect overflows, infinite recursion, etc.

On second thought, there are multiple stacks. Allocate them like dynamic arrays. Provide no guarantees about the address space on stacks other than that ESP and EBP will be adjusted accordingly. Re-address stack pages and adjust ESP and EBP if an address space collision occurs. Guarantee dynamic array addresses except across allocations.

An added benefit of this mechanism seems to be that one can, if sparsely addressed, not only append space, but often enough, prepend it for large dynamically allocated arrays.

OSDev.org

Question about tracking memory regions in processes...

Who is online