Handling physical memory schedules with paging

rdos · Post by **rdos** » Sat Oct 31, 2020 4:03 am

Many modern hardware devices are built on setting up physical memory schedules, like network cards, AHCI, and USB controllers. There is no direct way of accessing physical memory when using paging with a non-unity mapping. Many of these drivers tend to need extra entries in the schedules for virtual addresses, and when having a physical address it's hard to translate it to a virtual without maping it with paging, something that is slow and inefficient.

Another problem with using physical memory is that simple mistakes can lead to devices corrupting kernel or application memory. It's easy enough to do this with malicious code in a PCI device, but even well-behaved PCI devices can corrupt things in kernel by handing them corrupt schedules.

With 32-bit paging you can use 4M pages, and with long mode and PAE paging you can use 2M pages. These are large enough to create a memory pool of descriptors and buffers typically used by PCI devices that are based on physical memory schedules. It provides easy translation between physical address and ordinary adress and the reverse by adding or subtracting a "base". It's also easy to determine if a given physical address is within the 2M area by comparing the most significant bits.

For a segmented OS (like mine), it would be a good idea to allocate a selector with the linear base and a 2M limit. This way kernel software cannot exceed the limits of the 2M area and thus cannot access linear or physical memory that doesn't belong to the object. The translation between physical address and offset would then just subtract the physical base. If the offset is above 2M, a protection fault will result rather than accessing invalid memory.

Typical scheduled devices use many objects of the same size, which typically should be aligned. I think a bitmap can be used for allocation (rather than linked lists). The bitmap could either be part of the 2M area or a separate object.

There are a few challenges though. One is that it must be possible to allocate both 2M and 4M physical pages from both below 4G and anywhere in memory. This is because some PCI devices only support 32-bit physical addresses. Another is that it also must be possible to alocate 2M and 4M aligned linear addresses. When using 32-bit paging, 4M linear addresses and 4M physical pages must be used, and in order not to waste to much memory these blocks needs to be split into two 2M pages. It might also be possible to split blocks even further for devices that are not expected to use up to 2M of memory. For devices that need more than 2M, they could allocate new 2M areas on demand. It might even be a better idea to use a 64k base size instead of 2M. Or maybe support for both could be added.

So, maybe an API like this:
CreateMemoryBlock32() and CreateMemoryBlock64(). Returns a handle and creates a header with no memory objects
AllocateMemoryBlock(handle, size). Returns a 32-bit block handle, a 48-bit segmented address (or a linear address) and a physical address
GetMemoryBlockPhysical(handle). Converts between 32-bit block handle and physical address
GetMemoryBlockAddress(handle). Converts between 32-bit block handle and 48-bit address
FindMemoryBlockPhysical(physical). Finds the 32-bit block handle for a physical address.

The 32-bit handle could be split into different parts. The higher bits would be from CreateMemoryBlock(). The lowest bits would be from the 2M memory block. If the minimum allocation size is 32 bytes then up to 64k entries could be allocated from a 2M memory block and so 16-bits will be needed for block handles within a 2M block. This would leave 16-bits for the CreateMemoryBlock handle and associating multiple memory blocks to the same handle. Maybe 10 bits could be used for the handle (which would allow up to 1024 handles to be allocated) and 6 bits for the memory block, which would allow up to 64 memory blocks with a 2M size (a total size of 128M). Altough a more realistic scenario would be to initially allocate a few 64k blocks and then 2M blocks to save memory.

Using a 32 byte minimum block limit is a bit problematic as the handles will need to be two bytes and so would consume 128k. A better approach might be to only use 64k memory blocks for allocation of 32-byte memory blocks. This will only consume 4k. Always using 4k would set a minimum limit for allocation from 2M blocks to 1k. Thus, allocations of memory objects below 1k would use 64k blocks and above would use 2M blocks. This also means that this part will only use 10 bits instead of 16. The bitmaps used could also be made smaller with this method.

nullplan · Post by **nullplan** » Sat Oct 31, 2020 4:48 am

rdos wrote:Many modern hardware devices are built on setting up physical memory schedules, like network cards, AHCI, and USB controllers. There is no direct way of accessing physical memory when using paging with a non-unity mapping. Many of these drivers tend to need extra entries in the schedules for virtual addresses, and when having a physical address it's hard to translate it to a virtual without maping it with paging, something that is slow and inefficient.

You know, in 64-bit mode, you could just map all physical memory into kernel space, thereby giving you a linear mapping for all physical memory, making translation between physical and virtual memory entirely trivial. But I suppose you'd just be lazy if you did that.

rdos wrote:Another problem with using physical memory is that simple mistakes can lead to devices corrupting kernel or application memory. It's easy enough to do this with malicious code in a PCI device, but even well-behaved PCI devices can corrupt things in kernel by handing them corrupt schedules.

There is a solution to that as well, and it is called IOMMU. It allows you to intercept transactions from external devices before they reach the memory, and determine if something went wrong. And halt the system as needed.

rdos wrote:For a segmented OS (like mine), it would be a good idea to allocate a selector with the linear base and a 2M limit. This way kernel software cannot exceed the limits of the 2M area and thus cannot access linear or physical memory that doesn't belong to the object. The translation between physical address and offset would then just subtract the physical base. If the offset is above 2M, a protection fault will result rather than accessing invalid memory.

How does that address either of the problems you mentioned? When a PCI device corrupts your kernel, the OS is not performing a mistaken memory access, the device is (being made to by the OS). The PCI devices still write into your kernel, and you haven't solved the translation issue unless you are also mapping that segment linearly, in which case congratulations, you have invented linear memory mappings.

rdos wrote:There are a few challenges though. One is that it must be possible to allocate both 2M and 4M physical pages from both below 4G and anywhere in memory.

It's called zoning and it isn't hard. Fundamentally, you run independent memory allocators on each zone you wish to differentiate (I have a 1MB zone, a 4GB zone, and a zone beyond). They don't even need to all be the same allocator. And the overarching allocator just calls each of these in turn (first allocating from the zone beyond, then the 4GB zone, then the 1MB zone, as they get progressively more valuable). Then a restriction to a given maximum address is achieved by just allocating from one zone.

Or, like Linux, you maintain a list of physical memory blocks, and when a request for memory comes in, you iterate over the holes between the allocated chunks to see if you can find a hole that fits whatever criteria you need (size, alignment, maximum address, maybe even "does not cross 64kB line" if you are doing ISA DMA).

I admit that I fail to see a question in your post. Did you want to ask something or are you just bouncing ideas off a wall?

rdos · Post by **rdos** » Sat Oct 31, 2020 5:09 am

nullplan wrote:You know, in 64-bit mode, you could just map all physical memory into kernel space, thereby giving you a linear mapping for all physical memory, making translation between physical and virtual memory entirely trivial. But I suppose you'd just be lazy if you did that.

That's true, but I'm not in 64-bit mode. I can even run on very old PCs that doesn't support PAE, or has less than 4G of physical memory and then 32-bit paging requires less physical memory for setting up the environment.

nullplan wrote:
rdos wrote:Another problem with using physical memory is that simple mistakes can lead to devices corrupting kernel or application memory. It's easy enough to do this with malicious code in a PCI device, but even well-behaved PCI devices can corrupt things in kernel by handing them corrupt schedules.
There is a solution to that as well, and it is called IOMMU. It allows you to intercept transactions from external devices before they reach the memory, and determine if something went wrong. And halt the system as needed.

You cannot do that in production releases.

nullplan wrote: How does that address either of the problems you mentioned? When a PCI device corrupts your kernel, the OS is not performing a mistaken memory access, the device is (being made to by the OS). The PCI devices still write into your kernel, and you haven't solved the translation issue unless you are also mapping that segment linearly, in which case congratulations, you have invented linear memory mappings.

Because I can effectively monitor devices for returning the correct addresses and when I construct the schedules the physical addresses are effectively evaluated for being correct. I construct the memory buffert within the 2M selector and all physical addresses in the schedule are constructed by adding the physical base of 2M block. This way the schedule will not contain invalid physical addresses. It works the other way too. I can take a physical entry, subtract the physical base, and then use it to address within the 2M block. If the address is invalid, a protection fault will result. This solves many common mistakes and isolates problems to the driver itself.

When this is done in the "usual way", I will allocate a linear memory block, get it's physical page and set it in the schedule. If this page is overrun by hardware, or freed and reused by the OS, kernel memory will be overwritten. Here, at most, other blocks in the schedule itself might become corrupt by such mistakes, but not random memory in the kernel or application space.

nullplan wrote: Or, like Linux, you maintain a list of physical memory blocks, and when a request for memory comes in, you iterate over the holes between the allocated chunks to see if you can find a hole that fits whatever criteria you need (size, alignment, maximum address, maybe even "does not cross 64kB line" if you are doing ISA DMA).

I never thought Linux was so backwards. Once upon a time I also used linked lists for physical memory, but they are easily corrupted and allocating larger blocks of adjacent addresses becomes a nightmare.

nullplan wrote: I admit that I fail to see a question in your post. Did you want to ask something or are you just bouncing ideas off a wall?

The latter.

nullplan · Post by **nullplan** » Sat Oct 31, 2020 11:00 am

rdos wrote:That's true, but I'm not in 64-bit mode. I can even run on very old PCs that doesn't support PAE, or has less than 4G of physical memory and then 32-bit paging requires less physical memory for setting up the environment.

But in that case, mapping in all physical memory into the kernel side of address space becomes feasible again. Only after ca. 768MB of RAM (depending on how much I/O space you need) does this become impossible. With a higher-half kernel, anyway. And even then, you can move the border between kernel and user space to 2GB and get a bit more mileage out of the scheme.

rdos wrote:You cannot do that in production releases.

Can too, it's called Blue Screen Of Death, or Kernel Panic, or Guru Meditation, depending on your particular taste and vintage. If something is trying to write into your kernel, then your kernel has a bug. Better to stop it and force correction on part of the developer than to soldier on into even more bugs. If you have a well-separated driver system, you can figure out which driver is buggy and shut it down, and maybe relaunch it (like Minix does), but other than that, halting the system is the simplest choice.

rdos wrote: This way the schedule will not contain invalid physical addresses.

And there is your central assumption. In my experience, there is no algorithm so simple that a programmer cannot f*** it up. More to the point, if your memory segment is reusable, it is possible to break the reuse contracts. If it is not reusable, then you can only serve a limited number of requests before requiring a reboot. Again, nothing is gained by using segmentation. You changed the address calculation (by the way, you still have to take paging into account after segmentation) into something not particularly different (you need to add/subtract a base address in either case)

rdos wrote:I never thought Linux was so backwards.

Well, I did a bad job of representing it. What I described was the memblock allocator, which Linux uses only to bootstrap, and to setup the more advanced slab allocator. I understood the memblock allocator (was a bit difficult to understand that it is iterating over the holes between the allocated bits), but I have no idea about any of the possible slab allocators (SLAB, SLOB, and SLUB).

rdos · Post by **rdos** » Sat Oct 31, 2020 12:14 pm

nullplan wrote:But in that case, mapping in all physical memory into the kernel side of address space becomes feasible again. Only after ca. 768MB of RAM (depending on how much I/O space you need) does this become impossible. With a higher-half kernel, anyway. And even then, you can move the border between kernel and user space to 2GB and get a bit more mileage out of the scheme.

Well, I have a computer with 128GB of RAM (the motherboard only exports 100GB, but anyway), and 12 cores with hyperthreading. I certainly want that one to work just as well as the old PC-compatibles without PAE. I even think I can still run on a 386 SX processor, but, unfortunately the one I had no longer booted because of a bad CMOS RAM so I recycled it.

So, no, only supporting 768MB of RAM is not feasible. Besides, the kernel only owes 1GB while applications have 3GB.

nullplan wrote:
rdos wrote:You cannot do that in production releases.
Can too, it's called Blue Screen Of Death, or Kernel Panic, or Guru Meditation, depending on your particular taste and vintage. If something is trying to write into your kernel, then your kernel has a bug. Better to stop it and force correction on part of the developer than to soldier on into even more bugs. If you have a well-separated driver system, you can figure out which driver is buggy and shut it down, and maybe relaunch it (like Minix does), but other than that, halting the system is the simplest choice.

In production versions this results in a reboot and saving fault information. In debug versions, it triggers either the kernel debugger so I can debug the fault in a running OS, or the crash debugger which goes into it's own isolated environment so core state can be inspected if you have a supported keyboard.

nullplan wrote: And there is your central assumption. In my experience, there is no algorithm so simple that a programmer cannot f*** it up. More to the point, if your memory segment is reusable, it is possible to break the reuse contracts. If it is not reusable, then you can only serve a limited number of requests before requiring a reboot. Again, nothing is gained by using segmentation. You changed the address calculation (by the way, you still have to take paging into account after segmentation) into something not particularly different (you need to add/subtract a base address in either case)

It works just as well with a flat memory model, except for the limit checking. The central assumption is that a linear memory area will be mapped to consequtive physical memory so calculations between the two are made simple and safe. It's a bit like a unity mapping, except for it being local and "biased".

Sure, anything in kernel can be messed up with poor programs, but the issue is to lower the probability that it happens and spreads to unrelated code & data.

I've worked on providing crash debugger support for USB keyboards, and since this environment use unity mapping on a single 4k page writing USB drivers was much easier and resulted in code that is easier to read & understand. The experiences from that project has motivated my to rethink how the OS based USB drivers are done. It would be a lot easier if they were based on "biased" unity mapping.

rdos · Post by **rdos** » Sat Oct 31, 2020 4:30 pm

Apparently, the 4M page feature was introduced with the Pentium processor, and so it couldn't be expected to work on all processors. Also given that the size differs from the size of the directory in PAE mode (and long mode) which is 2M, a better strategy to implement this appears to be to fill in 512 page table entries with consequtive physical addresses rather than attempting to use 4M pages. This also means both 32-bit paging and PAE paging can use the same 2M size and there will be no need to allocate 4M aligned physical addresses.

rdos · Post by **rdos** » Sat Nov 21, 2020 4:57 pm

I know have functional code for this. It uses 4k pages instead as this is less wasteful when a device doesn't need a lot of blocks. I also have an extend table with linked 4k pages so I can have rather large buffers. Since many older USB chips cannot handle 64-bit addresses, and many new machines might have those chips and more than 4G of physical memory, I have one function to create 32-bit physical addresses and one for using any physical address. Blocks will be aligned to their size too, and so when a 32 byte block is requested it will be 32 byte aligned. Many hardware devices have requirements for alignment.

I will gradually convert all my USB drivers to use this. I then will be able to quickly convert physical addresses in the schedule to linear (with proper validation). I can use different base size of the blocks, and generally the transfer descriptor size will determine the base size.

Initially, I thought the XHCI design was pretty horrible, but I now find that this is a pretty good design that I would want to use with older USB standards. For UHCI, OHCI and EHCI I can create transfer rings that point to transfer descriptors (XHCI have the descriptors themselves in the transfer ring) so it's possible to schedule a large number of blocks on an USB device and know which one's are done and which are pending by using read/write pointers (just like XHCI does).

I will create a memory block object per USB device, and then all transfer descriptors and buffers will allocate memory from it. This also makes it simple to delete the device since all data is in the device selector and so everything is deleted by deleting this selector.

Since the design uses bitmaps, and x86 has functions to make those atomic with lock prefix, I decided to implement it to be multicore safe without requiring synchronization primitives. This also means that IRQs can use the functions since they never will block.

Bitmaps also have the advantage of detecting incorrect free operations, like freeing an invalid or non-allocated object or double freeing. I have breakpoints in the code to detect all of these mistakes.

OSDev.org

Handling physical memory schedules with paging

Handling physical memory schedules with paging

Re: Handling physical memory schedules with paging

Re: Handling physical memory schedules with paging

Re: Handling physical memory schedules with paging

Re: Handling physical memory schedules with paging

Re: Handling physical memory schedules with paging

Re: Handling physical memory schedules with paging