Optimal disc-buffering in 32-bit mode

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3193

When I designed my current algorithm, the amount of linear memory was large and the amount of physical memory was small, so it made sense to create buffering based on linear memory. Today, it is more or less the opposite. A PC can have several GB of physical memory, and the kernel only has maybe 1/2 GB of linear memory for buffers. Basing the buffering algorithm on linear memory thus makes it impossible to use all of the spare physical memory for disc buffering.

Of course, with long mode this is a non-issue, but I have no plans to move to long mode right now. It's also a lot more challenging (and thus fun) to solve this for 32-bit protected mode.

So, if it is linear memory that is the scares resource, and physical memory that is the abundant, this should mean that buffers should be based on physical pages rather than memory-mapped linear memory. This also is advantageous with most modern disc-hardware, like AHCI that is based on physical memory in the schedules and not linear. The disc driver for AHCI actually don't need to map buffers to linear memory, rather works better with physical addresses only. So, buffers to modern disc drivers will work best if handed physical memory buffers.

Another way to increase the linear memory area available to filesystem drivers is to run the disc driver & file system driver as a separate process and then use the 3 GB of linear memory available to a user process for linear memory buffers. This will mean that 6 times larger linear buffers can be used.

A problem with mapping & unmapping linear pages is that the TLB needs to be invalidated, and also that IPIs must be sent to all cores so they also invalidate their TLBs. However, if the OS can make sure that all threads in the disc process runs on the same core, then only the TLB on that core will need to be invalidated when mappings change.

Thoughts? Is this a usable idea?

thewrongchristian · **Joined:** Tue Apr 03, 2018 2:44 am **Posts:** 402

rdos wrote:

When I designed my current algorithm, the amount of linear memory was large and the amount of physical memory was small, so it made sense to create buffering based on linear memory. Today, it is more or less the opposite. A PC can have several GB of physical memory, and the kernel only has maybe 1/2 GB of linear memory for buffers. Basing the buffering algorithm on linear memory thus makes it impossible to use all of the spare physical memory for disc buffering.

Of course, with long mode this is a non-issue, but I have no plans to move to long mode right now. It's also a lot more challenging (and thus fun) to solve this for 32-bit protected mode.

So, if it is linear memory that is the scares resource, and physical memory that is the abundant, this should mean that buffers should be based on physical pages rather than memory-mapped linear memory. This also is advantageous with most modern disc-hardware, like AHCI that is based on physical memory in the schedules and not linear. The disc driver for AHCI actually don't need to map buffers to linear memory, rather works better with physical addresses only. So, buffers to modern disc drivers will work best if handed physical memory buffers.

Another way to increase the linear memory area available to filesystem drivers is to run the disc driver & file system driver as a separate process and then use the 3 GB of linear memory available to a user process for linear memory buffers. This will mean that 6 times larger linear buffers can be used.

When my OS grows up, it wants to use a similar scheme. I want to have at least filesystem drivers in user space, a'la FUSE. Once the OS has determined the page required is not in memory when resolving a page fault, you're already resigned to a slow device I/O path, so the added overhead of transitioning to user space is probably not that significant, and the complexity of filesystem code can be contained safely within user space making it easier to get the benefits of reduced complexity in kernel mode.

rdos wrote:

A problem with mapping & unmapping linear pages is that the TLB needs to be invalidated, and also that IPIs must be sent to all cores so they also invalidate their TLBs. However, if the OS can make sure that all threads in the disc process runs on the same core, then only the TLB on that core will need to be invalidated when mappings change.

Thoughts? Is this a usable idea?

If you have your filesystem at least in user space, you'll have less buffer pressure requiring less buffer reuse/unmapping, and less need for TLB shootdown.

In kernel space buffers, however, if each buffer has a dedicated linear address that is mapped on demand to the physical page as required, then so long as the buffer is locked by a thread to access that buffer memory, you'll have no need to issue TLB shootdown IPIs at all. Your buffer access would be something like:

Code:

struct buffer {
   mutex lock;
   page_t page;
   size_t size;
   void * p;
};

struct buffer * get_buffer(page_t page, size_t size)
{
   /* Get an unused buffer from the pool - comes back locked */
   struct buffer * buf = get_free_buffer();
   buf->page = page;
   buf->size = size;
   
   /* Temporarily map the page(s) to the buffer virtual address */
   arch_map(buf->page, buf->size, buf->p);
   
   return buf;
}

void put_buffer(struct buffer * buf)
{
   /* Unmap the temporary mapping */
   arch_unmap_no_tlb_shootdown(buf->p, buf->size);

   /* Put the still locked buffer back into the pool */   
   put_free_buffer(buf);
}

Thus, stale TLB entries from unlocked buffers are also guaranteed to be unused, and won't be referenced. Once a buffer is locked, the first thing the thread does is remap the buffer address, thus overwriting any stale TLB entries anyway. No TLB shoot down necessary.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3193

Actually, it's not really necessary to run the disc driver process in user mode for it to use the 3 GB of memory typically available to a user process. It can just as well run in kernel mode instead, which would avoid the costly user mode to kernel mode transisitions. The driver also will need to access kernel resources in order to communicate with the applications requesting filesystem services.

The drawback of the approach is that the user program will not be able to directly call the filesystem services. This will need to be done with some IPC communication protocol that is implemented in shared kernel linear memory space. It is a bit like a micro-kernel design, but only for the file system part of the OS.

I think the most critical part is how read & write of large files performs. Part of the disk cache can be mapped in linear memory in the disk driver process, and so could file data. Then this mapping can be copied to shared kernel memory or to the user process accessing the file. That way part of the file could be mapped in the user process and thus could be accessed without kernel mode switches. The user process would need to change the mapping when it wants to read parts not mapped. Writes would be detected by regularly scanning the dirty bits if the mapping is in user space.

Maybe the file system should run in it's own process? After all, the file system driver and the disc driver should be kept separate. So, maybe start one disc driver server process per disc, which would handle reading & writing sector to the physical disc. Then one file system process per partition, since a disc can have different types of partitions.

A major advantage of this approach would be that it would be possible to use C & flat memory model in the file system driver, and thus potentially port complex filesystems without risking corrupting kernel data. Here it actually might be better if the file system processes runs in user mode since then it cannot corrupt kernel memory, and would need to use predefined entry-points to communicate with the kernel. The disk driver can run in kernel mode.

Another potential idea might be that the disk driver & file system drivers for that disk actually share part of the 3 GB user memory, say 1 GB so the file system driver can access the disk buffers quickly without putting them in shared kernel memory.

thewrongchristian · **Joined:** Tue Apr 03, 2018 2:44 am **Posts:** 402

rdos wrote:

Maybe the file system should run in it's own process? After all, the file system driver and the disc driver should be kept separate. So, maybe start one disc driver server process per disc, which would handle reading & writing sector to the physical disc. Then one file system process per partition, since a disc can have different types of partitions.

A major advantage of this approach would be that it would be possible to use C & flat memory model in the file system driver, and thus potentially port complex filesystems without risking corrupting kernel data. Here it actually might be better if the file system processes runs in user mode since then it cannot corrupt kernel memory, and would need to use predefined entry-points to communicate with the kernel. The disk driver can run in kernel mode.

That's my thinking. Something like ZFS, which not only does its filesystem block mapping, but also volume management and its own block buffer cache. Plus, with a FUSE frontend, you can use existing FUSE implementations of common filesystems.

I was also thinking that the user filesystem protocol can be "stateless", working along the lines of how NFS identifies files and each request is self contained and idempotent. That would allow filesystem drivers to crash and be restarted, perhaps with some recovery on restart, all without interrupting regular file use (assuming nothing gets corrupted by the crash.)

rdos wrote:

Another potential idea might be that the disk driver & file system drivers for that disk actually share part of the 3 GB user memory, say 1 GB so the file system driver can access the disk buffers quickly without putting them in shared kernel memory.

Sharing physical memory with multiple mappings should provide sufficient performance, while providing isolation. I'm more interested in keeping the device drivers in kernel, to take advantage of low latency interrupt processing.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3193

thewrongchristian wrote:

That's my thinking. Something like ZFS, which not only does its filesystem block mapping, but also volume management and its own block buffer cache. Plus, with a FUSE frontend, you can use existing FUSE implementations of common filesystems.

FUSE seems usable, and it should work with my VFS implementation, but it seems to lack support for many common Linux-based FSes. For instance, there doesn't seem to be a stable ext4 that can both read & write, and there doesn't seem to be a ReiserFS implementation either. There is an NFTS interface which would be interesting for reading Windows created partitions.

thewrongchristian wrote:

I was also thinking that the user filesystem protocol can be "stateless", working along the lines of how NFS identifies files and each request is self contained and idempotent. That would allow filesystem drivers to crash and be restarted, perhaps with some recovery on restart, all without interrupting regular file use (assuming nothing gets corrupted by the crash.)

Kind of. The hardest aspect to support is when removable devices are unplugged while there are still outstanding IO. The typical scenario is an USB drive. This will be a whole lot easier with a set of user mode processes that can just be killed and then the resources can easily be freed. This is a lot harder with buffers in a kernel.

thewrongchristian wrote:

Sharing physical memory with multiple mappings should provide sufficient performance, while providing isolation. I'm more interested in keeping the device drivers in kernel, to take advantage of low latency interrupt processing.

I will want the disk device driver to run in kernel, and only let the file system run in user mode.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3193

The 1 GB shared mapping between the low-level disc driver (for instance, AHCI) and the file system in user space is easy to support for PAE but a lot harder for 32-bit paging. For PAE, you can just share the 3:rd entry in the top level page table. For 32-bit paging, there is a need for a "system" copy of the page directory that contains the currently allocated page directory entries and which is used for creating a shared copy. This is similar to how kernel page directory entries are shared globally.

The 1 GB area would be used by the low-level disc driver for the disc cache, and for operating the device. This way the kernel area will not be jepardized by flat memory accesses to a similar degree. The low-level disc driver could operate with a selector with a base at 3 GB and a 1 GB limit. Since the low-level disc driver best operates in kernel mode, this 1 GB area should be setup for kernel access only. The file system drivers shouldn't be able to corrupt it. It would need to be shared with the file system processes so the kernel can map buffers for the file system. Maybe the flat selector of the file system drivers should also have a 2GB limit, just in case. Normal user applications typically have a 3 GB limit.

The file system drivers could be ordinary applications that are started by the low-level disc driver based on which partitions exist on the disc. This makes it possible to debug them with the normal application debugger. These drivers can be preloaded on the RAM-drive. They could also create a common console that can output debug information.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3193

Making some progress here now. I've extended the idea now to make more use of the LDT, and sharing it among processes that operate a single disc. That way I can allocate things like file handles and other objects and protect them with selectors without any risk of running out of GDT selectors. There would be 8k selectors available per disc, which means all objects in kernel space can be protected with selectors. LDT selectors can also be used as call gates from the file system driver, which would also free up selectors in the GDT. In fact, this also means that these entry points will only be available to file systems and not for normal applications.

To avoid TLB-related IPIs when file systems modify their buffers, the kernel can make sure that all threads relating to a particular disc always run on the same core, and then only a local invalidate is necessary which will improve performance. Signaling between threads is also faster & more efficient if woken-up threads are on the same core.

To provide good performance on file IO I think the kernel should map relevant parts of files in user-space memory, and then no syscalls will be necessary to read & write as long as the contents are mapped. This can be achieved by the files also being mapped by the file system in their linear memory space, and then just copying the physical page addresses to userspace. This way no large moves of file content will be necessary, only moves of 4k page addresses.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3193

Maybe the disc buffers can be layouted out like this:

The last level would be an array of 64-bit physical addresses. Since these are always 4k aligned, it would leave 12 bits for other purposes. 8 bits can be used as a reference count (I think this should be enough). Bit 8 could be used for present, bit 9 for read req, bit 10 for write req, and bit 11 for referenced (which could be used for discarding). The number of entries at this level would always be a multiple of 512 (whole 4k pages), and the actual count would depend on device size.

The other levels would be like page directories. They would contain 32-bit linear addresses (allocated in the "device space"). As with the physical addresses, they should be 4k aligned, which would leave 12 bits for other purposes. One bit would be the present bit, and 9 bits could be a bitmap for which entries at the next levels that have active disc requests.

I think the interface to the disc driver would be a start sector, an array of physical addresses and a sector count (or 4k page count). I actually think that all operations should use at least 4k, even writes. This would make buffering less complex.

This scheme would particularly efficient for AHCI, where the physical addresses can be directly put on the schedule. For older drivers, there would be a need to map them in the linear address space, but these are slow anyway so this is not an issue with them.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3193

I've come up with a better scheme that can combine buffers with bitmaps for pending operations.

The physical level would still contain 512 entries per block. However, it's not necessary to support the full 64-bit physical address span, and so the upper 16 bits can be used as a reference count. That would open for the possibility of using the lower 8 bits as a write sector mask. That way, I can support sector sizes from 512 to 4096, which I think should be enough.

The upper array will cover 0.5G 4k pages (4G 512 sectors) per entry (using 29 bits). These entries would combine bitmap pointers (4k entries) and buffer pointers to the next level (1k entries). The next level of the bitmap would also have 4k pointers which would refer to 16k bitmaps. The buffer pointers two intermediate levels with 1k entries.

A disc using the full 48-bit sector range would need 0.5M entries which would consume 2M of linear memory.

The idea with the bitmaps is that they will indicate (per 4k block) if there are pending disc-operations. The disc driver would use it to find continuous requests.

Having the buffer area cover 29 bits would improve performance with 32-bit instructions. It's quite easy to reduce the 64-bit sector address to a 32-bit entry number and a 29 bit offset.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3193

I think I solved the issue with TLB invalidations and IPIs between cores. Ordinarily, I check which area an address belongs to. If it is in the upper 1GB (kernel area), then IPIs must be sent to all cores. If the address is in the application area, only cores running with the same CR3 needs to invalidate the entry.

For the disc-buffering scheme, there are multiple processes (with different CR3s) that shares the 2-3GB region, and so the typical application flushing method would be too restrictive. However, by saving the current disc-handle (or 0) in every core control block during the task switch process, I can send IPIs only to cores that are running with the same disc-handle if the address is in the 2-3GB region, while other regions work as usual. This way I don't need to keep all the processes on the same core.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3193

The buffering & USB XHCI reading code seems to work quite well. Unlike the old disc buffer, which I never could do extreme stress tests with, this time I will make extensive tests.

I have an USB analyser so I can verify that the drive is read using consecutive sectors in single, long requests.

The first attempt creates ten threads in the partition application and lets them request a random number of sectors with a random start sector in the whole 500 MB area. They then delay a random time (0-30 ms) and do it again. Since all sectors will eventually be buffered, it's a good test that checks scenarios from nothing buffered to everything buffered. The test works with some minor bug (a few of the requests are not performed).

The current code doesn't test if the actual contents are correct, but I plan to write the sector number itself + and MD5 hash to every sector using my old disc code. Then I can verify the MD5 on every sector I read, and that it is the correct data. If this test passes it should garantee that the correct contents are returned regardless of size, timing & amount buffered.
Edit: Seems like the test passes, which means the mapping code works as it should, and the correct sectors are returned.

Another important feature is to be able to discard buffers in low memory situations. I think I can test & implement this by setting the threshold for buffers to half buffered, and then the discard thread & the ten stress threads can compete for the buffers, and unlike the current scenario, everything will never get buffered, rather the stable state will be half buffered.

The last important feature is the ability to dismount the disc when the USB drive is unplugged. This can simply be tested by unplugging it while the tests run.

OSDev.org

Optimal disc-buffering in 32-bit mode

Who is online