How to implement efficient IPC for a server-base filesystem

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3194

So, how would this be designed to maximize performance?

My first idea was to use my own IPC mechanism, but then I realized it uses too much of dynamic memory allocation and only delivers messages. Parsing texts for commands seems really inefficient.

Next, I allocated control blocks for the messages, put them into queues. I also added a default structure and thought I would just extend them based on which command I wanted to run. This proved to be problematic too, particulary since many messages were anticipated to be paths of varying size. It also was to much work with creating new structures for every message.

So, I thought, why not save the registers to fixed positions in the control block? Then I can "execute" it by loading the operation code in AX. At the other side, I load the registers again, and use the operation code through a table to get to the right function. When the server function has executed the command, the registers are saved in the control block again and then control is passed back to the "client" thread. The client thread will load the registers again. This has the big advantage that the server-side function appears to be executed locally and the results are returned in registers just as if it was called locally.

An addition to this mechanism is to allocate extra space in the control block for passing data to and from the server.

Another issue is that the client will execute in kernel space, but the server should execute in user space in it's own address space. This can be handled by passing a buffer from the server, and then copy the control block to this buffer. Then the registers are loaded in an assembly stub in the server (in user space) and saved again as the function is completed. Then they need to be copied again in kernel space of the server to the kernel-mode control block. An obvious drawback is too much copying and dynamic memory allocation.

Another, probably better, idea is to use a set of fixed buffers so they don't need to be allocated. If the buffer is a multiple of 4k then the buffers can be allocated in 4k chunks. By saving the physical address (or addresses) of the buffer, then registers & message context can be mapped into both kernel space and the server process so no copying is necessary. A fixed number of message buffers can be statically allocated, saving the physical address of the buffer in a circular buffer.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3194

I came up with a design that is completely static (no need to allocate memory) and zero-copy. I define a 16-byte IPC structure that contains an eight-byte physical address, a four-byte kernel-level linear address, and a four-byte server process linear address. The kernel-mode linear address will always be mapped to the physical address, while the server (user mode) linear address will only be mapped when the server is processing a message.

I then reserve space for 32 of these 16 byte IPC structures per partition server. To quickly allocate them, I add a 32-bit mask with unused entries and a 32-bit mask with allocated but free entries. Using bsf, btr and btc I can then implement a fast "allocator". Even better, it's possible to implement the allocator without blocking locks (using lock prefix for btr & btc will do).

To keep track of pending requests I add a circular queue with 32 entries (it can use byte indexes into the IPC structure array).

If there are more than 32 requests pending, then request threads will be blocked until an IPC structure becomes available.

The flow will be that the client in kernel space will allocate an IPC structure entry, fill in the registers & data. The entry is put into the pending queue and the server thread is signalled. The server thread (in kernel space) will take the entry from the pending queue, map the physical address to the server process linear address (and possibly allocate it if it's the first usage), and return back to userspace passing the buffer as a parameter. The server process will load the registers, perform the function and save the registers back to the IPC structure. It might also read or write data in the remaining space of the 4k block. After this is done, it returns to kernel space and signals back to the client thread that the operation is completed. The client thread reloads the registers from the IPC structure and returns to the caller.

Edit: The client thread needs to be saved somewhere, and it should not be in the data that is accessible to user-mode, and so it must be placed in the IPC structure. Also, it's better to access the kernel-mode data through a selector rather than a linear address, which would save two bytes and allow the two-byte thread selector to be saved too within 16 bytes.

thewrongchristian · **Joined:** Tue Apr 03, 2018 2:44 am **Posts:** 402

rdos wrote:

The flow will be that the client in kernel space will allocate an IPC structure entry, fill in the registers & data. The entry is put into the pending queue and the server thread is signalled. The server thread (in kernel space) will take the entry from the pending queue, map the physical address to the server process linear address (and possibly allocate it if it's the first usage), and return back to userspace passing the buffer as a parameter. The server process will load the registers, perform the function and save the registers back to the IPC structure. It might also read or write data in the remaining space of the 4k block. After this is done, it returns to kernel space and signals back to the client thread that the operation is completed. The client thread reloads the registers from the IPC structure and returns to the caller.

So, from a FS server POV, a request will be self contained in a paged size buffer? What will that buffer contain? Stuff like the file operation and details, file offset to read/write, size etc?.

How is the data buffer aligned in that case? A common use case would be paging in data, in which case you want the data buffer to be page aligned and sized, and the request data to be disjoint and pointing at the buffer to use. Your request buffer needs to be separate from your data buffer.

Fitting in with this, it might be better for the FS server to manage the lifecycle of the data buffers.

Consider a request to read in a page sized block of data (to fulfill a demand page request in the VMM):

- VFS is called with a getpage() request to return a page filled in with data from the given vnode/offset. In my VFS, getpage() supplies and returns the page filled in, it is not pre-allocated, as the vnode might be a device vnode to a memory mapped device, in which case it can just return the physical page from the MMIO region for the device.
- VFS sees this vnode is attached to a user space FS, and so formulates an IPC request to the FS server, queues the request, and waits for the response. (notice, no memory is allocated to be filled in.)
- FS server now runs, and returns the request to read from vnode/offset/size. At this point, there is no data buffer, so the FS server uses a paged aligned pre-allocated virtual buffer, and populates the buffer with the desired data.
- FS server responds to the request, with a list of page sized references to buffers in the FS server address space. These buffers could have been filled in on the FS server behalf by an underlying disk device driver, or they could be filled in on the fly by the FS server itself if it providing some sort of virtual filesystem (such as a transparent compression wrapper.)
- VFS gets the response, and immediately uses the buffers returned in the response to return the underlying physical page(s) from getpage(). It should steal these pages, unmapping them from the FS server at the same time, so the pages can't be subsequently referenced by the FS server after they're returned.

You'd need a similar mechanism for page writes. The FS server would need some way of getting the data to write mapped, perhaps the IPC mechanism can have some way of reserving page aligned buffers in the FS server address space, and the write request will just reference the existing buffer addresses in the FS server address space of where the written data can be found. The FS server can then either use those buffers as is and pass them off to the underlying device, or read the buffers and processing them in whatever way it sees fit (compressing the data, for example, in the case of transparent compression wrapper.)

Basically, you need to separate out the management of buffers from the management of IPC requests.

rdos wrote:

Edit: The client thread needs to be saved somewhere, and it should not be in the data that is accessible to user-mode, and so it must be placed in the IPC structure. Also, it's better to access the kernel-mode data through a selector rather than a linear address, which would save two bytes and allow the two-byte thread selector to be saved too within 16 bytes.

I'd keep the data simple, and avoid x86-isms like selectors. Just use pointers, and if your IPC structure isn't 16 bytes, so be it. You're not exporting the IPC structure to user space, so it doesn't matter if it changes.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3194

thewrongchristian wrote:

So, from a FS server POV, a request will be self contained in a paged size buffer? What will that buffer contain? Stuff like the file operation and details, file offset to read/write, size etc?.

The operation type has its own field, and most things will be passed & returned in x86 registers (eax, ebx, ecx, edx, esi and edi). The client-side, as well as the server kernel side, are in x86 assembly, and so using a register-based interface seems to be the best alternative. Even the user mode part decoding the requests is in x86 assembly and then interfaces with C/C++ using OpenWatcom register pragmas.

thewrongchristian wrote:

How is the data buffer aligned in that case? A common use case would be paging in data, in which case you want the data buffer to be page aligned and sized, and the request data to be disjoint and pointing at the buffer to use. Your request buffer needs to be separate from your data buffer.

Data will be passed as physical addresses to the objects in question. This will allow up to 510 physical pages to be passed from server to client. I will add a syscall that can extract physical addresses from server space and place them in the reply. For the client side, which runs in kernel mode, these functions are already accessible. Although, this needs to be carefully handled so the server process cannot get access to kernel. Perhaps the mapping should always be part of a call that returns the buffer back to the client so the server process cannot access the physical addresses. I think passing physical addresses from client to server should not be possible, and probably not needed either.

thewrongchristian wrote:

Consider a request to read in a page sized block of data (to fulfill a demand page request in the VMM):

- VFS is called with a getpage() request to return a page filled in with data from the given vnode/offset. In my VFS, getpage() supplies and returns the page filled in, it is not pre-allocated, as the vnode might be a device vnode to a memory mapped device, in which case it can just return the physical page from the MMIO region for the device.
- VFS sees this vnode is attached to a user space FS, and so formulates an IPC request to the FS server, queues the request, and waits for the response. (notice, no memory is allocated to be filled in.)
- FS server now runs, and returns the request to read from vnode/offset/size. At this point, there is no data buffer, so the FS server uses a paged aligned pre-allocated virtual buffer, and populates the buffer with the desired data.
- FS server responds to the request, with a list of page sized references to buffers in the FS server address space. These buffers could have been filled in on the FS server behalf by an underlying disk device driver, or they could be filled in on the fly by the FS server itself if it providing some sort of virtual filesystem (such as a transparent compression wrapper.)
- VFS gets the response, and immediately uses the buffers returned in the response to return the underlying physical page(s) from getpage(). It should steal these pages, unmapping them from the FS server at the same time, so the pages can't be subsequently referenced by the FS server after they're returned.

I've not yet completely decided how to implement this, but a guess is that the client will issue a file lock request and put the file position at EDX:EAX and the number of pages in ECX. The server will then return the physical addresses to the file data (which will typically reside in the disc cache). When the client no longer needs access to the file data it must unlock that part of the file with a new request.

Perhaps more interesting is that normal file-IO can be implemented by mapping file data into userspace and then let user-space read directly from these buffers. In this case, the lock requests will be sent as needed by the kernel when the application tries to read non-mapped data.

thewrongchristian wrote:

You'd need a similar mechanism for page writes. The FS server would need some way of getting the data to write mapped, perhaps the IPC mechanism can have some way of reserving page aligned buffers in the FS server address space, and the write request will just reference the existing buffer addresses in the FS server address space of where the written data can be found. The FS server can then either use those buffers as is and pass them off to the underlying device, or read the buffers and processing them in whatever way it sees fit (compressing the data, for example, in the case of transparent compression wrapper.)

Writes are a bit more complicated, but they will typically start with a read lock and then will be followed by write requests on modified pages. However, a complication is that if file data is not page-aligned in the FS, then things get more complicated. There is also a need to "collect" writes so that discs are not ruined by too many writes when these happen byte-by-byte.

thewrongchristian wrote:

Basically, you need to separate out the management of buffers from the management of IPC requests.

They basically are. The IPC structure is kernel-only while the 4k data block which contains registers & data is accessible to the user-mode FS server process.

I already implemented the first (and simplest) request. It's a request to return free space. The free clusters value is collected when the partition starts up, and so reading this out is as simple as returning it in EDX:EAX. It seems to work.

The next step is to cache directories and to implement opendir, readdir and closedir. In the current solution, I create a snap-shot of the directory in a memory object, and then readdir will return a particular entry. I think in the new design I will use version numbers of the directory. Since directories are meta-data that differs by FS, it will need to be saved in the server process user address space. This caching should be page-aligned. My idea is that I will put a path in the data area of the request and issue a lock directory request. The server will then return physical pages in the data area. ECX will be the number of pages, and EBX might be the version of the directory. The server will keep a lock count per version of the directory data and will delete them once they are not current and not used. The client will map the directory entries to a kernel area and then service readdir by reading particular directories. It's also possible that the entries should be mapped as read-only in user-mode and then implement readdir completely in user mode instead.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3194

I think I have a solution to the security issues of letting the userspace FS server pass physical addresses of objects.

I can add a buffer type field to the message, and then let the kernel syscall used to return the buffer to the client fill in this field. If the reply is constructed by the server only, then this field might be "server only", and so should not contain physical addresses. If the server uses a syscall to return a user-space object, the field would be coded as "server object", and then the data part will be filled with the number of physical addresses + the actual physical addresses. A third method the server might use is to use a syscall that references sector numbers from the disc cache, and then the same coding would be used but the actual addresses would be collected by locking these sectors in the disc cache and returning their physical addresses in the data part. The server might code which sectors to return by writing them to the data part of the reply and then the kernel side would lock the sectors and replace the sector # with the physical address of the sector. A similar method might be used for memory objects. The server would write the size and base of the memory object to the reply and then the kernel side would transform it to an array of physical addresses.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3194

With the directory functions done, I've now started with file operations.

The idea is to create a share object with important file data, and an array of mapped file data requests. The user side will use the mapped file data if it exists, or request a new mapping if it doesn't. This will not be a regular client-server interaction like directory data is. Instead, it will go something like this:

* The client sends a request for file data at position x, requesting y bytes of data to the VFS server.
* The VFS server calls the file system requesting sector numbers that map this file position. If the file data is aligned on 4k boundaries, multiple clusters / inodes can be added to the same request. The only requirement of the request is that file data can be mapped in a continuous linear memory area.
* The VFS server creates one or more requests to lock sectors from the physical disc driver.
* The VFS server replies to the client that the data has been asked for, but without waiting for the sectors to be read
* The client tries to access the new requests, and if they are not yet available, adds itself to a wait queue for the file data.
* The physical disc maps the file data to a linear area in kernel space and then signals the client (not the VFS server) that the file data is ready.

This flow is more efficient than letting the VFS server wait for file data to become available and then signal back to the client. First, because the VFS server will not become blocked on waiting for file data, and second because there is one thread-to-thread IPC interaction less when the physical disc directly signals the client. Also, the VFS server has no reason to map the file data or access it.

The client-side will keep reference counts for which request data is used, and when it notices that some request is no longer used, it will signal the VFS server to unlock these sectors and delete the request.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3194

The file data buffering ended up as more complex than I anticipated. The biggest issue is that the cluster chain could be scattered all over the disc, and if I request many sectors, it won't be possible to map this to existing functions I've done. Also, if I have a large array of sectors I'll need to search the array for every completed request, which will scale poorly. This is more or less the same problem as I have with the present request lists that takes a long time to search through.

So, in order to provide fast checking if a sector is part of request I will need to sort the sector array (while still having pointers to the original requests in "real" order"). The sorting algorithm should perform best if the array is already sorted or close to sorted, as this will be the typical case. There also should be no dynamic memory allocation, other than allocating a memory area for the request itself. I'm testing a sort method that will be linear if the array is sorted and quadratic if it is unsorted.

An advantage of sorting is that I can also partition areas into 4G blocks so I can use 32-bit operations instead of 64-bit. This way I can sort 32-bit values instead of 64-bit. It will also be unusual for the arrays to be in different 4G blocks.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3194

I have had little time for this, but now I'm back working on it.

So, the sorting algorithm works. I stressed it with random data, and then checked so it was sorted and that all values were present. When the disc signals that a sector (rather a 4k block) is done, the sector can be quickly found with a binary search, and then the remaining request size can be calculated by moving forward in the sorted array until the read sectors has passed. When the remaining count goes to zero, it will derefence the sectors to physical addresses and put them in the reply buffer and signal the client. The same operation can also be done by the VFS side if all sectors are already cached, in which case remain count will be zero. Since the reply buffer is a little bit less than 4k, a maximum of about 500 sectors can be locked per request.

Each read request will get it's own object in the VFS address space, which gets saved in an array in the partition object. The array index will be the request handle. The VFS will reference it with the request handle. When the request is freed, the sorted sector array can be used to unlock the sectors and then the request object can be deleted.

The next step is to handle the physical buffers at the client side. I think the client will attempt to build areas that have some sectors at the beginning (0-7), a number of whole 4k pages and some sectors at the end (0-7). Then each process accessing the file must allocate a buffer that spans the area. The beginning & end should be mapped as read-only, while the whole pages can be mapped as read-write.

When a disc is unplugged, the driver must have some way to "close" files and mark them as "invalid". This probably means that the kernel must have a 4k page that is mapped in both kernel space and in the process, and which can be used to mark the file as invalid, and unmap the request areas so the physical addresses from the cache cannot be used (the cache will free them as part of unmounting the disc).

trolly · **Joined:** Tue Mar 25, 2008 12:26 pm **Posts:** 52

In m'y design, the ipc messages consist only on integers values.
Like for the syscall, but instead of using registers the client fills the values in the message structure. It gives the command code, and the local address of thé file to open.

The server Is allowed by the kernel to Map memory from the client address Space to is own local address Space, so hé can get the filename. The it create a file structure and respond to thé client with a handle number.

Thé client Can then sends commands using thé handle to read, write or Close the file. When a read or write opération is sends, thé server receive an address to read or write, and Map it to his local address, then do the operation

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3194

My ambition is to create a zero-copy read/write interface that typically will not need syscalls. This means that the client will get a mapping in its own address space that it can use to read file data. It's not a typical memory-mapped file API, rather the code will try to be smart about which parts of a file will be needed in the near future. The flow from the disc is through physical addresses. When the application doesn't have the wanted area of a file mapped, it will create an IPC request for it and send it to the VFS server. The VFS server can either provide a sector array or a local data area. Most common filesystems will return a sector array, while for instance, a zip file system will uncompress to local data. In the case of a sector array, the kernel part of the server will sort the array, try to lock the sectors with the physical disc (or queue requests if it's not cached). When the physical disc has read a sector, it will notify the server, which will check if it belongs to a particular request, and if so, decrease the number of pending requests. When there no longer is any pending requests, the kernel server part will extract the physical addresses for each sector in the disc cache and put these in the reply buffer to the client. It will also indicate that file data is returned (which the user-level server cannot do directly). The client kernel part then will build new memory maps in userspace for the new part of the file which the application then can use directly without syscalls. When an area has not been used for a while, the client kernel part will send another IPC to the VFS server requesting it to be unlocked & deleted.

In order to provide zero-copy functionality, the physical disc API operates on physical addresses, and not mapped buffers. Some drivers, like IDE might still need to map the buffer and copy data from the device, but drivers like AHCI can directly give the device the physical address and so the CPU never needs to access the actual file data until the application uses it. A disc cache based on physical addresses also is advantageous for 32-bit kernels since it can cache large amounts of data without having to map it into the linear address space.

A complication of this is that if a 512 sector size FS doesn't have file data that starts on 8-sector boundaries, then the client kernel part cannot map file data in a continuous linear area, rather might need to map many separate areas. This also makes it impossible to memory map the file without copying the file contents to a new area. However, this can be avoided by updating the partition tools to create suitably aligned filesystems. Still, existing filesystems without optimal alignment still should operate.

To enhance alignment, the client kernel part will truncate results that don't end on a page boundary. It will also truncate results that have too many areas.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3194

I've decided to optimize the file data requests further. There really is no reason for the client to use blocking IPC for file data requests. A better method is to allocate the request block, put it into the sorted list of available blocks, set it as "waiting for completion" and then queue an IPC request to the server. The client will do a blocking wait if it needs a request block that is not ready (which also makes it easier if several threads request the same data before it is available). For sequential access, a new request block can be queued before it is needed, which would increase throughput since FS code and sector reads can operate in parallel with the client reading data.

The server side thus will still use the request block to store the sector array, but on completion, it will do the decoding & filling in the request blocks directly instead of notifying the client that data is available. Then it will set the request block to "ready" and wake up any threads waiting for the data. This works since the file object & requests are mapped in kernel memory, and so is available to all threads & processes. The actual mapping of data into userspace will then be done by the client when the request block in the kernel is ready.

OSDev.org

How to implement efficient IPC for a server-base filesystem

Who is online