Avoiding redundant copies with direct user IPC

Demindiro · Post by **Demindiro** » Fri Jun 24, 2022 11:37 am

Hello, I'm currently trying to reduce excessive copying in my OS. While experimenting with various approaches I've come up with a model that should allow eliding most redundant copies relatively easily. I wonder what you think about it.

I'll briefly explain my current design and the problems I face with it, then I'll explain how my new model would (roughly) work.

Current design

There are two major aspects to my design: process isolation and asynchronous I/O.

The process isolation works by not explictly exposing any global state. Instead, all operations are performed on an object. For example, where a UNIX system would use `int open(const char *path)`, my design essentially uses `int open(int object, const char *path)`. Not only makes this sandboxing trivial, it also makes it easy to add "proxy" objects. E.g a firewall can be implemented by imitating a real network interface and filtering requests as needed.

Asynchronous I/O is implemented via ring buffers: a client puts requests in this buffer, notifies the kernel and the kernel then passes all these requests to the appropriate objects.

User processes are able to receive these requests by exposing a "table". Currently the only available table type is a "StreamTable", which exposes all basic operations and has a simple API, but more specialized tables are intended to be added later (e.g. FileTable).

Problems with the current design

One obvious disadvantage of the current design is that data associated with a request must also go through the kernel. For small amounts of data this is not an issue, but for large amounts the impact of (double!) copying becomes noticeable.

There are ways to avoid these redundant copies, such as sharing mappings (and using COW where necessary) but these aren't entirely free either as you need to walk the page table and invalidate TLB entries when necessary. These techniques also can't be used if the alignment of the source and destination doesn't match.

Potential solution: Sharing memory directly between processes

One way to avoid these redundant copies is allowing two processes to access shared range of memory. One process can then write data to this range and another process can read data from this range, trivially avoiding redundant copies.

This can be taken further advantage of by using specialized I/O structures, much like how different hardware define their own structures, and bypassing the kernel's asynchronous I/O queue.

I am still unsure how exactly these memory objects should be shared though: while there is already `share()` operation, it may make sense to allow sharing objects directly with e.g. `open()`, reducing the amount of I/O operations needed.

Notifying processes of changes to memory

Of course, we need a way to notify processes when an update has occurred to this shared memory. This can be done with semaphores.

A semaphore can hold a value (often an integer) and is atomically updated. When a thread wants to check if an update has occurred, it polls this semaphore. If no update has occurred, it can tell the kernel to wake it when it has.

The API for waiting on a single semaphore may look like `wait(semaphore, timeout)`, where the kernel will return to the thread when either the semaphore has been updated or the timeout has expired.

It may also be useful to wait on a (small) set of semaphores. For this there can be a`wait_any` call which takes a list of semaphores.

For a large set of semaphores it makes more sense to let the kernel keep track of all the semaphores. Such an API may look like this:

Code: Select all

create_semaphore_set()
add_semaphores(set, semaphores)
remove_semaphores(set, semaphores)
wait_set_any(set, timeout)

Potential problems with the solution

Two potential issues come to mind:

Mapping and unmapping objects and allocating a semaphore for transferring only a small amount of data once may be excessively expensive compared to simply copying it into kernel space and back. It may make sense to keep the old API around for such operations.
In many languages unsynchronized access to memory is undefined behaviour. It is possible to avoid this by using volatile accesses. However, the compiler is unable to coalesce or elide these accesses. Luckily, LLVM (and Rust) exposes an intrinsic for copying data with volatile accesses. While these accesses cannot be elided they can be coalesced. A better alternative would be to have a concept of "untrusted" memory in the language with corresponding operations. I've seen this also referred to as "freezing operations". This thread goes into more detail.

nexos · Post by **nexos** » Fri Jun 24, 2022 1:36 pm

I see the basic idea with your design. It would work great for heavily cooperating processes. In reality, a heavily loaded server would have to create a lot of shared memory blocks due to a lot of concurrent requests, which is expensive.

I am going to write a microkernel. How a message gets copied will depend on the size of the message.
First, let me preface by saying messaging will be synchronous. Asynchronous-ity will be implemented higher up.

For small messages (0 - to the max supported by the architecture), data from the source will first be checked to be present, and then data will be copied into registers. The receiver will take the date from the registers into the destination buffer.

For medium sized messages, (probably to around 250 KiB or so) the kernel will grab the physical address of the sender's buffer. The receiver will temporarily map that address to kernel space using a fast, direct mechanism. From there, it will copy it to the receiver's buffer. I would like to use SSE or something, but that might be impractical.

For large messages, CoW will be used.

Also, an API will be implemented that uses shared memory for heavily cooperating processes.

Demindiro · Post by **Demindiro** » Fri Jun 24, 2022 2:39 pm

nexos wrote:I see the basic idea with your design. It would work great for heavily cooperating processes. In reality, a heavily loaded server would have to create a lot of shared memory blocks due to a lot of concurrent requests, which is expensive.

For heavy networking I intend to let the server process share just one large shared memory block which is used for multiple connections. This block will contain a specialized ring buffer for communication. Packet payload data is also copied to this block.

nexos · Post by **nexos** » Fri Jun 24, 2022 4:33 pm

Ah, that makes sense

Demindiro · Post by **Demindiro** » Sat Jul 02, 2022 1:24 pm

nexos wrote:I would like to use SSE or something, but that might be impractical.

On modern processors `rep movsb/q` is quite fast, is only 2 bytes (vs potentially a few hundred) and doesn't require saving/restoring SSE registers. I recommend using it since an equivalent SSE algorithm is much larger and puts more pressure on the cache. I also expect the performance of rep instructions to further improve in the future.

For comparison, here are results for a Ryzen 2700X (memcpy_rust uses `rep movsq` first, then `rep movsb`):

Code: Select all

test memcpy_builtin_1048576           ... bench:      28,929 ns/iter (+/- 5,901) = 36246 MB/s
test memcpy_builtin_1048576_misalign  ... bench:      28,395 ns/iter (+/- 5,468) = 36928 MB/s
test memcpy_builtin_1048576_offset    ... bench:      28,450 ns/iter (+/- 5,723) = 36856 MB/s
test memcpy_builtin_4096              ... bench:          71 ns/iter (+/- 11) = 57690 MB/s
test memcpy_builtin_4096_misalign     ... bench:          76 ns/iter (+/- 13) = 53894 MB/s
test memcpy_builtin_4096_offset       ... bench:          73 ns/iter (+/- 8) = 56109 MB/s
test memcpy_rust_1048576              ... bench:      58,909 ns/iter (+/- 9,668) = 17799 MB/s
test memcpy_rust_1048576_misalign     ... bench:      61,874 ns/iter (+/- 7,405) = 16946 MB/s
test memcpy_rust_1048576_offset       ... bench:      61,128 ns/iter (+/- 7,211) = 17153 MB/s
test memcpy_rust_4096                 ... bench:          87 ns/iter (+/- 1) = 47080 MB/s
test memcpy_rust_4096_misalign        ... bench:         162 ns/iter (+/- 9) = 25283 MB/s
test memcpy_rust_4096_offset          ... bench:         163 ns/iter (+/- 11) = 25128 MB/s

(The misaligned/offset benchmarks are slow because writes aren't aligned. Aligning the destination increases perf to about 40GB/s)

I've finished a proof-of-concept of direct user IPC by rewriting the StreamTable to use shared memory. While it is more complex in terms of LoC the API is actually nicer than before IMO.

Like the standard I/O queue it uses a ring buffer for responses and requests. A major difference is that the request and response queue are "swapped": instead of the user process needing to initiate a read request to accept jobs and a write request to finish one the kernel can directly put entries on the queue. The entries themselves are also a lot smaller: requests are only 16 bytes and responses 12, whereas the standard I/O queue's are 32 and 16 bytes respectively.

There are two shared regions: one region for the queues and another for transferring data. The kernel allocates the region for the queue but the user provides the data region. This allows using e.g. a DMA region for data, which avoids the need for either bounce buffers or needing to look up physical addresses.

I haven't implemented the semaphore API. Instead I've added a "notify" object. The user process can do an empty read to wait for requests and an empty write to notify the kernel of available responses.

Sharing the data buffers proved to be tricky, since I can't reliably use a lock as a malicious user process can mess with it. A full blown allocator like dlmalloc would also have been overkill.

I ended up dividing the region in blocks of equal size. These blocks are added to a lock-free linked list which is used for allocation.

To transfer data, either a single block or a linked chain of scatter-gather lists is used, like so:

Code: Select all

	  L0 -------> L1 -------> .. -------> LN -------> Dn
	 /  \        /  \                    /  \
	 D0 D1 ..   Dm  Dm+1 ..             ..  Dn-1

Creating & using a StreamTable typically goes as follows:

Allocate storage for shared data buffers
Create table with data buffers
Map both the table's queue and the buffers
In a loop:
1. Dequeue any requests, handle them and send a response now or later
2. If any responses were send, do a write to the notify object
3. Do a read on the notify object

To make defining IPC data formats easier I've decided to implement something akin to Permebufs. The main difference is that my schema has a large focus on bitpacking, e.g.:

Code: Select all

struct Request {
	ty: RequestType
	binary: u1
	advance: u1
	meta: u1
	_reserved: u1
	id: Id
	handle: Handle
	args: RequestArgs
}

enum RequestType {
	Read
	Write
	Open
	Close
	Create
	Destroy
	SeekStart
	SeekCurrent
	SeekEnd
	Share
}

union RequestArgs {
	offset_u: u64
	offset_s: s64
	share: Handle
	slice: Slice
	amount: u32
}

alias Id: u24
alias Handle: u32

...

It is still very WIP though. I'm not sure what features it should have either, let alone how closely tied it should be to my OS.

nexos · Post by **nexos** » Wed Jul 06, 2022 12:11 pm

Demindiro wrote:On modern processors `rep movsb/q` is quite fast, is only 2 bytes (vs potentially a few hundred) and doesn't require saving/restoring SSE registers. I recommend using it since an equivalent SSE algorithm is much larger and puts more pressure on the cache. I also expect the performance of rep instructions to further improve in the future.

Hmm, that is interesting. I guess SSE would be good only for extremely large buffers. But in that case, you might as well use shared memory.

Come to think of it, why do C libraries (I'm talking to you, glibc!) use SSE and AVX so much for memcpy, memset, strlen, et.al? For applications where that is done a lot, that makes sense, but for a typical application, I think the overhead of saving / restoring with fxsave or xsave is greater than using versions of said functions optimized without processor extensions.

nullplan · Post by **nullplan** » Wed Jul 06, 2022 1:24 pm

nexos wrote:Come to think of it, why do C libraries (I'm talking to you, glibc!) use SSE and AVX so much for memcpy, memset, strlen, et.al?

To brag, mostly.

Well, that's half true. Historically, repeated string instructions were a mixed bag, with some optimization guides explicitly calling for new code to avoid them (they are Vector Path instructions and take quite long to execute -- by themselves. Of course, they also do more than any other integer instruction). It's sort of like unaligned memory accesses, which went from slow to fast and back again quite a few times (as I recall, currently they are fast). So some computer scientists got together and found quite a few very clever algorithms for all of these functions, and the glibc people implemented them and rolled them out for SSE and AVX. And then they benchmarked it and found their versions to be faster than the old version, and in glibc, nobody cares about code size. Of course, this can only ever matter when you start copying gigabytes of data, which only a select few applications are ever going to do. But the metrics were good.

To my knowledge, it is still the case that SSE and AVX manage to copy faster than repeated string instructions. At vastly increased footprint for something that is not really a central part of your code. The largest thing the processor is ever going to have to copy around in a modern PC is the frame buffer. All other major memory accesses are done with DMA.

nexos · Post by **nexos** » Wed Jul 06, 2022 3:19 pm

nullplan wrote:Of course, this can only ever matter when you start copying gigabytes of data, which only a select few applications are ever going to do. But the metrics were good.

Exactly. It may look cool on a graph, but I am sure that in the average use case, there is net drop (or at least no affect) on performance.

halofreak1990 · Post by **halofreak1990** » Wed Jul 06, 2022 3:55 pm

To add to what nullplan said, an SSE memcpy only really comes into its own when dealing with buffer sizes >= 256 bytes. Below that, the set-up cost isn't really worth it. You can find memcpy implementations that will pick either the classic rep movs(bwlq) or SSE depending on such a threshold. The one I use (but can't remember where I found it) does exactly this, using the tried and true string copy instructions for the small path, and also uses rep movsb in the SSE path to align any misaligned memory to the 16-byte boundary preferred by SSE to ensure the fastest copy throughput.

nexos · Post by **nexos** » Wed Jul 06, 2022 4:43 pm

To add to what nullplan said, an SSE memcpy only really comes into its own when dealing with buffer sizes >= 256 bytes.

Ah, good to know an actual threshold

. Does this account for save / restoring XMM / YMM on context switches? If a program calls memcpy on 300 byte buffers a lot, then I think the repeated calls to xsave / xrstor would negate some of this performance benefit.

glibc AFAICT unconditionally uses AVX or SSE, as on GDB I've seen __strlen_avx2 be called with strings that certainly weren't that big. Let me double check in the source.

EDIT: After coming out in shivers from glibc's codebase, it looks like it unconditionally calls said functions.

AndrewAPrice · Post by **AndrewAPrice** » Thu Aug 25, 2022 5:40 pm

Creator of Permebuf here. It's wonderful knowing I could have inspired someone.

OSDev.org

Avoiding redundant copies with direct user IPC

Avoiding redundant copies with direct user IPC

Re: Avoiding redundant copies with direct user IPC

Re: Avoiding redundant copies with direct user IPC

Re: Avoiding redundant copies with direct user IPC

Re: Avoiding redundant copies with direct user IPC

Re: Avoiding redundant copies with direct user IPC

Re: Avoiding redundant copies with direct user IPC

Re: Avoiding redundant copies with direct user IPC

Re: Avoiding redundant copies with direct user IPC

Re: Avoiding redundant copies with direct user IPC

Re: Avoiding redundant copies with direct user IPC