IPC without copying and without transitioning to the kernel?

axilmar · **Joined:** Sat Jun 02, 2007 10:35 am **Posts:** 28

I've searched about IPC without context switches (no copy/zero copy) but I haven't found much, so I would like to discuss the idea here.

The main idea is that when processes want to exchange information, they create (through the microkernel) a shared memory area which is used by both participants to read/write data. The way this shared memory area is used is not defined by the O/S, but by applications: this area can be used as a message queue, but it can be used for anything else.

In order for the shared memory area to be used concurrently by two processes, the shared memory area would have a special area (for example, a DWORD) which would be used for synchronization using atomically executed instructions.

Here are some use cases:

1) process A wants to send a message (that does not need a reply) to process B (on an already allocated shared buffer). Process A acquires the buffer through a critical section, puts the data in the buffer, releases the critical section. Process B wakes up, acquires the buffer through the critical section, copies the message into local memory (i.e. its stack), releases the buffer, then executes the message from the stack data (so as that message is checked atomically from B's stack).

2) process A wants to send a message that needs a reply to process B. Process A acquires the buffer, puts the message in the queue (including space for the result), then blocks on the message through a condition contained in the message. Process B wakes up, acquires the buffer, processes the message, puts the result in the message area, signals the message's condition. Process A wakes up, acquires the buffer, copies the reply into its stack, releases the buffer and processes the result.

3) process A wants process B to return MBs of data that do not possibly fit into the message queue. Process A allocates another buffer shared with B, big enough to host the requested data. B receives the handle to the shared buffer through a message, as in the above two cases (A waits on the message). B fills the buffer and then signals A that the data are available. A wakes up, acquires the buffer, processes the data and then releases the buffer.

Buffers would be resizable FIFO queues: if there is not enough memory at one edge of the queue, the buffer would be resized/relocated. For this to work, buffers would be managed by handles, and the programs that want to manipulate the data directly would pin the buffer by acquiring a pointer to the buffer and then releasing it when processing ends.

My questions are:

1) is the above design sound?
2) can critical sections / conditions exist on shared memory?
3) does it work for multiprocessor systems (considering that critical sections should be replaced with spinlocks)?
4) are there any possible security problems?
5) with the above design, copying is minimized, but the kernel is entered at almost every opportunity (since buffers are accessible through handles). Is there any other way to do the above without transitioning to the kernel all the time and without compromising security?

Combuster · **Posted:** Tue Jun 12, 2007 12:03 pm

Quote:

Buffers would be resizable FIFO queues: if there is not enough memory at one edge of the queue, the buffer would be resized/relocated. For this to work, buffers would be managed by handles, and the programs that want to manipulate the data directly would pin the buffer by acquiring a pointer to the buffer and then releasing it when processing ends.

This means several kernel calls, which can easily ruin your performance gain which you gained by using shared memory.
Some ways around is by overwriting previous messages or by using a circular buffer, which would only need the setup cost once (or every time the buffer is too small and needs to be resized)

Also, be aware of evil applications trying to hack into the receiving program by not obeying mutual exclusion protocols or loading incorrect data into the buffer.

nick8325 · **Joined:** Wed Oct 18, 2006 5:49 am **Posts:** 200

Combuster said all the important things.

It's hard for two programs to share memory if they don't trust each other, and often a server won't trust its clients. If the two programs trust each other, your scheme should work fine.

You might be able to avoid switching to the kernel in the common case. You might want to look at futexes, which Linux uses. Apparently they only switch to the kernel if there's contention. There's an explanation of them at http://people.redhat.com/drepper/futex.pdf.

If you do switch to the kernel all the time, don't forget that some kernels (e.g. L4) can send small messages without any memory access. So it might be just as efficient to send a small message as to lock some shared memory.

jnc100 · **Posted:** Tue Jun 12, 2007 1:14 pm

The method is similar to what I use, only that my implementation is through a circular buffer, as combuster mentioned, and my message buffers are one way. The writing procedures maintain the write pointer, and the reading process maintains a read pointer. Thus, if read_pointer != write_pointer then there is a message waiting. A read still needs to copy the message out of the buffer, however, as there is nothing stopping it being overwritten once it has been read. It should be noted that this scheme is only for small messages.

Like you, I implement shared memory for larger messages, negotiated by this smaller messaging system. Again, I like to keep shared memory one way, and shared memory can only be created from the recipient side. E.g. if A wants to get some shared memory to write to B, then it has to send a message to B asking for some. B then decides if it wants to or not and requests the kernel to create some shared memory that A can write to.

I do use the kernel in message passing, but only to send messages (not read them from the buffer - that is a userspace library function) as I want to ensure that processes send messages in the correct format to one another.

Overall, I think your scheme is a good idea. Bear in mind that going to kernel space is not a completely bad thing - it is not as slow as switching address spaces. I don't think it would be too much of a problem to extend to multiprocessing systems, either.

Regards,
John.

edit: @nick8325 thanks for the futex link

Brendan · **Posted:** Tue Jun 12, 2007 7:34 pm

Hi,

Some notes...

Do threads continually poll to see if they've received a message? If not, then you'd want some way for a task to block until a message is received, and therefore when taskA sends a message to taskB the scheduler may need to unblock taskB. This means always switching to the kernel when a message is sent, or implementing some sort of "wake me up if you send" flag in the shared buffer (and asking the kernel to unblock the receiver if the flag is set).

nick8325 wrote:

If you do switch to the kernel all the time, don't forget that some kernels (e.g. L4) can send small messages without any memory access. So it might be just as efficient to send a small message as to lock some shared memory.

AFAIK L4 passes small messages in registers across a task switch. For example, if taskB blocks with nothing useful in EAX, and taskA sets EAX to "FOO" and does a task switch to taskB without changing EAX during that task switch (e.g. the kernel doesn't load EAX from taskB's saved state), then when taskB starts running it's EAX is still set to "FOO". In this case the message ("FOO") was sent from taskA to taskB.

The problem here is that taskB must run on the same CPU that taskA was running on, and taskA can't send the small message unless taskB is waiting for it. I'm not sure what L4 does if taskA tries to send a small message to taskB while taskB isn't waiting for a message - either the kernel returns a "can't do that now" error (and taskB tries again later?), or taskA blocks until taskB is waiting (which can become messy if several tasks are waiting to send a message to taskB).

It also makes it hard to send more than one message at a time. For e.g. if taskA wants to send "FOO" to taskB and send "BAR" to taskC, then there'd be a task switch after it sends "FOO" to taskB and it won't get a chance to send "BAR" to taskC until it gets CPU time again. This means more task switches than necessary (e.g. "taskA -> taskB -> ? -> taskA -> taskC"), even if there's 3 or more CPUs and no task switches are needed.

The other problem is task priority. If taskA is a high priority task with other work to do and taskB is a low priority task, then when taskA sends a message to taskB there's a task switch, and the low priority task is run while the high priority waits to get the CPU back.

This is one of the reasons I like message queues - any task can always send a message without caring if the receiver is waiting for a message or not, and task switches can be postponed or avoided entirely.

Cheers,

Brendan

Colonel Kernel · **Posted:** Tue Jun 12, 2007 7:41 pm

Brendan wrote:

I'm not sure what L4 does if taskA tries to send a small message to taskB while taskB isn't waiting for a message - either the kernel returns a "can't do that now" error (and taskB tries again later?), or taskA blocks until taskB is waiting (which can become messy if several tasks are waiting to send a message to taskB).

L4 uses synchronous rendezvous-style message passing -- the sender blocks.

Quote:

It also makes it hard to send more than one message at a time. For e.g. if taskA wants to send "FOO" to taskB and send "BAR" to taskC, then there'd be a task switch after it sends "FOO" to taskB and it won't get a chance to send "BAR" to taskC until it gets CPU time again. This means more task switches than necessary (e.g. "taskA -> taskB -> ? -> taskA -> taskC"), even if there's 3 or more CPUs and no task switches are needed.

Yep, this is the way L4 works, and it has the problem you describe. The L4 folks claim to have such ridiculously fast IPC and task switching that this isn't much of a problem, but AFAIK this has yet to be demonstrated in a convincing manner (i.e. -- outside the realm of micro-benchmarks).

Crazed123 · **Joined:** Thu Oct 21, 2004 11:00 pm **Posts:** 248

Doesn't L4-Linux run pretty reasonably fast?

Brendan: How do microkernel systems allow tasks to filter for the type and format of message they want to receive? What happens if the receiver only allows itself to receive part of a message?

But yeah, shared-memory IPC requires a bunch of kernel calls to set up. I prefer either portals or L4-style messaging that includes pages of shared memory in the messages.

Brendan · **Posted:** Tue Jun 12, 2007 9:59 pm

Hi,

Crazed123 wrote:

Doesn't L4-Linux run pretty reasonably fast?

Colonel Kernel is right here...

For example, they use "small address spaces" where segmentation is used for process isolation and a task switch involves reloading segment registers (with no TLB invalidation, etc), and then they use synchronous rendezvous-style message passing. This means that (in some cases) sending a message can be very fast (around 22 cycles on a Pentium IIRC). Then they do micro-benchmarks and say how incredibly fast it is.

Unfortunately the micro-benchmarks ignore things like tasks being blocked while waiting to send, multi-CPU issues, long mode, tasks that use a lot of virtual address space (and don't fit in a small address space), etc. They also don't tell you anything about over-all performance.

For example, if "system A" takes 50 cycles to send a message but reading from a file involves several different tasks sending and receiving a total of 40 messages, and "system B" takes 500 cycles to send a message but you only need to use one message to read a file, then "system B" might be four times faster in practice (even though the micro-benchmarks might make it look like "system A" is much much faster than "system B").

Crazed123 wrote:

Brendan: How do microkernel systems allow tasks to filter for the type and format of message they want to receive? What happens if the receiver only allows itself to receive part of a message?

I'm not too sure about everyone else...

I think Minix uses a bitmasks where you can't send a message to a task unless your bit is set in that task's bitmask. IMHO this makes Minix broken. If the OS supports a maximum of 131072 tasks (which is much lower than most OSs) then you need to use 16 KB per task for it's bitfield, and if 131072 tasks are actually running it costs 2 GB of RAM for the bitfields alone. Of course even then it's coarse security - you can't say which messages can be sent.

For my OS the receiver does any security checking it likes after the message is received. All I do is make sure that the receiver gets a "senderID" that is guaranteed to identify who sent the message.

This doesn't prevent tasks from attempting to flood other tasks with messages (a denial of service attack, similar to a "ping flood). I'm not sure if I'll do anything about that or not - if malicious code is running the user can terminate it, however I probably will keep track of how many messages are sent and received by each task so the user can use this information to decide if a task is malicious or not.

Crazed123 wrote:

But yeah, shared-memory IPC requires a bunch of kernel calls to set up. I prefer either portals or L4-style messaging that includes pages of shared memory in the messages.

Setup costs aren't too bad if the communication between 2 tasks is going to occur over a long period of time. For e.g. if it costs 10000 cycles to setup but saves 100 cycles per message, then (ignoring any other factors) it's beneficial if more than 100 messages will be sent.

Cheers,

Brendan

Colonel Kernel · **Posted:** Wed Jun 13, 2007 12:05 am

Crazed123 wrote:

Doesn't L4-Linux run pretty reasonably fast?

I believe L4-Linux stuffs the Linux kernel and all the drivers into a single address space, so there isn't a whole lot of TLB thrashing going on. They need to build a large-scale complex system based entirely on a client/server design (like a normal microkernel OS) and then do some application benchmarks. If there are any out there, it would be worth a look (I haven't been following L4 lately).

Crazed123 · **Joined:** Thu Oct 21, 2004 11:00 pm **Posts:** 248

Brendan: I didn't mean who can send messages to whom; I meant how the receiver filters for a 16-byte message with a page attached instead of a 21-byte string message with 40 pages attached.

Brendan · **Posted:** Wed Jun 13, 2007 9:01 pm

Hi,

Crazed123 wrote:

Brendan: I didn't mean who can send messages to whom; I meant how the receiver filters for a 16-byte message with a page attached instead of a 21-byte string message with 40 pages attached.

Ah - sorry..

For me, the sender builds the complete message in a special message buffer area before calling the kernel's "sendMessage()" function. The first dword must be the length of the message and the second dword must be a message function number.

During "sendMessage()" the kernel moves the entire message from the sender's message buffer to the end of the receiver's queue. During "getMessage()" the kernel moves the entire message from the start of the receiver's queue to the receiver's message buffer area. This means it's impossible for a partial message to be received, and the receiver can determine the size and format of the message from the values at the start of the message in it's message buffer area.

For example, I might build a font renderer and in it's documentation state that message function number 0x12345678 is used to convert an ASCIIZ string into a bitmap, and that the format for the message must be a certain way. For example:

Code:

#define FONTMESS_CONVERT_ASCIIZ  0x12345678

typedef struct {
        u32 messageSize;
        u32 messageType;
        u32 unused;
        u32 resultSizeX;
        u32 resultSizeY;
        char string[MAX_MESSAGE_DATA_SIZE];
} MESS_convert_ASCIIZ;

I then might specifiy that the font render will respond with a reply message:

Code:

#define FONTMESS_CONVERT_ASCIIZ_REPLY  0x92345678

typedef struct {
        u32 messageSize;
        u32 messageType;
        u32 status;
        u32 bitmapSizeX;
        u32 bitmapSizeY;
        u8 bitmapData[MAX_MESSAGE_DATA_SIZE];
} MESS_convert_ASCIIZ_reply;

The font renderer would register itself with the kernel as a "service", which allows other code to ask the kernel for it's message port ID by name. Other code might ask the kernel for the message port ID corresponding to "Brendan's Font Renderer", then build a message in the format described above and send it. The font renderer would process the message and send a reply as described above (possibly with "status" set to an error code indicating that the ASCIIZ string wasn't null terminated, or the requested size for the bitmap is too small, or the ASCIIZ string contains non-ASCII characters). It's also possible for the font renderer to decide that the message is dodgy (e.g. message size too small) and ignore it.

In this case, I'd expect that the font renderer comes with a header file that describes the format for messages it sends and receives to make it easier for other programmers to use, although you could also write a (statically linked) library that contains functions to build messages to send and parse received messages.

In general, the kernel's messaging is just a tool for arbitrary communication (not unlike TCP/IP or sockets). Software needs to build their own protocols on top of it, so that senders and receivers can agree on what means what.

Cheers,

Brendan

oscoder · **Posted:** Sat Jun 16, 2007 4:11 pm

The main problem with shared memory IPC is latency (although this is fine with asynchronous messaging).

Quote:

does it work for multiprocessor systems (considering that critical sections should be replaced with spinlocks)?

The design I am working on avoids spinlocks/mutexes in the normal case - although I'm not sure its quite compatible with yours (without fixed size buffers, you can get situations where a process tries to read from what it thinks is the end of the buffer, which has actually been resized away). It works by having one part of the shared memory buffer used for writing by one process, and one for the other. The disadvantage of this, of course, is space efficiency. However, there are very few cases where both communicators will want to write to the same data structure, and there is a lot of complexity and possible problems (eg deadlocks) with using mutexes/spinlocks, which are generally best avoided where possible.

Quote:

with the above design, copying is minimized, but the kernel is entered at almost every opportunity (since buffers are accessible through handles). Is there any other way to do the above without transitioning to the kernel all the time and without compromising security?

How about having the handle structure stored in the relevant process'es memory, which is written to by the kernel? That way, when looking to see the base & limit of the relevant shared memory, all they have to do is look it up in this data structure. Of course, kernel call would still be required to re-size a buffer.

Quote:

are there any possible security problems?

There are two things you have to ensure:
1) Each process knows for certain the identity of the process sending messages to it (and the sender must not be able to fake their identity!)
2) No process can write to another's shared memory OR to where the recieving process 'thinks' the sending process is writing to.

The first of these could be done using the kind of data structure described above - each process's list of handle information could include the identity of the process 'attached' to it.

The second of these could have some interesting problems - for instance:
- Process A is about to read the last page of some memory shared with Process B, and is then subjected to a context switch
- Process B then resizes its shared memory, decreasing its size and loosing the last page, that Process A was about to read.
- Process E (E for evil >:-) ) then increases its buffer size, and the new page is placed where B's old one was.
- Process A will then read information written by E, that it thinks comes from B

Hope this helps,
OScoder

jnc100 · **Posted:** Sat Jun 16, 2007 4:57 pm

Brendan wrote:

During "sendMessage()" the kernel moves the entire message from the sender's message buffer to the end of the receiver's queue. During "getMessage()" the kernel moves the entire message from the start of the receiver's queue to the receiver's message buffer area.

I think I'm implementing a similar system to you. Out of interest, do you have one message buffer per process, or one per thread? Can processes/threads create buffers on demand, or is there one statically defined per process? My thinking is, if there is a single message buffer per thread, or if there is one for the process but it is only accessible by one thread, then read operations can be non-locking. Or do you simply rely on the process to only access the message buffer from one thread, as I believe windows does (?). Many thanks.

Regards,
John.

Brendan · **Posted:** Sun Jun 17, 2007 12:11 am

Hi,

jnc100 wrote:

Brendan wrote:

During "sendMessage()" the kernel moves the entire message from the sender's message buffer to the end of the receiver's queue. During "getMessage()" the kernel moves the entire message from the start of the receiver's queue to the receiver's message buffer area.

I think I'm implementing a similar system to you. Out of interest, do you have one message buffer per process, or one per thread? Can processes/threads create buffers on demand, or is there one statically defined per process?

In my last kernel each thread had one static message buffer (created by the kernel when a thread is spawned), however my next kernel will have 2 static message buffers per thread, as I've found it's easier to write software when you can use one message buffer for sending messages while you're processing a message received in the other buffer.

jnc100 wrote:

My thinking is, if there is a single message buffer per thread, or if there is one for the process but it is only accessible by one thread, then read operations can be non-locking. Or do you simply rely on the process to only access the message buffer from one thread, as I believe windows does (?).

I'm a little unusual, in that each thread has it's own address space (and "process space" is mapped into all address spaces belonging to that process' threads). Message buffer/s exist at the same linear address in each thread's address space, and can only be accessed by that thread. This means that no read or write operations on a thread's message buffer need locks. Of course each thread's message queue does needs locks, as several threads might be trying to add a message to the same queue at the same time (possibly while the thread that owns the queue is trying to remove a message from the queue).

A more complete description of my messaging follows...

As far as normal software is concerned, when a message is sent the kernel fills from the end of the message to the end of the buffer with zeros (for security purposes) then moves (not copies) the entire message buffer from the sender's address space to the message queue, then fills the sender's message buffer with zeros. Similarly, when a message is received the kernel overwrites everything in the target message buffer with the new message (including all the zeros between the end of the received message to the end of the message buffer).

This gives a "conceptually clean" model, where the entire message buffer vanishes and re-appears in the receiver's message buffer, except for data past the end of the message which vanishes and doesn't re-appear for security purposes.

This is only how it looks though - the kernel doesn't actually do this.

Instead, (when a message is sent) if the message is "small" (e.g. less than 2048 bytes) the kernel copies the message data to the message queue, then frees all pages in the message buffer. If the message is "medium" (e.g. less than 2 MB) the kernel frees any pages that are above the end of the message, fills from the end of the message to the start of the next page with zeros, then moves page table entries onto the message queue. If the message is "large" (e.g. 2 MB or more) the kernel frees any pages that are above the end of the message, fills from the end of the message to the start of the next page with zeros, then moves page directory entries onto the message queue.

This means that after a message is sent the message buffer actually contains no pages at all (good for reducing memory usage, considering how often a thread would send a message then block until a new message is received). The kernel uses "allocation on demand" - when a thread tries to access the message buffer (after sending a message) a page full of zeros is allocated and mapped there, so that it looks like the message buffer is full of zeros as far as other software can tell (even though it's actually full of "not-present" pages). It also reduces the size of the message queues in kernel space - for e.g. a message queue might be 48 bytes long in kernel space, even though there's 4 messages on the queue that are 8 MB each (as each message would be a few page directory entries, not the data itself).

When a small message is received, the kernel frees all pages in the receiver's message buffer except the first (if present) then copies the message data from the message queue to the message buffer. When a medium message is received the kernel frees all pages in the receiver's message buffer and then replaces the page table entries with page table entries from the message queue. When a large message is received the kernel frees all pages and page tables in the receiver's message buffer and then replaces the page directory entries with page directory entries from the message queue.

To avoid hassles, the kernel's linear memory manager only allows some types of pages to exist in the message buffers. For example, a thread can't memory map a file into it's message buffer, or allocate a DMA buffer there, or map a "special mapping" there (video display memory, pages of ROM, etc). If this was possible the kernel's "sendMessage()" code would have to check for strange page types and deal with them so that the receiver doesn't receive strange pages (which would complicate the IPC code and usually be an unnecessary waste of CPU time). This means that the kernel's linear memory manager must know where the thread's message buffer/s are so it can restrict the types of pages that can be mapped in it (which is simple to do when they're at the same linear address in all address spaces).

Note: it is possible to send and receive "not-present" pages as part of a message (if the message is medium or large). For example, if a page in the message buffer was sent to swap space, then it's possible for that page to still be in swap space after the receiver receives the message. In the same way it's possible to send and receive "not-present, allocate on demand" pages that look like they're full of zeros (but actually don't exist). These "not-present" pages don't really complicate the kernel's IPC code - if the IPC code touches data in a "not-present" page then the page will be mapped in (same as if the sender or receiver tries to touch the data), and the IPC code itself doesn't really need to care.

Also, I guess I should mention that for the last version of my OS messages could be up to 32 MB, and all message buffers took up 32 MB of the address space. For the next version of my OS this will probably be reduced, but I haven't decided what the maximum size of a message will be yet (it'll be somewhere between 4 MB and 16 MB).

Cheers,

Brendan

jnc100 · **Posted:** Sun Jun 17, 2007 3:29 am

Thanks Brendan, that's given me something to think about.

Regards,
John.

OSDev.org

IPC without copying and without transitioning to the kernel?

Who is online