Exception catching by kernel?

mariuszp · **Joined:** Sat Oct 16, 2010 3:38 pm **Posts:** 587

There are cases in which the kernel must catch unexpected exceptions. For example, when copying data to/from usermode, a thread may choose to munmap() a region that another kernel thread is currently writing to/reading from, which would raise a page fault. It is not possible to directly dispatch a SIGSEGV signal while the thread is in kernel mode, and it is also not possible to simply return to it.

Currently my kernel simply panics if kernel code raises a page fault, but it is not a good solution in a mulithreaded environment for the reason I have described.

What are the possible ways to handle this situation? Perhaps setting an exception handler by means of something similar to longjmp()?

Brendan · **Posted:** Mon May 09, 2016 8:36 am

Hi,

mariuszp wrote:

There are cases in which the kernel must catch unexpected exceptions.

The only cases where a kernel "must" catch unexpected exceptions is where the kernel failed to avoid unexpected exceptions.

mariuszp wrote:

For example, when copying data to/from usermode, a thread may choose to munmap() a region that another kernel thread is currently writing to/reading from, which would raise a page fault. It is not possible to directly dispatch a SIGSEGV signal while the thread is in kernel mode, and it is also not possible to simply return to it.

When 2 or more threads/CPUs try to access the same data at the same time in "conflicting" ways (e.g. at least one writing/modifying); then you have a concurrency problem and needs locks.

If you have a problem because one thread/CPU is copying data to/from user-space while another thread/CPU is doing "munmap()"; then you've failed at basic concurrency (e.g. failed to lock the pages).

Cheers,

Brendan

mariuszp · **Joined:** Sat Oct 16, 2010 3:38 pm **Posts:** 587

Brendan wrote:

Hi,

mariuszp wrote:

There are cases in which the kernel must catch unexpected exceptions.

The only cases where a kernel "must" catch unexpected exceptions is where the kernel failed to avoid unexpected exceptions.

mariuszp wrote:

For example, when copying data to/from usermode, a thread may choose to munmap() a region that another kernel thread is currently writing to/reading from, which would raise a page fault. It is not possible to directly dispatch a SIGSEGV signal while the thread is in kernel mode, and it is also not possible to simply return to it.

When 2 or more threads/CPUs try to access the same data at the same time in "conflicting" ways (e.g. at least one writing/modifying); then you have a concurrency problem and needs locks.

If you have a problem because one thread/CPU is copying data to/from user-space while another thread/CPU is doing "munmap()"; then you've failed at basic concurrency (e.g. failed to lock the pages).

Cheers,

Brendan

Is that not what Linux does? The copy_to_user() function just checks the validitiy of the address and calls memcpy() without any locks.

Brendan · **Posted:** Mon May 09, 2016 9:08 am

Hi,

mariuszp wrote:

Is that not what Linux does? The copy_to_user() function just checks the validitiy of the address and calls memcpy() without any locks.

I don't know; but wouldn't be too surprised if Linux is full of stupid security vulnerabilities like that.

For a simple case, imagine a kernel function that copies data from user-space to kernel space; which checks everything (including checking that the data itself is "safe") before copying, then copies the data. Now imagine another thread in the same process (that has good timing) modifies the data in user-space after the kernel has checked it but before the kernel has copied it. Now you've got data in kernel space that kernel has checked and that kernel think is perfectly "safe"; but is actually malicious. Oops.

Cheers,

Brendan

mariuszp · **Joined:** Sat Oct 16, 2010 3:38 pm **Posts:** 587

Brendan wrote:

Hi,

mariuszp wrote:

Is that not what Linux does? The copy_to_user() function just checks the validitiy of the address and calls memcpy() without any locks.

I don't know; but wouldn't be too surprised if Linux is full of stupid security vulnerabilities like that.

For a simple case, imagine a kernel function that copies data from user-space to kernel space; which checks everything (including checking that the data itself is "safe") before copying, then copies the data. Now imagine another thread in the same process (that has good timing) modifies the data in user-space after the kernel has checked it but before the kernel has copied it. Now you've got data in kernel space that kernel has checked and that kernel think is perfectly "safe"; but is actually malicious. Oops.

Cheers,

Brendan

That's why I copy into kernel space (stack or heap) before checking the data - then using the kernel copy to check for validity etc, and userspace cannot modify this data on the fly.

What is, then, the safe way to copy data to/from userspace? Do note that I employ page faults for load-on-demand, copy-on-write, and all that magic. My current way of copying to/from userspace is as follows:

1. Call a function which ensures that the appropriate permissions (read/write) are marked for the specific mapping in question, and that the address actually belongs to userspace.
2. Do a memcpy()

The memcpy() may then raise page faults which result in load-on-demand or copy-on-write; and that is done safely and the page fault returns to the faulter.

This, however, would not be possible for a multi-threaded process, because another thread could munmap() or mprotect() the mapping in question while the kernel is still copying.

So I thought to resolve it like this:

1. Check if the address is within the userland range.
2. Set a "fault handler" in my Thread structure by calling some function (let's call it catch() ): the catch() function returns 0 and sets the values of registers to restore in the Thread description, for when a page fault occurs.
3. Perform the memcpy(): if a page fault occurs, the "catch registers" are restored, causing a jump-back to step 2, and catch() returns -1.
4. Call uncatch(), which would report that we no longer want to catch the exceptions.

Is there a problem with this approach? How would I "lock pages" as you suggested? With your example, another thread could still modify data half-way through even if I used some kind of spinlock, because userspace cannot be trusted to respect that spinlock.

Brendan · **Posted:** Mon May 09, 2016 12:02 pm

Hi,

mariuszp wrote:

Brendan wrote:

For a simple case, imagine a kernel function that copies data from user-space to kernel space; which checks everything (including checking that the data itself is "safe") before copying, then copies the data. Now imagine another thread in the same process (that has good timing) modifies the data in user-space after the kernel has checked it but before the kernel has copied it. Now you've got data in kernel space that kernel has checked and that kernel think is perfectly "safe"; but is actually malicious. Oops.

That's why I copy into kernel space (stack or heap) before checking the data - then using the kernel copy to check for validity etc, and userspace cannot modify this data on the fly.

What is, then, the safe way to copy data to/from userspace?

Write a list of all the different operations that depend on (e.g.) page table entries. Copying data to/from user-space, loading pages from swap when they're accessed, copying "copy on write" pages when they're written to, establishing an area as copy on write during "fork()", "mmap()", "munmap()", ....

Now tell me how you guarantee that kernel code doing any of these operations (on one CPU) won't interfere with kernel code (on another CPU) doing the same operation or a different operation on the same area at the same time.

Essentially, virtual address space ranges (including their page tables, etc) are a shared resource being (potentially) accessed from multiple CPUs at the same time. Like any shared resource that may be accessed by multiple CPUs at the same time (excluding the "multiple readers, no writers" case) something (e.g. locks) is necessary to ensure correct behaviour.

mariuszp wrote:

So I thought to resolve it like this:

1. Check if the address is within the userland range.
2. Set a "fault handler" in my Thread structure by calling some function (let's call it catch() ): the catch() function returns 0 and sets the values of registers to restore in the Thread description, for when a page fault occurs.
3. Perform the memcpy(): if a page fault occurs, the "catch registers" are restored, causing a jump-back to step 2, and catch() returns -1.
4. Call uncatch(), which would report that we no longer want to catch the exceptions.

Is there a problem with this approach? How would I "lock pages" as you suggested? With your example, another thread could still modify data half-way through even if I used some kind of spinlock, because userspace cannot be trusted to respect that spinlock.

What I'm suggesting is more like:
1. Do basic range checking (is the area entirely in user-space, or is someone trying to trick kernel into trashing itself)?
2. Call a function that has a "for each page in area" loop; that:

acquires a lock for the page and makes the page "supervisor only" at the same time (so user-space can't modify the data without getting a page fault that waits until the lock is released). Note: The user/supervisor bit in the page table entry may be the bit you use for the lock itself.
ensures that the appropriate permissions (read/write) are marked for the specific page in question
checks if the page needs to be fetched from disk or something, and fetches it if necessary (because it's cheaper than paying for the overhead of a page fault and then having to figuring out its cause)
copies data to/from that page (after ensuring nothing else can access it, and that it's in RAM and no page faults can happen)
releases the lock for the page table entry and makes the page "user" again

Cheers,

Brendan

mariuszp · **Joined:** Sat Oct 16, 2010 3:38 pm **Posts:** 587

As you've asked, here's a list of all page operations: load-on-demand, copy-on-write (both happening during page faults), mapping and unmapping.

Those are all protected by a spinlock in the ProcMem object (which describes the address space of a process, by maintaining a list of mappings, allocated frame, and of course the page tables understood by the CPU). This ensures that multiple such operations don't happen at the same time.

And as for access to the memory by another thread:

For load-on-demand, the lock is acquired, and the page entry is marked as present only once the data is loaded to it. It then releases the spinlock and returns for a re-try; if another thread tries reading/writing that area before the page is marked present, that causes a page fault which then waits for the spinlock, sees the area has now been mapped by another thread, and so simply releases the lock and returns for a retry.

For copy-on-write it is similar, but the procdeure goes: map, then mark writeable.

I guess the way you explained could work, I'd just lock the ProcMem object, force all necessary pages into memory (in the same way the fault handler would), and hence access them safely, then release the lock.

I don't think I would need to lock each individual page using the supervisor bit, since as I've said, I always copy the data to kernel space before processing it in any way, and likewise after it's moved to user space I leave it alone.

Brendan · **Posted:** Tue May 10, 2016 3:21 am

Hi,

mariuszp wrote:

As you've asked, here's a list of all page operations: load-on-demand, copy-on-write (both happening during page faults), mapping and unmapping.

Those are all protected by a spinlock in the ProcMem object (which describes the address space of a process, by maintaining a list of mappings, allocated frame, and of course the page tables understood by the CPU). This ensures that multiple such operations don't happen at the same time.

And as for access to the memory by another thread:

For load-on-demand, the lock is acquired, and the page entry is marked as present only once the data is loaded to it. It then releases the spinlock and returns for a re-try; if another thread tries reading/writing that area before the page is marked present, that causes a page fault which then waits for the spinlock, sees the area has now been mapped by another thread, and so simply releases the lock and returns for a retry.

For copy-on-write it is similar, but the procdeure goes: map, then mark writeable.

I guess the way you explained could work, I'd just lock the ProcMem object, force all necessary pages into memory (in the same way the fault handler would), and hence access them safely, then release the lock.

I don't think I would need to lock each individual page using the supervisor bit, since as I've said, I always copy the data to kernel space before processing it in any way, and likewise after it's moved to user space I leave it alone.

That sounds like it should work correctly, but if a process has (e.g.) 10 threads (running on 10 CPUs) that all do something to 10 completely different areas then they all fight for the same ProcMem lock. Maybe the first CPU needs to fetch a page from swap space and the other 9 CPUs spin for a relatively long time (several milliseconds) before the first CPU can release the lock, maybe the second CPU accessed a copy on write area but you've got no free memory left to copy the page and need to send page/s to swap space so the remaining 8 CPUs spin for a relatively long time before the second CPU can release the lock, maybe by that time the first CPU has touched something else and starts spinning on the ProcMem lock too. Maybe 4 of those threads/CPUs are trying to free memory but they can't get the lock because other CPUs are desperately trying to send data to swap space because there's no free memory.

Also note that there's a "fairness" issue - e.g. 2 CPUs might be frequently lucky and always get the lock, while other CPUs are frequently unlucky and can't get the lock for an extremely long time.

Technically; you shouldn't really use a spinlock for this and it should be more like a mutex with a FIFO queue of "waiters" - if lock is acquired, tell scheduler this thread is blocked waiting for the lock (so this thread gets no CPU time) and ensure that whoever releases the lock tells scheduler to unblock the next waiting thread in the queue. Technically, my description should've also used something more like a mutex too, but with one lock per page the chance of 2 or more threads contending for the same lock is much lower.

Cheers,

Brendan

~ · **Joined:** Tue Mar 06, 2007 11:17 am **Posts:** 1225

Brendan wrote:

Hi,

mariuszp wrote:

Is that not what Linux does? The copy_to_user() function just checks the validitiy of the address and calls memcpy() without any locks.

I don't know; but wouldn't be too surprised if Linux is full of stupid security vulnerabilities like that.

For a simple case, imagine a kernel function that copies data from user-space to kernel space; which checks everything (including checking that the data itself is "safe") before copying, then copies the data. Now imagine another thread in the same process (that has good timing) modifies the data in user-space after the kernel has checked it but before the kernel has copied it. Now you've got data in kernel space that kernel has checked and that kernel think is perfectly "safe"; but is actually malicious. Oops.

Cheers,

Brendan

We could check the data several times.

We could use a space-efficient test with 2 SHA2-512 slots (one for the original user data and other for the temporal kernel copy buffer).

Then check if at the end they still have the same hash (which means no changes).

Once we actually copy the data to the temporary kernel data space, we can have the kernel check whether it's safe, and changes to the original user data no longer matter at this point (it's now responsibility of the user to ensure that the program is consistent).

If it's safe, we process the user data in kernel or driver space (for example registered window classes in WinAPI).

So in short:
- User tells the kernel to "register" a data structure/function pointers.
- The kernel makes a SHA2-512 to watch for changes.
- The kernel copies the structure to a safe buffer (given it already knows its size and that it's sane).
- The kernel makes a second SHA2-512 hash to check that it's still identical to the original data.
- The kernel checks that the parameters in the data structure are perfectly safe.
- The kernel validates, allocates/reserves resources and IDs for the user data.

It brings the question of whether the Windows kernel detects changes to something like a window class after it has been registered (it would be necessary to test obvious things like changing the window class after it's registered to see if it's using only the copy of the fields in the kernel or if it's still also checking the validity of the structure at the user side -- but surely there should be 2 copies -- user and kernel -- for such a vital function that touches fundamental kernel functions).

As far as I remember, in WinAPI, the window class is registered and an ATOM handle is returned, so probably the window class structure could be allocated and freed and only keep the ATOM (and the window class at the kernel side it refers to implicitly).

OSDev.org

Exception catching by kernel?

Who is online