Multithreading and memory mapping/unmapping

rod · **Joined:** Mon Feb 10, 2014 7:42 am **Posts:** 21

Hello,
I was wondering for some time... how can the following problem be solved:
When we have processes with multiple threads, one thread might alter the memory mapping (sbrk, etc.) mapping or unmapping some pages, and the kernel, and other threads concurrently running (using SMP) would need to know about the change. The change could originate from userspace and affect other userspace threads, and kernel threads, or originate from the kernel and affect both spaces likewise.
E.g.: one thread is running the write() system call and the kernel has already validated the memory range, and starts to read data from userspace. Then another thread of the same process calls the sbrk() or similar system call and unmaps some pages that happen to be some or the ones holding the data of write(). Then the kernel, that was still copying the data, might get an exception or might be reading from physical memory that is already mapped to other processes, etc.
I think there can be variations about who initiates the change and whom it affects like: kernel-kernel, kernel-userspace, userspace-kernel, and userspace-userspace. Also it might be different when mapping and when unmapping.
How can this be solved?
Because I think that by writing directly to the page tables (especially when unmapping), other threads might not know about the change and might be using the old mapping.

nakst · **Joined:** Sun Jan 17, 2016 7:57 am **Posts:** 51

rod wrote:

When we have processes with multiple threads, one thread might alter the memory mapping (sbrk, etc.) mapping or unmapping some pages, and the kernel, and other threads concurrently running (using SMP) would need to know about the change. The change could originate from userspace and affect other userspace threads, and kernel threads, or originate from the kernel and affect both spaces likewise.

It sounds like you're implying that userspace threads have access to their page tables, but generally they shouldn't for security reasons. For example, a malicious process could map each physical page on the system into their address space in tern and search for suspicious data.

rod wrote:

E.g.: one thread is running the write() system call and the kernel has already validated the memory range, and starts to read data from userspace. Then another thread of the same process calls the sbrk() or similar system call and unmaps some pages that happen to be some or the ones holding the data of write().

Pages aren't just unmapped when they're freed with something like VirtualFree/sbrk. The memory manager will be constantly unmapping pages that haven't been accessed recently to reduce the size of the system's working set - how much RAM is in use. Once they're unmapped the memory manager can move them into the swap file/partition, and then the physical pages can be zeroed for reuse. When the process, or the kernel, tries to access the pages again a page fault will be generated and the data can be read back into RAM, and the pages will be mapped.

rod wrote:

Then the kernel, that was still copying the data, might get an exception or might be reading from physical memory that is already mapped to other processes, etc.

The kernel won't end up reading from physical memory mapped to another process, since it'll be reading from the calling process's address space. And page faults should be expected - mapping is not the same as allocating. As noted above, as the kernel accesses the buffer, memory might be read from secondary storage, demand zero pages may be mapped, etc.

In terms of allocating and freeing memory, there are only a few things that can happen:

The original memory region is still allocated - intended buffer is read/modified by the kernel.
The region is freed and nothing has been allocated in its place - the page faults cannot be resolved.
The original region is freed and a new region overlaps it - the wrong buffer is read/modified by the kernel.

The only problematic case is the second, and there are 2 solutions I can see.

Prevent any memory regions in use by system calls from being freed.
When the kernel has an unresolvable page fault in a user's address space, it can longjmp out of the system call.

The first solution is likely a lot easier to implement, and it's what I do in my kernel. It also prevents the third case from above happening. Every time a memory region is referenced by a system call, a counter is incremented, and decremented when the call ends. If a userspace process attempts to free a region with a nonzero counter, it crashes (why would you want to free memory you've passed to a system call, anyway?). It also helps with asynchronous file IO, where you increment the counter when the request is placed, and decrement the counter once the request is finished and everything's copied into the user's address space.

rod wrote:

I think there can be variations about who initiates the change and whom it affects like: kernel-kernel, kernel-userspace, userspace-kernel, and userspace-userspace.

Userspace processes shouldn't be allowed to read from or write to anything in the kernel's address space, again for security reasons (except for a few special cases, maybe such as getting the time).

rod wrote:

Because I think that by writing directly to the page tables (especially when unmapping), other threads might not know about the change and might be using the old mapping.

This is a problem, but not for the reason you're thinking of. When you modify page tables you need to invalidate the TLB entry for the mapping on each processor. This is called a TLB shootdown. You should send an IPI to all processors that might have a TLB entry for the modified page mapping (be careful if you're using PCID) and get them to INVLPG on the changed virtual address. You should try to avoid sending unnecessary IPIs, such as when a page transitions from invalid to valid, since you can just invalidate the page in the page fault handler and return (some processors automatically invalidate the TLB entry for an invalid page and you'll never even get the page fault).

rod · **Joined:** Mon Feb 10, 2014 7:42 am **Posts:** 21

nakst wrote:

rod wrote:

I think there can be variations about who initiates the change and whom it affects like: kernel-kernel, kernel-userspace, userspace-kernel, and userspace-userspace.

Userspace processes shouldn't be allowed to read from or write to anything in the kernel's address space, again for security reasons (except for a few special cases, maybe such as getting the time).

I know. I was meaning that the address space change could be requested by userspace through a system call, or could be an internal operation of the kernel.

nakst wrote:

rod wrote:

Because I think that by writing directly to the page tables (especially when unmapping), other threads might not know about the change and might be using the old mapping.

This is a problem, but not for the reason you're thinking of. When you modify page tables you need to invalidate the TLB entry for the mapping on each processor. This is called a TLB shootdown. You should send an IPI to all processors that might have a TLB entry for the modified page mapping (be careful if you're using PCID) and get them to INVLPG on the changed virtual address. You should try to avoid sending unnecessary IPIs, such as when a page transitions from invalid to valid, since you can just invalidate the page in the page fault handler and return (some processors automatically invalidate the TLB entry for an invalid page and you'll never even get the page fault).

I see. So when the TLB shootdown happens, the IPIs should only be sent to the processors that have mapped the changed address? Then, I think there could be some races, such as: when I check there are 2 processors with that address space, but before I send the IPIs, some other processor switches of address space and gets unnotified...

In other line of thought, these things make me think about my kernel (x86_64, SMP) that currently has interrupts always disabled in kernel mode (interrupt gates, instead of trap gates). Then I wonder whether it is impossible to have a complex kernel with interrupts always disabled in kernel mode and 'advanced' features like TLB shootdown, etc., or there could be some other solution to these problems that does not involve enabling interrupts? Is it uncommon for a kernel to have interrupts always disabled in kernel mode?

If I decide to start enabling interrupts then I should think about interrupt nesting, and where interrupts would have to be disabled temporarily...

rod · **Joined:** Mon Feb 10, 2014 7:42 am **Posts:** 21

rod wrote:

In other line of thought, these things make me think about my kernel (x86_64, SMP) that currently has interrupts always disabled in kernel mode (interrupt gates, instead of trap gates). Then I wonder whether it is impossible to have a complex kernel with interrupts always disabled in kernel mode and 'advanced' features like TLB shootdown, etc., or there could be some other solution to these problems that does not involve enabling interrupts? Is it uncommon for a kernel to have interrupts always disabled in kernel mode?

If I decide to start enabling interrupts then I should think about interrupt nesting, and where interrupts would have to be disabled temporarily...

Sorry, this might be offtopic. I will post it in another thread.

linguofreak · **Joined:** Wed Mar 09, 2011 3:55 am **Posts:** 509

nakst wrote:

rod wrote:

When we have processes with multiple threads, one thread might alter the memory mapping (sbrk, etc.) mapping or unmapping some pages, and the kernel, and other threads concurrently running (using SMP) would need to know about the change. The change could originate from userspace and affect other userspace threads, and kernel threads, or originate from the kernel and affect both spaces likewise.

It sounds like you're implying that userspace threads have access to their page tables, but generally they shouldn't for security reasons. For example, a malicious process could map each physical page on the system into their address space in tern and search for suspicious data.

Userspace threads generally have the ability to ask for memory to be mapped or unmapped at specific places in their address spaces through API calls like mmap/munmap, but no control over what physical memory gets mapped there. The last sentence of his post makes it a bit uncertain if the OP is aware of this, but his questions are valid without userspace threads having access to their own page tables.

Quote:

rod wrote:

E.g.: one thread is running the write() system call and the kernel has already validated the memory range, and starts to read data from userspace. Then another thread of the same process calls the sbrk() or similar system call and unmaps some pages that happen to be some or the ones holding the data of write().

Pages aren't just unmapped when they're freed with something like VirtualFree/sbrk. The memory manager will be constantly unmapping pages that haven't been accessed recently to reduce the size of the system's working set - how much RAM is in use. Once they're unmapped the memory manager can move them into the swap file/partition, and then the physical pages can be zeroed for reuse. When the process, or the kernel, tries to access the pages again a page fault will be generated and the data can be read back into RAM, and the pages will be mapped.

For the purposes of the OP's questions, swapped out means mapped, as the pages in question are still part of the process's address space and will be pulled back into RAM if accessed.

linguofreak · **Joined:** Wed Mar 09, 2011 3:55 am **Posts:** 509

rod wrote:

Hello,
I was wondering for some time... how can the following problem be solved:
When we have processes with multiple threads, one thread might alter the memory mapping (sbrk, etc.) mapping or unmapping some pages, and the kernel, and other threads concurrently running (using SMP) would need to know about the change. The change could originate from userspace and affect other userspace threads, and kernel threads, or originate from the kernel and affect both spaces likewise.

"Userspace thread" and "kernel thread" can have a few different meanings, which you need to keep distinct to reason about this clearly:

1) "Userspace thread" can mean a thread scheduled and accounted for by a userspace runtime library, while "kernel thread" can mean a thread scheduled and accounted for by the kernel. This is the definition generally used in the literature. Note that multiple userspace threads can be implemented on top of a single kernel thread (which is the way that multithreading can be supported on kernels that allow only one kernel thread per process), but if there is only one kernel thread per process, only one system call can be in flight at a time even if there are multiple userspace threads, so you wouldn't be able to unmap memory that was being used for a blocking system call (as unmapping memory would require a system call, requiring the present system call to be finished first), though non-blocking calls might still give you trouble.

2) "Userspace thread" can mean a thread scheduled and accounted for by the kernel that is currently executing code in userspace, and "kernel thread" can mean the same thread when processing a system call and executing code in kernelspace. This is the definition you seem to be using from the wording of your post.

3) "Userspace thread" can mean a thread scheduled and accounted for by the kernel that is used by a specific process and may at any time be executing code in userpace or kernelspace, and "kernel thread" can mean a thread scheduled and accounted for by the kernel that is used by the kernel for its own background work. This is another definition that the wording of your post suggests you may be using.

Quote:

E.g.: one thread is running the write() system call and the kernel has already validated the memory range, and starts to read data from userspace. Then another thread of the same process calls the sbrk() or similar system call and unmaps some pages that happen to be some or the ones holding the data of write(). Then the kernel, that was still copying the data, might get an exception or might be reading from physical memory that is already mapped to other processes, etc.

The kernel already needs to do something if a thread hands it a system call acting on an unmapped address. If it gets a page fault because a page was unmapped while the kernel was doing something with it, it should use the same mechanism to deal with it. For example, it could send a signal (e.g, SIGSEGV) to the offending process, and terminate the process if the signal is not handled.

Quote:

I think there can be variations about who initiates the change and whom it affects like: kernel-kernel, kernel-userspace, userspace-kernel, and userspace-userspace. Also it might be different when mapping and when unmapping.
How can this be solved?
Because I think that by writing directly to the page tables (especially when unmapping), other threads might not know about the change and might be using the old mapping.

As already discussed, if other processors are currently running threads from the same process, an IPI needs to be sent to let those processors know that they need to invalidate the mapping. And, of course, a physical page that has been unmapped should not be returned to the free page pool until it has been confirmed that all processors have invalidated the mapping, all DMA transfers involving the page have been completed, etc.

rod · **Joined:** Mon Feb 10, 2014 7:42 am **Posts:** 21

linguofreak wrote:

"Userspace thread" and "kernel thread" can have a few different meanings, which you need to keep distinct to reason about this clearly:

1) "Userspace thread" can mean a thread scheduled and accounted for by a userspace runtime library, while "kernel thread" can mean a thread scheduled and accounted for by the kernel. This is the definition generally used in the literature. Note that multiple userspace threads can be implemented on top of a single kernel thread (which is the way that multithreading can be supported on kernels that allow only one kernel thread per process), but if there is only one kernel thread per process, only one system call can be in flight at a time even if there are multiple userspace threads, so you wouldn't be able to unmap memory that was being used for a blocking system call (as unmapping memory would require a system call, requiring the present system call to be finished first), though non-blocking calls might still give you trouble.

2) "Userspace thread" can mean a thread scheduled and accounted for by the kernel that is currently executing code in userspace, and "kernel thread" can mean the same thread when processing a system call and executing code in kernelspace. This is the definition you seem to be using from the wording of your post.

3) "Userspace thread" can mean a thread scheduled and accounted for by the kernel that is used by a specific process and may at any time be executing code in userpace or kernelspace, and "kernel thread" can mean a thread scheduled and accounted for by the kernel that is used by the kernel for its own background work. This is another definition that the wording of your post suggests you may be using.

Thanks for the clarification. I am not sure about the applicable wording in my case. My kernel currently has one userspace stack per thread, and one kernel stack per CPU. So several threads of the same process might be at a system call simultaneously. By now, I did not need a kernel stack per thread, but if I hit any problem with the current design, I might consider it.

linguofreak wrote:

Quote:

E.g.: one thread is running the write() system call and the kernel has already validated the memory range, and starts to read data from userspace. Then another thread of the same process calls the sbrk() or similar system call and unmaps some pages that happen to be some or the ones holding the data of write(). Then the kernel, that was still copying the data, might get an exception or might be reading from physical memory that is already mapped to other processes, etc.

The kernel already needs to do something if a thread hands it a system call acting on an unmapped address. If it gets a page fault because a page was unmapped while the kernel was doing something with it, it should use the same mechanism to deal with it. For example, it could send a signal (e.g, SIGSEGV) to the offending process, and terminate the process if the signal is not handled.

Quote:

I think there can be variations about who initiates the change and whom it affects like: kernel-kernel, kernel-userspace, userspace-kernel, and userspace-userspace. Also it might be different when mapping and when unmapping.
How can this be solved?
Because I think that by writing directly to the page tables (especially when unmapping), other threads might not know about the change and might be using the old mapping.

As already discussed, if other processors are currently running threads from the same process, an IPI needs to be sent to let those processors know that they need to invalidate the mapping. And, of course, a physical page that has been unmapped should not be returned to the free page pool until it has been confirmed that all processors have invalidated the mapping, all DMA transfers involving the page have been completed, etc.

So, the IPI is a part of the equation, a thing to do once the kernel is developed enough.

The one remaining issue I see is the userspace address checking in the kernel: How it can be done? I see two options:
1) the kernel only checks that the pointer is in userspace (>=0 and <0x0000800000000000 in x86_64), then tries to read/write the range, and let the processor issue exceptions if the range is not mapped or if it has wrong permissions. Then send SIGSEGV or similar to the process, if needed.
2) check exhaustively that the provided range is valid (explicitly reading or "parsing" the page tables), so no exception can happen.

The problems I see are, respectively:
1) the kernel might not receive an exception on the same exact conditions that userspace would. I mean, for the kernel it might be valid to write to a page that is marked read-only for the userspace. I recall that there is a flag to set that would solve this, but I am not sure if it is portable.
2) the memory might be unmapped between the validation and the actual access, so either we fallback to option 1), or we provide some locking mechanism so that the mapped memory status will not be modified while reading/writing the validated range.

I started doing option 2), but now I do think that option 1) might be better, as it does not require so big locking.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5137

rod wrote:

1) the kernel might not receive an exception on the same exact conditions that userspace would. I mean, for the kernel it might be valid to write to a page that is marked read-only for the userspace. I recall that there is a flag to set that would solve this, but I am not sure if it is portable.

It was introduced with the 486, so it's a standard part of the x64 architecture. Do you have plans to port your OS to any other architectures?

rod wrote:

2) the memory might be unmapped between the validation and the actual access, so either we fallback to option 1), or we provide some locking mechanism so that the mapped memory status will not be modified while reading/writing the validated range.

A locking mechanism would work, but it adds an expensive synchronization to every system call. You already need IPIs to keep the TLB fresh across CPUs running in the same address space, so it's much cheaper to use method 1. The IPIs will ensure you receive a fault when the access is invalid.

rod · **Joined:** Mon Feb 10, 2014 7:42 am **Posts:** 21

Octocontrabass wrote:

rod wrote:

1) the kernel might not receive an exception on the same exact conditions that userspace would. I mean, for the kernel it might be valid to write to a page that is marked read-only for the userspace. I recall that there is a flag to set that would solve this, but I am not sure if it is portable.

It was introduced with the 486, so it's a standard part of the x64 architecture. Do you have plans to port your OS to any other architectures?

I try to keep it portable, and ideally I would like to port it to some other 64 bit architectures like arm64 and riscv64.
It would be good if these architectures supported it too.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1604

All architectures I know allow you to set page protection such that the kernel gets an exception even if it was the kernel accessing the page.

What are we even talking about at this point? Page not present is not something you can alleviate with elevated privileges. Let's go back to the original question: When you are increasing access privileges (mapping an unmapped page, turning a read-only page into a read-write one), you only update your OS structures to reflect that. If another core gets a page fault, the page fault handler will see that the access should have been allowed, and can update the CPU-bound page tables accordingly.

If you are changing a mapping to lower access (e.g. revoking execution rights), or rebinding a virtual page to another physical one, then you will have to perform a TLB shootdown. How to do that depends on your design. On my OS I have a global variable holding a VMM ID (which is pretty much the same as a PID) and a virtual address, and a counter. The initiating CPU counts how many other CPUs are currently executing anything, or are sleeping but have registered themselves to be running in the same VMM, and those get an IPI. The IPI handler confirms that the current VMM is the same as the requested one, and drops the translation from the page table, and increases the counter. The initiating CPU waits for the counter to become large enough. At the moment, I have a panic in place in case of timeout, and it has never been triggered.

And about transfers to and from userspace: My kernel is higher-half. Therefore I need to check if "buffer + length" does not overflow and is not in kernel space. So in assembly, that's an add and two conditional jumps. The kernel has no special addresses in the lower half. This stuff is portable, as far as I know.

OSDev.org

Multithreading and memory mapping/unmapping

Who is online