josecm wrote:
Korona wrote:
Designs where physical memory is not directly mapped quickly turn into a mess. If you map only a part of physical RAM and you change this mapping, you have to perform TLB shootdown.
What if you could keep this region of virtual space you use to remap (the page tables caches we talked about above), private to a core? You wouldn't need TLB shootdowns, right?
Correct. But if you want to keep a per-CPU VM region, you either have to copy 2048 bytes at each CR3 switch (to move the lower half to a different per-CPU higher half), or you need to duplicate all VM spaces for each CPU. While the cost of the first approach could be reduced by some caching, I find neither approaches very compelling. Note that just not accessing the physical window of a different CPU is not enough: CPUs are allowed to speculatively prefetch
any TLB entry, regardless of whether pages are accessed or not. Also keep in mind that when the caching mode is changed, the OS must be sure that the physical page is not in any TLB of any CPU.
Note that for Spectre defense, it's not enough to get rid of the full physical mapping. You'd have to guarantee that
no sensitive information is mapped in the higher half
at all while user space runs. That seems to be much more difficult. At most, getting rid of the full physical mapping gives you some defense-in-depth, but it doesn't allow you to get rid of KPTI.
@bellezzasolo I'm not asking about how TLB shootdown works in general (in fact, managarm
handles it quite well), but how it would interact with a physical memory window that needs to be remapped. There are multiple questions that arise in this context: for example, is it possible to work with a fixed number of shootdown messages? If yes, what do you do if you run out of those? Can you block in all contexts where you unmap until a shootdown finishes? Keep in mind that this now includes all contexts where you need to access a physical page, as you do not have all physical pages mapped! Can you guarantee that this never leads to priority inversion? If not, can you handle priority inheritance on TLB shootdown? Is your chosen number of shootdown structs enough to prevents deadlocks (e.g. because the CPU that needs to finish the shootdown needs to issue another shootdown itself? So far, I have only ever seen hand-wavy answers to these questions but
no concrete design that explains how exactly it would work in reality.
For your suggestion about always waiting for each IPIs to finish: I don't think that's practical in a real-world system. If you do that, a task that maps and unmaps memory in a tight loop can starve the whole system. It also does not allow you to do lazy TLB invalidation, which is quite a nice performance boost.
Thus, my current position is that the introduced complexity make it just not worth to be able to utilize slightly more memory on 32-bit systems. And that's the only real advantage that getting rid of the full physical mapping would have.