Korona wrote:
kmalloc() does not need to remap memory. This is a huge performance boost. Memory allocated by kmalloc() does not trigger page faults.
You can have non-paging dynamic allocator (such as the non-paged pool in Windows) without having linear map of the physical memory. The linear map approach is the most immutable, but any other approach when implemented sanely will not immediately reclaim physical memory for repurposing to other subsystems. The repurposing will be deferred until another subsystem becomes starved, e.g. filesystem cache, and then its space will be extended in bulk, anticipating for future growth. The mapping of the virtual regions used by the fs cache, the kernel allocators, the virtual memory manager, will hopefully change periodically, and not frequently. One situation where I expect the contrary can occur is when the memory has become very scarce, but it is probably not a very important case to target, because the system is reaching its limit at that point anyway.
Korona wrote:
kfree() does not need to perform TLB shootdowns. No IPIs are required during kfree().
That is necessary only on subsystem memory contention events, not for each individual buffer allocation and release in the system. As I already suggested, semi-persistent mapping and linear mapping (as in the Linux kernel) are not the same thing. No sane system will map and unmap unless it has to rebalance the allocation of the physical memory pages between different virtual regions.
Korona wrote:
Kernel buffers (e.g. for IPC messages) can be accessed (page-wise) without remapping memory. This is also a huge performance boost (e.g. when accessing buffers in a different address space). There is no need to track which buffers are actually mapped and which are not mapped because all buffers are mapped.
The only complication for the case you describe comes from TLB shootdown, which you have pinpointed below, but hopefully it will be a rare event according to my arguments. There is however a related problem. With full physical memory mapping, user buffers can be accessed directly from any system thread, whereas in the absence of such mapping they need to be mapped in kernel space or accessed only from threads within that user process. This is not an issue for requests that can directly be served by dma, but for software raid, compression, software encryption, this will be an issue. So, the approach has its advantages for certain types of I/O.
Korona wrote:
No "in order to access new memory I need to setup a page table - in order to setup a page table I need to access new memory" cycles.
The linear mapping approach simplifies the PMM significantly IMO. Even if for no other reason, I think this is motivation enough to keep it around.
Korona wrote:
TLB shootdown is sane. Allocate a shootdown request in permanently mapped memory, IPI all processors, wait until all processors respond, free the request. No "how do I shootdown the memory holding the shootdown request?" shenanigans.
This is cpu related request, and thus can use per-cpu memory. It does not need to be dynamically allocated on each request. If it is dynamically allocated at all, it will be done during a cpu hot-plug event, if such is supported.
Korona wrote:
What do you lose?
The problem is not that the mapping is persistent. The mapping on systems like Windows is kept as persistent as possible under the changing system circumstances. In all honesty, I expect that it does thrash the virtual space more, but I don't think that it will produce perceptible difference unless the load is intentionally engineered with this goal in mind. The issue with the Linux approach is that the kernel pool and filesystem cache share the same virtual region (because they operate on the linearly mapped memory) and thus also share fragmentation with each other. It is difficult to make multi-page allocations, because the filesystem cache uses individual pages and has dusted the memory with them. The buddy allocator tries to remedy this by coalescing, but it can only do that for a few levels (it becomes exponentially less effective, the larger the allocation is.) There are even constants in the kernel itself that signify the largest allocation that can be reliably performed under normal statistical load.
I would consider a hybrid approach as a good strategy. But someone else will have to comment on the security implications. It is a little disconcerting that all of the physical memory is visible, even if the contents are generally at randomized locations. The general trend for security mitigations has become to limit the visibility and determinism as much as possible, including when kernel code is involved.
P.S. Forgot to mention, although it probably has become obvious from the entire tirade. The linear mapping approach can use large pages, even huge pages (i.e. 1GB). This means that depending on the limitation for the 1GB TLB entries in that CPU model, the accesses in that range may not thrash at all. This is a significant advantage actually. Windows tries to allocate physical memory to its virtual memory ranges in 2MB large page chunks, I think, but the efficiency of 1GB translations cannot be beat. Altogether, I have to agree that the TLB and translation efficiency is better with the linear mapping model, but I am not sure that it counters the issues with memory fragmentation. Reader's choice.