How to fix Meltdown on my OS ?

zesterer · **Posted:** Wed Jan 10, 2018 4:36 am

DavidCooper wrote:

How often does an OS actually need to access data that needs to be kept hidden? Wouldn't it be possible for the kernel itself to run under two different sets of page tables, with one of them shutting out any memory that you don't want apps to be able to access through these vulnerabilities? That way, if an interrupt occurs at CPL=3 you wouldn't need a CR3 switch unless the kernel actually needs to access private data, and I suspect that in most cases it doesn't.

The problem with this is that the performance overhead associated with determining what is security-sensitive data and what isn't would likely be at least close to the cost of simply implementing KPTI. All that, for a massive increase in the complexity of the memory-management code of your kernel. Not fun. You might as well go develop a microkernel because that's effectively what your design becomes.

tom9876543 · **Joined:** Wed Jul 18, 2007 5:51 am **Posts:** 170

Meltdown is a major Intel f#$k up.

Example code:

Code:

; rcx = kernel address
; rbx = probe array of 256 * 4kib
xor rax, rax
jz skipCode
  mov al, byte [rcx]
  shl rax, 0xc
  mov rdx, qword [rbx + rax]
skipCode:
; determine which qword in rbx array is cached -> value of rcx

The Intel CPU speculatively executes the instruction:

Code:

mov al, byte [rcx]

But it does NOT do security checks.
A major f#$k up.

bluemoon · **Posted:** Sat Jan 13, 2018 11:45 pm

tom9876543 wrote:

The Intel CPU speculatively executes the instruction:

Code:

mov al, byte [rcx]

But it does NOT do security checks. A major f#$k up.

From the other meltdown code where you trigger a page fault and use intermediate values from the (non-aliased) register files, we can assume the security check is done in async fashion, and probably by a separated unit, thus it "is" reasonable that SE don't call the checker, after all, checker is designed async.

It turns out everything you do has side effect, and yes I would say it's a design defect.

tom9876543 · **Joined:** Wed Jul 18, 2007 5:51 am **Posts:** 170

bluemoon wrote:

tom9876543 wrote:

The Intel CPU speculatively executes the instruction:

Code:

mov al, byte [rcx]

But it does NOT do security checks. A major f#$k up.

From the other meltdown code where you trigger a page fault and use intermediate values from the (non-aliased) register files, we can assume the security check is done in async fashion, and probably by a separated unit, thus it "is" reasonable that SE don't call the checker, after all, checker is designed async.

It turns out everything you do has side effect, and yes I would say it's a design defect.

Yes it looks like Intel CPU does asynchronous security validation.
Intel need to fix their CPU design.
When doing speculative execution, do NOT keep any values in cache. Make all loads to a hidden on-CPU cache (similar to CPU hidden registers).
Only when the instructions are committed, then save data to L1/L2/L3 cache.

AJ · **Posted:** Mon Jan 15, 2018 4:59 am

Hi All,

Just putting this out there as a quick thought from an amateur

The slowdown from fixing Meltdown is, by my limited understanding, caused by the fact that the kernel now has to reside in a separate process space. This means a call to the kernel involves at the very least:

Code:

Process --TS--> Kernel --TS--> Process

Of course, this will be worse where IPC is involved.

How about mapping a stub in to, say the first large page of kernel space. This stub contains no sensitive data, but purely implements a SYSCALL handler. For a 64 bit kernel, the stub then writes a single entry of the PML4 which points to the kernel. The kernel handles the call, the stub clears the present bit in the kernel and jumps back to the calling process:

Code:

Process--SYSCALL-->Stub--CALL-->Kernel--CALL-->Stub--SYSRET-->Process

I realise that mapping in / out a PML4 page has associated TLB costs, but just wondered firstly if those costs are lower than task switching and secondly whether this would actually rectify the cache issues?

I'm waiting to be put right!

Cheers,
Adam

bluemoon · **Posted:** Mon Jan 15, 2018 6:04 am

That would work on machine without PCID, which you have to invalidate some TLB anyway.

On modern machine, the problem is make sure you not run out of PCID (kernel should be sticky), and not wasting entry for non-frequent processes. For example, by not giving PCID for daemon and background processes and fall back to invalidating page entry.

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

AJ: Yeah, that is what the fixes are actually doing (at least for Linux). They map the syscall and IRQ handlers as usual and change cr3 on entry.

mallard · **Posted:** Mon Jan 15, 2018 8:54 am

bluemoon wrote:

On modern machine, the problem is make sure you not run out of PCID (kernel should be sticky), and not wasting entry for non-frequent processes. For example, by not giving PCID for daemon and background processes and fall back to invalidating page entry.

You've got 4096 PCIDs... It's pretty unusual for any system to be running that many processes symaltaneously.

Sure, it's possible and you should handle that situation, but it's not something that's going to be common, so pre-emptively not assigning PCIDs in the worry that you might eventually run out is almost certainly a bad approach. It's probably best just to assign PCIDs to every process until you run out, then remove them from low-priority/not-run-recently/some-other-appropriate-hueristic processes as needed.

AJ · **Posted:** Mon Jan 15, 2018 8:58 am

@Korona: OK - I guess the only difference in the suggestion I made then is that only a single PML4 entry is changed - CR3 doesn't need to be touched.

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Ah, I see. However, changing the PML4 could be more complex in the presence of SMP (other processors might share the CR3 with the processor that does the syscall). It also requires costly invalidation (via INVLPG of the affected range) at unmap time which probably negates the performance advantage of the CR3 switch.

Regarding PCIDs: Yes, theoretically there are many PCIDs but they are hashed at the hardware level. The processors do not actually implement 4096 PCIDs; in reality the number of available PCIDs might be as low as 8.

OSDev.org

How to fix Meltdown on my OS ?

Who is online