x86 double fault caused by invalid LSTAR MSR

aaronlu · **Posted:** Tue Jan 07, 2020 5:53 am

I have been chasing a #DF bug caused by invalid LSTAR, the bug is another story but I'm interested in how exactly the #DF occurred so I did an experiment under Linux.
So if I set MSR LSTAR to an invalid value(say 0x10, which is not mapped) before returning to user space in kernel mode, somewhere like in do_syscall_64(), then the system would instantly #DF.

Below is my imagination of what possibly happened after the setting of an invalid LSTAR.
1 CPU returns to user space, user code issues syscall using the 'syscall' instruction
2 CPU switch to priviledge(kernel) mode, %rip derived from LSTAR, %rsp still being user space's rsp.
3 Due to invalid LSTAR, CPU would page fault on 0x10(right?)
4 since CPU is already in priviledge mode, no stack change occured when invoking page fault handler through IDT
5 CPU would need to push flags/EIP/CS on current stack before invoking page fault handler and since current stack is user space's stack, the stack can be unmapped too, which would cause another page fault. CPU decides to #DF.

Not sure if the above is correct.

I also experimented touching every page of the user stack after setting the invalid LSTAR but before returning to user mode, in hope that when cpu faulted on the invalid LSTAR address, although user stack is in use, it's all mapped so no #DF should occur. But the reality is, system still #DF and df_debug() showed me the %rsp is user stack's bottom, so any push will trigger #DF since cpu is already serving a page fault. I have no idea why %rsp would be at the bottom of the stack, maybe it's user space's trick, maybe I missed something.

Thanks for your time.

feryno · **Posted:** Tue Jan 07, 2020 10:14 am

1-4 correct.
5 I don't think so. Ring3 stack is usually mapped and also you wrote that with these words: "it's all mapped"
Look into your OS pagefault handler. It will probably check whether #PF was from ring0 or from ring3 by checking CS pushed in the stack, then if CS seems to be ring3 the handler executes SWAPGS instruction, if CS seems to be ring0 then the SWAPGS is NOT executed as OS thinks it is already in ring0 with ring0 GS base.
But executing syscall from ring3 leaves CPU with ring3 GS base - this should be added into your "2"
So your pagefault handler is very likely attempting to access ring3 GS base instead of ring0 GS base and that causes another #PF (it grabs some bogus from ring3 GS base and the second #PF is when using this bogus pointer to access memory).
Maybe your OS has advanced #PF handler similar to #NMI handler where OS checks whether the interrupt happened in ring0 or ring3 using pushed CS and then also checks whether GS base is ring3 or ring0 using RDMSR from 0xC0000101 and in this case I do not yet have any explanation of your #DF.
Without analyzing your #PF handler I can only guess, but very likely the problem is caused because you entered ring0 with still ring3 GS base (the first instruction of syscall handler is usually SWAPGS which establishes ring0 GS base).
Surprisingly it is possible to execute syscall from ring0 and in this case syscall handler is entered with ring0 GS base, but it has no senseful meaning, and I saw it at only one point - ms kernel developer team used syscall execution in ring0 as a detection method of poorly designed hypervisor (KiErrataSkx55Present, KiErrata704Present - camouflaged names, not errata at all).
The #NMI handler checks not only pushed CS but also GS base because #NMI could happen immediately on entering syscall handler when its first instruction (the SWAPGS) was not yet executed - when CS is already ring0 but GS base is still ring3 GS base. But #PF handler is usually less careful than #NMI handler and do not expect that #PF happens with "wrong" GS base.
#NMI handler must always run with good stack too - so it uses IST feature of 64-bit IDT gate descriptor but #PF handler does not use IST feature and uses legacy stack switching mechanism (TSS.RSP0 instead of TSS.IST1-IST7) - which guarantees possibility of generating #DF.

here something to study, compare cheap way / expensive way
https://lore.kernel.org/lkml/[email protected]/

quirck · **Posted:** Tue Jan 07, 2020 10:25 am

Could you check what's on the user stack when you get #DF?

aaronlu · **Posted:** Tue Jan 07, 2020 9:19 pm

Hi feryno and quirck, thanks for your replies.

Let me clarify "the user stack is all mapped" part. Normally we can't guarantee that. But since I doublt the #DF is due to user stack not mapped, I made sure thay are all mapped in kernel mode before returning to user space. I did this by finding out which VMA the user stack belongs to and then touch every page of that vma range. e.g. if user stack resides in vma region [0x7ffc5a04d000 - 0x7ffc5a06e000], I made sure all these addresses are accessible.

But when #DF occur, in double fault handler(which uses a dedicated stack by means of IST), I can see the saved %rsp on #DF stack points to the bottom of the vma region, i.e. 0x7ffc5a04d000. So it appears when user(perhaps libc?) issues 'syscall', %rsp is at the bottom of the region. This %rsp would make CPU unable to push things into stack before invoking page fault handler and thus, CPU #DF.

The tricky part is, why user space would set %rsp at the bottom of the stack region? It should have something to do with my touching the VMA region in kernel mode because if I don't do that, the saved %rsp from double fault handler wouldn't point to the bottom of the stack(but then I can't be sure all user stack is mapped).

BTW, Linux fault handler would check CS to decide if the fault comes from kernel mode or user mode in error_entry() IIUC. The problem is, I don't think CPU can successfully invoke page fault handler, or it wouldn't #DF. Alternatively, CPU might successfully invoked page fault handler and then faulted again due to things like GS as you pointed out, and eventually the stack is overflowed and #DF. I will take a look at the user stack as suggested by quirck.

aaronlu · **Posted:** Wed Jan 08, 2020 2:36 am

OK, so inspired by feryno about the GS register -

Currently Linux' fault handler does 'swapgs' depending on which mode CPU is at before the fault occurs, but since the fault here is caused by invalid RIP due to wrong LSTAR on 'syscall', the CPU is already in kernel mode when invoking page fault handler so Linux doesn't do 'swapgs'. And that would cause more page faults during the execution of the page fault handler with a user GS(e.g. get_cpu() will use per cpu data). These page faults will repeat themselves and it will eventually cross the user stack boundary and then, the CPU would not be able to invoke page fault handler anymore and will deliver #DF.

I modified Linux fault handler to also 'swapgs' if CS indicates kernel mode but stack is user stack. And after this change, CPU no longer #DF and kernel can correctly dump error message and die now.

I'll try to summarize the whole picture again here (the below starting point is after the setting of an invalid LSTAR in kernel mode):
1 CPU returns to user space, user code issues syscall using the 'syscall' instruction
2 CPU switch to priviledge(kernel) mode, %rip derived from LSTAR, %rsp still being user space's rsp. GS still being user GS.
3 Due to invalid LSTAR, CPU would page fault on 0x10
4 since CPU is already in priviledge mode, no stack change occured when invoking page fault handler through IDT
5 CPU would need to push flags/EIP/CS on current stack before invoking page fault handler. Most likely, user stack is mapped so page fault handler is invoked.
6 Linux' fault handler will check the saved CS to decide which mode CPU was in, if it was kernel mode, no swapgs occur(as is the case here).
7 during the execution of the page fault handler, per cpu data will need to be accessed and that would cause another page fault.
8 goes to step 5, till the user stack boundary is crossed and CPU is unable to invoke page fault handler anymore, then #DF.

To quirck: I guess checking the user stack could also reveal this, but they are hard to read from my untrained eyes...

feryno · **Posted:** Fri Jan 10, 2020 5:25 am

It seems #PF handler code tried to read something from kernel GS: which was in your case usermode GS: so #PF handler got bogus pointer from usermode mem instead of correct one from kernelmode mem, then subsequent mem access using this bogus pointer caused an exception inside #PF handler thus CPU generated #DF.
#DF very likely uses IST feature (TSS.IST1-IST7, IDT descriptor for #DF with enabled IST which tells which one from 1-7) so CPU loads always good stack, that allows #DF handler to run even it was generated by faulting #PF handler code (e.g. in case of kernel stack corruption and in your case by invalid mem access caused by bogus pointer).

aaronlu · **Posted:** Mon Jan 13, 2020 12:14 am

Yes, Linux kernel use GS to access percpu data and the fault handler will use percpu data so additional page fault will occur, but an exception inside #PF handler will not cause CPU to generate #DF. Fault will nest, i.e. CPU will invoke page fault handler again(as long as it can).

The final #DF occured when user stack boundary is crossed, when CPU can not push CS/IP/Flags on stack before invoking page fault handler.

For page fault induced #DF, my current understanding is, CPU will only #DF when it cannot invoke the page fault handler when trying to serve a page fault. If it managed to invoke the page fault handler, then additional page fault will just nest.

feryno · **Posted:** Wed Jan 15, 2020 12:17 pm

Yes, when stack exhausted, CPU on #PF tries to push SS, RSP, RFLAGS, CS, RIP (+align it at 0x10 boundary) into stack but CPU cannot complete that so this second #PF happens when CPU tries to deliver the first #PF hence #DF generated. Because Linux uses IST feature the CPU loads good stack for #DF handler using IST (if OS did not use IST then third #PF generated = triple fault = CPU shutdown).
Just do not forget to use the SWAPGS as the first one instruction in your syscall handler (where MSR LSTAR points) because the syscall instruction is intended to be executed from ring3 (although it could be executed from ring0 too where it is almost useless).
Syscall handler typically performs this sequence:

Code:

swapgs
mov gs:[xxx],rsp
mov rsp,gs:[yyy]

Use IST feature for at least #DF, #NMI, #MCE so they always load good stack. In exception handler beginning when determining whether CS pushed in the stack is ring0 or ring3 - for some critical exception handlers do not forget to check GS base and eventually load ring0 GS base too.

OSDev.org

x86 double fault caused by invalid LSTAR MSR

Who is online