QEMU "-enable-kvm" mishandles #VE?

mallard · Post by **mallard** » Thu Apr 13, 2017 3:22 am

This is less of a request for help, more of a FYI/RFC...

Recently, while looking for ways to speed up execution of my OS in QEMU, especially on my slower computers, I came across the "-enable-kvm" parameter. This lets QEMU use the KVM virtualisation engine built into the Linux kernel, enabling hardware virtualization and making it significantly faster (similar in performance to VirtualBox).

However, while doing a little bit of testing with this I was finding that my OS would inevitably double-fault within a minute or so of booting up. Since I have a TSS-based double-fault handler, I was able to investigate the state of the system at the time of the #DF, and I found the following:

It always occurred while the idle thread was active.
It always occurred while the scheduler was running in response to a PIT IRQ (the idle thread manually yields to check for runnable threads after each interrupt, which never resulted in a #DF).
This is the most odd thing: The #DF occurred because of a stack overflow (That causes a #PF, due to the guard page below the stack that then can't handled normally for obvious reasons. A stack overflow in kernel mode is by far the most common cause of a #DF in my OS.) that occurred when the kernel attempted to handle interrupt 20; a "virtualization exception" (#VE in Intel's documentation).

The immediate cause is fairly obvious; the idle thread has a very small stack space allocated (only 1KB) and handling two interrupts simultaneously (the PIT IRQ and #VE) takes at least 600 bytes, so if the scheduler uses more than about 400 bytes, a stack overflow is will occur, as it did.

This doesn't happen on other virtualizers/emulators or on real hardware because the scheduler doesn't usually trigger a CPU exception (if it did, that would probably indicate something very wrong). The fact that #VE is delivered the guest OS seems to be a bug/issue in QEMU/KVM; #VE is not useful to the guest, research indicates that it's supposed to be something like a "page fault" for virtualization, where the guest's page tables map memory correctly, but the VM host hasn't provided actual memory to back that mapping yet. Since KVM supports "nested virtualization" (i.e. KVM can run inside a KVM guest) then there will be cases where it's correct to send #VE to the guest so this is probably why it happens here. Since my OS doesn't use hardware virtualisation (surely there's a way for QEMU/KVM to detect that?), it has no need to be sent #VE.

The "fix" that I've applied is to increase the size of the idle thread's stack (1KB is probably a bit tight even for normal situations), which does solve the problem, but I'm concerned that this issue exists; surely some OSs won't be at all tolerant of spurious CPU exceptions and having them happen unexpectedly can cause other issues as I've found...

Does this analysis make sense? Should this be considered a bug in QEMU/KVM (I'm not sure which is responsible for this, it appears that the "-enable-kvm" switch is not very commonly used)?