Clarify how x86 interrupts work

Brendan · **Posted:** Thu Jun 22, 2017 6:36 pm

Hi,

LtG wrote:

I liked the idea of segmentation (more granularity) but never really benchmarked it, and since AMD64 dropped it I guess I don't have much reason to care about it at this point, though may check its performance if I ever get around to creating x86_32 kernel.

I've never really liked segmentation; but I do still like the idea of setting "CPL=3 CS limit" during task switches in protected mode (especially for old CPUs that don't support "no execute" page protection), which is something that would work well for me because I don't have shared libraries.

LtG wrote:

Brendan wrote:

You can take "another NMI interrupts NMI handler somewhere" into consideration and minimise the chance that it can happen (e.g. become immune to the problem if the second NMI interrupts after you've managed to execute 10 or more instructions). The question is how much "ugly work-around" you're willing to have, and whether or not it's possible to be 100% immune (if it's possible for a second NMI to occur before you've executed the first instruction).

I don't think adjusting NMI handler IST entry is that ugly, and at least it's quite simple and straightforward, and either minimizes the risk to one instruction "timing" or completely gets rid of the issue.

For my purposes; I'd rather just not use IST (and not support SYSCALL either). For a micro-kernel (where there shouldn't be a large number of kernel API functions) it's probably just as fast to use "call gate per kernel API function" (and avoid the likely branch misprediction for "call [table+rax*8]" dispatch); and most CPUs (all Intel) support SYSENTER anyway.

Note that SYSCALL also causes awkwardness for machine check exception handler, and if you support "machine check error recovery" (and don't just halt everything when a machine check occurs) you can't (easily) use IST for machine check exceptions either. The problem here is that if the MCIP flag is set a second machine check exception causes a triple fault (which destroys "machine check error recovery"), and if you clear the MCIP flag as soon as possible (so that a second machine check exception won't cause triple fault) then the machine check exception handler needs to be re-entrant.

LtG wrote:

Brendan wrote:

It's a messy and infrequent corner-case with no easy solution; and a technical writer working for one of these companies wrote "something" (that another technical writer at the other company probably quietly copied). There's no real guarantee that it's correct, and it's certainly not as simple as "just use IST and don't worry!".

I can't remember AMD even mentioning the nested NMI/NMI-SMI-NMI issue at all, and if Intel decided to add a section specifically mentioning this and also says that OS should prepare for it, that implies it can be dealt with. I can't really think of anything except IST to avoid it (or not using SYSCALL).

I'd just assume most OSs reduce the risk but aren't immune (and that Intel's advice only covers "risk reduction" and not immunity).

LtG wrote:

Brendan wrote:

Intel manual 34.3.1 Entering SMM wrote:

An SMI has a greater priority than debug exceptions and external interrupts. Thus, if an NMI, maskable hardware interrupt, or a debug exception occurs at an instruction boundary along with an SMI, only the SMI is handled. Subsequent SMI requests are not acknowledged while the processor is in SMM. The first SMI interrupt request that occurs while the processor is in SMM (that is, after SMM has been acknowledged to external hardware) is latched and serviced when the processor exits SMM with the RSM instruction. The processor will latch only one SMI while in SMM.

To clarify the situation; I'd expect that this is entirely possible (and can't find anything in Intel's manual to convincingly prove or disprove it):

An NMI occurs
CPU begins starting the NMI handler (IDT lookup, checks, etc); and while this is happening an SMI is received causing "pending SMI"
CPU finishes starting the NMI handler (RIP pointing to first instruction of NMI handler) and commits changes to visible state
Immediately after "commit changes to visible state" CPU checks for pending interrupts, sees the pending SMI, and starts handling it. The "old RIP value" that the CPU stores in the SMM state save map is the address of the first instruction in the NMI handler.
The CPU executes the firmware's SMM code; and while that is happening a second NMI is received causing "pending NMI" (because NMIs are blocked in SMM)
The firmware's SMM code does "RSM"; the CPU executes this instruction (including loading "old RIP value" from SMM state save map that still points to the first instruction of the NMI handler) and commits changes to visible state
Immediately after "commit changes to visible state" CPU checks for pending interrupts, sees the pending NMI, and starts handling it.
CPU trashes the first NMI handler's stack when starting the second NMI handler.

Earlier I mentioned the two possibilities I could think of for instruction boundary to allow for the issue, one where the instruction boundary is the entire duration of "invoking" NMI and the other where each uop (or something smaller than an actual instruction) is considered to be an instruction thus creating extra instruction boundaries between two "real" instructions.

The Intel quote above says that if NMI and SMI occur at same instruction boundary then SMI wins and NMI gets forgotten (though if still present after SMI then it would get taken care of). I think that means that the "prolonged instruction boundary" case can be dismissed.

I very much doubt Intel means "only the SMI is handled (and the NMI is forgotten forever)" and would assume "only the SMI is handled (at this instruction boundary)". Essentially; NMI isn't discarded but remains "pending".

LtG wrote:

I can't point to anything in the manual that explicitly says that the "uop boundaries = instruction boundaries", which would imply that the sequence you made up is maybe possible. I would consider it a bit pathological of Intel/AMD however. The "invoke NMI" has already been committed to, changing to SMI midway (before first NMI handler instruction) doesn't seem reasonable to me.

You definitely won't find "uop boundaries = instruction boundaries". If an instruction is split into 5 uops there will only be one instruction boundary when the last uop reaches retirement and the entire instruction (all changes from all uops) are committed to visible state together.

LtG wrote:

Also worth noting, AFAIK all of this applies only if the SMI handler _intentionally_ enables NMI's, which was missing in your "sequence"..

No, my sequence is for "SMI does not intentionally enable/unmask NMI". Please note that Intel like to use the word "disabled" when they actually mean "held pending" (e.g. the normal IRQ enable/disable flag, which does not enable/disable IRQs and only causes them to be "held pending until IF cleared").

LtG wrote:

But do SMI handlers do that? And if they do, aren't they prone to race conditions?

I'd hope most SMI handlers don't intentionally enable NMI (in the same way that I hope software is never released with critical vulnerabilities, CPUs don't have errata, and unicorns actually exist

).

LtG wrote:

How can the SMI change IDT back to OS IDT (I assume they change IDT, or does the CPU restore it from the state it's already saved) and RSM without allowing NMI in between (assuming they enabled NMI)?

I think (not entirely sure); that in 16-bit code with "32-bit operand size overrides" the SIDT and LIDT instructions only effect the lowest 32 bits of the GDT base address; and that (if the OS is 64-bit) SMM code can safely use SIDT to store the lowest 32 bits of the OS's IDT base, then do whatever it likes, then use LIDT to restore the lowest 32 bits of the OS's IDT base.

LtG wrote:

Brendan wrote:

Due to risk of malware (rootkits, etc); it's "intentionally almost impossible" to modify firmware's SMM code on almost all motherboards. I'm not sure, but it might be possible to use hardware virtualization to bypass this (at least I vaguely remember something about SMM virtualization in AMD's manual), but if it is possible I'm also not if it'd influence results.

I'm not sure how often it applies, but there was a "hack" that allowed access to SMM, and I think it was pretty simple and mostly universal. Might be fixed in newer systems though.

IIRC the idea was:
- LAPIC "hijacks" memory references that are directed at the LAPIC
- Relocating LAPIC memory to overlay SMM memory
- SMI occurs (generate it or wait for it)
- SMI handler code references SMM (data) memory, but that is now hijacked by LAPIC and writes to it are discarded and reads from it (for the most part) return zero
- SMI handler jumps to wrong location where you've planted your code
- You now have ring -2 access

As I remember the paper, they suggested that all SMI handlers begin the same, with a couple of variations, so relocating the LAPIC memory works quite well..

I can try to find the paper if you like...

I think I found it here. As far as vulnerabilities go, I'd rate this one as "extremely plausible in practice".

LtG wrote:

Maybe I should get an Intel dev account and ask at their forums, which also lead me to this, though I'm not 100% sure if I trust the answer:
https://software.intel.com/en-us/forums/watercooler-catchall/topic/305672

I don't trust that answer at all - it completely ignores the part of the manual that the original poster quoted in their question.

Cheers,

Brendan

Brendan · **Posted:** Thu Jun 22, 2017 7:00 pm

Hi,

LtG wrote:

Brendan wrote:

NMI was a horrible thing from the start (the "iret unblocks NMI" is a disaster - why not have a flag in EFLAGS?), then became a badly documented horrible thing (I'm still not entirely sure which parts of which chipsets might trigger NMI), then became a buggy badly documented horrible thing (when SMM was introduced), then became a potentially extra buggy badly documented horrible thing (when Pentium was released with what I'd consider errata), then it hasn't improved since.

What's the Pentium thing you are referring to?

That would be this:

Intel manual wrote:

Also, for the Pentium processor, exceptions that invoke a trap or fault handler will enable NMI interrupts from inside of SMM. This behavior is implementation specific for the Pentium processor and is not part of the IA-32 architecture.

LtG wrote:

In your list SMM is before it and isn't at least part of the issue we are discussing SYSCALL related, so what's the one you are referring to?

SMM dates back to 80486 (and was backported to some later 80386 CPUs intended for embedded systems). SYSCALL came later and would've been fine if SMM didn't cause nested NMI problems.

LtG wrote:

The way I see NMI it's just another priority of interrupts given that NMI's can be masked by external hardware. So the "correct" way of adding NMI's would have been to make the interrupt controller "better" from the beginning and allow proper interrupt prioritization so that instead of enabling/disabling all interrupts you could use more granularity and for example never disable the highest priority "NMI". I guess interfacing with the PIC was too slow, so they had to resort to two interrupt priority levels of which the "normal" level is further prioritized by the PICs.

Originally NMI was used for things like "RAM parity error" (it's wasn't something that should ever be masked or discarded). Later that type of thing got shifted to "Machine Check Architecture" (a different exception that can't be masked).

LtG wrote:

One other thing to the actual NMI-SMI-NMI topic, for some instructions the CPU blocks all interrupts (IIRC SMI's included), like switching stacks.

From Intel's manual:

To prevent this situation, the processor inhibits interrupts, debug exceptions, and single-step trap exceptions after either a MOV to SS instruction or a POP to SS instruction, until the instruction boundary following the next instruction is reached. All other faults may still be generated.

From this; I'd assume that neither NMI nor SMI nor machine check is disabled, so now I'm worried about potential "MOV SS then SMI/NMI/machine check then IRQ" problems.

Cheers,

Brendan

LtG · **Joined:** Thu Aug 13, 2015 4:57 pm **Posts:** 384

Brendan wrote:

LtG wrote:

I liked the idea of segmentation (more granularity) but never really benchmarked it, and since AMD64 dropped it I guess I don't have much reason to care about it at this point, though may check its performance if I ever get around to creating x86_32 kernel.

I've never really liked segmentation; but I do still like the idea of setting "CPL=3 CS limit" during task switches in protected mode (especially for old CPUs that don't support "no execute" page protection), which is something that would work well for me because I don't have shared libraries.

I was contemplating of not having shared libraries as such, but rather as optimization for non-critical software (ie. games, potentially parts of browsers, etc) allowing services to be mapped to the same address space. So in that way there's no real shared libs, but any process/service could be made into one for some client processes for performance.

I'd actually need to get the OS pretty much completed to see if there's even any point or if the performance gain would be negligible.

Maybe if I stopped thinking about x86_32 features I might get more coding done...

Brendan wrote:

For my purposes; I'd rather just not use IST (and not support SYSCALL either). For a micro-kernel (where there shouldn't be a large number of kernel API functions) it's probably just as fast to use "call gate per kernel API function" (and avoid the likely branch misprediction for "call [table+rax*8]" dispatch); and most CPUs (all Intel) support SYSENTER anyway.

I'll likely (hopefully) create a few different syscall dispatch method, including call gates, to see the performance difference. I have no idea of the performance impact of call gates. Note, I'm not planning on similar async syscall interface as you for various reasons, so for me the syscall overhead is likely a bigger deal..

Do any AMD CPUs support SYSENTER in long mode? If not then that leaves out a pretty big part.

Brendan wrote:

Note that SYSCALL also causes awkwardness for machine check exception handler, and if you support "machine check error recovery" (and don't just halt everything when a machine check occurs) you can't (easily) use IST for machine check exceptions either.

Haven't really thought about MCE, but I'm assuming that there's a way to handle it..

As an aside, I keep forgetting that with NMI I'm not sure if this whole thing even matters. Are there any reasonable way to continue operation after NMI instead of BSOD/reboot?

Even if a watchdog is used then it would only trigger (on my OS) because of bugs/issues (in the kernel/OS), so there's no recovery.

Brendan wrote:

You definitely won't find "uop boundaries = instruction boundaries". If an instruction is split into 5 uops there will only be one instruction boundary when the last uop reaches retirement and the entire instruction (all changes from all uops) are committed to visible state together.

Then again, the issue (using the IST-swap trick) shouldn't exist, unless you consider NMI itself to be an instruction which I've seen no support for.

Brendan wrote:

LtG wrote:

Also worth noting, AFAIK all of this applies only if the SMI handler _intentionally_ enables NMI's, which was missing in your "sequence"..

No, my sequence is for "SMI does not intentionally enable/unmask NMI". Please note that Intel like to use the word "disabled" when they actually mean "held pending" (e.g. the normal IRQ enable/disable flag, which does not enable/disable IRQs and only causes them to be "held pending until IF cleared").

According to the manuals, if SMI is triggered during NMI then all further NMI's remain "held pending", during and after SMI (after RSM). The second NMI won't be triggered at RSM, not until first IRET. So unless the SMI explicitly re-enables NMI (by using IRET), NMI's will remain "disabled"/masked until the initial NMI is handled. So that "state" of NMI's blocked is preserved thru an SMI handler.

Brendan wrote:

LtG wrote:

How can the SMI change IDT back to OS IDT (I assume they change IDT, or does the CPU restore it from the state it's already saved) and RSM without allowing NMI in between (assuming they enabled NMI)?

I think (not entirely sure); that in 16-bit code with "32-bit operand size overrides" the SIDT and LIDT instructions only effect the lowest 32 bits of the GDT base address; and that (if the OS is 64-bit) SMM code can safely use SIDT to store the lowest 32 bits of the OS's IDT base, then do whatever it likes, then use LIDT to restore the lowest 32 bits of the OS's IDT base.

Ah, quite possible. However that would still leave a race condition where the OS's NMI handler could potentially be called during SMI which would give the NMI ring -2. And I'm guessing that it could even be exploited relatively easily. Assuming the SMI does LIDT to restore OS's IDT and then does RSM, if NMI is triggered between those two -> OS's NMI is called.

Brendan wrote:

I think I found it here. As far as vulnerabilities go, I'd rate this one as "extremely plausible in practice".

That's the one.. Interesting read, and only two pages =)

I guess that's what happens with extremely complex systems, which get added to 30 years running..

Brendan wrote:

LtG wrote:

Maybe I should get an Intel dev account and ask at their forums, which also lead me to this, though I'm not 100% sure if I trust the answer:
https://software.intel.com/en-us/forums/watercooler-catchall/topic/305672

I don't trust that answer at all - it completely ignores the part of the manual that the original poster quoted in their question.

Same here, at the very least they should have given some info as to why the other part doesn't apply, instead of referring to the NMI nesting earlier in the manual. But then again, I don't understand why the manual doesn't point from the OS dev NMI nesting part to the SMM NMI nesting part, given that OS devs need to worry about it. It should be obvious that OS dev need not read SMM since it doesn't apply, but they've dropped that little nugget there with no reference from the OS dev part.

However the OP had tested it and not able to produce the documented NMI nesting..

LtG · **Joined:** Thu Aug 13, 2015 4:57 pm **Posts:** 384

Brendan wrote:

That would be this:

Intel manual wrote:

Also, for the Pentium processor, exceptions that invoke a trap or fault handler will enable NMI interrupts from inside of SMM. This behavior is implementation specific for the Pentium processor and is not part of the IA-32 architecture.

Ah yes, forgot about that for a moment =)

Brendan wrote:

From Intel's manual:

To prevent this situation, the processor inhibits interrupts, debug exceptions, and single-step trap exceptions after either a MOV to SS instruction or a POP to SS instruction, until the instruction boundary following the next instruction is reached. All other faults may still be generated.

From this; I'd assume that neither NMI nor SMI nor machine check is disabled, so now I'm worried about potential "MOV SS then SMI/NMI/machine check then IRQ" problems.

But it says "other _faults_", #MCE is an abort, not a fault, and both NMI and SMI are interrupts, are they not?

Antti · **Posted:** Thu Jun 22, 2017 11:45 pm

I would hate to bring this up again but the stack trick should compeletely solve these problems. Please note that my first "implementation" is not important but the idea of it.

Absolutely no unsafe instruction windows.
Stack does not overflow. System is immune to the "SMI/NMI" storm, e.g. a true bug walking on electronics. If this or some other physical interference lasts for a few seconds, there may be "SMIs/NMIs" literally triggered one after another with no gaps.

For the latter, what "for a seconds" could mean in the CPU world? 4 KiB stacks... are we serious?

LtG · **Joined:** Thu Aug 13, 2015 4:57 pm **Posts:** 384

Antti wrote:

I would hate to bring this up again but the stack trick should compeletely solve these problems. Please note that my first "implementation" is not important but the idea of it.

Absolutely no unsafe instruction windows.
Stack does not overflow. System is immune to the "SMI/NMI" storm, e.g. a true bug walking on electronics. If this or some other physical interference lasts for a few seconds, there may be "SMIs/NMIs" literally triggered one after another with no gaps.

For the latter, what "for a seconds" could mean in the CPU world? 4 KiB stacks... are we serious?

I don't think I fully understand your solution, maybe the solution was explained in earlier posts? Is it supposed to utilize IST's or what is the stack set to upon entry to the NMI handler? I'm assuming the IDT directs NMI's to the NmiHandler, not to GeneralNmiHandler..?

If it uses IST, then how does it solve the problem Brendan says exists where the first NMI never executes a single instruction but sets RIP to point to first instruction and pushes return address on stack and then the second NMI overwrites the stack with a return address pointing to the first instruction of the NMI handler?

Antti · **Posted:** Fri Jun 23, 2017 12:19 am

LtG wrote:

the first NMI never executes a single instruction

This is the key problem for which I came up a solution. The stack and the TSS overlap so the IST entry is modified before a single instruction of the NMI handler executes.

LtG · **Joined:** Thu Aug 13, 2015 4:57 pm **Posts:** 384

Antti wrote:

LtG wrote:

the first NMI never executes a single instruction

This is the key problem for which I came up a solution. The stack and the TSS overlap so the IST entry is modified before a single instruction of the NMI handler executes.

Ah yes, it was explained in some earlier post, I remember it now, thanks.

If possible I'd prefer something less hacky, but if it's the only alternative then it should work... Haven't really thought about every aspect of it, but at least the premise sounded plausible.

Brendan · **Posted:** Fri Jun 23, 2017 5:30 am

Hi,

LtG wrote:

Do any AMD CPUs support SYSENTER in long mode? If not then that leaves out a pretty big part.

No - AMD CPUs don't support SYSENTER in 64-bit code (even if the feature flag is set in CPUID).

Intel does something "similar to opposite" - they don't support SYSCALL in 32-bit code (even if the feature flag is set in CPUID).

This stupidity means that you can't rely on the feature flags in CPUID alone, and (at a minimum) need to follow them with a vendor check (e.g. "if feature flag FOO is set && vendor == BAR"). Then there's some errata (one Intel CPU says misreports SYSENTER support). I normally have a "CPU identification" thing during boot that sorts out this kind of stuff and generates sanitised/corrected CPU information (including having separate "SYSENTER32, SYSENTER64, SYSCALL32, SYSCALL64" flags, fixing errata, etc) and then (try to) use my sanitised data instead of CPUID after that. Unfortunately there's no "disable CPUID at CPL=3" possibility (like the "disable RDTSC at CPL=3" flag) so I can't use a "trap and emulate" approach to force user-space to use sanitised/corrected information.

LtG wrote:

Brendan wrote:

Note that SYSCALL also causes awkwardness for machine check exception handler, and if you support "machine check error recovery" (and don't just halt everything when a machine check occurs) you can't (easily) use IST for machine check exceptions either.

Haven't really thought about MCE, but I'm assuming that there's a way to handle it..

That depends on how you want to handle it (e.g. if you care about "second MCE at wrong time causes triple fault with no chance of logging and no chance of recovery").

LtG wrote:

As an aside, I keep forgetting that with NMI I'm not sure if this whole thing even matters. Are there any reasonable way to continue operation after NMI instead of BSOD/reboot?

Unless you know what caused it all you can really do is "kernel panic". For my project, a kernel wouldn't use NMI for anything itself, and when NMI occurs kernel would ask the motherboard driver to deal with it (hoping that motherboard driver would have some clue about why it happened and how it should be handled). Of course the motherboard driver is in user-space, so...

Note that even if you can't recover; if you care about reliability then you must also care about making sure that the user/admin can find out what went wrong ("post-mortem"). Crashing before starting "kernel panic" or crashing during "kernel panic" (and failing to give user/admin any information about why the kernel panic happened) isn't a desirable option.

LtG wrote:

Even if a watchdog is used then it would only trigger (on my OS) because of bugs/issues (in the kernel/OS), so there's no recovery.

Do you think you could recover if an NMI was caused by an "NMI button" that the user pressed by accident?

LtG wrote:

Brendan wrote:

You definitely won't find "uop boundaries = instruction boundaries". If an instruction is split into 5 uops there will only be one instruction boundary when the last uop reaches retirement and the entire instruction (all changes from all uops) are committed to visible state together.

Then again, the issue (using the IST-swap trick) shouldn't exist, unless you consider NMI itself to be an instruction which I've seen no support for.

For fun; try to convince me that in a world were every single thing that matters (the manual, all CPUs, all other hardware, all firmware, etc) is riddled with "things that shouldn't happen" your suspiciously convenient interpretation of "instruction boundary" is a safe assumption.

You only need to look at the various notes for "instructions retired" performance monitoring counters to see that the majority of the CPU (everything after "decode") has no clue what is/isn't an instruction (and that they had to add some kind of special tag to the last micro-op of a "multi-micro-op instruction" to make it work, and that they've got that special tag wrong in various different ways in various CPUs). Once you accept that the majority of the CPU (everything after "decode") has no clue about instructions the entire concept of "instruction boundary" has to be considered an over-simplification at best. "Results committed to visible state" is the only assumption that can really be considered "safe" (if a CPU gets "results committed to visible state" wrong it'd be plagued with issues involving corrupted visible state); but even that might be considered an "overly optimistic" interpretation after you've seen some of the issues various CPUs have had (see note).

Note: One quirk is that for SYSRET (on Intel CPUs only) if the RIP being returned to isn't canonical the CPU generates a GPF with RSP set to the user-space stack. This indicates that deep in the CPU's internals "transactional properties" aren't upheld - it's not necessarily a case of "either changes are committed to visible state or changes are not committed to visible state" but can be a case of "some changes are committed to visible state while other changes are not committed to visible state".

LtG wrote:

Brendan wrote:

LtG wrote:

How can the SMI change IDT back to OS IDT (I assume they change IDT, or does the CPU restore it from the state it's already saved) and RSM without allowing NMI in between (assuming they enabled NMI)?

I think (not entirely sure); that in 16-bit code with "32-bit operand size overrides" the SIDT and LIDT instructions only effect the lowest 32 bits of the GDT base address; and that (if the OS is 64-bit) SMM code can safely use SIDT to store the lowest 32 bits of the OS's IDT base, then do whatever it likes, then use LIDT to restore the lowest 32 bits of the OS's IDT base.

Ah, quite possible. However that would still leave a race condition where the OS's NMI handler could potentially be called during SMI which would give the NMI ring -2. And I'm guessing that it could even be exploited relatively easily. Assuming the SMI does LIDT to restore OS's IDT and then does RSM, if NMI is triggered between those two -> OS's NMI is called.

Heh - while swimming several layers deep in "things that happen that wouldn't happen in a perfect world", you've found another layer of things that wouldn't happen in a perfect world (involving race conditions between NMI and LIDT within SMM code).

This is progress. It won't be too long before you start seeing things as a huge ball of rusty old shrapnel being held together by duct tape, chewing gum and hacks; that is being pushed forward by a thousands of assorted CPU designers, hardware manufacturers and software developers that are all are unable see all the details of the huge ball and are all just trying to avoid being sacked this week.

LtG wrote:

Brendan wrote:

I think I found it here. As far as vulnerabilities go, I'd rate this one as "extremely plausible in practice".

That's the one.. Interesting read, and only two pages =)

I guess that's what happens with extremely complex systems, which get added to 30 years running..

As the huge ball of rusty old shrapnel rolls along, those thousands of people pushing it forward add new vulnerabilities, discover old vulnerabilities, and stick more duct tape and hacks on the ball to cover up old vulnerabilities.

Note that for a "huge ball of new shiny shrapnel" nobody has had a chance to discover vulnerabilities or cover any of them up. It's the complexity (the size of the ball) that is the majority of the problem.

LtG wrote:

Brendan wrote:

LtG wrote:

Maybe I should get an Intel dev account and ask at their forums, which also lead me to this, though I'm not 100% sure if I trust the answer:
https://software.intel.com/en-us/forums/watercooler-catchall/topic/305672

I don't trust that answer at all - it completely ignores the part of the manual that the original poster quoted in their question.

Same here, at the very least they should have given some info as to why the other part doesn't apply, instead of referring to the NMI nesting earlier in the manual. But then again, I don't understand why the manual doesn't point from the OS dev NMI nesting part to the SMM NMI nesting part, given that OS devs need to worry about it. It should be obvious that OS dev need not read SMM since it doesn't apply, but they've dropped that little nugget there with no reference from the OS dev part.

However the OP had tested it and not able to produce the documented NMI nesting..

The OP crafted a bow and used it to shoot several arrows at the huge ball, and those arrows happened to hit something solid and bounce back. Maybe if the OP waited a little while for the ball to roll forward (e.g. next firmware update for the same computer OP tested) the arrows would've hit something else and not bounced back.

Cheers,

Brendan

Brendan · **Posted:** Fri Jun 23, 2017 5:44 am

Hi,

LtG wrote:

Brendan wrote:

From Intel's manual:

To prevent this situation, the processor inhibits interrupts, debug exceptions, and single-step trap exceptions after either a MOV to SS instruction or a POP to SS instruction, until the instruction boundary following the next instruction is reached. All other faults may still be generated.

From this; I'd assume that neither NMI nor SMI nor machine check is disabled, so now I'm worried about potential "MOV SS then SMI/NMI/machine check then IRQ" problems.

But it says "other _faults_", #MCE is an abort, not a fault, and both NMI and SMI are interrupts, are they not?

Here, you'd have to interpret "interrupts" as "interrupts that are effected by IF in EFLAGS", because it doesn't make sense to say "the processor inhibits interrupts, debug exceptions (technically a type of interrupt which has already been mentioned), and single-step trap exceptions (technically a type of interrupt which has already been mentioned)".

More specifically; I'd assume it means "processor behaves as if IF=0 and RF=1 in EFLAGS".

Cheers,

Brendan

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Handling NMI on a non-IST stack is broken. If the NMI-SMI-NMI situation indeed happens often enough your stack will overflow. Even if it only happens once it requires all kernel stacks to be large enough to handle their normal control flow + possibly multiple faults + at least two NMIs. This is not practical. What is worse is that your OS is not even able to detect this and display a nice panic screen that tells the user that there is something wrong. If you handle NMIs on an IST stack, you can at least panic.

Handling MCE on a non-IST stack is even more broken: If you clear MCIP instantly nested MCEs will just trash the MCE state. MCEs are designed to be not nested. The CPU tripple faulting on a nested MCE is a feature that prevents the OS from worrying about situations in which the whole system is broken anyways. If you're unlucky enough handling MCE on a non-IST stack might just tripple fault, e.g. if the memory that contains your stack reports errors. Sure, this might even happen for IST stacks but it is much more unlikely as there is no single point (aka cache line/bank/DIMM) of failure. Besides, how are you going to handle an MCE in your MCE handler/kernel? If kernel memory is broken you're unable to recover anyways.

IMHO the only sane way to handle the situation is putting NMI on an IST stack, detect nested NMIs that trash your return stack frame (by comparing the stored IP to some "critical NMI entry code" range) and panic if such NMIs happen. If there are rapidly nested NMIs your system is likely to be broken anyways. Why compromise the entire kernel architecture just to recover from this case? I don't think there are systems that repeatedly cause NMI without any sort of acknowledgement, precisely because this would break the x86 platform. The realistic case is not NMI-SMI-NMI but NMI-SMI-acknowledge NMI by chipset driver-NMI which is handled just fine on ISTs.

You'll have to accept that there are cases (where the peripheral hardware is broken, even though the CPU is working fine) that will just tripple fault your OS without any chance to react.

Brendan · **Posted:** Fri Jun 23, 2017 6:12 am

Hi,

Antti wrote:

I would hate to bring this up again but the stack trick should compeletely solve these problems. Please note that my first "implementation" is not important but the idea of it.

Absolutely no unsafe instruction windows.
Stack does not overflow. System is immune to the "SMI/NMI" storm, e.g. a true bug walking on electronics. If this or some other physical interference lasts for a few seconds, there may be "SMIs/NMIs" literally triggered one after another with no gaps.

For the latter, what "for a seconds" could mean in the CPU world? 4 KiB stacks... are we serious?

That can/should work, even for the "SMI before NMI handler executes its first instruction" case.

However; you'd need to ensure that (for multi-CPU) different CPUs use different (physical) addresses for most of the instructions (e.g. for "cmp rsp, [Tss_RIP]", "mov rsp, [Tss_RIP]", etc); which means you'd need to use one of the tricks I mentioned (per CPU interrupt handlers and IDT; or per CPU areas). With this addition; it goes a little too far beyond my "too clever in a scary way" threshold..

Cheers,

Brendan

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Did anyone ever try if NMI-SMI-NMI can actually happen? The latest Intel SDM contains the wording

Intel SDM, chapter 34.8 wrote:

If NMIs were blocked before the SMI occurred, they are blocked after execution of RSM.

along with the well known NMI-SMI-NMI warning. I suspect that NMI-SMI-NMI cannot actually happen in hardware (or maybe can only happen if the SMI handler executes IRET?) and the documentation is just plain wrong/misleading in the SDM (or targeted at SMM code writers that have to handle NMI inside SMM). One observation that supports this is that mainstream OSs do not randomly tripple fault if you're using USB keyboard emulation and watchdog timers at the same time.

Brendan · **Posted:** Fri Jun 23, 2017 7:19 am

Hi,

Korona wrote:

Handling NMI on a non-IST stack is broken. If the NMI-SMI-NMI situation indeed happens often enough your stack will overflow. Even if it only happens once it requires all kernel stacks to be large enough to handle their normal control flow + possibly multiple faults + at least two NMIs. This is not practical. What is worse is that your OS is not even able to detect this and display a nice panic screen that tells the user that there is something wrong. If you handle NMIs on an IST stack, you can at least panic.

No, that's very wrong.

Using IST is broken (or at least impractical to work-around properly) for the "NMI-SMI-NMI" case. Not using IST is not broken for the "NMI-SMI-NMI" case, and as an extra added bonus you can also get a free "triple fault when no progress is possible due to NMI-SMI-NMI-SMI-NMI-SMI-NMI.... storm" advantage.

Note that (with an NMI coalescing scheme) you probably only need less than 64 bytes of stack space per nested NMI. This means that for the (extremely unlikely, in a "not going to happen for thousands of computers running for thousands of years") NMI-SMI-NMI-SMI-NMI-SMI-NMI.... storm case; with 4 KiB of kernel stack you can handle NMI nested 64 deep.

Korona wrote:

Handling MCE on a non-IST stack is even more broken: If you clear MCIP instantly nested MCEs will just trash the MCE state. MCEs are designed to be not nested. The CPU tripple faulting on a nested MCE is a feature that prevents the OS from worrying about situations in which the whole system is broken anyways.

If the machine check exception handler is capable of recovery in some cases; IST is completely broken because it's impossible to avoid "second MCE trashes first MCE's stack after first MCE cleared MCIP but before first MCE did IRET". Also, for this case you do want to clear MCIP as soon as possible (after pulling information out of MSRs and storing it somewhere safe, and before you bother processing any of it) to reduce the risk of triple fault ruining your ability to recover.

Korona wrote:

If you're unlucky enough handling MCE on a non-IST stack might just tripple fault, e.g. if the memory that contains your stack reports errors. Sure, this might even happen for IST stacks but it is much more unlikely as there is no single point (aka cache line/bank/DIMM) of failure.

Not using IST means that each nested MCE uses different cache lines/memory for its stack and there is no single point of failure (e.g. if the memory used by the first MCE is bad then it may cause a second MCE, but the second MCE will not use the same bad memory as the first). IST is a single point of failure (if the first MCE uses bad memory, the 2nd, 3rd, 4th, ... will also use the same bad memory).

Korona wrote:

Besides, how are you going to handle an MCE in your MCE handler/kernel? If kernel memory is broken you're unable to recover anyways.

How about; send an IPI to other CPUs (to tell them you're doing an "emergency soft-offline"), then send "INIT IPI" to yourself (to reset CPU and put it into a "wait-for-SIPI" state); then keep the OS running (including cleaning up any mess left behind from "emergency soft-offline") using all the remaining CPUs?

Korona wrote:

IMHO the only sane way to handle the situation is putting NMI on an IST stack, detect nested NMIs that trash your return stack frame (by comparing the stored IP to some "critical NMI entry code" range) and panic if such NMIs happen. If there are rapidly nested NMIs your system is likely to be broken anyways. Why compromise the entire kernel architecture just to recover from this case? I don't think there are systems that repeatedly cause NMI without any sort of acknowledgement, precisely because this would break the x86 platform. The realistic case is not NMI-SMI-NMI but NMI-SMI-acknowledge NMI by chipset driver-NMI which is handled just fine on ISTs.

Yes; if you don't care about minimising the risk of preventable failures, using IST is an easier option. For some cases (e.g. games console?) I'd even consider leaving the NMI, machine check, and double fault as "not present" in the IDT so it all just goes straight to triple fault.

Korona wrote:

You'll have to accept that there are cases (where the peripheral hardware is broken, even though the CPU is working fine) that will just tripple fault your OS without any chance to react.

No, I don't need to accept that. What I do need to accept is that for a "peer to peer distributed" OS like mine, the failure of any one computer can effect many computers; and minimising the risk of failures as much as possible, and recovering from failures as much as possible; is a necessity.

Cheers,

Brendan

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Brendan wrote:

Using IST is broken (or at least impractical to work-around properly) for the "NMI-SMI-NMI" case. Not using IST is not broken for the "NMI-SMI-NMI" case, and as an extra added bonus you can also get a free "triple fault when no progress is possible due to NMI-SMI-NMI-SMI-NMI-SMI-NMI.... storm" advantage.

As I said, I'm not sure if this can actually happen (my reading of the SDM is that is cannot for sane firmware that doesn't do SMI-IRET to enable NMI-NMI-IRET from NMI-RSM) and even if it can it is astronomically rare and the IST solution can just panic in this case.

Brendan wrote:

If the machine check exception handler is capable of recovery in some cases; IST is completely broken because it's impossible to avoid "second MCE trashes first MCE's stack after first MCE cleared MCIP but before first MCE did IRET". Also, for this case you do want to clear MCIP as soon as possible (after pulling information out of MSRs and storing it somewhere safe, and before you bother processing any of it) to reduce the risk of triple fault ruining your ability to recover.

Just self-IPI and clear MCIP inside the IPI. Besides, if there is an MCE before the IRET your kernel memory is doomed anyways.

Brendan wrote:

How about; send an IPI to other CPUs (to tell them you're doing an "emergency soft-offline"), then send "INIT IPI" to yourself (to reset CPU and put it into a "wait-for-SIPI" state); then keep the OS running (including cleaning up any mess left behind from "emergency soft-offline") using all the remaining CPUs?

That won't work. If your MCE happens in kernel space (and all nested MCE that we're talking about here happen in kernel space ) those other CPUs will just MCE too. The main purpose of MCE is reporting broken RAM. The fix is to disable that RAM. You cannot reliably disable kernel RAM. Yes, MCE can also report SERR and similar errors but those are even more critical and you cannot recover from them.

Brendan wrote:

No, I don't need to accept that. What I do need to accept is that for a "peer to peer distributed" OS like mine, the failure of any one computer can effect many computers; and minimising the risk of failures as much as possible, and recovering from failures as much as possible; is a necessity.

If you're writing your OS with a distributed environment in mind you should be even better off: Failure of a single machine should easily be recovered from. This recovery cannot (in case of a kernel space MCE) save a single machine but should instead rely on the other machine's replicating the failing machines tasks.

OSDev.org

Clarify how x86 interrupts work

Who is online