Kernel Side-Channel "Ultra-Nuke"

Brendan · **Posted:** Tue Mar 20, 2018 4:15 am

Hi,

Some kernel information can leak to user-space through various side-channels; and some of these are known (spectre, meltdown) and people have created defences against them. However, it'd be foolish to assume that all of them are known and foolish to assume that the defences people (Microsoft, Google, Linux) have implemented adequately defend against any "currently unknown" side channels.

These side channels come from information that the CPU retains during a switch from kernel to user-space. To prevent all known and all unknown side channels, you'd only have to "nuke" any information that the CPU would've retained during a switch from kernel to user-space. "Nuking" the information means invalidating it, polluting it with garbage, or replacing it with something unimportant; so that an attacker in user-space can't obtain it.

The information a CPU may retain can be broken down into:

Caches
TLBs and higher level paging structure caches
Store Buffers
Return Buffers
Branch Target Buffers
Pipelines
Time

Caches can be invalidated using the WBINVD instruction.

TLBs can be invalided by either modifying CR4 (e.g. toggle the "global pages enabled/disable" flag) or reloading CR3 if global pages aren't being used. Note that PCID still leaves information in the TLBs that may still be exploitable (e.g. by seeing if a TLB entry belonging to an attacker's address space is evicted by something the kernel does in the kernel's address space) and is insufficient to guard against "currently unknown" attacks.

Store Buffers only really matter if you using non-temporal stores or write-combining; and a store fence ("sfence") followed by a dummy store should be enough to ensure the store buffers are empty.

Information in return buffers can be overwritten by unimportant data by having a series of (at least) 16 "call $" instructions followed by an "add rsp,16*8" to remove the return addresses from the stack.

In theory; Branch Target Buffers can be overwritten with a series of (at least) 4096 branches that are spread out over a large enough area. How large that area needs to be depends on the BTB topology (associativity, set size). Without knowing the BTB topology (and without knowing the BTB entry eviction policy); it'd still be "relatively safe" to assume that 65536 branches spread over a 256 KiB area (an average of 4 bytes per branch) is sufficient. Note that (because BTB indexing works on virtual addresses) you could have a single physical page full of branches and map that same physical page 64 times at consecutive virtual addresses in kernel space to create a 256 KiB area full of branches (in other words, it needn't cost 256 KiB of RAM).

After doing all of the above, there's approximately zero chance that the CPU's pipelines will still contain anything from before you started the "ultra-nuke".

For time; a kernel can add a small random delay before returning to user-space (from kernel's API, exceptions handlers, IRQ handlers, etc), and a kernel can add "random jitter" to functions used to obtain time. This alone does not hide the time a kernel spends doing anything - to circumvent "random uncertainty" an attacker can do the same thing a few thousand times and use the average. To guard against that the kernel can check if user-space is doing the same thing lots of times (e.g. calling the same kernel API function with the same parameters) and take evasive actions (e.g. stall the potential attacker's task for 10 seconds each time, so that calling the same kernel API function with the same parameters a few thousand times ends up taking several days). In addition; when many tasks are running a kernel can queue pending operations (kernel API calls and exceptions) and perform them in random order, so that no single task is able to guess how long its own operations took. Batch system calls (e.g. tasks telling kernel to do a list of things rather than asking kernel to do one thing at a time) would also help a lot.

Of course "ultra-nuke" would be incredibly expensive. If an OS tracks how trusted/untrusted a process is, it could easily compromise - e.g. use the full "ultra-nuke" when returning to a very untrusted process, (randomly?) disable/skip pieces when returning to a slightly untrusted process, and disable/skip the entire thing when returning to a very trusted process. This "trust based dynamic compromise" can also include the ability to fall back to less expensive approaches (e.g. use KPTI and PCID when full TLB flushing is disabled).

Any thoughts/ideas or suggested improvements appreciated!

Cheers,

Brendan

Antti · **Posted:** Tue Mar 20, 2018 6:04 am

Brendan wrote:

Any thoughts/ideas or suggested improvements appreciated!

If you detected that a process is suspicious, you could activate a "single-step" debugger when returning to the user-space. This could be used for controlling the situation (whatever means are necessary) as long as is needed. This was just an idea.

bluemoon · **Posted:** Tue Mar 20, 2018 6:27 am

With many-core architecture would it be more plausible to physically isolate kernel (which further split to hardware management and higher level services) with applications, and have them communicate in a restricted way to reduce attack surface?

That would render a core idle but if you have like hundred or thousand cores, or have a slow core dedicated for that, it is affordable.

Solar · **Posted:** Tue Mar 20, 2018 7:04 am

Antti wrote:

If you detected that a process is suspicious, you could activate a "single-step" debugger when returning to the user-space. This could be used for controlling the situation (whatever means are necessary) as long as is needed. This was just an idea.

Activating a single-step debugger on a user-space application -- if I understood correctly that this would be available to the user for inspection -- would create a security hole, not plug one. (How do you ensure that the person now having access to that process' memory space is actually authorized to do so?)

Anyway... the crux of these side-channel attacks is that the activities undergone by exploiting code cannot be distinguished from perfectly benevolent usage with any degree of certainty, and certainly not without the kernel taking lots of notes (logs) of process activities (which could in turn be dodged if the heuristics are known).

If you do identify a process as malicious, crash and burn it.

simeonz · **Joined:** Fri Aug 19, 2016 10:28 pm **Posts:** 360

Haven't posted for a while, but will indulge in blurting out for some reason.

Brendan wrote:

Any thoughts/ideas or suggested improvements appreciated!

With large pages, or with predictable physical allocator sequence, an application process could probe different page rows in RAM to determine which RAM device was used during the last API.
Also I am not sure if it is exploitable easily, but it would not be easy to hide the I/O latency that can be used to infer whether two objects that the kernel had to access from a couple of preceding API calls reside in the same storage unit, thus one of them hit the cache, or resides in a different location. Generally, I assume you can get into trouble with all devices, as long as they or their drivers optimize requests using a shared discipline or their latency is influenced by the history of their operation. The whole problem is, as long as optimization (in CPU, a device driver, etc) correlate performance variations to privileged information, and the operation completion is delivered without deferment, and the application has access to timing information, something, no matter how small, does leak out.

With micro-kernels, since synchronous APIs are less intrinsic to the programming model, if the client has some kind of completion event queue on which to block, it could be intentionally reordered. And if it has only one event, it could be randomly stalled, because performance will be probably not so strongly impacted. Also, timing data may be restricted, provided as opaque object to unprivileged processes, that could than be passed to other privileged services, such as visualization and timer APIs. The question is, how to facilitate legitimate uses, such as frame timing, while prohibiting the process from inferring the encapsulated time by performing an activity with known fixed duration. Which opens the general question, what restrictions can be made on a system service without surrendering its primary function. Vaguely reminds me of the sci-fi concept of homomorphic encryption - information to process, but not to inspect.

Edit: After some consideration, I realized that malicious applications could try to use one core to time another, relative to a fixed activity. Also, applications could send execution traces to a server, which is responsible for timing. If the timing is inaccurate, with enough volume it will be averaged with high confidence to sufficient precision. So, I am not sure what the solution is, generally speaking. Then again, I probably drift on a tangent here, as this thread may be aiming at much more specific issues.
Edit 2: On the other hand, such deliberate actions imply hostile code being executed, not an initial exploit, so it depends on the assumptions for the software ecosystem and the progress level of the attack.

Solar wrote:

Anyway... the crux of these side-channel attacks is that the activities undergone by exploiting code cannot be distinguished from perfectly benevolent usage with any degree of certainty

I agree with this. Still, the question is can you limit the access to "types" of system resources and activities that are essential to the software's operation. Even if the restrictions are self-imposed by the application or manually imposed by security administrator.

Solar wrote:

If you do identify a process as malicious, crash and burn it.

Even if the software and all objects are trusted (and possibly come from the same original source as the OS), they are a different degree of a virus due to exploitability. The tendency is to limit the content to "trustworthy" online stores anyway, in the hope that they can blacklist malicious sources, enforce non-repudiation, "guarantee" the delivery integrity, etc. If the OS provides a sensible security model for self-restriction that the trusted application can employ (per-application, per-thread, per-process, through separation of concerns, impersonation, elevation, etc), and if this model is both efficient and effective (which seems like a pipe dream), the exploit will be ideally limited to the extent of the foothold that the attacker already has. Some loss of privacy and authenticity will be thus unavoidable by the very nature of software and the people interacting with it.

On a side note, I doubt that this current security paradigm, which relies on object identity so much, can be applied to prospective adaptive (i.e. AI) software, which at least in some places may replace conventional hard-coded applications.

Brendan · **Posted:** Tue Mar 20, 2018 11:09 am

Hi,

Antti wrote:

Brendan wrote:

Any thoughts/ideas or suggested improvements appreciated!

If you detected that a process is suspicious, you could activate a "single-step" debugger when returning to the user-space. This could be used for controlling the situation (whatever means are necessary) as long as is needed. This was just an idea.

I'm not sure if you meant a literal debugger (for the user to debug the process) or if you meant using "single-step" and the debugging exception. For the latter (kernel using single-step for "extreme monitoring of untrusted process") it could be quite powerful (e.g. maybe even be able to defeat "counter incremented by other CPU" timers); but the performance impact would be a little hard to justify.

bluemoon wrote:

With many-core architecture would it be more plausible to physically isolate kernel (which further split to hardware management and higher level services) with applications, and have them communicate in a restricted way to reduce attack surface?

There are resources (e.g. last level cache) that are typically shared by multiple cores, so it wouldn't close all side channels on its own. Beyond that there'd be balancing issues - either a core would be partially wasted (not enough for kernel to do) or a core would be a bottleneck (too much waiting for kernel to do something).

Solar wrote:

Anyway... the crux of these side-channel attacks is that the activities undergone by exploiting code cannot be distinguished from perfectly benevolent usage with any degree of certainty, and certainly not without the kernel taking lots of notes (logs) of process activities (which could in turn be dodged if the heuristics are known).

If you do identify a process as malicious, crash and burn it.

Ideally you'd make it impossible to exploit the side-channels, so that you don't need to care if a process is (failing to be) malicious.

Cheers,

Brendan

davidv1992 · **Joined:** Thu Jul 05, 2007 8:58 am **Posts:** 223

Also, note that for cache side channels, most multicore processors share at least some cache between cores. This would mean that, when using this approach, you would only be able to use a single core.

Brendan · **Posted:** Tue Mar 20, 2018 12:27 pm

Hi,

simeonz wrote:

Haven't posted for a while, but will indulge in blurting out for some reason.

Brendan wrote:

Any thoughts/ideas or suggested improvements appreciated!

With large pages, or with predictable physical allocator sequence, an application process could probe different page rows in RAM to determine which RAM device was used during the last API.
Also I am not sure if it is exploitable easily, but it would not be easy to hide the I/O latency that can be used to infer whether two objects that the kernel had to access from a couple of preceding API calls reside in the same storage unit, thus one of them hit the cache, or resides in a different location.

For "row select" timing in the RAM chips themselves, it'd be ruined by CPU fetching the first instruction from user-space. Doing WBINVD (invalidating all the caches) between kernel API calls would make it impossible to infer anything from preceding kernel API calls. The timing mitigation/s I mentioned should (I hope) cover the problem of using 2 kernel API functions at the same time (where one is used to infer something about the other).

simeonz wrote:

Generally, I assume you can get into trouble with all devices, as long as they or their drivers optimize requests using a shared discipline or their latency is influenced by the history of their operation. The whole problem is, as long as optimization (in CPU, a device driver, etc) correlate performance variations to privileged information, and the operation completion is delivered without deferment, and the application has access to timing information, something, no matter how small, does leak out.

For device drivers; a micro-kernel using the IOMMU (to restrict what devices can/can't access) goes a long way to prevent device drivers from using devices to use figure out what kernel did/didn't fetch into cache. Assuming that the kernel handles timers itself (e.g. and doesn't have a "HPET driver' in user-space); I'm not sure if a device driver can use a device to get better time measurements (e.g. to bypass the timing mitigation/s I mentioned).

simeonz wrote:

With micro-kernels, since synchronous APIs are less intrinsic to the programming model, if the client has some kind of completion event queue on which to block, it could be intentionally reordered. And if it has only one event, it could be randomly stalled, because performance will be probably not so strongly impacted. Also, timing data may be restricted, provided as opaque object to unprivileged processes, that could than be passed to other privileged services, such as visualization and timer APIs.

Yes.

simeonz wrote:

The question is, how to facilitate legitimate uses, such as frame timing, while prohibiting the process from inferring the encapsulated time by performing an activity with known fixed duration.

Most legitimate uses don't need nanosecond precision - e.g. frame timing can be done with timing that's many orders of magnitude worse (microsecond precision). Note that if you add too much "random jitter" to timing then an attacker will just use other approaches (e.g. a different CPU doing "lock inc [shared_counter]"), so there's no real benefit from having more than several hundred cycles of "random jitter", which means that you'd still be providing "tenths of microseconds" for legitimate uses.

simeonz wrote:

Which opens the general question, what restrictions can be made on a system service without surrendering its primary function. Vaguely reminds me of the sci-fi concept of homomorphic encryption - information to process, but not to inspect.

Edit: After some consideration, I realized that malicious applications could try to use one core to time another, relative to a fixed activity. Also, applications could send execution traces to a server, which is responsible for timing. If the timing is inaccurate, with enough volume it will be averaged with high confidence to sufficient precision. So, I am not sure what the solution is, generally speaking. Then again, I probably drift on a tangent here, as this thread may be aiming at much more specific issues.
Edit 2: On the other hand, such deliberate actions imply hostile code being executed, not an initial exploit, so it depends on the assumptions for the software ecosystem and the progress level of the attack.

You'd still need to either make a small number of "accurate enough" measurements (which would be ruined by random delays before returning from kernel to user-space) or need to make a large number of measurements (which the kernel could hopefully detect). If you can't do either of these things then you can't gather the data to send to a remote server to analyse.

Note that network access would be disabled by default (a processes can't send/receive packets unless admin explicitly enabled network access for that executable).

simeonz wrote:

Solar wrote:

Anyway... the crux of these side-channel attacks is that the activities undergone by exploiting code cannot be distinguished from perfectly benevolent usage with any degree of certainty

I agree with this. Still, the question is can you limit the access to "types" of system resources and activities that are essential to the software's operation. Even if the restrictions are self-imposed by the application or manually imposed by security administrator.

Solar wrote:

If you do identify a process as malicious, crash and burn it.

Even if the software and all objects are trusted (and possibly come from the same original source as the OS), they are a different degree of a virus due to exploitability. The tendency is to limit the content to "trustworthy" online stores anyway, in the hope that they can blacklist malicious sources, enforce non-repudiation, "guarantee" the delivery integrity, etc. If the OS provides a sensible security model for self-restriction that the trusted application can employ (per-application, per-thread, per-process, through separation of concerns, impersonation, elevation, etc), and if this model is both efficient and effective (which seems like a pipe dream), the exploit will be ideally limited to the extent of the foothold that the attacker already has. Some loss of privacy and authenticity will be thus unavoidable by the very nature of software and the people interacting with it.

I'd say that if a process is intentionally malicious it should crash and burn; and if a process is unintentionally exploitable then it should crash and burn. For both cases maybe the system should try to inform the developer that their code was a problem, so that if it was unintentional the developer gets notified and can fix the problem sooner.

Note that I'm planning executables that are digitally signed by the developer (where the developer has to register for a key to be added to a white list), where how trusted an executable is depends on how trusted the developer is. If an executable is intentionally malicious or exploitable it would effect/decrease how much the system trusts the developer and all executables created by that developer. Ideally this would be tied into a (semi-automated) "world-wide blacklist/white list" maintained as part of the OS project (and probably involve a messy appeals process where people complain because the system stopped trusting them).

simeonz wrote:

On a side note, I doubt that this current security paradigm, which relies on object identity so much, can be applied to prospective adaptive (i.e. AI) software, which at least in some places may replace conventional hard-coded applications.

I don't think it makes any difference if a piece of software is created by a human or created by a machine - if the software is malicious or exploitable then it dies regardless of what created it.

Cheers,

Brendan

Brendan · **Posted:** Tue Mar 20, 2018 12:47 pm

Hi,

davidv1992 wrote:

Also, note that for cache side channels, most multicore processors share at least some cache between cores. This would mean that, when using this approach, you would only be able to use a single core.

There would be a small amount of time between starting a kernel API function on one CPU and doing WBIVND on that CPU, where user-space code on a different CPU (that happens to share a cache) might be able to infer information about what the kernel did/didn't fetch into caches. However (in general) this would be hard to exploit (hard to get the timing just right while also hoping that other user-space process don't mess up the shared caches); and having a queue of kernel operations where the operations are done in random order would help that.

Also note that (starting from very early during boot, before kernel is even loaded into RAM) I do something I call "physical address space randomisation" - when the physical memory manager is asked to allocate a page it allocates a random physical page (and eventually there will be multiple things that deliberately "re-randomise" pages periodically while the kernel is running). This mostly means that it's extremely difficult for normal software to figure out the physical address of any page of RAM (including the physical addresses of pages that the kernel uses). Because caches on 80x86 are physically indexed this helps to make cache timing attacks significantly harder (you'd be able to know the kernel touched something but wouldn't know what it touched). It also helps to guard against things like Rowhammer (which is what it was originally intended to do).

Cheers,

Brendan

Antti · **Posted:** Tue Mar 20, 2018 12:49 pm

Brendan wrote:

I'm not sure if you meant a literal debugger (for the user to debug the process) or if you meant using "single-step" and the debugging exception. For the latter (kernel using single-step for "extreme monitoring of untrusted process") it could be quite powerful (e.g. maybe even be able to defeat "counter incremented by other CPU" timers); but the performance impact would be a little hard to justify.

Heh, I introduced my idea in a wrong way and there is a really big difference between the interpretations. The wrong interpretation could even continue overshadowing the whole idea that is not a full solution of its own but can be a tool in the toolbox. I meant the latter, so no debugging itself was involved (and definitely not a user doing it) but this "single-step" mechanism could be used for giving fine-grained control to the kernel. It is not very realistic to single-step the whole executable but what about some relatively small "bursts" of instructions when it is relevant? During these steps, the kernel could do these tricks.

I don't think that monitoring the instructions themselves and what they do is the key but having this kernel trampoline may help hiding the information. As a naive example, just add a random small delay in the exception handler and do some cache tricks while executing the next 20 instructions of the process executable.

Brendan · **Posted:** Tue Mar 20, 2018 1:26 pm

Hi,

Antti wrote:

Brendan wrote:

I'm not sure if you meant a literal debugger (for the user to debug the process) or if you meant using "single-step" and the debugging exception. For the latter (kernel using single-step for "extreme monitoring of untrusted process") it could be quite powerful (e.g. maybe even be able to defeat "counter incremented by other CPU" timers); but the performance impact would be a little hard to justify.

Heh, I introduced my idea in a wrong way and there is a really big difference between the interpretations. The wrong interpretation could even continue overshadowing the whole idea that is not a full solution of its own but can be a tool in the toolbox. I meant the latter, so no debugging itself was involved (and definitely not a user doing it) but this "single-step" mechanism could be used for giving fine-grained control to the kernel. It is not very realistic to single-step the whole executable but what about some relatively small "bursts" of instructions when it is relevant? During these steps, the kernel could do these tricks.

I don't think that monitoring the instructions themselves and what they do is the key but having this kernel trampoline may help hiding the information. As a naive example, just add a random small delay in the exception handler and do some cache tricks while executing the next 20 instructions of the process executable.

For this example, performance monitoring counters might be more suitable - e.g. count "non-bogus instructions at CPL=3" so it generates an interrupt after 20 instructions (and avoid having an exception after ever instruction).

I also think it's more suited to solving a different problem - e.g. preventing software from using information that the kernel does leak, rather than ensuring that kernel doesn't leak that information.

Cheers,

Brendan

Antti · **Posted:** Wed Mar 21, 2018 12:11 am

Brendan wrote:

Note that I'm planning executables that are digitally signed by the developer (where the developer has to register for a key to be added to a white list), where how trusted an executable is depends on how trusted the developer is. If an executable is intentionally malicious or exploitable it would effect/decrease how much the system trusts the developer and all executables created by that developer.

Wording matters a lot. Your earlier statements about this topic were a serious controversy that might have ideologically prevented the adoption of the whole infrastructure. I think that saying it like this ("would effect/decrease how much...") is significantly less controversial, does not get negative attention, and is just an entry on the list of (obviously) good features. If this is your new "public relations" strategy, good work!

Brendan · **Posted:** Wed Mar 21, 2018 5:25 am

Hi,

Antti wrote:

Brendan wrote:

Note that I'm planning executables that are digitally signed by the developer (where the developer has to register for a key to be added to a white list), where how trusted an executable is depends on how trusted the developer is. If an executable is intentionally malicious or exploitable it would effect/decrease how much the system trusts the developer and all executables created by that developer.

Wording matters a lot. Your earlier statements about this topic were a serious controversy that might have ideologically prevented the adoption of the whole infrastructure. I think that saying it like this ("would effect/decrease how much...") is significantly less controversial, does not get negative attention, and is just an entry on the list of (obviously) good features. If this is your new "public relations" strategy, good work!

Last year I put a lot of effort into ensuring my boot code was as secure as possible, and in December I thought I had extremely secure boot code. Then in January I found out my code wasn't secure at all.

One of the things I've learnt from Spectre is that sometimes "unintentionally exploitable" is nobody's fault; and as a side effect my stance on consequences for releasing "unintentionally exploitable" software is a lot softer than it was (I can't penalise developers for failing to foresee something that's impossible to foresee). My stance on "intentionally malicious" hasn't changed much though. Mostly; it's a bit like realising that the grey area between "intentionally malicious" and "unintentionally exploitable" isn't a scale that goes from "black to grey" but is actually a scale that goes from "black to grey to white".

The other thing that's changed is that previously I only really had "blacklisted or whitelisted" with no options in between (where most things in the "black to grey" range ended up closer to black than white). If I support "per process performance vs. mitigation compromises" it gives me a lot more options - I wouldn't be trying to convert a grey scale into monochrome.

Cheers,

Brendan

Schol-R-LEA · **Posted:** Wed Mar 21, 2018 8:17 am

Solar wrote:

Antti wrote:

If you detected that a process is suspicious, you could activate a "single-step" debugger when returning to the user-space. This could be used for controlling the situation (whatever means are necessary) as long as is needed. This was just an idea.

Activating a single-step debugger on a user-space application -- if I understood correctly that this would be available to the user for inspection -- would create a security hole, not plug one. (How do you ensure that the person now having access to that process' memory space is actually authorized to do so?)

Historical note: some older systems, including the Alto, the LispMs by LMI and Symbolics, and many early micros, did just that. However, it has to be remembered that a) these were single-user machines with no (or only local) networking, and b) they had good reason to expect that anyone who was using them had experience with debugging, and no reason to abuse it for malicious purposes. It was a very different time...

Schol-R-LEA · **Posted:** Wed Mar 21, 2018 8:34 am

simeonz wrote:

Vaguely reminds me of the sci-fi concept of homomorphic encryption - information to process, but not to inspect.

Uhm, is this sarcasm? It maybe experimental, but 'sci fi' seems a bit dismissive, though I suppose you could have meant it as being 'extremely advanced to the point of seeming fantastical', which does seem apt. Unfortunately, nuanced communication is rather difficult over plain text...

I have to admit that I wasn't aware of it before, but now it has caught my interest; with any luck, it will be a practical measure someday soon.

While it may not be practical yet except in very limited forms, according to that page there have been proof-of-concept implementations of the fully homomorphic encryption for a couple of years (it's even available as FOSS libraries, though I doubt they are much use yet except to researchers), and apparently significant improvements have come in the past few years.

Even limited systems may be of use now, though, even if they aren't actually used much today; as with all encryption, the goal is to deter and delay interception, or make it impractically costly, not prevent it (as that is theoretically impossible). I can see a number of ways specific forms of this can be applied today (such as the "Bitcoin Split-Key Vanity Mining" mentioned in the article) with a bit of careful planning and cherry-picking of the critical data regions.

OSDev.org

Kernel Side-Channel "Ultra-Nuke"

Who is online