CPU bug makes virtually all chips vulnerable

Solar · **Posted:** Wed Jan 03, 2018 9:09 am

CPU bug makes virtually all chips vulnerable to page table exploit.

The Register: 'Kernel memory leaking' Intel processor design flaw forces Linux, Windows redesign

python sweetness: The mysterious case of the Linux Page Table Isolation patches

----

Edit:

Not limited to Intel.

Brendan · **Posted:** Wed Jan 03, 2018 9:49 am

Hi,

Solar wrote:

CPU bug makes virtually all Intel chips vulnerable to page table exploit.

Actual details for the vulnerability are still under "embargo"; so I've been chasing clues all day and trying to piece it all together.

So far my theory is that the attack works like this:

Attacker starts a transaction (using Intel's TSX)
Attacker attempts to read from an address in kernel space, like "movzx edi, word [kernelAddress]". This would normally (without TSX) cause a page fault, but only causes the transaction to be aborted.
CPU speculatively executes the read from kernel space; but (for performance reasons?) does the protection checks in parallel with the read, causing a temporary register to be set to the value that was read from kernel space. When the protection checks complete the instruction is tagged as "failed/bogus" and later on (when the instruction hits retirement) the results are discarded and the transaction is aborted
Attacker attempts to read from a buffer where the address depends on the previously read value from kernel space, like "mov rax,[buffer+rdi*8]"; but arranges this to be speculatively executed before the previous instruction hits retirement.
CPU speculatively executes this read too; causing the read to use the address in the temporary register from the previous read. This second read is also discarded at retirement; but the cache line is fetched (or at least the fetch is "speculatively started").
Attacker figures out which cache line (in their buffer) was fetched by measuring how long a normal access takes (cache hit or cache miss), and uses this information to figure out the which address was used (to the nearest 64 bytes) and therefore determine the highest bits of the value that was in the temporary register, which is the highest bits of the value at "kernelAddress" in kernel space.
Attacker repeats this process a few more times (e.g. using "mov rax,[buffer+rdi*8 + 4]", "mov rax,[buffer+rdi*8+2]", "mov rax,[buffer+rdi*8+1]") to determine the lower bits of the value at "kernelAddress".

The end result is that the attacker (in user space, running at CPL=3) can read data from kernel space, bypassing the page's permissions.

Cheers,

Brendan

Solar · **Posted:** Wed Jan 03, 2018 10:12 am

The main problems here are:

Apparently there is no way to fix this in hardware / firmware.
It renders the "obvious" solution of mapping the kernel into a process' memory tables vulnerable for exploit.

This means pretty much every x86-based OS is affected, and the Wiki should probably be sifted for relevant passages once details become available.

bace · **Posted:** Wed Jan 03, 2018 11:30 am

Brendan wrote:

[*]Attacker figures out which cache line (in their buffer) was fetched by measuring how long a normal access takes (cache hit or cache miss), and uses this information to figure out the which address was used (to the nearest 64 bytes)

If the attack relies on precise timing, does this mean it could be prevented by disabling the TSC at CPL=3?

Octacone · **Joined:** Fri Aug 07, 2015 6:13 am **Posts:** 1134

I wanted to make a topic about this but it looks like somebody already has.

The main problem is that there is no easy fix, and there are speculations that fixing this would cause 30% performance drop across all Intel based PCs from last 10 years.
This is actually huge! I wonder if it is all worth it? 30% slower PC, no thanks. Users should be given a choice. Bigger companies might want to patch this and trade security for speed but regular users - don't think so.
I bet AMD is banging their heads against the wall thinking how they could use this thing in their advantage.

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

The 30% number was only measured in synthetic benchmarks with PCID disabled. Expect a less severe performance impact for real workloads, especially if you have PCID.

If this is really a cache-based privilege escalation attack, you really want to patch this, even if you're an "uninteresting" consumer: After all, rowhammer was demonstrated to be exploitable from JavaScript inside browsers.

davidv1992 · **Joined:** Thu Jul 05, 2007 8:58 am **Posts:** 223

Brendan wrote:

[*]Attacker figures out which cache line (in their buffer) was fetched by measuring how long a normal access takes (cache hit or cache miss), and uses this information to figure out the which address was used (to the nearest 64 bytes) and therefore determine the highest bits of the value that was in the temporary register, which is the highest bits of the value at "kernelAddress" in kernel space.

This requires a fairly precise time source. Is my interpretation that, if you disable the TSC in userspace (per Intel recommendation), this would pretty much render the exploit infeasable, correct?

ggodw000 · **Posted:** Wed Jan 03, 2018 4:29 pm

subscribing...

Sik · **Joined:** Wed Aug 17, 2016 4:55 am **Posts:** 251

Solar wrote:

Apparently there is no way to fix this in hardware / firmware.

AMD is not affected so it's clearly avoidable. It's specifically an issue in the hardware implementation used by Intel.

It's in a very critical path so it's not microcoded which means a firmware update can't fix it, so for now we're stuck updating operating systems to reduce the likelihood of such an attack working. The problem is that this means that to get rid of the issue for good we need to get rid of the affected chips and switch to ones that don't have the exploitable implementation (which on Intel's case would mean wait for them to come up with fixed CPUs and then for those to be released - otherwise you have to go to another vendor like AMD).

For now the best workaround for the performance loss caused by these kernel changes is to rewrite applications to jump less to the kernel. Better get ready getting good at managing your own memory pools =P

bluemoon · **Posted:** Wed Jan 03, 2018 8:55 pm

Seems the current workaround is implement kernel page table isolation, but incur a significant performance penalty.

Since everyone is slower, do this means the performance gap between monolithic and micro kernel is closer and micro will be more viable design?

edit: never mind, seems the performance hit is not that significant with PCID

Brendan · **Posted:** Wed Jan 03, 2018 8:56 pm

Hi,

davidv1992 wrote:

Brendan wrote:

[*]Attacker figures out which cache line (in their buffer) was fetched by measuring how long a normal access takes (cache hit or cache miss), and uses this information to figure out the which address was used (to the nearest 64 bytes) and therefore determine the highest bits of the value that was in the temporary register, which is the highest bits of the value at "kernelAddress" in kernel space.

This requires a fairly precise time source. Is my interpretation that, if you disable the TSC in userspace (per Intel recommendation), this would pretty much render the exploit infeasable, correct?

Disabling TSC in userspace would make it less feasible. In this case an attacker could (e.g.) have code running on one CPU does "lock inc dword [counter]" in a loop so that code on the another CPU can do "mov eax,[counter]" instead of using "rdtsc". This would be less precise than TSC and less accurate (due to a lot more being able to interfere with the counter); but still enough (especially with repeated attempts to improve probabilities, and especially if the CPUs share caches) to determine if something always/never causes a cache miss.

Cheers,

Brendan

~ · **Joined:** Tue Mar 06, 2007 11:17 am **Posts:** 1225

Wouldn't it be enough to just disable the CPU cache entirely at least for security-critical machines as it's easier to configure as the default? Or invalidate the entire cache every time we switch/enter/exit/terminate/create a process or thread?

It seems to me that the intention is just having a cache that is separated for each process (instead of the existing one which is global to the CPU/computer) so there is no possibility to read leftover cached data between arbitrary processes. That would also need the capability of letting the programmer know what each byte of the cache contains, being able to access, manipulate and separate/allocate parts of it to different processes.

Brendan · **Posted:** Wed Jan 03, 2018 10:35 pm

Hi,

bluemoon wrote:

Seems the current workaround is implement kernel page table isolation, but incur a significant performance penalty.

Since everyone is slower, do this means the performance gap between monolithic and micro kernel is closer and micro will be more viable design?

I've been trying to think of effective work-arounds, and there really aren't many. Apart from making kernel pages inaccessible ("not present" or using PCID):

Disabling caches for CPL=3 pages would work but would have an extreme performance cost
For 32-bit OS; segmentation might work, but to be honest I very much doubt it (I'd assume segment limit checks are done at the same time as page permission checks and therefore have the same problem)
In theory, managed languages could work (by making it impossible for programmers to generate code that tries to access kernel space); but quite frankly every "managed language" attempt that's ever hit production machines has had so many security problems that it's much safer to assume that a managed languages would only make security worse (far more code needs to be trusted than kernel alone), and the performance is likely to be worse than an PTI approach (especially for anything where performance matters).

This leaves "make kernel pages inaccessible" as the least worst option.

For "make kernel pages inaccessible" it doesn't necessarily need to be all kernel pages. Pages that contain sensitive information (e.g. encryption keys) would need to be made inaccessible, but pages that don't contain sensitive information don't need to be made inaccessible. This gives 2 cases.

If PCID can't be used; then you could separate everything into "sensitive kernel data" and "not sensitive kernel data" and leave all of the "not sensitive kernel data" mapped in all address spaces all the time to minimise the overhead. For a monolithic kernel (especially a pre-existing monolithic kernel) it'd be almost impossible to separate "sensitive" and "not sensitive" (because there's all kinds of drivers, etc to worry about) and it'd be easy to overlook something; so you'd mostly want a tiny stub where almost everything is treated as "sensitive" to avoid the headaches. For a micro-kernel it wouldn't be too hard to distinguish between "sensitive" and "not sensitive", and it'd be possible to create a micro-kernel where everything is "not sensitive", simple because there's very little in the kernel to begin with. The performance of a micro-kernel would be much less effected or not effected at all; closing the performance gap between micro-kernel and monolithic, and potentially making micro-kernels faster than monolithic kernels.

Note: For this case, especially for monolithic kernels, if you're paying for the TLB trashing anyway then it wouldn't take much more to have fully separated virtual address spaces, so that both user-space and kernel-space can be larger (e.g. on a 32-bit CPU, let user-space have almost 4 GiB of space and let kernel have a separate 4 GiB of space).

If PCID can be used (which excludes 32-bit OSs); then the overhead of making kernel pages inaccessible is significantly less. In this case, if nothing in the kernel is "sensitive" you can do nothing, and if anything in the kernel is "sensitive" you'd probably just use PCID to protect everything (including the "not sensitive" data). In practice this probably means that monolithic kernels and some micro-kernels are effected; but "100% not sensitive micro-kernel" wouldn't be effected.

In other words; it reduces the performance gap between some micro-kernel and monolithic kernels, but not all micro-kernels, and probably not enough to make some micro-kernels faster than monolithic kernels.

The other thing I'd want to mention is that for all approaches and all kernel types (but excluding "kernel given completely separate virtual address space so that both user-space and kernel-space can be larger"), the kernel could distinguish between "more trusted" processes and "less trusted" processes and leave the kernel mapped (and avoid PTI overhead) when "more trusted" processes are running. In practice this means that if the OS supports (e.g.) digitally signed executables (and is therefore able to associate different amounts of trust depending on the existence of a signature and depending on who the signer was) then it may perform far better than an OS that doesn't. This makes me think that various open source groups that shun things like signatures (e.g. GNU) may end up penalised on a lot of OSs (possibly including future versions of Linux).

Cheers,

Brendan

Brendan · **Posted:** Wed Jan 03, 2018 10:44 pm

Hi,

~ wrote:

Wouldn't it be enough to just disable the CPU cache entirely at least for security-critical machines as it's easier to configure as the default? Or invalidate the entire cache every time we switch/enter/exit/terminate/create a process or thread?

It seems to me that the intention is just having a cache that is separated for each process (instead of the existing one which is global to the CPU/computer) so there is no possibility to read leftover cached data between arbitrary processes. That would also need the capability of letting the programmer know what each byte of the cache contains, being able to access, manipulate and separate/allocate parts of it to different processes.

No caching for user-space would be extremely expensive (at least for any computer made in the last 30 years). For modern computers (from the last decade) it'd probably something like 150 times slower for all software (where the early "PTI without PCID" benchmarks I've seen have a worst case of around 1.4 times slower).

Cheers,

Brendan

~ · **Joined:** Tue Mar 06, 2007 11:17 am **Posts:** 1225

How do the patches for these vulnerabilities actually work (Meltdown/Spectre)? Are they really just a separation of page tables for kernel and programs?

At most I think that they could be selective caching only for programs that aren't security-critical, or global destruction of the cache after some time running, maybe flushing CPU cache along with disk caches.

Or I think that the cache could be fully invalidated every time the kernel switches to a different task, at least it could be invalidated 25% of the times or so.

I think that the system kernel could even provide API functions for security-conscious programs so that any program can request clearing its data from the CPU cache if it's going to use very private data, or initialize a configuration flag (by invoking a system call at program or shared library startup) that tells the CPU to clear the cache every time this program is switched in or out of the multitasking queue. Processes running as root/administrator could also have their cache invalidated every time they are switched.

This is how I think that the patch is implemented. Probably this way of implementing a protection mechanism could make better use of the CPU as it's only invoked when requested by a program, assuming that the rest of running programs aren't malware.

________________________________________
________________________________________
We would still need to reproduce the bug for implementing OS code that really solves something, not just designing it based on suspicion of how things work if we don't actually know first how a vulnerability works as to write code that instead of trying to patch problems, handles the memory/cache resources better to not even have things to patch.

OSDev.org

CPU bug makes virtually all chips vulnerable

Who is online