Breaking ASLR on systems with memory caches

xenos · **Posted:** Wed Feb 15, 2017 11:21 am

I was just pointed to an interesting article that I would like to share:

http://www.cs.vu.nl/~herbertb/download/ ... ndss17.pdf

tl;dr: The article presents a method how address space layout randomization, a security feature used to prevent attacks which are based on knowing the location of a program's code and data, can be broken by probing the memory cache. This probing is done in such as way that an attacker figures out which entries from page tables at all levels of the paging hierarchy have been recently fetched (as a consequence of a page walk, when a TLB miss occurs), and this basically tells the attacker for which virtual address the page walk was performed (and this is the information that ASLR is supposed to disguise). This is demonstrated in JavaScript.

onlyonemac · **Joined:** Sat Mar 01, 2014 2:59 pm **Posts:** 1146

I'm not the most clued-up about these things but, in my opinion, relying on ASLR to prevent an exploit is a bit like hiding a file containing all your passwords in a long tree of subdirectories with a false name. In other words, if information is held in a place that's accessible to an attacker, assume that they can access it, and don't think "oh, they won't find this".

Brendan · **Posted:** Wed Feb 15, 2017 1:03 pm

Hi,

XenOS wrote:

I was just pointed to an interesting article that I would like to share:

Once upon a time Intel invented a new instruction - "RDTSC". Intel weren't stupid and knew that access to such precise timing would be a security problem, so they also provided a flag in CR4 that an OS can use to prevent CPL=3 code from being able to use the RDTSC instruction, and in the "Volume 3: System Programming Guide" (starting from the very first version that mentioned the time stamp counter) they wrote (highlighting is mine):

Intel in 1995 wrote:

A secure operating system would set the TSD flag during system initialization to disable user access to the time stamp counter. An operating system that disables user access to the timestamp counter should emulate the instruction through a user-accessible programming interface.

This same warning (using the exact same wording) has existed in every single version of the Intel manuals since, and still exists today.

Sadly; incompetent morons ignored Intel then and continued to ignore Intel since. The end result is that every 6 months or so (for a period of over 20 years) there's yet another white-paper showing yet another security problem that relies on very precise timing side-channels that could've and should've been impossible.

Cheers,

Brendan

SpyderTL · **Joined:** Sun Sep 19, 2010 10:05 pm **Posts:** 1074

I recently watched a YouTube video on the CCCGen channel about measuring the timings of memory access to both find memory addresses that have been used by the OS, and also to communicate between processes by having one process either access memory or clear cache entries, and the second process measure how long it takes to access the same memory to determine if the access was slow (meaning the bit would be a 1) or fast (meaning the bit would be a 0).

I think they said that they got up to 75Kbps using this approach... Pretty fascinating stuff. Unfortunately, that channel has been (quite recently) shut down by YouTube for breaking TOS, somehow.

EDIT: Found it. https://www.youtube.com/watch?v=iS8YPDuZbKM

OSwhatever · **Joined:** Mon Jul 05, 2010 4:15 pm **Posts:** 595

Is there a good reason cache flush instructions should be available to user space at all?

Brendan · **Posted:** Wed Feb 15, 2017 10:06 pm

Hi,

OSwhatever wrote:

Is there a good reason cache flush instructions should be available to user space at all?

CLFLUSH (and the non-temporal moves that came later) are important for performance in cases where the data you touched recently won't be data you need again soon (especially when it's a lot of data you won't need again soon). They're an effective work around for pathological cases caused by the cache's "least recently used eviction" algorithm that are needed to avoid potentially severe performance problems.

There are also a few other uses - mostly, in code that benchmarks RAM chip latency and/or code that tests if RAM is faulty; where you don't want to accidentally test the CPUs cache and not test RAM.

If CLFLUSH was only allowed at CPL=0 and you had a kernel API function to do it instead, then the only thing it'd do is make good code much slower (because it's mostly used inside loops in performance critical code that processes a large amount of data, the overhead of an exception of kernel API call would be disastrous); and bad/malicious code would just use the kernel API function. If you didn't allow CLFLUSH (or non-temporal moves) at all (disabled with no alternative provided by kernel) you'd be stuck with the potentially severe performance problems (and it'd be harder to benchmark or test RAM).

Note that there is no actual security problem with CLFLUSH itself. There's a security problem caused by dodgy RAM chips ("rowhammer") where even good/non-malicious software can cause unintended RAM corruption; and there's a problem with "too precise timing" that leads to all kinds of timing side-channels (including data caches, and instruction caches, and TLBs, and shared resources in "attacker and victim sharing core" SMT cases, and ....).

The simple solution is for the kernel to disable RDTSC (and RDTSCP) at CPL=3 and either emulate them (for backward compatibility) or provide kernel API function/s. This allows kernel to nerf precision and/or nerf accuracy (and make a lot of timing side-channels far less practical). It also allows kernel to avoid the "unsynchronised time stamp on different CPUs" problem and/or allows kernel to provide a more convenient calibrated time stamp (that doesn't depend on CPU speed - e.g. always "microseconds since something" for all CPUs in all power saving states).

Cheers,

Brendan

Sik · **Joined:** Wed Aug 17, 2016 4:55 am **Posts:** 251

Brendan wrote:

Note that there is no actual security problem with CLFLUSH itself. There's a security problem caused by dodgy RAM chips ("rowhammer") where even good/non-malicious software can cause unintended RAM corruption; and there's a problem with "too precise timing" that leads to all kinds of timing side-channels (including data caches, and instruction caches, and TLBs, and shared resources in "attacker and victim sharing core" SMT cases, and ....).

That was my thought when I first heard about it, although rather than blame the DRAM chips themselves I'd blame the DRAM controller instead. Under absolutely no circumstance should any memory write interfere with memory refresh, yet that's exactly what seemed to be happening (hence the RAM corruption).

azblue · **Joined:** Sat Feb 27, 2010 8:55 pm **Posts:** 147

There are a couple questions I have arising from the OP and some of the comments:

1. Does this mean we shouldn't bother with ASLR? (Isn't it almost "security through obscurity" anyway?)

2. Regarding timing attacks, how precise can timing be before it becomes a security concern? This is something I've wondered for quite a while.

Brendan · **Posted:** Thu Feb 16, 2017 8:22 am

Hi,

azblue wrote:

There are a couple questions I have arising from the OP and some of the comments:

1. Does this mean we shouldn't bother with ASLR? (Isn't it almost "security through obscurity" anyway?)

First; "security through obscurity" is where you're relying on the attacker not being able to figure out how the security works. If the attacker knows exactly how the security works but doesn't know something else (a password, an encryption key, a seed for a random number generator, etc) then it's not considered "security through obscurity".

It's best to think of security as many layers, where an attacker has to penetrate through multiple layers to succeed. The question is whether or not ASLR is a worthwhile additional layer - do the benefits (for security) justify the disadvantages (for code complexity, performance, etc). There's no right answer - it's more a question of design goals.

Note that for 32-bit 80x86 usually you can't get enough entropy for ASLR to be very effective (the "random" addresses end up with the lowest 12 bits known due to page alignment and several upper bits known due splitting the virtual address space into zones; and that only leaves about 18 "random middle bits"); and because the CPU doesn't support efficient "IP relative addressing" for 32-bit code there's a performance cost involved with relocated code and data. This means that ASLR is closer to "not worthwhile" for 32-bit 80x86. For 64-bit 80x86 it's different - there are more addressing bits than can be random, and there is support for "IP relative addressing" and less performance cost involved; so it's "more worthwhile" for 64-bit 80x86 than it would be for 32-bit 80x86.

azblue wrote:

2. Regarding timing attacks, how precise can timing be before it becomes a security concern? This is something I've wondered for quite a while.

Assume you have an "uncertainty value" like this:

Code:

    do {
        last_tsc = rdtsc();
    } while(last_tsc % uncertainty != 0);
    return last_tsc/uncertainty;

I'd expect that an "uncertainty value" of a few hundred cycles would be enough to make half of the attacks (those that rely on cache timing or "shared resource in SMT timing") go from "vaguely plausible" to "impractical". I'd also be tempted to make the amount of uncertainty proportional to how often software asks for the time (e.g. 200 cycles of uncertainty if you ask for the TSC infrequently, and 999999 cycles of uncertainty if you ask for TSC millions of times per second).

Cheers,

Brendan

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Brendan wrote:

I'd expect that an "uncertainty value" of a few hundred cycles would be enough to make half of the attacks (those that rely on cache timing or "shared resource in SMT timing") go from "vaguely plausible" to "impractical". I'd also be tempted to make the amount of uncertainty proportional to how often software asks for the time (e.g. 200 cycles of uncertainty if you ask for the TSC infrequently, and 999999 cycles of uncertainty if you ask for TSC millions of times per second).

Note that there are programs that legitimately ask for TSC 10s - 100s of thousands times per second. The overhead of a syscall would totally kill those applications.

Brendan · **Posted:** Thu Feb 16, 2017 8:46 am

Hi,

Korona wrote:

Brendan wrote:

I'd expect that an "uncertainty value" of a few hundred cycles would be enough to make half of the attacks (those that rely on cache timing or "shared resource in SMT timing") go from "vaguely plausible" to "impractical". I'd also be tempted to make the amount of uncertainty proportional to how often software asks for the time (e.g. 200 cycles of uncertainty if you ask for the TSC infrequently, and 999999 cycles of uncertainty if you ask for TSC millions of times per second).

Note that there are programs that legitimately ask for TSC 10s - 100s of thousands times per second. The overhead of a syscall would totally kill those applications.

Which applications, and why?

Cheers,

Brendan

willedwards · **Joined:** Sat Mar 15, 2014 3:49 pm **Posts:** 96

azblue wrote:

There are a couple questions I have arising from the OP and some of the comments:

1. Does this mean we shouldn't bother with ASLR? (Isn't it almost "security through obscurity" anyway?)

2. Regarding timing attacks, how precise can timing be before it becomes a security concern? This is something I've wondered for quite a while.

1. You should absolutely bother with exploit mitigation in general and ASLR in particular.

https://www.openbsd.org/papers/auug04/mgp00001.html is old but still excellent background.

You can harden your exploit mitigation further by W^X, even for JITed code, having an at-install-time or at-each-load-time code order randomizer and so on. I think I heard that apple and google are now doing the latter, e.g. apple for watch apps.

There's no substitute for strong memory protection and carefully audited kernel APIs. But you absolutely want exploit mitigation too.

Re 2 iirc they demonstrated a pure-Javascript timer loop that worked for this attack.

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Brendan wrote:

Which applications, and why?

For example if you're polling (IRQs do not make sense at high I/O rates) a low-latency link (e.g. Infiniband) and want to keep track of RTT or wait times. I'm sure there are more examples in real-time applications, this is just one I know of.

EDIT: Another example would be performance measurements in HPC applications like SAT or ILP solvers or physics/weather/fluid/whatever simulation.

Brendan · **Posted:** Thu Feb 16, 2017 9:40 am

Hi,

Korona wrote:

Brendan wrote:

Which applications, and why?

For example if you're polling (IRQs do not make sense at high I/O rates) a low-latency link (e.g. Infiniband) and want to keep track of RTT or wait times. I'm sure there are more examples in real-time applications, this is just one I know of.

EDIT: Another example would be performance measurements in HPC applications like SAT or ILP solvers or physics/weather/fluid/whatever simulation.

That doesn't really sound like enough to matter; given that NIC drivers are drivers (which tend to be "more trusted" and could or would be exempt) and the rest don't need precision at all (e.g. benchmark how much work they can do in a 5 day period and you can use a sundial instead ;-)

).

Cheers,

Brendan

Rusky · **Joined:** Wed Jan 06, 2010 7:07 pm **Posts:** 792

Brendan wrote:

That doesn't really sound like enough to matter; given that NIC drivers are drivers (which tend to be "more trusted" and could or would be exempt) and the rest don't need precision at all (e.g. benchmark how much work they can do in a 5 day period and you can use a sundial instead ;-)

).

Low-latency networking tends to be done in userspace, not in a driver; performance measurements that only measure overall throughput are useful but insufficient.

Someone's always going to be able to come up with a good reason to want high-precision timing information. The solution is not to get rid of it, but to use a well-designed execution environment with enough mitigations that it becomes impractical to take advantage of timing for malicious purposes.

OSDev.org

Breaking ASLR on systems with memory caches

Who is online