

## Mitigation Plans for Microarchitectural Attacks

## **Daniel Gruss**

July 15, 2019

Graz University of Technology









• Observing cache utilization with performance counters?



ullet Observing cache utilization with performance counters? o No



- ullet Observing cache utilization with performance counters? o No
- Observing cache utilization with performance counters and using it to infer a crypto key?



- ullet Observing cache utilization with performance counters? o No
- ullet Observing cache utilization with performance counters and using it to infer a crypto key? o Yes



- ullet Observing cache utilization with performance counters? o No
- ullet Observing cache utilization with performance counters and using it to infer a crypto key? o Yes
- Measuring memory access latency with Flush+Reload?



- ullet Observing cache utilization with performance counters? o No
- $\bullet$  Observing cache utilization with performance counters and using it to infer a crypto key?  $\to$  Yes
- $\bullet$  Measuring memory access latency with Flush+Reload?  $\to$  No



- ullet Observing cache utilization with performance counters? o No
- $\bullet$  Observing cache utilization with performance counters and using it to infer a crypto key?  $\to$  Yes
- ullet Measuring memory access latency with Flush+Reload? o No
- Measuring memory access latency with Flush+Reload and using it to infer keystroke timings?



- ullet Observing cache utilization with performance counters? o No
- $\bullet$  Observing cache utilization with performance counters and using it to infer a crypto key?  $\to$  Yes
- ullet Measuring memory access latency with Flush+Reload? o No
- ullet Measuring memory access latency with Flush+Reload and using it to infer keystroke timings? o Yes



• traditional cache attacks (crypto, keys, etc)



- traditional cache attacks (crypto, keys, etc)
- actual misspeculation (e.g., branch misprediction)



- traditional cache attacks (crypto, keys, etc)
- actual misspeculation (e.g., branch misprediction)
- Meltdown, Foreshadow, ZombieLoad, etc



- traditional cache attacks (crypto, keys, etc)
- actual misspeculation (e.g., branch misprediction)
- Meltdown, Foreshadow, ZombieLoad, etc



- traditional cache attacks (crypto, keys, etc)
- actual misspeculation (e.g., branch misprediction)
- Meltdown, Foreshadow, ZombieLoad, etc
- Let's avoid the term Speculative Side Channels

```
printf("%d", i);
printf("%d", i);
```





```
printf("%d", i);
printf("%d", i);
```



















































if cache attacks are simple because the mapping to sets is simple ...

if cache attacks are simple because the mapping to sets is simple ..

instead of this:



if cache attacks are simple because the mapping to sets is simple ...

instead of this:



let's do this:





 Index Derivation Function (IDF) takes an address and returns a cache set



- Index Derivation Function
   (IDF) takes an address and returns a cache set
- Depends on hardware key K and optional Security
   Domain ID (SDID)



- Index Derivation Function
   (IDF) takes an address and returns a cache set
- Depends on hardware key K and optional Security
   Domain ID (SDID)
- unique combination of cache lines for each address





• ScatterCache requires no software support, default SDID = 0



- ScatterCache requires no software support, default SDID = 0
- But OS support enables security domains



- ScatterCache requires no software support, default SDID = 0
- But OS support enables security domains
  - ightarrow shared read-only pages can be private in the cache!



- ScatterCache requires no software support, default SDID = 0
- But OS support enables security domains
  - $\rightarrow$  shared read-only pages can be private in the cache!
- OS can define SDID per process and separate user space and kernel space



- ScatterCache requires no software support, default SDID = 0
- But OS support enables security domains
  - → shared read-only pages can be private in the cache!
- OS can define SDID per process and separate user space and kernel space
- Process can request distinct SDIDs for memory ranges



• Non-shared memory has no shared cache lines



- Non-shared memory has no shared cache lines
  - → Flush+Reload, Flush+Flush and Evict+Reload are not possible



- Non-shared memory has no shared cache lines
  - $\rightarrow$  Flush+Reload, Flush+Flush and Evict+Reload are not possible
- Shared, read-only memory is like non-shared memory, given OS support. Without OS support, eviction-based attacks are hindered



- Non-shared memory has no shared cache lines
  - $\rightarrow$  Flush+Reload, Flush+Flush and Evict+Reload are not possible
- Shared, read-only memory is like non-shared memory, given OS support. Without OS support, eviction-based attacks are hindered
- Shared, writable memory can't be separated, eviction-based attacks are hindered

• Specialized Prime+Probe variants are still possible





- Specialized Prime+Probe variants are still possible
- But, overlap in more than 1 cache line is very unlikely
  - $\rightarrow$  Eviction is now probabilistic,  $p = \frac{1}{n_{wavs}^2}$  to evict



- Specialized Prime+Probe variants are still possible
- But, overlap in more than 1 cache line is very unlikely  $\rightarrow$  Eviction is now probabilistic,  $p = \frac{1}{n_{\text{trans}}^2}$  to evict
- Evicting an address with 99% certainty needs 275 addresses for 8-way cache, instead of  $\approx$  8 for standard Prime+Probe



- Specialized Prime+Probe variants are still possible
- But, overlap in more than 1 cache line is very unlikely  $\rightarrow$  Eviction is now probabilistic,  $p=\frac{1}{n_{ways}^2}$  to evict
- Evicting an address with 99% certainty needs 275 addresses for 8-way cache, instead of  $\approx$  8 for standard Prime+Probe
- Constructing this set requires  $\approx 2^{25}$  profiled victim accesses, compared to less than 100 accesses for standard, noise-free Prime+Probe

 Micro benchmarks GAP, MiBench, Imbench, scimark2 on gem5 full system simulator





- Micro benchmarks GAP, MiBench, Imbench, scimark2 on gem5 full system simulator
- Macro benchmarks from SPEC CPU 2017 on custom cache simulator



- Micro benchmarks GAP, MiBench, Imbench, scimark2 on gem5 full system simulator
- Macro benchmarks from SPEC CPU 2017 on custom cache simulator
- Cache hit rate always at or above levels of set-associative cache with random replacement



- Micro benchmarks GAP, MiBench, Imbench, scimark2 on gem5 full system simulator
- Macro benchmarks from SPEC CPU 2017 on custom cache simulator
- Cache hit rate always at or above levels of set-associative cache with random replacement
- Typically 2% 4% below LRU on micro benchmarks, 0% 2% for SPEC











• Mark secrets in source code



- Mark secrets in source code
- Propagate taint through memory hierarchy:



- Mark secrets in source code
- Propagate taint through memory hierarchy:
  - Pages



- Mark secrets in source code
- Propagate taint through memory hierarchy:
  - Pages
  - Cache Lines (in caches and buffers)



- Mark secrets in source code
- Propagate taint through memory hierarchy:
  - Pages
  - Cache Lines (in caches and buffers)
  - Registers



## Serializing Barrier



## Unprotected















• Writing to unprotected memory exposes value to attackers



- Writing to unprotected memory exposes value to attackers
  - $\rightarrow \ \, \mathsf{Untaint} \,\, \mathsf{register} \,\,$



- Writing to unprotected memory exposes value to attackers
  - $\rightarrow \ \, \mathsf{Untaint} \,\, \mathsf{register} \,\,$
- Split stack into protected and unprotected half



- Writing to unprotected memory exposes value to attackers
  - ightarrow Untaint register
- Split stack into protected and unprotected half
- $\bullet$  Stack spills of unprotected data  $\to$  stay unprotected as long as they stay in the cache







• Compiler Extension



- Compiler Extension
- Linux Patch



- Compiler Extension
- Linux Patch
- CPU Emulation in Bochs



- Compiler Extension
- Linux Patch
- CPU Emulation in Bochs
- Native via uncacheable memory (ConTExT-light)

| Benchmark       | SPEC Score                    |         | Overhead |
|-----------------|-------------------------------|---------|----------|
|                 | Baseline                      | ConTExT | [%]      |
| 600.perlbench_s | 7.03                          | 6.86    | +2.42    |
| 602.gcc_s       | 11.90                         | 11.80   | +0.84    |
| 605.mcf_s       | 9.06                          | 9.16    | -1.10    |
| 620.omnetpp_s   | 5.07                          | 4.81    | +5.13    |
| 623.xalancbmk_s | 6.06                          | 5.95    | +1.82    |
| 625.×264_s      | 9.25                          | 9.25    | 0.00     |
| 631.deepsjeng_s | 5.26                          | 5.22    | +0.76    |
| 641.leela_s     | 4.71                          | 4.64    | +1.48    |
| 648.exchange2_s | would require Fortran runtime |         |          |
| 657.×z_s        | 12.10                         | 12.10   | 0.00     |
| Average         |                               |         | +1.26    |

**Table 1:** Performance of the ConTExT split stack using the SPECspeed 2017 integer benchmark.



















64k cells 1 capacitor, 1 transitor each



- Cells leak → repetitive refresh necessary
- Maximum interval between refreshes to guarantee data integrity
- ullet Cells leak faster upon proximate accesses o Rowhammer



- Cells leak → repetitive refresh necessary
- Maximum interval between refreshes to guarantee data integrity
- Cells leak faster upon proximate accesses → Rowhammer



- Cells leak → repetitive refresh necessary
- Maximum interval between refreshes to guarantee data integrity
- ullet Cells leak faster upon proximate accesses o Rowhammer



- Cells leak → repetitive refresh necessary
- Maximum interval between refreshes to guarantee data integrity
- Cells leak faster upon proximate accesses → Rowhammer



- Cells leak → repetitive refresh necessary
- Maximum interval between refreshes to guarantee data integrity
- ullet Cells leak faster upon proximate accesses o Rowhammer



- Cells leak → repetitive refresh necessary
- Maximum interval between refreshes to guarantee data integrity
- ullet Cells leak faster upon proximate accesses o Rowhammer

• no flush instruction

- no flush instruction
- increase refresh rate

- no flush instruction
- increase refresh rate



Errors depending on refresh interval

• ECC protection: server can handle or correct single bit errors

- ECC protection: server can handle or correct single bit errors
- no standard for event reporting

- ECC protection: server can handle or correct single bit errors
- no standard for event reporting
- ECCploit paper (S&P 2019)

- ECC protection: server can handle or correct single bit errors
- no standard for event reporting
- ECCploit paper (S&P 2019)
- RAMbleed (S&P 2020)

ullet one row closed o one adjacent row opened with low probability p

- ullet one row closed o one adjacent row opened with low probability p
- ullet Rowhammer: one row opened and closed a high number of times  $N_{th}$

- ullet one row closed o one adjacent row opened with low probability p
- $\bullet$  Rowhammer: one row opened and closed a high number of times  $N_{th}$
- ullet statistically, neighbor rows are refreshed o no bit flip

- ullet one row closed o one adjacent row opened with low probability p
- ullet Rowhammer: one row opened and closed a high number of times  $N_{th}$
- ullet statistically, neighbor rows are refreshed o no bit flip
- implementation at the memory controller level

- ullet one row closed o one adjacent row opened with low probability p
- ullet Rowhammer: one row opened and closed a high number of times  $N_{th}$
- ullet statistically, neighbor rows are refreshed o no bit flip
- implementation at the memory controller level
- advantage: stateless → not expensive

- one row closed  $\rightarrow$  one adjacent row opened with low probability p
- ullet Rowhammer: one row opened and closed a high number of times  $N_{th}$
- ullet statistically, neighbor rows are refreshed o no bit flip
- implementation at the memory controller level
- advantage: stateless → not expensive
- for p=0.001 and  $N_{th}=100K$ , experiencing one error in one year has a probability  $9.4\times10^{-14}$

- counter per row
- increment neighbor rows
- refresh when counter reaches a threshold



- counter per row
- increment neighbor rows
- refresh when counter reaches a threshold



- counter per row
- increment neighbor rows
- refresh when counter reaches a threshold



- counter per row
- increment neighbor rows
- refresh when counter reaches a threshold



- counter per row
- increment neighbor rows
- refresh when counter reaches a threshold



- counter per row
- increment neighbor rows
- refresh when counter reaches a threshold



- counter per row
- increment neighbor rows
- refresh when counter reaches a threshold



- counter per row
- increment neighbor rows
- refresh when counter reaches a threshold



- counter per row
- increment neighbor rows
- refresh when counter reaches a threshold



- counter per row
- increment neighbor rows
- refresh when counter reaches a threshold



- counter per row
- increment neighbor rows
- refresh when counter reaches a threshold



- counter per row
- increment neighbor rows
- refresh when counter reaches a threshold



- B-CATT: disable vulnerable physical memory
- G-CATT/ZebRAM: isolate security domains in physical memory



- B-CATT: disable vulnerable physical memory
- G-CATT/ZebRAM: isolate security domains in physical memory





 $\bullet \ \ \mathsf{lower} \ \mathsf{refresh} \ \mathsf{rate} = \mathsf{lower} \ \mathsf{energy} \ \mathsf{but} \ \mathsf{more} \ \mathsf{bit} \ \mathsf{flips}$ 



- lower refresh rate = lower energy but more bit flips
- $\bullet \ \ \mathsf{ECC} \ \mathsf{memory} \to \mathsf{fewer} \ \mathsf{bit} \ \mathsf{flips}$



- lower refresh rate = lower energy but more bit flips
- ullet ECC memory o fewer bit flips
- ightarrow it's an optimization problem



- lower refresh rate = lower energy but more bit flips
- $\bullet \ \ \mathsf{ECC} \ \mathsf{memory} \to \mathsf{fewer} \ \mathsf{bit} \ \mathsf{flips}$
- $\rightarrow$  it's an optimization problem
  - what if "too aggressive" changes over time?



- lower refresh rate = lower energy but more bit flips
- ullet ECC memory o fewer bit flips
- $\rightarrow$  it's an optimization problem
  - what if "too aggressive" changes over time?
  - $\,\rightarrow\,$  difficult to optimize with an intelligent adversary



C. Canella, J. Van Bulck, M. Schwarz, M. Lipp, B. von Berg, P. Ortner, F. Piessens,

D. Evtyushkin, and D. Gruss. A Systematic Evaluation of Transient Execution Attacks and Defenses. In: USENIX Security Symposium. 2019.



D. Gruss, E. Kraft, T. Tiwari, M. Schwarz, A. Trachtenberg, J. Hennessey, A. Ionescu, and

A. Fogh. Page Cache Attacks. In: CCS. 2019.



M. Schwarz, R. Schilling, F. Kargl, M. Lipp, C. Canella, and D. Gruss. ConTExT: Leakage-Free Transient Execution. In: arXiv:1905.09100 (2019).



many attacks out there



- many attacks out there
- thorough defenses can defeat entire classes of attacks



- many attacks out there
- thorough defenses can defeat entire classes of attacks
- important to distinguish between different attacks



# Mitigation Plans for Microarchitectural Attacks

#### **Daniel Gruss**

July 15, 2019

Graz University of Technology