^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Page Table Isolation (PTI)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Overview
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) Page Table Isolation (pti, previously known as KAISER [1]_) is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) countermeasure against attacks on the shared user/kernel address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) space such as the "Meltdown" approach [2]_.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) To mitigate this class of attacks, we create an independent set of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) page tables for use only when running userspace applications. When
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) the kernel is entered via syscalls, interrupts or exceptions, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) page tables are switched to the full "kernel" copy. When the system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) switches back to user mode, the user copy is used again.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) The userspace page tables contain only a minimal amount of kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) data: only what is needed to enter/exit the kernel such as the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) entry/exit functions themselves and the interrupt descriptor table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) (IDT). There are a few strictly unnecessary things that get mapped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) such as the first C function when entering an interrupt (see
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) comments in pti.c).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) This approach helps to ensure that side-channel attacks leveraging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) the paging structures do not function when PTI is enabled. It can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) Once enabled at compile-time, it can be disabled at boot with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) Page Table Management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) When PTI is enabled, the kernel manages two sets of page tables.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) The first set is very similar to the single set which is present in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) kernels without PTI. This includes a complete mapping of userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) that the kernel can use for things like copy_to_user().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) Although _complete_, the user portion of the kernel page tables is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) crippled by setting the NX bit in the top level. This ensures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) that any missed kernel->user CR3 switch will immediately crash
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) userspace upon executing its first instruction.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) The userspace page tables map only the kernel data needed to enter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) and exit the kernel. This data is entirely contained in the 'struct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) cpu_entry_area' structure which is placed in the fixmap which gives
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) each CPU's copy of the area a compile-time-fixed virtual address.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) For new userspace mappings, the kernel makes the entries in its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) page tables like normal. The only difference is when the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) makes entries in the top (PGD) level. In addition to setting the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) entry in the main kernel PGD, a copy of the entry is made in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) userspace page tables' PGD.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) This sharing at the PGD level also inherently shares all the lower
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) layers of the page tables. This leaves a single, shared set of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) userspace page tables to manage. One PTE to lock, one set of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) accessed bits, dirty bits, etc...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) Overhead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) Protection against side-channel attacks is important. But,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) this protection comes at a cost:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) 1. Increased Memory Use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) a. Each process now needs an order-1 PGD instead of order-0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) (Consumes an additional 4k per process).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) aligned so that it can be mapped by setting a single PMD
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) entry. This consumes nearly 2MB of RAM once the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) is decompressed, but no space in the kernel image itself.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) 2. Runtime Cost
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) a. CR3 manipulation to switch between the page table copies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) must be done at interrupt, syscall, and exception entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) and exit (it can be skipped when the kernel is interrupted,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) though.) Moves to CR3 are on the order of a hundred
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) cycles, and are required at every entry and exit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) b. A "trampoline" must be used for SYSCALL entry. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) trampoline depends on a smaller set of resources than the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) non-PTI SYSCALL entry code, so requires mapping fewer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) things into the userspace page tables. The downside is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) that stacks must be switched at entry time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) c. Global pages are disabled for all kernel structures not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) mapped into both kernel and userspace page tables. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) feature of the MMU allows different processes to share TLB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) entries mapping the kernel. Losing the feature means more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) TLB misses after a context switch. The actual loss of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) performance is very small, however, never exceeding 1%.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) d. Process Context IDentifiers (PCID) is a CPU feature that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) allows us to skip flushing the entire TLB when switching page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) tables by setting a special bit in CR3 when the page tables
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) are changed. This makes switching the page tables (at context
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) switch, or kernel entry/exit) cheaper. But, on systems with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) PCID support, the context switch code must flush both the user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) and kernel entries out of the TLB. The user PCID TLB flush is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) deferred until the exit to userspace, minimizing the cost.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) See intel.com/sdm for the gory PCID/INVPCID details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) e. The userspace page tables must be populated for each new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) process. Even without PTI, the shared kernel mappings
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) are created by copying top-level (PGD) entries into each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) new process. But, with PTI, there are now *two* kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) mappings: one in the kernel page tables that maps everything
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) and one for the entry/exit structures. At fork(), we need to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) copy both.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) f. In addition to the fork()-time copying, there must also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) be an update to the userspace PGD any time a set_pgd() is done
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) on a PGD used to map userspace. This ensures that the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) and userspace copies always map the same userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) g. On systems without PCID support, each CR3 write flushes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) the entire TLB. That means that each syscall, interrupt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) or exception flushes the TLB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) h. INVPCID is a TLB-flushing instruction which allows flushing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) of TLB entries for non-current PCIDs. Some systems support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) PCIDs, but do not support INVPCID. On these systems, addresses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) can only be flushed from the TLB for the current PCID. When
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) flushing a kernel address, we need to flush all PCIDs, so a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) single kernel address flush will require a TLB-flushing CR3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) write upon the next use of every PCID.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) Possible Future Work
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 1. We can be more careful about not actually writing to CR3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) unless its value is actually changed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) 2. Allow PTI to be enabled/disabled at runtime in addition to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) boot-time switching.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) Testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) To test stability of PTI, the following test procedure is recommended,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) ideally doing all of these in parallel:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 1. Set CONFIG_DEBUG_ENTRY=y
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) 2. Run several copies of all of the tools/testing/selftests/x86/ tests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) (excluding MPX and protection_keys) in a loop on multiple CPUs for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) several minutes. These tests frequently uncover corner cases in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) kernel entry code. In general, old kernels might cause these tests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) themselves to crash, but they should never crash the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 3. Run the 'perf' tool in a mode (top or record) that generates many
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) frequent performance monitoring non-maskable interrupts (see "NMI"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) in /proc/interrupts). This exercises the NMI entry/exit code which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) is known to trigger bugs in code paths that did not expect to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) interrupted, including nested NMIs. Using "-c" boosts the rate of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) NMIs, and using two -c with separate counters encourages nested NMIs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) and less deterministic behavior.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) 4. Launch a KVM virtual machine.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) This has been a lightly-tested code path and needs extra scrutiny.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) Debugging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) Bugs in PTI cause a few different signatures of crashes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) that are worth noting here.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) * Failures of the selftests/x86 code. Usually a bug in one of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) more obscure corners of entry_64.S
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) * Crashes in early boot, especially around CPU bringup. Bugs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) in the trampoline code or mappings cause these.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) * Crashes at the first interrupt. Caused by bugs in entry_64.S,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) like screwing up a page table switch. Also caused by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) incorrectly mapping the IRQ handler entry code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) * Crashes at the first NMI. The NMI code is separate from main
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) interrupt handlers and can have bugs that do not affect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) normal interrupts. Also caused by incorrectly mapping NMI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) code. NMIs that interrupt the entry code must be very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) careful and can be the cause of crashes that show up when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) running perf.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) * Kernel crashes at the first exit to userspace. entry_64.S
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) bugs, or failing to map some of the exit code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) * Crashes at first interrupt that interrupts userspace. The paths
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) in entry_64.S that return to userspace are sometimes separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) from the ones that return to the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) * Double faults: overflowing the kernel stack because of page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) faults upon page faults. Caused by touching non-pti-mapped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) data in the entry code, or forgetting to switch to kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) CR3 before calling into C functions which are not pti-mapped.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) * Userspace segfaults early in boot, sometimes manifesting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) as mount(8) failing to mount the rootfs. These have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) tended to be TLB invalidation issues. Usually invalidating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) the wrong PCID, or otherwise missing an invalidation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) .. [1] https://gruss.cc/files/kaiser.pdf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) .. [2] https://meltdownattack.com/meltdown.pdf