Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) Page Table Isolation (PTI)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) Overview
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) Page Table Isolation (pti, previously known as KAISER [1]_) is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) countermeasure against attacks on the shared user/kernel address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) space such as the "Meltdown" approach [2]_.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) To mitigate this class of attacks, we create an independent set of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) page tables for use only when running userspace applications.  When
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) the kernel is entered via syscalls, interrupts or exceptions, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) page tables are switched to the full "kernel" copy.  When the system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) switches back to user mode, the user copy is used again.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) The userspace page tables contain only a minimal amount of kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) data: only what is needed to enter/exit the kernel such as the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) entry/exit functions themselves and the interrupt descriptor table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) (IDT).  There are a few strictly unnecessary things that get mapped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) such as the first C function when entering an interrupt (see
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) comments in pti.c).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) This approach helps to ensure that side-channel attacks leveraging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) the paging structures do not function when PTI is enabled.  It can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) Once enabled at compile-time, it can be disabled at boot with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) Page Table Management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) When PTI is enabled, the kernel manages two sets of page tables.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) The first set is very similar to the single set which is present in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) kernels without PTI.  This includes a complete mapping of userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) that the kernel can use for things like copy_to_user().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) Although _complete_, the user portion of the kernel page tables is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) crippled by setting the NX bit in the top level.  This ensures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) that any missed kernel->user CR3 switch will immediately crash
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) userspace upon executing its first instruction.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) The userspace page tables map only the kernel data needed to enter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) and exit the kernel.  This data is entirely contained in the 'struct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) cpu_entry_area' structure which is placed in the fixmap which gives
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) each CPU's copy of the area a compile-time-fixed virtual address.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) For new userspace mappings, the kernel makes the entries in its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) page tables like normal.  The only difference is when the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) makes entries in the top (PGD) level.  In addition to setting the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) entry in the main kernel PGD, a copy of the entry is made in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) userspace page tables' PGD.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) This sharing at the PGD level also inherently shares all the lower
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) layers of the page tables.  This leaves a single, shared set of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) userspace page tables to manage.  One PTE to lock, one set of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) accessed bits, dirty bits, etc...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) Overhead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) Protection against side-channel attacks is important.  But,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) this protection comes at a cost:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) 1. Increased Memory Use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70)   a. Each process now needs an order-1 PGD instead of order-0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71)      (Consumes an additional 4k per process).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72)   b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73)      aligned so that it can be mapped by setting a single PMD
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74)      entry.  This consumes nearly 2MB of RAM once the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75)      is decompressed, but no space in the kernel image itself.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 2. Runtime Cost
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79)   a. CR3 manipulation to switch between the page table copies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80)      must be done at interrupt, syscall, and exception entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81)      and exit (it can be skipped when the kernel is interrupted,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82)      though.)  Moves to CR3 are on the order of a hundred
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83)      cycles, and are required at every entry and exit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84)   b. A "trampoline" must be used for SYSCALL entry.  This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85)      trampoline depends on a smaller set of resources than the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86)      non-PTI SYSCALL entry code, so requires mapping fewer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87)      things into the userspace page tables.  The downside is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88)      that stacks must be switched at entry time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89)   c. Global pages are disabled for all kernel structures not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90)      mapped into both kernel and userspace page tables.  This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91)      feature of the MMU allows different processes to share TLB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92)      entries mapping the kernel.  Losing the feature means more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93)      TLB misses after a context switch.  The actual loss of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94)      performance is very small, however, never exceeding 1%.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95)   d. Process Context IDentifiers (PCID) is a CPU feature that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96)      allows us to skip flushing the entire TLB when switching page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97)      tables by setting a special bit in CR3 when the page tables
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98)      are changed.  This makes switching the page tables (at context
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99)      switch, or kernel entry/exit) cheaper.  But, on systems with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100)      PCID support, the context switch code must flush both the user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101)      and kernel entries out of the TLB.  The user PCID TLB flush is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)      deferred until the exit to userspace, minimizing the cost.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)      See intel.com/sdm for the gory PCID/INVPCID details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104)   e. The userspace page tables must be populated for each new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105)      process.  Even without PTI, the shared kernel mappings
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)      are created by copying top-level (PGD) entries into each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107)      new process.  But, with PTI, there are now *two* kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)      mappings: one in the kernel page tables that maps everything
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109)      and one for the entry/exit structures.  At fork(), we need to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110)      copy both.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111)   f. In addition to the fork()-time copying, there must also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112)      be an update to the userspace PGD any time a set_pgd() is done
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)      on a PGD used to map userspace.  This ensures that the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)      and userspace copies always map the same userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)      memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116)   g. On systems without PCID support, each CR3 write flushes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)      the entire TLB.  That means that each syscall, interrupt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118)      or exception flushes the TLB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119)   h. INVPCID is a TLB-flushing instruction which allows flushing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120)      of TLB entries for non-current PCIDs.  Some systems support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121)      PCIDs, but do not support INVPCID.  On these systems, addresses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)      can only be flushed from the TLB for the current PCID.  When
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123)      flushing a kernel address, we need to flush all PCIDs, so a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)      single kernel address flush will require a TLB-flushing CR3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)      write upon the next use of every PCID.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) Possible Future Work
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 1. We can be more careful about not actually writing to CR3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)    unless its value is actually changed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) 2. Allow PTI to be enabled/disabled at runtime in addition to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)    boot-time switching.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) Testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) To test stability of PTI, the following test procedure is recommended,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) ideally doing all of these in parallel:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 1. Set CONFIG_DEBUG_ENTRY=y
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) 2. Run several copies of all of the tools/testing/selftests/x86/ tests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)    (excluding MPX and protection_keys) in a loop on multiple CPUs for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)    several minutes.  These tests frequently uncover corner cases in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144)    kernel entry code.  In general, old kernels might cause these tests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)    themselves to crash, but they should never crash the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 3. Run the 'perf' tool in a mode (top or record) that generates many
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)    frequent performance monitoring non-maskable interrupts (see "NMI"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148)    in /proc/interrupts).  This exercises the NMI entry/exit code which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)    is known to trigger bugs in code paths that did not expect to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)    interrupted, including nested NMIs.  Using "-c" boosts the rate of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)    NMIs, and using two -c with separate counters encourages nested NMIs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)    and less deterministic behavior.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 	while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) 4. Launch a KVM virtual machine.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)    This has been a lightly-tested code path and needs extra scrutiny.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) Debugging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) Bugs in PTI cause a few different signatures of crashes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) that are worth noting here.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167)  * Failures of the selftests/x86 code.  Usually a bug in one of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)    more obscure corners of entry_64.S
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169)  * Crashes in early boot, especially around CPU bringup.  Bugs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170)    in the trampoline code or mappings cause these.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171)  * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172)    like screwing up a page table switch.  Also caused by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173)    incorrectly mapping the IRQ handler entry code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)  * Crashes at the first NMI.  The NMI code is separate from main
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)    interrupt handlers and can have bugs that do not affect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176)    normal interrupts.  Also caused by incorrectly mapping NMI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)    code.  NMIs that interrupt the entry code must be very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)    careful and can be the cause of crashes that show up when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)    running perf.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180)  * Kernel crashes at the first exit to userspace.  entry_64.S
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181)    bugs, or failing to map some of the exit code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)  * Crashes at first interrupt that interrupts userspace. The paths
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)    in entry_64.S that return to userspace are sometimes separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184)    from the ones that return to the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185)  * Double faults: overflowing the kernel stack because of page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186)    faults upon page faults.  Caused by touching non-pti-mapped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187)    data in the entry code, or forgetting to switch to kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)    CR3 before calling into C functions which are not pti-mapped.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189)  * Userspace segfaults early in boot, sometimes manifesting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190)    as mount(8) failing to mount the rootfs.  These have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191)    tended to be TLB invalidation issues.  Usually invalidating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192)    the wrong PCID, or otherwise missing an invalidation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) .. [1] https://gruss.cc/files/kaiser.pdf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) .. [2] https://meltdownattack.com/meltdown.pdf