Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) L1TF - L1 Terminal Fault
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) L1 Terminal Fault is a hardware vulnerability which allows unprivileged
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) speculative access to data which is available in the Level 1 Data Cache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) when the page table entry controlling the virtual address, which is used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) for the access, has the Present bit cleared or other reserved bits set.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) Affected processors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) -------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) This vulnerability affects a wide range of Intel processors. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) vulnerability is not present on:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15)    - Processors from AMD, Centaur and other non Intel vendors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17)    - Older processor models, where the CPU family is < 6
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19)    - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20)      Penwell, Pineview, Silvermont, Airmont, Merrifield)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22)    - The Intel XEON PHI family
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24)    - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25)      IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26)      by the Meltdown vulnerability either. These CPUs should become
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27)      available by end of 2018.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) Whether a processor is affected or not can be read out from the L1TF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) Related CVEs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) ------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) The following CVE entries are related to the L1TF vulnerability:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37)    =============  =================  ==============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38)    CVE-2018-3615  L1 Terminal Fault  SGX related aspects
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39)    CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40)    CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41)    =============  =================  ==============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) Problem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) If an instruction accesses a virtual address for which the relevant page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) table entry (PTE) has the Present bit cleared or other reserved bits set,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) then speculative execution ignores the invalid PTE and loads the referenced
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) data if it is present in the Level 1 Data Cache, as if the page referenced
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) by the address bits in the PTE was still present and accessible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) While this is a purely speculative mechanism and the instruction will raise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) a page fault when it is retired eventually, the pure act of loading the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) data and making it available to other speculative instructions opens up the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) opportunity for side channel attacks to unprivileged malicious code,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) similar to the Meltdown attack.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) While Meltdown breaks the user space to kernel space protection, L1TF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) allows to attack any physical memory address in the system and the attack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) works across all protection domains. It allows an attack of SGX and also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) works from inside virtual machines because the speculation bypasses the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) extended page table (EPT) protection mechanism.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) Attack scenarios
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) ----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) 1. Malicious user space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) ^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71)    Operating Systems store arbitrary information in the address bits of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72)    PTE which is marked non present. This allows a malicious user space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73)    application to attack the physical memory to which these PTEs resolve.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74)    In some cases user-space can maliciously influence the information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75)    encoded in the address bits of the PTE, thus making attacks more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76)    deterministic and more practical.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78)    The Linux kernel contains a mitigation for this attack vector, PTE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79)    inversion, which is permanently enabled and has no performance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80)    impact. The kernel ensures that the address bits of PTEs, which are not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81)    marked present, never point to cacheable physical memory space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83)    A system with an up to date kernel is protected against attacks from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84)    malicious user space applications.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) 2. Malicious guest in a virtual machine
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89)    The fact that L1TF breaks all domain protections allows malicious guest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90)    OSes, which can control the PTEs directly, and malicious guest user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91)    space applications, which run on an unprotected guest kernel lacking the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92)    PTE inversion mitigation for L1TF, to attack physical host memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94)    A special aspect of L1TF in the context of virtualization is symmetric
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95)    multi threading (SMT). The Intel implementation of SMT is called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96)    HyperThreading. The fact that Hyperthreads on the affected processors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97)    share the L1 Data Cache (L1D) is important for this. As the flaw allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98)    only to attack data which is present in L1D, a malicious guest running
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99)    on one Hyperthread can attack the data which is brought into the L1D by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100)    the context which runs on the sibling Hyperthread of the same physical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101)    core. This context can be host OS, host user space or a different guest.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)    If the processor does not support Extended Page Tables, the attack is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104)    only possible, when the hypervisor does not sanitize the content of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105)    effective (shadow) page tables.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107)    While solutions exist to mitigate these attack vectors fully, these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)    mitigations are not enabled by default in the Linux kernel because they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109)    can affect performance significantly. The kernel provides several
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110)    mechanisms which can be utilized to address the problem depending on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111)    deployment scenario. The mitigations, their protection scope and impact
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112)    are described in the next sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)    The default mitigations and the rationale for choosing them are explained
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)    at the end of this document. See :ref:`default_mitigations`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) .. _l1tf_sys_info:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) L1TF system information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) -----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) The Linux kernel provides a sysfs interface to enumerate the current L1TF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) status of the system: whether the system is vulnerable, and which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) mitigations are active. The relevant sysfs file is:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) /sys/devices/system/cpu/vulnerabilities/l1tf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) The possible values in this file are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)   ===========================   ===============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)   'Not affected'		The processor is not vulnerable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)   'Mitigation: PTE Inversion'	The host protection is active
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)   ===========================   ===============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) If KVM/VMX is enabled and the processor is vulnerable then the following
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) information is appended to the 'Mitigation: PTE Inversion' part:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)   - SMT status:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140)     =====================  ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)     'VMX: SMT vulnerable'  SMT is enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)     'VMX: SMT disabled'    SMT is disabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)     =====================  ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)   - L1D Flush mode:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)     ================================  ====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148)     'L1D vulnerable'		      L1D flushing is disabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)     'L1D conditional cache flushes'   L1D flush is conditionally enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)     'L1D cache flushes'		      L1D flush is unconditionally enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)     ================================  ====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) The resulting grade of protection is discussed in the following sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) Host mitigation mechanism
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) -------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) The kernel is unconditionally protected against L1TF attacks from malicious
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) user space running on the host.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) Guest mitigation mechanisms
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) .. _l1d_flush:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) 1. L1D flush on VMENTER
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) ^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173)    To make sure that a guest cannot attack data which is present in the L1D
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)    the hypervisor flushes the L1D before entering the guest.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176)    Flushing the L1D evicts not only the data which should not be accessed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)    by a potentially malicious guest, it also flushes the guest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)    data. Flushing the L1D has a performance impact as the processor has to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)    bring the flushed guest data back into the L1D. Depending on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180)    frequency of VMEXIT/VMENTER and the type of computations in the guest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181)    performance degradation in the range of 1% to 50% has been observed. For
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)    scenarios where guest VMEXIT/VMENTER are rare the performance impact is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)    minimal. Virtio and mechanisms like posted interrupts are designed to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184)    confine the VMEXITs to a bare minimum, but specific configurations and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185)    application scenarios might still suffer from a high VMEXIT rate.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187)    The kernel provides two L1D flush modes:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)     - conditional ('cond')
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189)     - unconditional ('always')
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191)    The conditional mode avoids L1D flushing after VMEXITs which execute
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192)    only audited code paths before the corresponding VMENTER. These code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)    paths have been verified that they cannot expose secrets or other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194)    interesting data to an attacker, but they can leak information about the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195)    address space layout of the hypervisor.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197)    Unconditional mode flushes L1D on all VMENTER invocations and provides
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198)    maximum protection. It has a higher overhead than the conditional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199)    mode. The overhead cannot be quantified correctly as it depends on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200)    workload scenario and the resulting number of VMEXITs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202)    The general recommendation is to enable L1D flush on VMENTER. The kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203)    defaults to conditional mode on affected processors.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205)    **Note**, that L1D flush does not prevent the SMT problem because the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206)    sibling thread will also bring back its data into the L1D which makes it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207)    attackable again.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209)    L1D flush can be controlled by the administrator via the kernel command
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210)    line and sysfs control files. See :ref:`mitigation_control_command_line`
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211)    and :ref:`mitigation_control_kvm`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) .. _guest_confinement:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) 2. Guest VCPU confinement to dedicated physical cores
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218)    To address the SMT problem, it is possible to make a guest or a group of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219)    guests affine to one or more physical cores. The proper mechanism for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220)    that is to utilize exclusive cpusets to ensure that no other guest or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221)    host tasks can run on these cores.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223)    If only a single guest or related guests run on sibling SMT threads on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224)    the same physical core then they can only attack their own memory and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225)    restricted parts of the host memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227)    Host memory is attackable, when one of the sibling SMT threads runs in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228)    host OS (hypervisor) context and the other in guest context. The amount
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229)    of valuable information from the host OS context depends on the context
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230)    which the host OS executes, i.e. interrupts, soft interrupts and kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231)    threads. The amount of valuable data from these contexts cannot be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232)    declared as non-interesting for an attacker without deep inspection of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233)    the code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235)    **Note**, that assigning guests to a fixed set of physical cores affects
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236)    the ability of the scheduler to do load balancing and might have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237)    negative effects on CPU utilization depending on the hosting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238)    scenario. Disabling SMT might be a viable alternative for particular
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239)    scenarios.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241)    For further information about confining guests to a single or to a group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242)    of cores consult the cpusets documentation:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244)    https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v1/cpusets.rst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) .. _interrupt_isolation:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) 3. Interrupt affinity
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) ^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251)    Interrupts can be made affine to logical CPUs. This is not universally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252)    true because there are types of interrupts which are truly per CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253)    interrupts, e.g. the local timer interrupt. Aside of that multi queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254)    devices affine their interrupts to single CPUs or groups of CPUs per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255)    queue without allowing the administrator to control the affinities.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257)    Moving the interrupts, which can be affinity controlled, away from CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258)    which run untrusted guests, reduces the attack vector space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260)    Whether the interrupts with are affine to CPUs, which run untrusted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261)    guests, provide interesting data for an attacker depends on the system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262)    configuration and the scenarios which run on the system. While for some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263)    of the interrupts it can be assumed that they won't expose interesting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264)    information beyond exposing hints about the host OS memory layout, there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265)    is no way to make general assumptions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267)    Interrupt affinity can be controlled by the administrator via the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268)    /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269)    available at:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271)    https://www.kernel.org/doc/Documentation/core-api/irq/irq-affinity.rst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) .. _smt_control:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) 4. SMT control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) ^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278)    To prevent the SMT issues of L1TF it might be necessary to disable SMT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279)    completely. Disabling SMT can have a significant performance impact, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280)    the impact depends on the hosting scenario and the type of workloads.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)    The impact of disabling SMT needs also to be weighted against the impact
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282)    of other mitigation solutions like confining guests to dedicated cores.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)    The kernel provides a sysfs interface to retrieve the status of SMT and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285)    to control it. It also provides a kernel command line interface to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286)    control SMT.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288)    The kernel command line interface consists of the following options:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290)      =========== ==========================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291)      nosmt	 Affects the bring up of the secondary CPUs during boot. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) 		 kernel tries to bring all present CPUs online during the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) 		 boot process. "nosmt" makes sure that from each physical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) 		 core only one - the so called primary (hyper) thread is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) 		 activated. Due to a design flaw of Intel processors related
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) 		 to Machine Check Exceptions the non primary siblings have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) 		 to be brought up at least partially and are then shut down
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) 		 again.  "nosmt" can be undone via the sysfs interface.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300)      nosmt=force Has the same effect as "nosmt" but it does not allow to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) 		 undo the SMT disable via the sysfs interface.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302)      =========== ==========================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304)    The sysfs interface provides two files:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306)    - /sys/devices/system/cpu/smt/control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307)    - /sys/devices/system/cpu/smt/active
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309)    /sys/devices/system/cpu/smt/control:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311)      This file allows to read out the SMT control state and provides the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312)      ability to disable or (re)enable SMT. The possible states are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) 	==============  ===================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) 	on		SMT is supported by the CPU and enabled. All
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) 			logical CPUs can be onlined and offlined without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) 			restrictions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) 	off		SMT is supported by the CPU and disabled. Only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) 			the so called primary SMT threads can be onlined
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) 			and offlined without restrictions. An attempt to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) 			online a non-primary sibling is rejected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) 	forceoff	Same as 'off' but the state cannot be controlled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) 			Attempts to write to the control file are rejected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) 	notsupported	The processor does not support SMT. It's therefore
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) 			not affected by the SMT implications of L1TF.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) 			Attempts to write to the control file are rejected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) 	==============  ===================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332)      The possible states which can be written into this file to control SMT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333)      state are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335)      - on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336)      - off
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337)      - forceoff
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339)    /sys/devices/system/cpu/smt/active:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341)      This file reports whether SMT is enabled and active, i.e. if on any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342)      physical core two or more sibling threads are online.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344)    SMT control is also possible at boot time via the l1tf kernel command
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345)    line parameter in combination with L1D flush control. See
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346)    :ref:`mitigation_control_command_line`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) 5. Disabling EPT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) ^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351)   Disabling EPT for virtual machines provides full mitigation for L1TF even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352)   with SMT enabled, because the effective page tables for guests are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353)   managed and sanitized by the hypervisor. Though disabling EPT has a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354)   significant performance impact especially when the Meltdown mitigation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355)   KPTI is enabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357)   EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) There is ongoing research and development for new mitigation mechanisms to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) address the performance impact of disabling SMT or EPT.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) .. _mitigation_control_command_line:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) Mitigation control on the kernel command line
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) ---------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) The kernel command line allows to control the L1TF mitigations at boot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) time with the option "l1tf=". The valid arguments for this option are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370)   ============  =============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371)   full		Provides all available mitigations for the L1TF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) 		vulnerability. Disables SMT and enables all mitigations in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) 		the hypervisors, i.e. unconditional L1D flushing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) 		SMT control and L1D flush control via the sysfs interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) 		is still possible after boot.  Hypervisors will issue a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) 		warning when the first VM is started in a potentially
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) 		insecure configuration, i.e. SMT enabled or L1D flush
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) 		disabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381)   full,force	Same as 'full', but disables SMT and L1D flush runtime
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) 		control. Implies the 'nosmt=force' command line option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) 		(i.e. sysfs control of SMT is disabled.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385)   flush		Leaves SMT enabled and enables the default hypervisor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) 		mitigation, i.e. conditional L1D flushing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) 		SMT control and L1D flush control via the sysfs interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) 		is still possible after boot.  Hypervisors will issue a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) 		warning when the first VM is started in a potentially
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) 		insecure configuration, i.e. SMT enabled or L1D flush
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) 		disabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394)   flush,nosmt	Disables SMT and enables the default hypervisor mitigation,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) 		i.e. conditional L1D flushing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) 		SMT control and L1D flush control via the sysfs interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) 		is still possible after boot.  Hypervisors will issue a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) 		warning when the first VM is started in a potentially
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) 		insecure configuration, i.e. SMT enabled or L1D flush
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) 		disabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403)   flush,nowarn	Same as 'flush', but hypervisors will not warn when a VM is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) 		started in a potentially insecure configuration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406)   off		Disables hypervisor mitigations and doesn't emit any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) 		warnings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) 		It also drops the swap size and available RAM limit restrictions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) 		on both hypervisor and bare metal.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411)   ============  =============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) .. _mitigation_control_kvm:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) Mitigation control for KVM - module parameter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) -------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) The KVM hypervisor mitigation mechanism, flushing the L1D cache when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) entering a guest, can be controlled with a module parameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) following arguments:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427)   ============  ==============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428)   always	L1D cache flush on every VMENTER.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430)   cond		Flush L1D on VMENTER only when the code between VMEXIT and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) 		VMENTER can leak host memory which is considered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) 		interesting for an attacker. This still can leak host memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) 		which allows e.g. to determine the hosts address space layout.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435)   never		Disables the mitigation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436)   ============  ==============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) The parameter can be provided on the kernel command line, as a module
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) parameter when loading the modules and at runtime modified via the sysfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) file:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) /sys/module/kvm_intel/parameters/vmentry_l1d_flush
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) module parameter is ignored and writes to the sysfs file are rejected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) .. _mitigation_selection:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) Mitigation selection guide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) --------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) 1. No virtualization in use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456)    The system is protected by the kernel unconditionally and no further
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457)    action is required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) 2. Virtualization with trusted guests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462)    If the guest comes from a trusted source and the guest OS kernel is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463)    guaranteed to have the L1TF mitigations in place the system is fully
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464)    protected against L1TF and no further action is required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466)    To avoid the overhead of the default L1D flushing on VMENTER the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467)    administrator can disable the flushing via the kernel command line and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468)    sysfs control files. See :ref:`mitigation_control_command_line` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469)    :ref:`mitigation_control_kvm`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) 3. Virtualization with untrusted guests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) 3.1. SMT not supported or disabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) """"""""""""""""""""""""""""""""""
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478)   If SMT is not supported by the processor or disabled in the BIOS or by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479)   the kernel, it's only required to enforce L1D flushing on VMENTER.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481)   Conditional L1D flushing is the default behaviour and can be tuned. See
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482)   :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) 3.2. EPT not supported or disabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) """"""""""""""""""""""""""""""""""
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487)   If EPT is not supported by the processor or disabled in the hypervisor,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488)   the system is fully protected. SMT can stay enabled and L1D flushing on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489)   VMENTER is not required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491)   EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) 3.3. SMT and EPT supported and active
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) """""""""""""""""""""""""""""""""""""
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496)   If SMT and EPT are supported and active then various degrees of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497)   mitigations can be employed:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499)   - L1D flushing on VMENTER:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501)     L1D flushing on VMENTER is the minimal protection requirement, but it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502)     is only potent in combination with other mitigation methods.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504)     Conditional L1D flushing is the default behaviour and can be tuned. See
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505)     :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507)   - Guest confinement:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509)     Confinement of guests to a single or a group of physical cores which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510)     are not running any other processes, can reduce the attack surface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511)     significantly, but interrupts, soft interrupts and kernel threads can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512)     still expose valuable data to a potential attacker. See
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513)     :ref:`guest_confinement`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515)   - Interrupt isolation:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517)     Isolating the guest CPUs from interrupts can reduce the attack surface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518)     further, but still allows a malicious guest to explore a limited amount
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519)     of host physical memory. This can at least be used to gain knowledge
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520)     about the host address space layout. The interrupts which have a fixed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521)     affinity to the CPUs which run the untrusted guests can depending on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522)     the scenario still trigger soft interrupts and schedule kernel threads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523)     which might expose valuable information. See
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524)     :ref:`interrupt_isolation`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) The above three mitigation methods combined can provide protection to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527) certain degree, but the risk of the remaining attack surface has to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) carefully analyzed. For full protection the following methods are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) available:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531)   - Disabling SMT:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533)     Disabling SMT and enforcing the L1D flushing provides the maximum
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534)     amount of protection. This mitigation is not depending on any of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535)     above mitigation methods.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537)     SMT control and L1D flushing can be tuned by the command line
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538)     parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539)     time with the matching sysfs control files. See :ref:`smt_control`,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540)     :ref:`mitigation_control_command_line` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541)     :ref:`mitigation_control_kvm`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543)   - Disabling EPT:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545)     Disabling EPT provides the maximum amount of protection as well. It is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546)     not depending on any of the above mitigation methods. SMT can stay
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547)     enabled and L1D flushing is not required, but the performance impact is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548)     significant.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550)     EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551)     parameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) 3.4. Nested virtual machines
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) """"""""""""""""""""""""""""
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556) When nested virtualization is in use, three operating systems are involved:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557) the bare metal hypervisor, the nested hypervisor and the nested virtual
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) machine.  VMENTER operations from the nested hypervisor into the nested
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559) guest will always be processed by the bare metal hypervisor. If KVM is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560) bare metal hypervisor it will:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562)  - Flush the L1D cache on every switch from the nested hypervisor to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563)    nested virtual machine, so that the nested hypervisor's secrets are not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564)    exposed to the nested virtual machine;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566)  - Flush the L1D cache on every switch from the nested virtual machine to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567)    the nested hypervisor; this is a complex operation, and flushing the L1D
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568)    cache avoids that the bare metal hypervisor's secrets are exposed to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569)    nested virtual machine;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571)  - Instruct the nested hypervisor to not perform any L1D cache flush. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572)    is an optimization to avoid double L1D flushing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575) .. _default_mitigations:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577) Default mitigations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 578) -------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 579) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 580)   The kernel default mitigations for vulnerable processors are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 581) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 582)   - PTE inversion to protect against malicious user space. This is done
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 583)     unconditionally and cannot be controlled. The swap storage is limited
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 584)     to ~16TB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 585) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 586)   - L1D conditional flushing on VMENTER when EPT is enabled for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 587)     a guest.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 588) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 589)   The kernel does not by default enforce the disabling of SMT, which leaves
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 590)   SMT systems vulnerable when running untrusted guests with EPT enabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 591) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 592)   The rationale for this choice is:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 593) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 594)   - Force disabling SMT can break existing setups, especially with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 595)     unattended updates.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 596) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 597)   - If regular users run untrusted guests on their machine, then L1TF is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 598)     just an add on to other malware which might be embedded in an untrusted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 599)     guest, e.g. spam-bots or attacks on the local network.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 600) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 601)     There is no technical way to prevent a user from running untrusted code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 602)     on their machines blindly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 603) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 604)   - It's technically extremely unlikely and from today's knowledge even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 605)     impossible that L1TF can be exploited via the most popular attack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 606)     mechanisms like JavaScript because these mechanisms have no way to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 607)     control PTEs. If this would be possible and not other mitigation would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 608)     be possible, then the default might be different.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 609) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 610)   - The administrators of cloud and hosting setups have to carefully
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 611)     analyze the risk for their scenarios and make the appropriate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 612)     mitigation choices, which might even vary across their deployed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 613)     machines and also result in other changes of their overall setup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 614)     There is no way for the kernel to provide a sensible default for this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 615)     kind of scenarios.