^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) Kernel Self-Protection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) Kernel self-protection is the design and implementation of systems and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) structures within the Linux kernel to protect against security flaws in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) the kernel itself. This covers a wide range of issues, including removing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) entire classes of bugs, blocking security flaw exploitation methods,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) and actively detecting attack attempts. Not all topics are explored in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) this document, but it should serve as a reasonable starting point and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) answer any frequently asked questions. (Patches welcome, of course!)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) In the worst-case scenario, we assume an unprivileged local attacker
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) has arbitrary read and write access to the kernel's memory. In many
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) cases, bugs being exploited will not provide this level of access,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) but with systems in place that defend against the worst case we'll
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) cover the more limited cases as well. A higher bar, and one that should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) still be kept in mind, is protecting the kernel against a _privileged_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) local attacker, since the root user has access to a vastly increased
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) attack surface. (Especially when they have the ability to load arbitrary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) kernel modules.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) The goals for successful self-protection systems would be that they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) are effective, on by default, require no opt-in by developers, have no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) performance impact, do not impede kernel debugging, and have tests. It
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) is uncommon that all these goals can be met, but it is worth explicitly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) mentioning them, since these aspects need to be explored, dealt with,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) and/or accepted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) Attack Surface Reduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) The most fundamental defense against security exploits is to reduce the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) areas of the kernel that can be used to redirect execution. This ranges
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) from limiting the exposed APIs available to userspace, making in-kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) APIs hard to use incorrectly, minimizing the areas of writable kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) memory, etc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) Strict kernel memory permissions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) --------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) When all of kernel memory is writable, it becomes trivial for attacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) to redirect execution flow. To reduce the availability of these targets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) the kernel needs to protect its memory with a tight set of permissions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) Executable code and read-only data must not be writable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) Any areas of the kernel with executable memory must not be writable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) While this obviously includes the kernel text itself, we must consider
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) all additional places too: kernel modules, JIT memory, etc. (There are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) temporary exceptions to this rule to support things like instruction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) alternatives, breakpoints, kprobes, etc. If these must exist in a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) kernel, they are implemented in a way where the memory is temporarily
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) made writable during the update, and then returned to the original
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) permissions.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) ``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) writable, data is not executable, and read-only data is neither writable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) nor executable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) Most architectures have these options on by default and not user selectable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) For some architectures like arm that wish to have these be selectable,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) Function pointers and sensitive variables must not be writable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) Vast areas of kernel memory contain function pointers that are looked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) up by the kernel and used to continue execution (e.g. descriptor/vector
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) tables, file/network/etc operation structures, etc). The number of these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) variables must be reduced to an absolute minimum.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) Many such variables can be made read-only by setting them "const"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) so that they live in the .rodata section instead of the .data section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) of the kernel, gaining the protection of the kernel's strict memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) permissions as described above.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) For variables that are initialized once at ``__init`` time, these can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) be marked with the (new and under development) ``__ro_after_init``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) attribute.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) What remains are variables that are updated rarely (e.g. GDT). These
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) will need another infrastructure (similar to the temporary exceptions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) made to kernel code mentioned above) that allow them to spend the rest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) of their lifetime read-only. (For example, when being updated, only the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) CPU thread performing the update would be given uninterruptible write
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) access to the memory.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) Segregation of kernel memory from userspace memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) The kernel must never execute userspace memory. The kernel must also never
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) access userspace memory without explicit expectation to do so. These
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) rules can be enforced either by support of hardware-based restrictions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) By blocking userspace memory in this way, execution and data parsing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) cannot be passed to trivially-controlled userspace memory, forcing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) attacks to operate entirely in kernel memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) Reduced access to syscalls
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) --------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) One trivial way to eliminate many syscalls for 64-bit systems is building
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) The "seccomp" system provides an opt-in feature made available to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) userspace, which provides a way to reduce the number of kernel entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) points available to a running process. This limits the breadth of kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) code that can be reached, possibly reducing the availability of a given
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) bug to an attack.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) An area of improvement would be creating viable ways to keep access to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) things like compat, user namespaces, BPF creation, and perf limited only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) to trusted processes. This would keep the scope of kernel entry points
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) restricted to the more regular set of normally available to unprivileged
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) Restricting access to kernel modules
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) ------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) The kernel should never allow an unprivileged user the ability to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) load specific kernel modules, since that would provide a facility to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) unexpectedly extend the available attack surface. (The on-demand loading
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) considered "expected" here, though additional consideration should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) given even to these.) For example, loading a filesystem module via an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) unprivileged socket API is nonsense: only the root or physically local
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) user should trigger filesystem module loading. (And even this can be up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) for debate in some scenarios.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) To protect against even privileged users, systems may need to either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) disable module loading entirely (e.g. monolithic kernel builds or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) modules_disabled sysctl), or provide signed modules (e.g.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) ``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) root load arbitrary kernel code via the module loader interface.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) Memory integrity
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) There are many memory structures in the kernel that are regularly abused
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) to gain execution control during an attack, By far the most commonly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) understood is that of the stack buffer overflow in which the return
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) address stored on the stack is overwritten. Many other examples of this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) kind of attack exist, and protections exist to defend against them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) Stack buffer overflow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) ---------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) The classic stack buffer overflow involves writing past the expected end
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) of a variable stored on the stack, ultimately writing a controlled value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) to the stack frame's stored return address. The most widely used defense
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) is the presence of a stack canary between the stack variables and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) return address (``CONFIG_STACKPROTECTOR``), which is verified just before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) the function returns. Other defenses include things like shadow stacks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) Stack depth overflow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) --------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) A less well understood attack is using a bug that triggers the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) kernel to consume stack memory with deep function calls or large stack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) allocations. With this attack it is possible to write beyond the end of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) the kernel's preallocated stack space and into sensitive structures. Two
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) important changes need to be made for better protections: moving the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) sensitive thread_info structure elsewhere, and adding a faulting memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) hole at the bottom of the stack to catch these overflows.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) Heap memory integrity
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) ---------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) The structures used to track heap free lists can be sanity-checked during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) allocation and freeing to make sure they aren't being used to manipulate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) other memory areas.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) Counter integrity
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) Many places in the kernel use atomic counters to track object references
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) or perform similar lifetime management. When these counters can be made
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) to wrap (over or under) this traditionally exposes a use-after-free
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) flaw. By trapping atomic wrapping, this class of bug vanishes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) Size calculation overflow detection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) -----------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) Similar to counter overflow, integer overflows (usually size calculations)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) need to be detected at runtime to kill this class of bug, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) traditionally leads to being able to write past the end of kernel buffers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) Probabilistic defenses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) While many protections can be considered deterministic (e.g. read-only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) memory cannot be written to), some protections provide only statistical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) defense, in that an attack must gather enough information about a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) running system to overcome the defense. While not perfect, these do
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) provide meaningful defenses.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) Canaries, blinding, and other secrets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) -------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) It should be noted that things like the stack canary discussed earlier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) are technically statistical defenses, since they rely on a secret value,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) and such values may become discoverable through an information exposure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) flaw.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) Blinding literal values for things like JITs, where the executable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) contents may be partially under the control of userspace, need a similar
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) secret value.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) It is critical that the secret values used must be separate (e.g.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) different canary per stack) and high entropy (e.g. is the RNG actually
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) working?) in order to maximize their success.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) Kernel Address Space Layout Randomization (KASLR)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) -------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) Since the location of kernel memory is almost always instrumental in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) mounting a successful attack, making the location non-deterministic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) raises the difficulty of an exploit. (Note that this in turn makes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) the value of information exposures higher, since they may be used to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) discover desired memory locations.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) Text and module base
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) ~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) By relocating the physical and virtual base address of the kernel at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) frustrated. Additionally, offsetting the module loading base address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) means that even systems that load the same set of modules in the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) order every boot will not share a common base address with the rest of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) the kernel text.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) Stack base
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) ~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) If the base address of the kernel stack is not the same between processes,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) or even not the same between syscalls, targets on or beyond the stack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) become more difficult to locate.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) Dynamic memory base
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) ~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) being relatively deterministic in layout due to the order of early-boot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) initializations. If the base address of these areas is not the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) between boots, targeting them is frustrated, requiring an information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) exposure specific to the region.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) Structure layout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) ~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) By performing a per-build randomization of the layout of sensitive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) structures, attacks must either be tuned to known kernel builds or expose
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) enough kernel memory to determine structure layouts before manipulating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) Preventing Information Exposures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) ================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) Since the locations of sensitive structures are the primary target for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) attacks, it is important to defend against exposure of both kernel memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) addresses and kernel memory contents (since they may contain kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) addresses or other sensitive things like canary values).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) Kernel addresses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) ----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) Printing kernel addresses to userspace leaks sensitive information about
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) the kernel memory layout. Care should be exercised when using any printk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) specifier that prints the raw address, currently %px, %p[ad], (and %p[sSb]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) in certain circumstances [*]). Any file written to using one of these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) specifiers should be readable only by privileged processes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) Kernels 4.14 and older printed the raw address using %p. As of 4.15-rc1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) addresses printed with the specifier %p are hashed before printing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) [*] If KALLSYMS is enabled and symbol lookup fails, the raw address is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) printed. If KALLSYMS is not enabled the raw address is printed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) Unique identifiers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) Kernel memory addresses must never be used as identifiers exposed to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) userspace. Instead, use an atomic counter, an idr, or similar unique
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) identifier.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) Memory initialization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) ---------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) Memory copied to userspace must always be fully initialized. If not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) explicitly memset(), this will require changes to the compiler to make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) sure structure holes are cleared.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) Memory poisoning
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) ----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) When releasing memory, it is best to poison the contents, to avoid reuse
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) attacks that rely on the old contents of memory. E.g., clear stack on a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) syscall return (``CONFIG_GCC_PLUGIN_STACKLEAK``), wipe heap memory on a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) free. This frustrates many uninitialized variable attacks, stack content
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) exposures, heap content exposures, and use-after-free attacks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) Destination tracking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) --------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) To help kill classes of bugs that result in kernel addresses being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) written to userspace, the destination of writes needs to be tracked. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) it should automatically censor sensitive values.