^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Memory Management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Complete virtual memory map with 4-level page tables
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) ====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) .. note::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) - Negative addresses such as "-23 TB" are absolute addresses in bytes, counted down
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) from the top of the 64-bit address space. It's easier to understand the layout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) when seen both in absolute addresses and in distance-from-top notation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) For example 0xffffe90000000000 == -23 TB, it's 23 TB lower than the top of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) 64-bit address space (ffffffffffffffff).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) Note that as we get closer to the top of the address space, the notation changes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) from TB to GB and then MB/KB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) - "16M TB" might look weird at first sight, but it's an easier way to visualize size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) notation than "16 EB", which few will recognize at first sight as 16 exabytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) It also shows it nicely how incredibly large 64-bit address space is.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) ========================================================================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) Start addr | Offset | End addr | Size | VM area description
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) ========================================================================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) __________________|____________|__________________|_________|___________________________________________________________
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) | | | | virtual memory addresses up to the -128 TB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) | | | | starting offset of kernel mappings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) __________________|____________|__________________|_________|___________________________________________________________
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) | Kernel-space virtual memory, shared between all processes:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) ____________________________________________________________|___________________________________________________________
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) __________________|____________|__________________|_________|____________________________________________________________
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) | Identical layout to the 56-bit one from here on:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) ____________________________________________________________|____________________________________________________________
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) | | | | vaddr_end for KASLR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) ffffffff80000000 |-2048 MB | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) ffffffffff000000 | -16 MB | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) __________________|____________|__________________|_________|___________________________________________________________
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) Complete virtual memory map with 5-level page tables
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) ====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) .. note::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) - With 56-bit addresses, user-space memory gets expanded by a factor of 512x,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) from 0.125 PB to 64 PB. All kernel mappings shift down to the -64 PB starting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) offset and many of the regions expand to support the much larger physical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) memory supported.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) ========================================================================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) Start addr | Offset | End addr | Size | VM area description
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) ========================================================================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) 0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) __________________|____________|__________________|_________|___________________________________________________________
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) 0100000000000000 | +64 PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) | | | | virtual memory addresses up to the -64 PB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) | | | | starting offset of kernel mappings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) __________________|____________|__________________|_________|___________________________________________________________
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) | Kernel-space virtual memory, shared between all processes:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) ____________________________________________________________|___________________________________________________________
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) ffdf000000000000 | -8.25 PB | fffffbffffffffff | ~8 PB | KASAN shadow memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) __________________|____________|__________________|_________|____________________________________________________________
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) | Identical layout to the 47-bit one from here on:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) ____________________________________________________________|____________________________________________________________
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) | | | | vaddr_end for KASLR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) ffffffff80000000 |-2048 MB | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) ffffffffff000000 | -16 MB | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) __________________|____________|__________________|_________|___________________________________________________________
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) Architecture defines a 64-bit virtual address. Implementations can support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) through to the most-significant implemented bit are sign extended.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) This causes hole between user space and kernel addresses if you interpret them
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) as unsigned.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) The direct mapping covers all memory in the system up to the highest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) memory address (this means in some cases it can also include PCI memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) holes).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) vmalloc space is lazily synchronized into the different PML4/PML5 pages of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) the processes using the page fault handler, with init_top_pgt as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) reference.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) memory window (this size is arbitrary, it can be raised later if needed).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) The mappings are not part of any other kernel PGD and are only available
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) during EFI runtime calls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) physical memory, vmalloc/ioremap space and virtual memory map are randomized.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) Their order is preserved but their base will be offset early at boot time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) Be very careful vs. KASLR when changing anything here. The KASLR address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) range must not overlap with anything except the KASAN shadow area, which is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) correct as KASAN disables KASLR.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) For both 4- and 5-level layouts, the STACKLEAK_POISON value in the last 2MB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) hole: ffffffffffff4111