^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) 5-level paging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Overview
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) Original x86-64 was limited by 4-level paing to 256 TiB of virtual address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) space and 64 TiB of physical address space. We are already bumping into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) this limit: some vendors offers servers with 64 TiB of memory today.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) To overcome the limitation upcoming hardware will introduce support for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) 5-level paging. It is a straight-forward extension of the current page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) table structure adding one more layer of translation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) It bumps the limits to 128 PiB of virtual address space and 4 PiB of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) physical address space. This "ought to be enough for anybody" ©.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) QEMU 2.9 and later support 5-level paging.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) Virtual memory layout for 5-level paging is described in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) Documentation/x86/x86_64/mm.rst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) Enabling 5-level paging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) =======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) CONFIG_X86_5LEVEL=y enables the feature.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) Kernel with CONFIG_X86_5LEVEL=y still able to boot on 4-level hardware.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) In this case additional page table level -- p4d -- will be folded at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) runtime.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) User-space and large virtual address space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) ==========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) On x86, 5-level paging enables 56-bit userspace virtual address space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) Not all user space is ready to handle wide addresses. It's known that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) at least some JIT compilers use higher bits in pointers to encode their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) information. It collides with valid pointers with 5-level paging and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) leads to crashes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) To mitigate this, we are not going to allocate virtual address space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) above 47-bit by default.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) But userspace can ask for allocation from full address space by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) specifying hint address (with or without MAP_FIXED) above 47-bits.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) If hint address set above 47-bit, but MAP_FIXED is not specified, we try
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) to look for unmapped area by specified address. If it's already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) occupied, we look for unmapped area in *full* address space, rather than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) from 47-bit window.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) A high hint address would only affect the allocation in question, but not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) any future mmap()s.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) Specifying high hint address on older kernel or on machine without 5-level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) paging support is safe. The hint will be ignored and kernel will fall back
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) to allocation from 47-bit address space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) This approach helps to easily make application's memory allocator aware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) about large address space without manually tracking allocated virtual
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) address space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) One important case we need to handle here is interaction with MPX.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) MPX (without MAWA extension) cannot handle addresses above 47-bit, so we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) need to make sure that MPX cannot be enabled we already have VMA above
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) the boundary and forbid creating such VMAs once MPX is enabled.