^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. _admin_guide_transhuge:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Transparent Hugepage Support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Objective
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) Performance critical computing applications dealing with large memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) working sets are already running on top of libhugetlbfs and in turn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) using huge pages for the backing of virtual memory with huge pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) that supports the automatic promotion and demotion of page sizes and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) without the shortcomings of hugetlbfs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) Currently THP only works for anonymous memory mappings and tmpfs/shmem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) But in the future it can expand to other filesystems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) .. note::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) in the examples below we presume that the basic page size is 4K and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) the huge page size is 2M, although the actual numbers may vary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) depending on the CPU architecture.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) The reason applications are running faster is because of two
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) factors. The first factor is almost completely irrelevant and it's not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) of significant interest because it'll also have the downside of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) requiring larger clear-page copy-page in page faults which is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) potentially negative effect. The first factor consists in taking a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) single page fault for each 2M virtual region touched by userland (so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) reducing the enter/exit kernel frequency by a 512 times factor). This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) only matters the first time the memory is accessed for the lifetime of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) a memory mapping. The second long lasting and much more important
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) factor will affect all subsequent accesses to the memory for the whole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) runtime of the application. The second factor consist of two
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) components:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) 1) the TLB miss will run faster (especially with virtualization using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) nested pagetables but almost always also on bare metal without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) virtualization)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) 2) a single TLB entry will be mapping a much larger amount of virtual
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) memory in turn reducing the number of TLB misses. With
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) virtualization and nested pagetables the TLB can be mapped of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) larger size only if both KVM and the Linux guest are using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) hugepages but a significant speedup already happens if only one of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) the two is using hugepages just because of the fact the TLB miss is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) going to run faster.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) THP can be enabled system wide or restricted to certain tasks or even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) memory ranges inside task's address space. Unless THP is completely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) disabled, there is ``khugepaged`` daemon that scans memory and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) collapses sequences of basic pages into huge pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) interface and using madvise(2) and prctl(2) system calls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) Transparent Hugepage Support maximizes the usefulness of free memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) if compared to the reservation approach of hugetlbfs by allowing all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) unused memory to be used as cache or other movable (or even unmovable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) entities). It doesn't require reservation to prevent hugepage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) allocation failures to be noticeable from userland. It allows paging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) and all other advanced VM features to be available on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) hugepages. It requires no modifications for applications to take
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) advantage of it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) Applications however can be further optimized to take advantage of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) this feature, like for example they've been optimized before to avoid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) a flood of mmap system calls for every malloc(4k). Optimizing userland
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) is by far not mandatory and khugepaged already can take care of long
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) lived page allocations even for hugepage unaware applications that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) deals with large amounts of memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) In certain cases when hugepages are enabled system wide, application
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) may end up allocating more memory resources. An application may mmap a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) large region but only touch 1 byte of it, in that case a 2M page might
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) be allocated instead of a 4k page for no good. This is why it's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) possible to disable hugepages system-wide and to only have them inside
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) MADV_HUGEPAGE madvise regions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) Embedded systems should enable hugepages only inside madvise regions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) to eliminate any risk of wasting any precious byte of memory and to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) only run faster.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) Applications that gets a lot of benefit from hugepages and that don't
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) risk to lose memory by using hugepages, should use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) madvise(MADV_HUGEPAGE) on their critical mmapped regions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) .. _thp_sysfs:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) sysfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) =====
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) Global THP controls
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) -------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) Transparent Hugepage Support for anonymous memory can be entirely disabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) regions (to avoid the risk of consuming more memory resources) or enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) system wide. This can be achieved with one of::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) echo always >/sys/kernel/mm/transparent_hugepage/enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) echo never >/sys/kernel/mm/transparent_hugepage/enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) It's also possible to limit defrag efforts in the VM to generate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) anonymous hugepages in case they're not immediately free to madvise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) regions or to never try to defrag memory and simply fallback to regular
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) pages unless hugepages are immediately available. Clearly if we spend CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) time to defrag memory, we would expect to gain even more by the fact we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) use hugepages later instead of regular pages. This isn't always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) guaranteed, but it may be more likely in case the allocation is for a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) MADV_HUGEPAGE region.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) echo always >/sys/kernel/mm/transparent_hugepage/defrag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) echo defer >/sys/kernel/mm/transparent_hugepage/defrag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) echo never >/sys/kernel/mm/transparent_hugepage/defrag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) means that an application requesting THP will stall on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) allocation failure and directly reclaim pages and compact
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) memory in an effort to allocate a THP immediately. This may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) desirable for virtual machines that benefit heavily from THP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) use and are willing to delay the VM start to utilise them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) defer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) means that an application will wake kswapd in the background
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) to reclaim pages and wake kcompactd to compact memory so that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) THP is available in the near future. It's the responsibility
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) of khugepaged to then install the THP pages later.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) defer+madvise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) will enter direct reclaim and compaction like ``always``, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) only for regions that have used madvise(MADV_HUGEPAGE); all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) other regions will wake kswapd in the background to reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) pages and wake kcompactd to compact memory so that THP is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) available in the near future.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) madvise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) will enter direct reclaim like ``always`` but only for regions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) that are have used madvise(MADV_HUGEPAGE). This is the default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) behaviour.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) never
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) should be self-explanatory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) By default kernel tries to use huge zero page on read page fault to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) anonymous mapping. It's possible to disable huge zero page by writing 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) or enable it back by writing 1::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) Some userspace (such as a test program, or an optimized memory allocation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) library) may want to know the size (in bytes) of a transparent hugepage::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) khugepaged will be automatically started when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) transparent_hugepage/enabled is set to "always" or "madvise, and it'll
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) be automatically shutdown if it's set to "never".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) Khugepaged controls
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) -------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) khugepaged runs usually at low frequency so while one may not want to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) invoke defrag algorithms synchronously during the page faults, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) should be worth invoking defrag at least in khugepaged. However it's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) also possible to disable defrag in khugepaged by writing 0 or enable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) defrag in khugepaged by writing 1::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) You can also control how many pages khugepaged should scan at each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) pass::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) and how many milliseconds to wait in khugepaged between each pass (you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) can set this to 0 to run khugepaged at 100% utilization of one core)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) and how many milliseconds to wait in khugepaged if there's an hugepage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) allocation failure to throttle the next allocation attempt::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) The khugepaged progress can be seen in the number of pages collapsed::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) for each pass::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) ``max_ptes_none`` specifies how many extra small pages (that are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) not already mapped) can be allocated when collapsing a group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) of small pages into one large page::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) A higher value leads to use additional memory for programs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) A lower value leads to gain less thp performance. Value of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) max_ptes_none can waste cpu time very little, you can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) ignore it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) ``max_ptes_swap`` specifies how many pages can be brought in from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) swap when collapsing a group of pages into a transparent huge page::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) A higher value can cause excessive swap IO and waste
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) memory. A lower value can prevent THPs from being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) collapsed, resulting fewer pages being collapsed into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) THPs, and lower memory access performance.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) ``max_ptes_shared`` specifies how many pages can be shared across multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) processes. Exceeding the number would block the collapse::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) A higher value may increase memory footprint for some workloads.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) Boot parameter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) You can change the sysfs boot time defaults of Transparent Hugepage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) Support by passing the parameter ``transparent_hugepage=always`` or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) ``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) to the kernel command line.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) Hugepages in tmpfs/shmem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) You can control hugepage allocation policy in tmpfs with mount option
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) ``huge=``. It can have following values:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) Attempt to allocate huge pages every time we need a new page;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) never
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) Do not allocate huge pages;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) within_size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) Only allocate huge page if it will be fully within i_size.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) Also respect fadvise()/madvise() hints;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) advise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) Only allocate huge pages if requested with fadvise()/madvise();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) The default policy is ``never``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) ``huge=never`` will not attempt to break up huge pages at all, just stop more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) from being allocated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) There's also sysfs knob to control hugepage allocation policy for internal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) In addition to policies listed above, shmem_enabled allows two further
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) values:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) deny
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) For use in emergencies, to force the huge option off from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) all mounts;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) force
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) Force the huge option on for all - very useful for testing;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) Need of application restart
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) ===========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) The transparent_hugepage/enabled values and tmpfs mount option only affect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) future behavior. So to make them effective you need to restart any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) application that could have been using hugepages. This also applies to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) regions registered in khugepaged.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) Monitoring usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) The number of anonymous transparent huge pages currently used by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) system is available by reading the AnonHugePages field in ``/proc/meminfo``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) To identify what applications are using anonymous transparent huge pages,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) for each mapping.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) The number of file transparent huge pages mapped to userspace is available
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) To identify what applications are mapping file transparent huge pages, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) for each mapping.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) Note that reading the smaps file is expensive and reading it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) frequently will incur overhead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) There are a number of counters in ``/proc/vmstat`` that may be used to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) monitor how successfully the system is providing huge pages for use.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) thp_fault_alloc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) is incremented every time a huge page is successfully
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) allocated to handle a page fault.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) thp_collapse_alloc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) is incremented by khugepaged when it has found
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) a range of pages to collapse into one huge page and has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) successfully allocated a new huge page to store the data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) thp_fault_fallback
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) is incremented if a page fault fails to allocate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) a huge page and instead falls back to using small pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) thp_fault_fallback_charge
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) is incremented if a page fault fails to charge a huge page and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) instead falls back to using small pages even though the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) allocation was successful.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) thp_collapse_alloc_failed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) is incremented if khugepaged found a range
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) of pages that should be collapsed into one huge page but failed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) the allocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) thp_file_alloc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) is incremented every time a file huge page is successfully
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) allocated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) thp_file_fallback
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) is incremented if a file huge page is attempted to be allocated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) but fails and instead falls back to using small pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) thp_file_fallback_charge
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) is incremented if a file huge page cannot be charged and instead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) falls back to using small pages even though the allocation was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) successful.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) thp_file_mapped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) is incremented every time a file huge page is mapped into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) user address space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) thp_split_page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) is incremented every time a huge page is split into base
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) pages. This can happen for a variety of reasons but a common
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) reason is that a huge page is old and is being reclaimed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) This action implies splitting all PMD the page mapped with.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) thp_split_page_failed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) is incremented if kernel fails to split huge
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) page. This can happen if the page was pinned by somebody.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) thp_deferred_split_page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) is incremented when a huge page is put onto split
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) queue. This happens when a huge page is partially unmapped and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) splitting it would free up some memory. Pages on split queue are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) going to be split under memory pressure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) thp_split_pmd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) is incremented every time a PMD split into table of PTEs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) This can happen, for instance, when application calls mprotect() or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) munmap() on part of huge page. It doesn't split huge page, only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) page table entry.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) thp_zero_page_alloc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) is incremented every time a huge zero page is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) successfully allocated. It includes allocations which where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) dropped due race with other allocation. Note, it doesn't count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) every map of the huge zero page, only its allocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) thp_zero_page_alloc_failed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) is incremented if kernel fails to allocate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) huge zero page and falls back to using small pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) thp_swpout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) is incremented every time a huge page is swapout in one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) piece without splitting.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) thp_swpout_fallback
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) is incremented if a huge page has to be split before swapout.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) Usually because failed to allocate some continuous swap space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) for the huge page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) As the system ages, allocating huge pages may be expensive as the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) system uses memory compaction to copy data around memory to free a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) huge page for use. There are some counters in ``/proc/vmstat`` to help
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) monitor this overhead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) compact_stall
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) is incremented every time a process stalls to run
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) memory compaction so that a huge page is free for use.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) compact_success
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) is incremented if the system compacted memory and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) freed a huge page for use.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) compact_fail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) is incremented if the system tries to compact memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) but failed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) compact_pages_moved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) is incremented each time a page is moved. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) this value is increasing rapidly, it implies that the system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) is copying a lot of data to satisfy the huge page allocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) It is possible that the cost of copying exceeds any savings
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) from reduced TLB misses.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) compact_pagemigrate_failed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) is incremented when the underlying mechanism
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) for moving a page failed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) compact_blocks_moved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) is incremented each time memory compaction examines
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) a huge page aligned range of pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) It is possible to establish how long the stalls were using the function
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) tracer to record how long was spent in __alloc_pages_nodemask and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) using the mm_page_alloc tracepoint to identify which allocations were
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) for huge pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) Optimizing the applications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) ===========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) To be guaranteed that the kernel will map a 2M page immediately in any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) memory region, the mmap region has to be hugepage naturally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) aligned. posix_memalign() can provide that guarantee.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) Hugetlbfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) You can use hugetlbfs on a kernel that has transparent hugepage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) support enabled just fine as always. No difference can be noted in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) hugetlbfs other than there will be less overall fragmentation. All
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) usual features belonging to hugetlbfs are preserved and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) unaffected. libhugetlbfs will also work fine as usual.