^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) ==================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) HugeTLB Controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ==================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) HugeTLB controller can be created by first mounting the cgroup filesystem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) # mount -t cgroup -o hugetlb none /sys/fs/cgroup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) With the above step, the initial or the parent HugeTLB group becomes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) New groups can be created under the parent group /sys/fs/cgroup::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) # cd /sys/fs/cgroup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) # mkdir g1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) # echo $$ > g1/tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) The above steps create a new group g1 and move the current shell
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) process (bash) into it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) Brief summary of control files::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) hugetlb.<hugepagesize>.rsvd.limit_in_bytes # set/show limit of "hugepagesize" hugetlb reservations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes # show max "hugepagesize" hugetlb reservations and no-reserve faults
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) hugetlb.<hugepagesize>.rsvd.usage_in_bytes # show current reservations and no-reserve faults for "hugepagesize" hugetlb
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) hugetlb.<hugepagesize>.rsvd.failcnt # show the number of allocation failure due to HugeTLB reservation limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb faults
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB usage limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) For a system supporting three hugepage sizes (64k, 32M and 1G), the control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) files include::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) hugetlb.1GB.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) hugetlb.1GB.max_usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) hugetlb.1GB.usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) hugetlb.1GB.failcnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) hugetlb.1GB.rsvd.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) hugetlb.1GB.rsvd.max_usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) hugetlb.1GB.rsvd.usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) hugetlb.1GB.rsvd.failcnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) hugetlb.64KB.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) hugetlb.64KB.max_usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) hugetlb.64KB.usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) hugetlb.64KB.failcnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) hugetlb.64KB.rsvd.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) hugetlb.64KB.rsvd.max_usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) hugetlb.64KB.rsvd.usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) hugetlb.64KB.rsvd.failcnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) hugetlb.32MB.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) hugetlb.32MB.max_usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) hugetlb.32MB.usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) hugetlb.32MB.failcnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) hugetlb.32MB.rsvd.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) hugetlb.32MB.rsvd.max_usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) hugetlb.32MB.rsvd.usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) hugetlb.32MB.rsvd.failcnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) 1. Page fault accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) hugetlb.<hugepagesize>.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) hugetlb.<hugepagesize>.max_usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) hugetlb.<hugepagesize>.usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) hugetlb.<hugepagesize>.failcnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) control group and enforces the limit during page fault. Since HugeTLB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) doesn't support page reclaim, enforcing the limit at page fault time implies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) that, the application will get SIGBUS signal if it tries to fault in HugeTLB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) pages beyond its limit. Therefore the application needs to know exactly how many
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) HugeTLB pages it uses before hand, and the sysadmin needs to make sure that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) there are enough available on the machine for all the users to avoid processes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) getting SIGBUS.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) 2. Reservation accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) hugetlb.<hugepagesize>.rsvd.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) hugetlb.<hugepagesize>.rsvd.usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) hugetlb.<hugepagesize>.rsvd.failcnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) The HugeTLB controller allows to limit the HugeTLB reservations per control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) group and enforces the controller limit at reservation time and at the fault of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) HugeTLB memory for which no reservation exists. Since reservation limits are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) enforced at reservation time (on mmap or shget), reservation limits never causes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) the application to get SIGBUS signal if the memory was reserved before hand. For
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) MAP_NORESERVE allocations, the reservation limit behaves the same as the fault
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) limit, enforcing memory usage at fault time and causing the application to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) receive a SIGBUS if it's crossing its limit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) Reservation limits are superior to page fault limits described above, since
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) reservation limits are enforced at reservation time (on mmap or shget), and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) never causes the application to get SIGBUS signal if the memory was reserved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) before hand. This allows for easier fallback to alternatives such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) non-HugeTLB memory for example. In the case of page fault accounting, it's very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) hard to avoid processes getting SIGBUS since the sysadmin needs precisely know
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) the HugeTLB usage of all the tasks in the system and make sure there is enough
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) pages to satisfy all requests. Avoiding tasks getting SIGBUS on overcommited
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) systems is practically impossible with page fault accounting.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) 3. Caveats with shared memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) For shared HugeTLB memory, both HugeTLB reservation and page faults are charged
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) to the first task that causes the memory to be reserved or faulted, and all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) subsequent uses of this reserved or faulted memory is done without charging.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) Shared HugeTLB memory is only uncharged when it is unreserved or deallocated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) This is usually when the HugeTLB file is deleted, and not when the task that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) caused the reservation or fault has exited.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) 4. Caveats with HugeTLB cgroup offline.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) When a HugeTLB cgroup goes offline with some reservations or faults still
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) charged to it, the behavior is as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) - The fault charges are charged to the parent HugeTLB cgroup (reparented),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) - the reservation charges remain on the offline HugeTLB cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) This means that if a HugeTLB cgroup gets offlined while there is still HugeTLB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) reservations charged to it, that cgroup persists as a zombie until all HugeTLB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) reservations are uncharged. HugeTLB reservations behave in this manner to match
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) the memory controller whose cgroups also persist as zombie until all charged
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) memory is uncharged. Also, the tracking of HugeTLB reservations is a bit more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) complex compared to the tracking of HugeTLB faults, so it is significantly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) harder to reparent reservations at offline time.