^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) Memory Resource Controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) This document is hopelessly outdated and it asks for a complete
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) rewrite. It still contains a useful information so we are keeping it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) here but make sure to check the current code if you need a deeper
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) understanding.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) The Memory Resource Controller has generically been referred to as the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) memory controller in this document. Do not confuse memory controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) used here with the memory controller that is used in hardware.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) (For editors) In this document:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) When we mention a cgroup (cgroupfs's directory) with memory controller,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) we call it "memory cgroup". When you see git-log and source code, you'll
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) see patch's title and function names tend to use "memcg".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) In this document, we avoid using it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) Benefits and Purpose of the memory controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) =============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) The memory controller isolates the memory behaviour of a group of tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) from the rest of the system. The article on LWN [12] mentions some probable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) uses of the memory controller. The memory controller can be used to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) a. Isolate an application or a group of applications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) Memory-hungry applications can be isolated and limited to a smaller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) amount of memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) b. Create a cgroup with a limited amount of memory; this can be used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) as a good alternative to booting with mem=XXXX.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) c. Virtualization solutions can control the amount of memory they want
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) to assign to a virtual machine instance.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) d. A CD/DVD burner could control the amount of memory used by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) rest of the system to ensure that burning does not fail due to lack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) of available memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) e. There are several other use cases; find one or use the controller just
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) for fun (to learn and hack on the VM subsystem).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) Current Status: linux-2.6.34-mmotm(development version of 2010/April)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) Features:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) - accounting anonymous pages, file caches, swap caches usage and limiting them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) - pages are linked to per-memcg LRU exclusively, and there is no global LRU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) - optionally, memory+swap usage can be accounted and limited.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) - hierarchical accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) - soft limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) - moving (recharging) account at moving a task is selectable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) - usage threshold notifier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) - memory pressure notifier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) - oom-killer disable knob and oom-notifier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) - Root cgroup has no limit controls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) Kernel memory support is a work in progress, and the current version provides
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) basically functionality. (See Section 2.7)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) Brief summary of control files.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) ==================================== ==========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) tasks attach a task(thread) and show list of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) threads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) cgroup.procs show list of processes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) cgroup.event_control an interface for event_fd()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) memory.usage_in_bytes show current usage for memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) (See 5.5 for details)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) memory.memsw.usage_in_bytes show current usage for memory+Swap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) (See 5.5 for details)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) memory.limit_in_bytes set/show limit of memory usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) memory.memsw.limit_in_bytes set/show limit of memory+Swap usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) memory.failcnt show the number of memory usage hits limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) memory.memsw.failcnt show the number of memory+Swap hits limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) memory.max_usage_in_bytes show max memory usage recorded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) memory.soft_limit_in_bytes set/show soft limit of memory usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) memory.stat show various statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) memory.use_hierarchy set/show hierarchical account enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) memory.force_empty trigger forced page reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) memory.pressure_level set memory pressure notifications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) memory.swappiness set/show swappiness parameter of vmscan
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) (See sysctl's vm.swappiness)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) memory.move_charge_at_immigrate set/show controls of moving charges
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) memory.oom_control set/show oom controls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) memory.numa_stat show the number of memory usage per numa
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) memory.kmem.limit_in_bytes set/show hard limit for kernel memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) This knob is deprecated and shouldn't be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) used. It is planned that this be removed in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) the foreseeable future.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) memory.kmem.usage_in_bytes show current kernel memory allocation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) memory.kmem.failcnt show the number of kernel memory usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) hits limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) memory.kmem.max_usage_in_bytes show max kernel memory usage recorded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) memory.kmem.tcp.failcnt show the number of tcp buf memory usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) hits limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) ==================================== ==========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) 1. History
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) The memory controller has a long history. A request for comments for the memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) controller was posted by Balbir Singh [1]. At the time the RFC was posted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) there were several implementations for memory control. The goal of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) RFC was to build consensus and agreement for the minimal features required
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) for memory control. The first RSS controller was posted by Balbir Singh[2]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) RSS controller. At OLS, at the resource management BoF, everyone suggested
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) that we handle both page cache and RSS together. Another request was raised
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) to allow user space handling of OOM. The current memory controller is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) at version 6; it combines both mapped (RSS) and unmapped Page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) Cache Control [11].
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) 2. Memory Control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) Memory is a unique resource in the sense that it is present in a limited
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) amount. If a task requires a lot of CPU processing, the task can spread
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) its processing over a period of hours, days, months or years, but with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) memory, the same physical memory needs to be reused to accomplish the task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) The memory controller implementation has been divided into phases. These
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) 1. Memory controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) 2. mlock(2) controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) 3. Kernel user memory accounting and slab control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) 4. user mappings length controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) The memory controller is the first controller developed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 2.1. Design
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) -----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) The core of the design is a counter called the page_counter. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) page_counter tracks the current memory usage and limit of the group of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) processes associated with the controller. Each cgroup has a memory controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) specific data structure (mem_cgroup) associated with it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 2.2. Accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) +--------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) | mem_cgroup |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) | (page_counter) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) +--------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) / ^ \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) / | \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) +---------------+ | +---------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) | mm_struct | |.... | mm_struct |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) | | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) +---------------+ | +---------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) + --------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) +---------------+ +------+--------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) | page +----------> page_cgroup|
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) +---------------+ +---------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) (Figure 1: Hierarchy of Accounting)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) Figure 1 shows the important aspects of the controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) 1. Accounting happens per cgroup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) 2. Each mm_struct knows about which cgroup it belongs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) 3. Each page has a pointer to the page_cgroup, which in turn knows the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) cgroup it belongs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) The accounting is done as follows: mem_cgroup_charge_common() is invoked to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) set up the necessary data structures and check if the cgroup that is being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) charged is over its limit. If it is, then reclaim is invoked on the cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) More details can be found in the reclaim section of this document.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) If everything goes well, a page meta-data-structure called page_cgroup is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) updated. page_cgroup has its own LRU on cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) (*) page_cgroup structure is allocated at boot/memory-hotplug time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) 2.2.1 Accounting details
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) ------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) Some pages which are never reclaimable and will not be on the LRU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) are not accounted. We just account pages under usual VM management.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) RSS pages are accounted at page_fault unless they've already been accounted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) for earlier. A file page will be accounted for as Page Cache when it's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) inserted into inode (radix-tree). While it's mapped into the page tables of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) processes, duplicate accounting is carefully avoided.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) An RSS page is unaccounted when it's fully unmapped. A PageCache page is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) unaccounted when it's removed from radix-tree. Even if RSS pages are fully
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) unmapped (by kswapd), they may exist as SwapCache in the system until they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) are really freed. Such SwapCaches are also accounted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) A swapped-in page is accounted after adding into swapcache.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) Note: The kernel does swapin-readahead and reads multiple swaps at once.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) Since page's memcg recorded into swap whatever memsw enabled, the page will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) be accounted after swapin.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) At page migration, accounting information is kept.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) Note: we just account pages-on-LRU because our purpose is to control amount
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) of used pages; not-on-LRU pages tend to be out-of-control from VM view.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) 2.3 Shared Page Accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) --------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) Shared pages are accounted on the basis of the first touch approach. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) cgroup that first touches a page is accounted for the page. The principle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) behind this approach is that a cgroup that aggressively uses a shared
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) page will eventually get charged for it (once it is uncharged from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) the cgroup that brought it in -- this will happen on memory pressure).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) But see section 8.2: when moving a task to another cgroup, its pages may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) 2.4 Swap Extension
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) Swap usage is always recorded for each of cgroup. Swap Extension allows you to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) read and limit it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) When CONFIG_SWAP is enabled, following files are added.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) - memory.memsw.usage_in_bytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) - memory.memsw.limit_in_bytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) memsw means memory+swap. Usage of memory+swap is limited by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) memsw.limit_in_bytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) Example: Assume a system with 4G of swap. A task which allocates 6G of memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) (by mistake) under 2G memory limitation will use all swap.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) By using the memsw limit, you can avoid system OOM which can be caused by swap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) shortage.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) **why 'memory+swap' rather than swap**
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) to move account from memory to swap...there is no change in usage of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) memory+swap. In other words, when we want to limit the usage of swap without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) affecting global LRU, memory+swap limit is better than just limiting swap from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) an OS point of view.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) **What happens when a cgroup hits memory.memsw.limit_in_bytes**
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) in this cgroup. Then, swap-out will not be done by cgroup routine and file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) caches are dropped. But as mentioned above, global LRU can do swapout memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) from it for sanity of the system's memory management state. You can't forbid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) it by cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) 2.5 Reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) -----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) Each cgroup maintains a per cgroup LRU which has the same structure as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) global VM. When a cgroup goes over its limit, we first try
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) to reclaim memory from the cgroup so as to make space for the new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) pages that the cgroup has touched. If the reclaim is unsuccessful,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) an OOM routine is invoked to select and kill the bulkiest task in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) cgroup. (See 10. OOM Control below.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) The reclaim algorithm has not been modified for cgroups, except that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) pages that are selected for reclaiming come from the per-cgroup LRU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) Reclaim does not work for the root cgroup, since we cannot set any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) limits on the root cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) Note2:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) When panic_on_oom is set to "2", the whole system will panic.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) When oom event notifier is registered, event will be delivered.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) (See oom_control section)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) 2.6 Locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) -----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) lock_page_cgroup()/unlock_page_cgroup() should not be called under
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) the i_pages lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) Other lock order is following:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) PG_locked.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) mm->page_table_lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) pgdat->lru_lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) lock_page_cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) In many cases, just lock_page_cgroup() is called.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) pgdat->lru_lock, it has no lock of its own.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) -----------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) With the Kernel memory extension, the Memory Controller is able to limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) the amount of kernel memory used by the system. Kernel memory is fundamentally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) different than user memory, since it can't be swapped out, which makes it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) possible to DoS the system by consuming too much of this precious resource.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) Kernel memory accounting is enabled for all memory cgroups by default. But
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) it can be disabled system-wide by passing cgroup.memory=nokmem to the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) at boot time. In this case, kernel memory will not be accounted at all.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) Kernel memory limits are not imposed for the root cgroup. Usage for the root
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) cgroup may or may not be accounted. The memory used is accumulated into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) (currently only for tcp).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) The main "kmem" counter is fed into the main counter, so kmem charges will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) also be visible from the user counter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) Currently no soft limit is implemented for kernel memory. It is future work
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) to trigger slab reclaim when those limits are reached.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) 2.7.1 Current Kernel Memory resources accounted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) -----------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) stack pages:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) every process consumes some stack pages. By accounting into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) kernel memory, we prevent new processes from being created when the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) memory usage is too high.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) slab pages:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) pages allocated by the SLAB or SLUB allocator are tracked. A copy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) of each kmem_cache is created every time the cache is touched by the first time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) from inside the memcg. The creation is done lazily, so some objects can still be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) skipped while the cache is being created. All objects in a slab page should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) belong to the same memcg. This only fails to hold when a task is migrated to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) different memcg during the page allocation by the cache.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) sockets memory pressure:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) some sockets protocols have memory pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) thresholds. The Memory Controller allows them to be controlled individually
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) per cgroup, instead of globally.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) tcp memory pressure:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) sockets memory pressure for the tcp protocol.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) 2.7.2 Common use cases
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) ----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) Because the "kmem" counter is fed to the main user counter, kernel memory can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) never be limited completely independently of user memory. Say "U" is the user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) limit, and "K" the kernel limit. There are three possible ways limits can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) set:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) U != 0, K = unlimited:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) This is the standard memcg limitation mechanism already present before kmem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) accounting. Kernel memory is completely ignored.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) U != 0, K < U:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) Kernel memory is a subset of the user memory. This setup is useful in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) deployments where the total amount of memory per-cgroup is overcommited.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) Overcommiting kernel memory limits is definitely not recommended, since the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) box can still run out of non-reclaimable memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) In this case, the admin could set up K so that the sum of all groups is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) never greater than the total memory, and freely set U at the cost of his
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) QoS.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) WARNING:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) In the current implementation, memory reclaim will NOT be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) triggered for a cgroup when it hits K while staying below U, which makes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) this setup impractical.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) U != 0, K >= U:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) Since kmem charges will also be fed to the user counter and reclaim will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) triggered for the cgroup for both kinds of memory. This setup gives the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) admin a unified view of memory, and it is also useful for people who just
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) want to track kernel memory usage.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) 3. User Interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) 3.0. Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) a. Enable CONFIG_CGROUPS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) b. Enable CONFIG_MEMCG
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) 3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) -------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) # mount -t tmpfs none /sys/fs/cgroup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) # mkdir /sys/fs/cgroup/memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) # mount -t cgroup none /sys/fs/cgroup/memory -o memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) 3.2. Make the new group and move bash into it::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) # mkdir /sys/fs/cgroup/memory/0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) # echo $$ > /sys/fs/cgroup/memory/0/tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) Since now we're in the 0 cgroup, we can alter the memory limit::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) Gibibytes.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) We cannot set limits on the root cgroup any more.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) 4194304
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) We can check the usage::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) 1216512
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) A successful write to this file does not guarantee a successful setting of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) this limit to the value written into the file. This can be due to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) number of factors, such as rounding up to page boundaries or the total
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) availability of memory on the system. The user is required to re-read
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) this file after a write to guarantee the value committed by the kernel::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) # echo 1 > memory.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) # cat memory.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) 4096
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) The memory.failcnt field gives the number of times that the cgroup limit was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) exceeded.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) The memory.stat file gives accounting information. Now, the number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) caches, RSS and Active pages/Inactive pages are shown.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) 4. Testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) For testing features and implementation, see memcg_test.txt.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) Performance test is also important. To see pure memory controller's overhead,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) testing on tmpfs will give you good numbers of small overheads.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) Example: do kernel make on tmpfs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) Page-fault scalability is also important. At measuring parallel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) page fault test, multi-process test may be better than multi-thread
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) test because it has noise of shared objects/status.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) But the above two are testing extreme situations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) Trying usual test under memory controller is always helpful.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) 4.1 Troubleshooting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) -------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) Sometimes a user might find that the application under a cgroup is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) terminated by the OOM killer. There are several causes for this:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) 1. The cgroup limit is too low (just too low to do anything useful)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) 2. The user is using anonymous memory and swap is turned off or too low
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) some of the pages cached in the cgroup (page cache pages).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) seeing what happens will be helpful.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) 4.2 Task migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) When a task migrates from one cgroup to another, its charge is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) carried forward by default. The pages allocated from the original cgroup still
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) remain charged to it, the charge is dropped when the page is freed or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) reclaimed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) You can move charges of a task along with task migration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) See 8. "Move charges at task migration"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) 4.3 Removing a cgroup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) ---------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) cgroup might have some charge associated with it, even though all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) tasks have migrated away from it. (because we charge against pages, not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) against tasks.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) We move the stats to root (if use_hierarchy==0) or parent (if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) use_hierarchy==1), and no change on the charge except uncharging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) from the child.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) Charges recorded in swap information is not updated at removal of cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) Recorded information is discarded and a cgroup which uses swap (swapcache)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) will be charged as a new owner of it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) About use_hierarchy, see Section 6.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) 5. Misc. interfaces
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) 5.1 force_empty
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) memory.force_empty interface is provided to make cgroup's memory usage empty.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) When writing anything to this::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) # echo 0 > memory.force_empty
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) the cgroup will be reclaimed and as many pages reclaimed as possible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) The typical use case for this interface is before calling rmdir().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) Though rmdir() offlines memcg, but the memcg may still stay there due to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) charged file caches. Some out-of-use page caches may keep charged until
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) memory pressure happens. If you want to avoid that, force_empty will be useful.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525) Also, note that when memory.kmem.limit_in_bytes is set the charges due to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) kernel pages will still be seen. This is not considered a failure and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527) write will still return success. In this case, it is expected that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) memory.kmem.usage_in_bytes == memory.usage_in_bytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) About use_hierarchy, see Section 6.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) 5.2 stat file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) memory.stat file includes following statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) per-memory cgroup local status
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540) =============== ===============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) cache # of bytes of page cache memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) rss # of bytes of anonymous and swap cache memory (includes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543) transparent hugepages).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) rss_huge # of bytes of anonymous transparent hugepages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545) mapped_file # of bytes of mapped file (includes tmpfs/shmem)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) pgpgin # of charging events to the memory cgroup. The charging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547) event happens each time a page is accounted as either mapped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) anon page(RSS) or cache page(Page Cache) to the cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) pgpgout # of uncharging events to the memory cgroup. The uncharging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) event happens each time a page is unaccounted from the cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551) swap # of bytes of swap usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552) dirty # of bytes that are waiting to get written back to the disk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) writeback # of bytes of file/anon cache that are queued for syncing to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) disk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555) inactive_anon # of bytes of anonymous and swap cache memory on inactive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556) LRU list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557) active_anon # of bytes of anonymous and swap cache memory on active
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) LRU list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559) inactive_file # of bytes of file-backed memory on inactive LRU list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560) active_file # of bytes of file-backed memory on active LRU list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561) unevictable # of bytes of memory that cannot be reclaimed (mlocked etc).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562) =============== ===============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564) status considering hierarchy (see memory.use_hierarchy settings)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567) ========================= ===================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568) hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569) under which the memory cgroup is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570) hierarchical_memsw_limit # of bytes of memory+swap limit with regard to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571) hierarchy under which memory cgroup is.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573) total_<counter> # hierarchical version of <counter>, which in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574) addition to the cgroup's own value includes the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575) sum of all hierarchical children's values of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576) <counter>, i.e. total_cache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577) ========================= ===================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 578)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 579) The following additional stats are dependent on CONFIG_DEBUG_VM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 580) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 581)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 582) ========================= ========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 583) recent_rotated_anon VM internal parameter. (see mm/vmscan.c)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 584) recent_rotated_file VM internal parameter. (see mm/vmscan.c)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 585) recent_scanned_anon VM internal parameter. (see mm/vmscan.c)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 586) recent_scanned_file VM internal parameter. (see mm/vmscan.c)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 587) ========================= ========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 588)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 589) Memo:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 590) recent_rotated means recent frequency of LRU rotation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 591) recent_scanned means recent # of scans to LRU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 592) showing for better debug please see the code for meanings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 593)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 594) Note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 595) Only anonymous and swap cache memory is listed as part of 'rss' stat.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 596) This should not be confused with the true 'resident set size' or the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 597) amount of physical memory used by the cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 598)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 599) 'rss + mapped_file" will give you resident set size of cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 600)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 601) (Note: file and shmem may be shared among other cgroups. In that case,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 602) mapped_file is accounted only when the memory cgroup is owner of page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 603) cache.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 604)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 605) 5.3 swappiness
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 606) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 607)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 608) Overrides /proc/sys/vm/swappiness for the particular group. The tunable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 609) in the root cgroup corresponds to the global swappiness setting.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 610)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 611) Please note that unlike during the global reclaim, limit reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 612) enforces that 0 swappiness really prevents from any swapping even if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 613) there is a swap storage available. This might lead to memcg OOM killer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 614) if there are no file pages to reclaim.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 615)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 616) 5.4 failcnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 617) -----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 618)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 619) A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 620) This failcnt(== failure count) shows the number of times that a usage counter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 621) hit its limit. When a memory cgroup hits a limit, failcnt increases and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 622) memory under it will be reclaimed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 623)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 624) You can reset failcnt by writing 0 to failcnt file::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 625)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 626) # echo 0 > .../memory.failcnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 627)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 628) 5.5 usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 629) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 630)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 631) For efficiency, as other kernel components, memory cgroup uses some optimization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 632) to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 633) method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 634) value for efficient access. (Of course, when necessary, it's synchronized.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 635) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 636) value in memory.stat(see 5.2).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 637)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 638) 5.6 numa_stat
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 639) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 640)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 641) This is similar to numa_maps but operates on a per-memcg basis. This is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 642) useful for providing visibility into the numa locality information within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 643) an memcg since the pages are allowed to be allocated from any physical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 644) node. One of the use cases is evaluating application performance by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 645) combining this information with the application's CPU allocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 646)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 647) Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 648) per-node page counts including "hierarchical_<counter>" which sums up all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 649) hierarchical children's values in addition to the memcg's own value.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 650)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 651) The output format of memory.numa_stat is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 652)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 653) total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 654) file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 655) anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 656) unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 657) hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 658)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 659) The "total" count is sum of file + anon + unevictable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 660)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 661) 6. Hierarchy support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 662) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 663)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 664) The memory controller supports a deep hierarchy and hierarchical accounting.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 665) The hierarchy is created by creating the appropriate cgroups in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 666) cgroup filesystem. Consider for example, the following cgroup filesystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 667) hierarchy::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 668)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 669) root
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 670) / | \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 671) / | \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 672) a b c
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 673) | \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 674) | \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 675) d e
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 676)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 677) In the diagram above, with hierarchical accounting enabled, all memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 678) usage of e, is accounted to its ancestors up until the root (i.e, c and root),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 679) that has memory.use_hierarchy enabled. If one of the ancestors goes over its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 680) limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 681) children of the ancestor.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 682)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 683) 6.1 Enabling hierarchical accounting and reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 684) ------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 685)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 686) A memory cgroup by default disables the hierarchy feature. Support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 687) can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 688)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 689) # echo 1 > memory.use_hierarchy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 690)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 691) The feature can be disabled by::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 692)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 693) # echo 0 > memory.use_hierarchy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 694)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 695) NOTE1:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 696) Enabling/disabling will fail if either the cgroup already has other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 697) cgroups created below it, or if the parent cgroup has use_hierarchy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 698) enabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 699)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 700) NOTE2:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 701) When panic_on_oom is set to "2", the whole system will panic in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 702) case of an OOM event in any cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 703)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 704) 7. Soft limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 705) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 706)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 707) Soft limits allow for greater sharing of memory. The idea behind soft limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 708) is to allow control groups to use as much of the memory as needed, provided
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 709)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 710) a. There is no memory contention
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 711) b. They do not exceed their hard limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 712)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 713) When the system detects memory contention or low memory, control groups
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 714) are pushed back to their soft limits. If the soft limit of each control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 715) group is very high, they are pushed back as much as possible to make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 716) sure that one control group does not starve the others of memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 717)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 718) Please note that soft limits is a best-effort feature; it comes with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 719) no guarantees, but it does its best to make sure that when memory is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 720) heavily contended for, memory is allocated based on the soft limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 721) hints/setup. Currently soft limit based reclaim is set up such that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 722) it gets invoked from balance_pgdat (kswapd).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 723)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 724) 7.1 Interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 725) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 726)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 727) Soft limits can be setup by using the following commands (in this example we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 728) assume a soft limit of 256 MiB)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 729)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 730) # echo 256M > memory.soft_limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 731)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 732) If we want to change this to 1G, we can at any time use::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 733)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 734) # echo 1G > memory.soft_limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 735)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 736) NOTE1:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 737) Soft limits take effect over a long period of time, since they involve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 738) reclaiming memory for balancing between memory cgroups
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 739) NOTE2:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 740) It is recommended to set the soft limit always below the hard limit,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 741) otherwise the hard limit will take precedence.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 742)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 743) 8. Move charges at task migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 744) =================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 745)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 746) Users can move charges associated with a task along with task migration, that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 747) is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 748) This feature is not supported in !CONFIG_MMU environments because of lack of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 749) page tables.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 750)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 751) 8.1 Interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 752) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 753)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 754) This feature is disabled by default. It can be enabled (and disabled again) by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 755) writing to memory.move_charge_at_immigrate of the destination cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 756)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 757) If you want to enable it::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 758)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 759) # echo (some positive value) > memory.move_charge_at_immigrate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 760)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 761) Note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 762) Each bits of move_charge_at_immigrate has its own meaning about what type
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 763) of charges should be moved. See 8.2 for details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 764) Note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 765) Charges are moved only when you move mm->owner, in other words,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 766) a leader of a thread group.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 767) Note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 768) If we cannot find enough space for the task in the destination cgroup, we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 769) try to make space by reclaiming memory. Task migration may fail if we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 770) cannot make enough space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 771) Note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 772) It can take several seconds if you move charges much.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 773)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 774) And if you want disable it again::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 775)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 776) # echo 0 > memory.move_charge_at_immigrate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 777)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 778) 8.2 Type of charges which can be moved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 779) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 780)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 781) Each bit in move_charge_at_immigrate has its own meaning about what type of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 782) charges should be moved. But in any case, it must be noted that an account of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 783) a page or a swap can be moved only when it is charged to the task's current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 784) (old) memory cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 785)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 786) +---+--------------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 787) |bit| what type of charges would be moved ? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 788) +===+==========================================================================+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 789) | 0 | A charge of an anonymous page (or swap of it) used by the target task. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 790) | | You must enable Swap Extension (see 2.4) to enable move of swap charges. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 791) +---+--------------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 792) | 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 793) | | and swaps of tmpfs file) mmapped by the target task. Unlike the case of |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 794) | | anonymous pages, file pages (and swaps) in the range mmapped by the task |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 795) | | will be moved even if the task hasn't done page fault, i.e. they might |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 796) | | not be the task's "RSS", but other task's "RSS" that maps the same file. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 797) | | And mapcount of the page is ignored (the page can be moved even if |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 798) | | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 799) | | enable move of swap charges. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 800) +---+--------------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 801)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 802) 8.3 TODO
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 803) --------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 804)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 805) - All of moving charge operations are done under cgroup_mutex. It's not good
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 806) behavior to hold the mutex too long, so we may need some trick.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 807)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 808) 9. Memory thresholds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 809) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 810)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 811) Memory cgroup implements memory thresholds using the cgroups notification
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 812) API (see cgroups.txt). It allows to register multiple memory and memsw
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 813) thresholds and gets notifications when it crosses.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 814)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 815) To register a threshold, an application must:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 816)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 817) - create an eventfd using eventfd(2);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 818) - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 819) - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 820) cgroup.event_control.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 821)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 822) Application will be notified through eventfd when memory usage crosses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 823) threshold in any direction.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 824)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 825) It's applicable for root and non-root cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 826)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 827) 10. OOM Control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 828) ===============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 829)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 830) memory.oom_control file is for OOM notification and other controls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 831)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 832) Memory cgroup implements OOM notifier using the cgroup notification
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 833) API (See cgroups.txt). It allows to register multiple OOM notification
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 834) delivery and gets notification when OOM happens.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 835)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 836) To register a notifier, an application must:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 837)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 838) - create an eventfd using eventfd(2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 839) - open memory.oom_control file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 840) - write string like "<event_fd> <fd of memory.oom_control>" to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 841) cgroup.event_control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 842)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 843) The application will be notified through eventfd when OOM happens.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 844) OOM notification doesn't work for the root cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 845)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 846) You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 847)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 848) #echo 1 > memory.oom_control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 849)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 850) If OOM-killer is disabled, tasks under cgroup will hang/sleep
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 851) in memory cgroup's OOM-waitqueue when they request accountable memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 852)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 853) For running them, you have to relax the memory cgroup's OOM status by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 854)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 855) * enlarge limit or reduce usage.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 856)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 857) To reduce usage,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 858)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 859) * kill some tasks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 860) * move some tasks to other group with account migration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 861) * remove some files (on tmpfs?)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 862)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 863) Then, stopped tasks will work again.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 864)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 865) At reading, current status of OOM is shown.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 866)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 867) - oom_kill_disable 0 or 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 868) (if 1, oom-killer is disabled)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 869) - under_oom 0 or 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 870) (if 1, the memory cgroup is under OOM, tasks may be stopped.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 871)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 872) 11. Memory Pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 873) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 874)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 875) The pressure level notifications can be used to monitor the memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 876) allocation cost; based on the pressure, applications can implement
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 877) different strategies of managing their memory resources. The pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 878) levels are defined as following:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 879)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 880) The "low" level means that the system is reclaiming memory for new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 881) allocations. Monitoring this reclaiming activity might be useful for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 882) maintaining cache level. Upon notification, the program (typically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 883) "Activity Manager") might analyze vmstat and act in advance (i.e.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 884) prematurely shutdown unimportant services).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 885)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 886) The "medium" level means that the system is experiencing medium memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 887) pressure, the system might be making swap, paging out active file caches,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 888) etc. Upon this event applications may decide to further analyze
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 889) vmstat/zoneinfo/memcg or internal memory usage statistics and free any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 890) resources that can be easily reconstructed or re-read from a disk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 891)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 892) The "critical" level means that the system is actively thrashing, it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 893) about to out of memory (OOM) or even the in-kernel OOM killer is on its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 894) way to trigger. Applications should do whatever they can to help the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 895) system. It might be too late to consult with vmstat or any other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 896) statistics, so it's advisable to take an immediate action.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 897)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 898) By default, events are propagated upward until the event is handled, i.e. the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 899) events are not pass-through. For example, you have three cgroups: A->B->C. Now
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 900) you set up an event listener on cgroups A, B and C, and suppose group C
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 901) experiences some pressure. In this situation, only group C will receive the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 902) notification, i.e. groups A and B will not receive it. This is done to avoid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 903) excessive "broadcasting" of messages, which disturbs the system and which is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 904) especially bad if we are low on memory or thrashing. Group B, will receive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 905) notification only if there are no event listers for group C.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 906)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 907) There are three optional modes that specify different propagation behavior:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 908)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 909) - "default": this is the default behavior specified above. This mode is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 910) same as omitting the optional mode parameter, preserved by backwards
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 911) compatibility.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 912)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 913) - "hierarchy": events always propagate up to the root, similar to the default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 914) behavior, except that propagation continues regardless of whether there are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 915) event listeners at each level, with the "hierarchy" mode. In the above
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 916) example, groups A, B, and C will receive notification of memory pressure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 917)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 918) - "local": events are pass-through, i.e. they only receive notifications when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 919) memory pressure is experienced in the memcg for which the notification is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 920) registered. In the above example, group C will receive notification if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 921) registered for "local" notification and the group experiences memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 922) pressure. However, group B will never receive notification, regardless if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 923) there is an event listener for group C or not, if group B is registered for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 924) local notification.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 925)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 926) The level and event notification mode ("hierarchy" or "local", if necessary) are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 927) specified by a comma-delimited string, i.e. "low,hierarchy" specifies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 928) hierarchical, pass-through, notification for all ancestor memcgs. Notification
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 929) that is the default, non pass-through behavior, does not specify a mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 930) "medium,local" specifies pass-through notification for the medium level.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 931)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 932) The file memory.pressure_level is only used to setup an eventfd. To
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 933) register a notification, an application must:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 934)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 935) - create an eventfd using eventfd(2);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 936) - open memory.pressure_level;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 937) - write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 938) to cgroup.event_control.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 939)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 940) Application will be notified through eventfd when memory pressure is at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 941) the specific level (or higher). Read/write operations to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 942) memory.pressure_level are no implemented.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 943)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 944) Test:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 945)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 946) Here is a small script example that makes a new cgroup, sets up a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 947) memory limit, sets up a notification in the cgroup and then makes child
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 948) cgroup experience a critical pressure::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 949)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 950) # cd /sys/fs/cgroup/memory/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 951) # mkdir foo
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 952) # cd foo
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 953) # cgroup_event_listener memory.pressure_level low,hierarchy &
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 954) # echo 8000000 > memory.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 955) # echo 8000000 > memory.memsw.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 956) # echo $$ > tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 957) # dd if=/dev/zero | read x
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 958)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 959) (Expect a bunch of notifications, and eventually, the oom-killer will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 960) trigger.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 961)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 962) 12. TODO
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 963) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 964)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 965) 1. Make per-cgroup scanner reclaim not-shared pages first
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 966) 2. Teach controller to account for shared-pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 967) 3. Start reclamation in the background when the limit is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 968) not yet hit but the usage is getting closer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 969)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 970) Summary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 971) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 972)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 973) Overall, the memory controller has been a stable controller and has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 974) commented and discussed quite extensively in the community.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 975)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 976) References
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 977) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 978)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 979) 1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 980) 2. Singh, Balbir. Memory Controller (RSS Control),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 981) http://lwn.net/Articles/222762/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 982) 3. Emelianov, Pavel. Resource controllers based on process cgroups
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 983) http://lkml.org/lkml/2007/3/6/198
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 984) 4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 985) http://lkml.org/lkml/2007/4/9/78
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 986) 5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 987) http://lkml.org/lkml/2007/5/30/244
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 988) 6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 989) 7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 990) subsystem (v3), http://lwn.net/Articles/235534/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 991) 8. Singh, Balbir. RSS controller v2 test results (lmbench),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 992) http://lkml.org/lkml/2007/5/17/232
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 993) 9. Singh, Balbir. RSS controller v2 AIM9 results
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 994) http://lkml.org/lkml/2007/5/18/1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 995) 10. Singh, Balbir. Memory controller v6 test results,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 996) http://lkml.org/lkml/2007/8/19/36
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 997) 11. Singh, Balbir. Memory controller introduction (v6),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 998) http://lkml.org/lkml/2007/8/17/69
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 999) 12. Corbet, Jonathan, Controlling memory use in cgroups,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1000) http://lwn.net/Articles/243795/