Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    1) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    2) Memory Resource Controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    3) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    5) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    6)       This document is hopelessly outdated and it asks for a complete
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    7)       rewrite. It still contains a useful information so we are keeping it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    8)       here but make sure to check the current code if you need a deeper
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    9)       understanding.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   11) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   12)       The Memory Resource Controller has generically been referred to as the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   13)       memory controller in this document. Do not confuse memory controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   14)       used here with the memory controller that is used in hardware.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   15) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   16) (For editors) In this document:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   17)       When we mention a cgroup (cgroupfs's directory) with memory controller,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   18)       we call it "memory cgroup". When you see git-log and source code, you'll
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   19)       see patch's title and function names tend to use "memcg".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   20)       In this document, we avoid using it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   21) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   22) Benefits and Purpose of the memory controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   23) =============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   24) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   25) The memory controller isolates the memory behaviour of a group of tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   26) from the rest of the system. The article on LWN [12] mentions some probable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   27) uses of the memory controller. The memory controller can be used to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   28) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   29) a. Isolate an application or a group of applications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   30)    Memory-hungry applications can be isolated and limited to a smaller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   31)    amount of memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   32) b. Create a cgroup with a limited amount of memory; this can be used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   33)    as a good alternative to booting with mem=XXXX.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   34) c. Virtualization solutions can control the amount of memory they want
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   35)    to assign to a virtual machine instance.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   36) d. A CD/DVD burner could control the amount of memory used by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   37)    rest of the system to ensure that burning does not fail due to lack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   38)    of available memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   39) e. There are several other use cases; find one or use the controller just
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   40)    for fun (to learn and hack on the VM subsystem).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   41) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   42) Current Status: linux-2.6.34-mmotm(development version of 2010/April)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   43) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   44) Features:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   45) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   46)  - accounting anonymous pages, file caches, swap caches usage and limiting them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   47)  - pages are linked to per-memcg LRU exclusively, and there is no global LRU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   48)  - optionally, memory+swap usage can be accounted and limited.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   49)  - hierarchical accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   50)  - soft limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   51)  - moving (recharging) account at moving a task is selectable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   52)  - usage threshold notifier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   53)  - memory pressure notifier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   54)  - oom-killer disable knob and oom-notifier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   55)  - Root cgroup has no limit controls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   56) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   57)  Kernel memory support is a work in progress, and the current version provides
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   58)  basically functionality. (See Section 2.7)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   59) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   60) Brief summary of control files.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   61) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   62) ==================================== ==========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   63)  tasks				     attach a task(thread) and show list of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   64) 				     threads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   65)  cgroup.procs			     show list of processes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   66)  cgroup.event_control		     an interface for event_fd()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   67)  memory.usage_in_bytes		     show current usage for memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   68) 				     (See 5.5 for details)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   69)  memory.memsw.usage_in_bytes	     show current usage for memory+Swap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   70) 				     (See 5.5 for details)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   71)  memory.limit_in_bytes		     set/show limit of memory usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   72)  memory.memsw.limit_in_bytes	     set/show limit of memory+Swap usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   73)  memory.failcnt			     show the number of memory usage hits limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   74)  memory.memsw.failcnt		     show the number of memory+Swap hits limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   75)  memory.max_usage_in_bytes	     show max memory usage recorded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   76)  memory.memsw.max_usage_in_bytes     show max memory+Swap usage recorded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   77)  memory.soft_limit_in_bytes	     set/show soft limit of memory usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   78)  memory.stat			     show various statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   79)  memory.use_hierarchy		     set/show hierarchical account enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   80)  memory.force_empty		     trigger forced page reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   81)  memory.pressure_level		     set memory pressure notifications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   82)  memory.swappiness		     set/show swappiness parameter of vmscan
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   83) 				     (See sysctl's vm.swappiness)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   84)  memory.move_charge_at_immigrate     set/show controls of moving charges
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   85)  memory.oom_control		     set/show oom controls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   86)  memory.numa_stat		     show the number of memory usage per numa
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   87) 				     node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   88)  memory.kmem.limit_in_bytes          set/show hard limit for kernel memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   89)                                      This knob is deprecated and shouldn't be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   90)                                      used. It is planned that this be removed in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   91)                                      the foreseeable future.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   92)  memory.kmem.usage_in_bytes          show current kernel memory allocation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   93)  memory.kmem.failcnt                 show the number of kernel memory usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   94) 				     hits limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   95)  memory.kmem.max_usage_in_bytes      show max kernel memory usage recorded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   96) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   97)  memory.kmem.tcp.limit_in_bytes      set/show hard limit for tcp buf memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   98)  memory.kmem.tcp.usage_in_bytes      show current tcp buf memory allocation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   99)  memory.kmem.tcp.failcnt             show the number of tcp buf memory usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  100) 				     hits limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  101)  memory.kmem.tcp.max_usage_in_bytes  show max tcp buf memory usage recorded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  102) ==================================== ==========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  103) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  104) 1. History
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  105) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  106) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  107) The memory controller has a long history. A request for comments for the memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  108) controller was posted by Balbir Singh [1]. At the time the RFC was posted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  109) there were several implementations for memory control. The goal of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  110) RFC was to build consensus and agreement for the minimal features required
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  111) for memory control. The first RSS controller was posted by Balbir Singh[2]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  112) in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  113) RSS controller. At OLS, at the resource management BoF, everyone suggested
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  114) that we handle both page cache and RSS together. Another request was raised
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  115) to allow user space handling of OOM. The current memory controller is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  116) at version 6; it combines both mapped (RSS) and unmapped Page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  117) Cache Control [11].
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  118) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  119) 2. Memory Control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  120) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  121) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  122) Memory is a unique resource in the sense that it is present in a limited
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  123) amount. If a task requires a lot of CPU processing, the task can spread
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  124) its processing over a period of hours, days, months or years, but with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  125) memory, the same physical memory needs to be reused to accomplish the task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  126) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  127) The memory controller implementation has been divided into phases. These
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  128) are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  129) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  130) 1. Memory controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  131) 2. mlock(2) controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  132) 3. Kernel user memory accounting and slab control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  133) 4. user mappings length controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  134) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  135) The memory controller is the first controller developed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  136) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  137) 2.1. Design
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  138) -----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  139) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  140) The core of the design is a counter called the page_counter. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  141) page_counter tracks the current memory usage and limit of the group of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  142) processes associated with the controller. Each cgroup has a memory controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  143) specific data structure (mem_cgroup) associated with it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  144) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  145) 2.2. Accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  146) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  147) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  148) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  149) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  150) 		+--------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  151) 		|  mem_cgroup        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  152) 		|  (page_counter)    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  153) 		+--------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  154) 		 /            ^      \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  155) 		/             |       \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  156)            +---------------+  |        +---------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  157)            | mm_struct     |  |....    | mm_struct     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  158)            |               |  |        |               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  159)            +---------------+  |        +---------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  160)                               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  161)                               + --------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  162)                                               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  163)            +---------------+           +------+--------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  164)            | page          +---------->  page_cgroup|
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  165)            |               |           |               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  166)            +---------------+           +---------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  167) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  168)              (Figure 1: Hierarchy of Accounting)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  169) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  170) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  171) Figure 1 shows the important aspects of the controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  172) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  173) 1. Accounting happens per cgroup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  174) 2. Each mm_struct knows about which cgroup it belongs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  175) 3. Each page has a pointer to the page_cgroup, which in turn knows the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  176)    cgroup it belongs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  177) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  178) The accounting is done as follows: mem_cgroup_charge_common() is invoked to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  179) set up the necessary data structures and check if the cgroup that is being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  180) charged is over its limit. If it is, then reclaim is invoked on the cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  181) More details can be found in the reclaim section of this document.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  182) If everything goes well, a page meta-data-structure called page_cgroup is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  183) updated. page_cgroup has its own LRU on cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  184) (*) page_cgroup structure is allocated at boot/memory-hotplug time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  185) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  186) 2.2.1 Accounting details
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  187) ------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  188) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  189) All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  190) Some pages which are never reclaimable and will not be on the LRU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  191) are not accounted. We just account pages under usual VM management.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  192) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  193) RSS pages are accounted at page_fault unless they've already been accounted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  194) for earlier. A file page will be accounted for as Page Cache when it's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  195) inserted into inode (radix-tree). While it's mapped into the page tables of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  196) processes, duplicate accounting is carefully avoided.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  197) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  198) An RSS page is unaccounted when it's fully unmapped. A PageCache page is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  199) unaccounted when it's removed from radix-tree. Even if RSS pages are fully
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  200) unmapped (by kswapd), they may exist as SwapCache in the system until they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  201) are really freed. Such SwapCaches are also accounted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  202) A swapped-in page is accounted after adding into swapcache.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  203) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  204) Note: The kernel does swapin-readahead and reads multiple swaps at once.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  205) Since page's memcg recorded into swap whatever memsw enabled, the page will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  206) be accounted after swapin.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  207) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  208) At page migration, accounting information is kept.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  209) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  210) Note: we just account pages-on-LRU because our purpose is to control amount
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  211) of used pages; not-on-LRU pages tend to be out-of-control from VM view.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  212) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  213) 2.3 Shared Page Accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  214) --------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  215) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  216) Shared pages are accounted on the basis of the first touch approach. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  217) cgroup that first touches a page is accounted for the page. The principle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  218) behind this approach is that a cgroup that aggressively uses a shared
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  219) page will eventually get charged for it (once it is uncharged from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  220) the cgroup that brought it in -- this will happen on memory pressure).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  221) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  222) But see section 8.2: when moving a task to another cgroup, its pages may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  223) be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  224) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  225) 2.4 Swap Extension
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  226) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  227) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  228) Swap usage is always recorded for each of cgroup. Swap Extension allows you to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  229) read and limit it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  230) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  231) When CONFIG_SWAP is enabled, following files are added.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  232) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  233)  - memory.memsw.usage_in_bytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  234)  - memory.memsw.limit_in_bytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  235) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  236) memsw means memory+swap. Usage of memory+swap is limited by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  237) memsw.limit_in_bytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  238) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  239) Example: Assume a system with 4G of swap. A task which allocates 6G of memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  240) (by mistake) under 2G memory limitation will use all swap.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  241) In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  242) By using the memsw limit, you can avoid system OOM which can be caused by swap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  243) shortage.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  244) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  245) **why 'memory+swap' rather than swap**
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  246) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  247) The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  248) to move account from memory to swap...there is no change in usage of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  249) memory+swap. In other words, when we want to limit the usage of swap without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  250) affecting global LRU, memory+swap limit is better than just limiting swap from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  251) an OS point of view.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  252) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  253) **What happens when a cgroup hits memory.memsw.limit_in_bytes**
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  254) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  255) When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  256) in this cgroup. Then, swap-out will not be done by cgroup routine and file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  257) caches are dropped. But as mentioned above, global LRU can do swapout memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  258) from it for sanity of the system's memory management state. You can't forbid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  259) it by cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  260) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  261) 2.5 Reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  262) -----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  263) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  264) Each cgroup maintains a per cgroup LRU which has the same structure as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  265) global VM. When a cgroup goes over its limit, we first try
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  266) to reclaim memory from the cgroup so as to make space for the new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  267) pages that the cgroup has touched. If the reclaim is unsuccessful,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  268) an OOM routine is invoked to select and kill the bulkiest task in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  269) cgroup. (See 10. OOM Control below.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  270) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  271) The reclaim algorithm has not been modified for cgroups, except that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  272) pages that are selected for reclaiming come from the per-cgroup LRU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  273) list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  274) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  275) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  276)   Reclaim does not work for the root cgroup, since we cannot set any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  277)   limits on the root cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  278) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  279) Note2:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  280)   When panic_on_oom is set to "2", the whole system will panic.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  281) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  282) When oom event notifier is registered, event will be delivered.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  283) (See oom_control section)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  284) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  285) 2.6 Locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  286) -----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  287) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  288)    lock_page_cgroup()/unlock_page_cgroup() should not be called under
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  289)    the i_pages lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  290) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  291)    Other lock order is following:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  292) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  293)    PG_locked.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  294)      mm->page_table_lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  295)          pgdat->lru_lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  296) 	   lock_page_cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  297) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  298)   In many cases, just lock_page_cgroup() is called.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  299) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  300)   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  301)   pgdat->lru_lock, it has no lock of its own.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  302) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  303) 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  304) -----------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  305) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  306) With the Kernel memory extension, the Memory Controller is able to limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  307) the amount of kernel memory used by the system. Kernel memory is fundamentally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  308) different than user memory, since it can't be swapped out, which makes it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  309) possible to DoS the system by consuming too much of this precious resource.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  310) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  311) Kernel memory accounting is enabled for all memory cgroups by default. But
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  312) it can be disabled system-wide by passing cgroup.memory=nokmem to the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  313) at boot time. In this case, kernel memory will not be accounted at all.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  314) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  315) Kernel memory limits are not imposed for the root cgroup. Usage for the root
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  316) cgroup may or may not be accounted. The memory used is accumulated into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  317) memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  318) (currently only for tcp).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  319) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  320) The main "kmem" counter is fed into the main counter, so kmem charges will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  321) also be visible from the user counter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  322) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  323) Currently no soft limit is implemented for kernel memory. It is future work
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  324) to trigger slab reclaim when those limits are reached.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  325) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  326) 2.7.1 Current Kernel Memory resources accounted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  327) -----------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  328) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  329) stack pages:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  330)   every process consumes some stack pages. By accounting into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  331)   kernel memory, we prevent new processes from being created when the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  332)   memory usage is too high.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  333) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  334) slab pages:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  335)   pages allocated by the SLAB or SLUB allocator are tracked. A copy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  336)   of each kmem_cache is created every time the cache is touched by the first time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  337)   from inside the memcg. The creation is done lazily, so some objects can still be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  338)   skipped while the cache is being created. All objects in a slab page should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  339)   belong to the same memcg. This only fails to hold when a task is migrated to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  340)   different memcg during the page allocation by the cache.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  341) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  342) sockets memory pressure:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  343)   some sockets protocols have memory pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  344)   thresholds. The Memory Controller allows them to be controlled individually
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  345)   per cgroup, instead of globally.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  346) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  347) tcp memory pressure:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  348)   sockets memory pressure for the tcp protocol.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  349) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  350) 2.7.2 Common use cases
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  351) ----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  352) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  353) Because the "kmem" counter is fed to the main user counter, kernel memory can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  354) never be limited completely independently of user memory. Say "U" is the user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  355) limit, and "K" the kernel limit. There are three possible ways limits can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  356) set:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  357) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  358) U != 0, K = unlimited:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  359)     This is the standard memcg limitation mechanism already present before kmem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  360)     accounting. Kernel memory is completely ignored.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  361) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  362) U != 0, K < U:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  363)     Kernel memory is a subset of the user memory. This setup is useful in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  364)     deployments where the total amount of memory per-cgroup is overcommited.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  365)     Overcommiting kernel memory limits is definitely not recommended, since the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  366)     box can still run out of non-reclaimable memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  367)     In this case, the admin could set up K so that the sum of all groups is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  368)     never greater than the total memory, and freely set U at the cost of his
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  369)     QoS.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  370) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  371) WARNING:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  372)     In the current implementation, memory reclaim will NOT be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  373)     triggered for a cgroup when it hits K while staying below U, which makes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  374)     this setup impractical.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  375) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  376) U != 0, K >= U:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  377)     Since kmem charges will also be fed to the user counter and reclaim will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  378)     triggered for the cgroup for both kinds of memory. This setup gives the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  379)     admin a unified view of memory, and it is also useful for people who just
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  380)     want to track kernel memory usage.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  381) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  382) 3. User Interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  383) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  384) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  385) 3.0. Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  386) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  387) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  388) a. Enable CONFIG_CGROUPS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  389) b. Enable CONFIG_MEMCG
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  390) c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  391) d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  392) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  393) 3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  394) -------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  395) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  396) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  397) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  398) 	# mount -t tmpfs none /sys/fs/cgroup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  399) 	# mkdir /sys/fs/cgroup/memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  400) 	# mount -t cgroup none /sys/fs/cgroup/memory -o memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  401) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  402) 3.2. Make the new group and move bash into it::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  403) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  404) 	# mkdir /sys/fs/cgroup/memory/0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  405) 	# echo $$ > /sys/fs/cgroup/memory/0/tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  406) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  407) Since now we're in the 0 cgroup, we can alter the memory limit::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  408) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  409) 	# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  410) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  411) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  412)   We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  413)   mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  414)   Gibibytes.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  415) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  416) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  417)   We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  418) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  419) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  420)   We cannot set limits on the root cgroup any more.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  421) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  422) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  423) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  424)   # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  425)   4194304
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  426) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  427) We can check the usage::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  428) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  429)   # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  430)   1216512
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  431) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  432) A successful write to this file does not guarantee a successful setting of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  433) this limit to the value written into the file. This can be due to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  434) number of factors, such as rounding up to page boundaries or the total
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  435) availability of memory on the system. The user is required to re-read
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  436) this file after a write to guarantee the value committed by the kernel::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  437) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  438)   # echo 1 > memory.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  439)   # cat memory.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  440)   4096
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  441) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  442) The memory.failcnt field gives the number of times that the cgroup limit was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  443) exceeded.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  444) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  445) The memory.stat file gives accounting information. Now, the number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  446) caches, RSS and Active pages/Inactive pages are shown.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  447) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  448) 4. Testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  449) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  450) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  451) For testing features and implementation, see memcg_test.txt.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  452) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  453) Performance test is also important. To see pure memory controller's overhead,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  454) testing on tmpfs will give you good numbers of small overheads.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  455) Example: do kernel make on tmpfs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  456) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  457) Page-fault scalability is also important. At measuring parallel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  458) page fault test, multi-process test may be better than multi-thread
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  459) test because it has noise of shared objects/status.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  460) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  461) But the above two are testing extreme situations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  462) Trying usual test under memory controller is always helpful.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  463) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  464) 4.1 Troubleshooting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  465) -------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  466) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  467) Sometimes a user might find that the application under a cgroup is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  468) terminated by the OOM killer. There are several causes for this:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  469) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  470) 1. The cgroup limit is too low (just too low to do anything useful)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  471) 2. The user is using anonymous memory and swap is turned off or too low
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  472) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  473) A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  474) some of the pages cached in the cgroup (page cache pages).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  475) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  476) To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  477) seeing what happens will be helpful.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  478) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  479) 4.2 Task migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  480) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  481) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  482) When a task migrates from one cgroup to another, its charge is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  483) carried forward by default. The pages allocated from the original cgroup still
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  484) remain charged to it, the charge is dropped when the page is freed or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  485) reclaimed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  486) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  487) You can move charges of a task along with task migration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  488) See 8. "Move charges at task migration"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  489) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  490) 4.3 Removing a cgroup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  491) ---------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  492) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  493) A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  494) cgroup might have some charge associated with it, even though all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  495) tasks have migrated away from it. (because we charge against pages, not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  496) against tasks.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  497) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  498) We move the stats to root (if use_hierarchy==0) or parent (if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  499) use_hierarchy==1), and no change on the charge except uncharging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  500) from the child.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  501) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  502) Charges recorded in swap information is not updated at removal of cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  503) Recorded information is discarded and a cgroup which uses swap (swapcache)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  504) will be charged as a new owner of it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  505) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  506) About use_hierarchy, see Section 6.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  507) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  508) 5. Misc. interfaces
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  509) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  510) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  511) 5.1 force_empty
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  512) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  513)   memory.force_empty interface is provided to make cgroup's memory usage empty.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  514)   When writing anything to this::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  515) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  516)     # echo 0 > memory.force_empty
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  517) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  518)   the cgroup will be reclaimed and as many pages reclaimed as possible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  519) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  520)   The typical use case for this interface is before calling rmdir().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  521)   Though rmdir() offlines memcg, but the memcg may still stay there due to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  522)   charged file caches. Some out-of-use page caches may keep charged until
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  523)   memory pressure happens. If you want to avoid that, force_empty will be useful.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  524) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  525)   Also, note that when memory.kmem.limit_in_bytes is set the charges due to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  526)   kernel pages will still be seen. This is not considered a failure and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  527)   write will still return success. In this case, it is expected that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  528)   memory.kmem.usage_in_bytes == memory.usage_in_bytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  529) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  530)   About use_hierarchy, see Section 6.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  531) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  532) 5.2 stat file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  533) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  534) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  535) memory.stat file includes following statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  536) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  537) per-memory cgroup local status
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  538) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  539) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  540) =============== ===============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  541) cache		# of bytes of page cache memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  542) rss		# of bytes of anonymous and swap cache memory (includes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  543) 		transparent hugepages).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  544) rss_huge	# of bytes of anonymous transparent hugepages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  545) mapped_file	# of bytes of mapped file (includes tmpfs/shmem)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  546) pgpgin		# of charging events to the memory cgroup. The charging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  547) 		event happens each time a page is accounted as either mapped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  548) 		anon page(RSS) or cache page(Page Cache) to the cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  549) pgpgout		# of uncharging events to the memory cgroup. The uncharging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  550) 		event happens each time a page is unaccounted from the cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  551) swap		# of bytes of swap usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  552) dirty		# of bytes that are waiting to get written back to the disk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  553) writeback	# of bytes of file/anon cache that are queued for syncing to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  554) 		disk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  555) inactive_anon	# of bytes of anonymous and swap cache memory on inactive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  556) 		LRU list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  557) active_anon	# of bytes of anonymous and swap cache memory on active
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  558) 		LRU list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  559) inactive_file	# of bytes of file-backed memory on inactive LRU list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  560) active_file	# of bytes of file-backed memory on active LRU list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  561) unevictable	# of bytes of memory that cannot be reclaimed (mlocked etc).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  562) =============== ===============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  563) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  564) status considering hierarchy (see memory.use_hierarchy settings)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  565) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  566) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  567) ========================= ===================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  568) hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  569) 			  under which the memory cgroup is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  570) hierarchical_memsw_limit  # of bytes of memory+swap limit with regard to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  571) 			  hierarchy under which memory cgroup is.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  572) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  573) total_<counter>		  # hierarchical version of <counter>, which in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  574) 			  addition to the cgroup's own value includes the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  575) 			  sum of all hierarchical children's values of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  576) 			  <counter>, i.e. total_cache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  577) ========================= ===================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  578) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  579) The following additional stats are dependent on CONFIG_DEBUG_VM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  580) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  581) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  582) ========================= ========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  583) recent_rotated_anon	  VM internal parameter. (see mm/vmscan.c)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  584) recent_rotated_file	  VM internal parameter. (see mm/vmscan.c)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  585) recent_scanned_anon	  VM internal parameter. (see mm/vmscan.c)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  586) recent_scanned_file	  VM internal parameter. (see mm/vmscan.c)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  587) ========================= ========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  588) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  589) Memo:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  590) 	recent_rotated means recent frequency of LRU rotation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  591) 	recent_scanned means recent # of scans to LRU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  592) 	showing for better debug please see the code for meanings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  593) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  594) Note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  595) 	Only anonymous and swap cache memory is listed as part of 'rss' stat.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  596) 	This should not be confused with the true 'resident set size' or the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  597) 	amount of physical memory used by the cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  598) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  599) 	'rss + mapped_file" will give you resident set size of cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  600) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  601) 	(Note: file and shmem may be shared among other cgroups. In that case,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  602) 	mapped_file is accounted only when the memory cgroup is owner of page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  603) 	cache.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  604) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  605) 5.3 swappiness
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  606) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  607) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  608) Overrides /proc/sys/vm/swappiness for the particular group. The tunable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  609) in the root cgroup corresponds to the global swappiness setting.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  610) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  611) Please note that unlike during the global reclaim, limit reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  612) enforces that 0 swappiness really prevents from any swapping even if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  613) there is a swap storage available. This might lead to memcg OOM killer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  614) if there are no file pages to reclaim.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  615) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  616) 5.4 failcnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  617) -----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  618) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  619) A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  620) This failcnt(== failure count) shows the number of times that a usage counter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  621) hit its limit. When a memory cgroup hits a limit, failcnt increases and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  622) memory under it will be reclaimed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  623) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  624) You can reset failcnt by writing 0 to failcnt file::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  625) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  626) 	# echo 0 > .../memory.failcnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  627) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  628) 5.5 usage_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  629) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  630) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  631) For efficiency, as other kernel components, memory cgroup uses some optimization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  632) to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  633) method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  634) value for efficient access. (Of course, when necessary, it's synchronized.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  635) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  636) value in memory.stat(see 5.2).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  637) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  638) 5.6 numa_stat
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  639) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  640) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  641) This is similar to numa_maps but operates on a per-memcg basis.  This is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  642) useful for providing visibility into the numa locality information within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  643) an memcg since the pages are allowed to be allocated from any physical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  644) node.  One of the use cases is evaluating application performance by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  645) combining this information with the application's CPU allocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  646) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  647) Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  648) per-node page counts including "hierarchical_<counter>" which sums up all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  649) hierarchical children's values in addition to the memcg's own value.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  650) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  651) The output format of memory.numa_stat is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  652) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  653)   total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  654)   file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  655)   anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  656)   unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  657)   hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  658) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  659) The "total" count is sum of file + anon + unevictable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  660) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  661) 6. Hierarchy support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  662) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  663) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  664) The memory controller supports a deep hierarchy and hierarchical accounting.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  665) The hierarchy is created by creating the appropriate cgroups in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  666) cgroup filesystem. Consider for example, the following cgroup filesystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  667) hierarchy::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  668) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  669) 	       root
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  670) 	     /  |   \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  671)             /	|    \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  672) 	   a	b     c
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  673) 		      | \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  674) 		      |  \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  675) 		      d   e
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  676) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  677) In the diagram above, with hierarchical accounting enabled, all memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  678) usage of e, is accounted to its ancestors up until the root (i.e, c and root),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  679) that has memory.use_hierarchy enabled. If one of the ancestors goes over its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  680) limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  681) children of the ancestor.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  682) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  683) 6.1 Enabling hierarchical accounting and reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  684) ------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  685) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  686) A memory cgroup by default disables the hierarchy feature. Support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  687) can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  688) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  689) 	# echo 1 > memory.use_hierarchy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  690) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  691) The feature can be disabled by::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  692) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  693) 	# echo 0 > memory.use_hierarchy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  694) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  695) NOTE1:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  696)        Enabling/disabling will fail if either the cgroup already has other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  697)        cgroups created below it, or if the parent cgroup has use_hierarchy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  698)        enabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  699) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  700) NOTE2:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  701)        When panic_on_oom is set to "2", the whole system will panic in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  702)        case of an OOM event in any cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  703) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  704) 7. Soft limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  705) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  706) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  707) Soft limits allow for greater sharing of memory. The idea behind soft limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  708) is to allow control groups to use as much of the memory as needed, provided
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  709) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  710) a. There is no memory contention
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  711) b. They do not exceed their hard limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  712) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  713) When the system detects memory contention or low memory, control groups
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  714) are pushed back to their soft limits. If the soft limit of each control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  715) group is very high, they are pushed back as much as possible to make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  716) sure that one control group does not starve the others of memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  717) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  718) Please note that soft limits is a best-effort feature; it comes with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  719) no guarantees, but it does its best to make sure that when memory is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  720) heavily contended for, memory is allocated based on the soft limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  721) hints/setup. Currently soft limit based reclaim is set up such that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  722) it gets invoked from balance_pgdat (kswapd).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  723) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  724) 7.1 Interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  725) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  726) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  727) Soft limits can be setup by using the following commands (in this example we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  728) assume a soft limit of 256 MiB)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  729) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  730) 	# echo 256M > memory.soft_limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  731) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  732) If we want to change this to 1G, we can at any time use::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  733) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  734) 	# echo 1G > memory.soft_limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  735) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  736) NOTE1:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  737)        Soft limits take effect over a long period of time, since they involve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  738)        reclaiming memory for balancing between memory cgroups
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  739) NOTE2:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  740)        It is recommended to set the soft limit always below the hard limit,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  741)        otherwise the hard limit will take precedence.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  742) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  743) 8. Move charges at task migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  744) =================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  745) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  746) Users can move charges associated with a task along with task migration, that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  747) is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  748) This feature is not supported in !CONFIG_MMU environments because of lack of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  749) page tables.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  750) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  751) 8.1 Interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  752) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  753) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  754) This feature is disabled by default. It can be enabled (and disabled again) by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  755) writing to memory.move_charge_at_immigrate of the destination cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  756) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  757) If you want to enable it::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  758) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  759) 	# echo (some positive value) > memory.move_charge_at_immigrate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  760) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  761) Note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  762)       Each bits of move_charge_at_immigrate has its own meaning about what type
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  763)       of charges should be moved. See 8.2 for details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  764) Note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  765)       Charges are moved only when you move mm->owner, in other words,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  766)       a leader of a thread group.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  767) Note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  768)       If we cannot find enough space for the task in the destination cgroup, we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  769)       try to make space by reclaiming memory. Task migration may fail if we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  770)       cannot make enough space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  771) Note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  772)       It can take several seconds if you move charges much.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  773) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  774) And if you want disable it again::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  775) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  776) 	# echo 0 > memory.move_charge_at_immigrate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  777) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  778) 8.2 Type of charges which can be moved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  779) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  780) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  781) Each bit in move_charge_at_immigrate has its own meaning about what type of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  782) charges should be moved. But in any case, it must be noted that an account of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  783) a page or a swap can be moved only when it is charged to the task's current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  784) (old) memory cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  785) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  786) +---+--------------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  787) |bit| what type of charges would be moved ?                                    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  788) +===+==========================================================================+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  789) | 0 | A charge of an anonymous page (or swap of it) used by the target task.   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  790) |   | You must enable Swap Extension (see 2.4) to enable move of swap charges. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  791) +---+--------------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  792) | 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  793) |   | and swaps of tmpfs file) mmapped by the target task. Unlike the case of  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  794) |   | anonymous pages, file pages (and swaps) in the range mmapped by the task |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  795) |   | will be moved even if the task hasn't done page fault, i.e. they might   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  796) |   | not be the task's "RSS", but other task's "RSS" that maps the same file. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  797) |   | And mapcount of the page is ignored (the page can be moved even if       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  798) |   | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  799) |   | enable move of swap charges.                                             |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  800) +---+--------------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  801) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  802) 8.3 TODO
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  803) --------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  804) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  805) - All of moving charge operations are done under cgroup_mutex. It's not good
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  806)   behavior to hold the mutex too long, so we may need some trick.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  807) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  808) 9. Memory thresholds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  809) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  810) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  811) Memory cgroup implements memory thresholds using the cgroups notification
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  812) API (see cgroups.txt). It allows to register multiple memory and memsw
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  813) thresholds and gets notifications when it crosses.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  814) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  815) To register a threshold, an application must:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  816) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  817) - create an eventfd using eventfd(2);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  818) - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  819) - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  820)   cgroup.event_control.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  821) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  822) Application will be notified through eventfd when memory usage crosses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  823) threshold in any direction.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  824) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  825) It's applicable for root and non-root cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  826) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  827) 10. OOM Control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  828) ===============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  829) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  830) memory.oom_control file is for OOM notification and other controls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  831) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  832) Memory cgroup implements OOM notifier using the cgroup notification
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  833) API (See cgroups.txt). It allows to register multiple OOM notification
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  834) delivery and gets notification when OOM happens.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  835) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  836) To register a notifier, an application must:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  837) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  838)  - create an eventfd using eventfd(2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  839)  - open memory.oom_control file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  840)  - write string like "<event_fd> <fd of memory.oom_control>" to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  841)    cgroup.event_control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  842) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  843) The application will be notified through eventfd when OOM happens.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  844) OOM notification doesn't work for the root cgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  845) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  846) You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  847) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  848) 	#echo 1 > memory.oom_control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  849) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  850) If OOM-killer is disabled, tasks under cgroup will hang/sleep
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  851) in memory cgroup's OOM-waitqueue when they request accountable memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  852) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  853) For running them, you have to relax the memory cgroup's OOM status by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  854) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  855) 	* enlarge limit or reduce usage.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  856) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  857) To reduce usage,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  858) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  859) 	* kill some tasks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  860) 	* move some tasks to other group with account migration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  861) 	* remove some files (on tmpfs?)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  862) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  863) Then, stopped tasks will work again.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  864) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  865) At reading, current status of OOM is shown.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  866) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  867) 	- oom_kill_disable 0 or 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  868) 	  (if 1, oom-killer is disabled)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  869) 	- under_oom	   0 or 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  870) 	  (if 1, the memory cgroup is under OOM, tasks may be stopped.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  871) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  872) 11. Memory Pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  873) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  874) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  875) The pressure level notifications can be used to monitor the memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  876) allocation cost; based on the pressure, applications can implement
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  877) different strategies of managing their memory resources. The pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  878) levels are defined as following:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  879) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  880) The "low" level means that the system is reclaiming memory for new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  881) allocations. Monitoring this reclaiming activity might be useful for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  882) maintaining cache level. Upon notification, the program (typically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  883) "Activity Manager") might analyze vmstat and act in advance (i.e.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  884) prematurely shutdown unimportant services).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  885) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  886) The "medium" level means that the system is experiencing medium memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  887) pressure, the system might be making swap, paging out active file caches,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  888) etc. Upon this event applications may decide to further analyze
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  889) vmstat/zoneinfo/memcg or internal memory usage statistics and free any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  890) resources that can be easily reconstructed or re-read from a disk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  891) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  892) The "critical" level means that the system is actively thrashing, it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  893) about to out of memory (OOM) or even the in-kernel OOM killer is on its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  894) way to trigger. Applications should do whatever they can to help the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  895) system. It might be too late to consult with vmstat or any other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  896) statistics, so it's advisable to take an immediate action.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  897) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  898) By default, events are propagated upward until the event is handled, i.e. the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  899) events are not pass-through. For example, you have three cgroups: A->B->C. Now
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  900) you set up an event listener on cgroups A, B and C, and suppose group C
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  901) experiences some pressure. In this situation, only group C will receive the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  902) notification, i.e. groups A and B will not receive it. This is done to avoid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  903) excessive "broadcasting" of messages, which disturbs the system and which is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  904) especially bad if we are low on memory or thrashing. Group B, will receive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  905) notification only if there are no event listers for group C.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  906) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  907) There are three optional modes that specify different propagation behavior:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  908) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  909)  - "default": this is the default behavior specified above. This mode is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  910)    same as omitting the optional mode parameter, preserved by backwards
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  911)    compatibility.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  912) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  913)  - "hierarchy": events always propagate up to the root, similar to the default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  914)    behavior, except that propagation continues regardless of whether there are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  915)    event listeners at each level, with the "hierarchy" mode. In the above
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  916)    example, groups A, B, and C will receive notification of memory pressure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  917) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  918)  - "local": events are pass-through, i.e. they only receive notifications when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  919)    memory pressure is experienced in the memcg for which the notification is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  920)    registered. In the above example, group C will receive notification if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  921)    registered for "local" notification and the group experiences memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  922)    pressure. However, group B will never receive notification, regardless if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  923)    there is an event listener for group C or not, if group B is registered for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  924)    local notification.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  925) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  926) The level and event notification mode ("hierarchy" or "local", if necessary) are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  927) specified by a comma-delimited string, i.e. "low,hierarchy" specifies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  928) hierarchical, pass-through, notification for all ancestor memcgs. Notification
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  929) that is the default, non pass-through behavior, does not specify a mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  930) "medium,local" specifies pass-through notification for the medium level.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  931) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  932) The file memory.pressure_level is only used to setup an eventfd. To
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  933) register a notification, an application must:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  934) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  935) - create an eventfd using eventfd(2);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  936) - open memory.pressure_level;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  937) - write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  938)   to cgroup.event_control.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  939) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  940) Application will be notified through eventfd when memory pressure is at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  941) the specific level (or higher). Read/write operations to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  942) memory.pressure_level are no implemented.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  943) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  944) Test:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  945) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  946)    Here is a small script example that makes a new cgroup, sets up a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  947)    memory limit, sets up a notification in the cgroup and then makes child
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  948)    cgroup experience a critical pressure::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  949) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  950) 	# cd /sys/fs/cgroup/memory/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  951) 	# mkdir foo
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  952) 	# cd foo
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  953) 	# cgroup_event_listener memory.pressure_level low,hierarchy &
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  954) 	# echo 8000000 > memory.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  955) 	# echo 8000000 > memory.memsw.limit_in_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  956) 	# echo $$ > tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  957) 	# dd if=/dev/zero | read x
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  958) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  959)    (Expect a bunch of notifications, and eventually, the oom-killer will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  960)    trigger.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  961) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  962) 12. TODO
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  963) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  964) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  965) 1. Make per-cgroup scanner reclaim not-shared pages first
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  966) 2. Teach controller to account for shared-pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  967) 3. Start reclamation in the background when the limit is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  968)    not yet hit but the usage is getting closer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  969) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  970) Summary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  971) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  972) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  973) Overall, the memory controller has been a stable controller and has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  974) commented and discussed quite extensively in the community.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  975) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  976) References
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  977) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  978) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  979) 1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  980) 2. Singh, Balbir. Memory Controller (RSS Control),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  981)    http://lwn.net/Articles/222762/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  982) 3. Emelianov, Pavel. Resource controllers based on process cgroups
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  983)    http://lkml.org/lkml/2007/3/6/198
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  984) 4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  985)    http://lkml.org/lkml/2007/4/9/78
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  986) 5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  987)    http://lkml.org/lkml/2007/5/30/244
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  988) 6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  989) 7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  990)    subsystem (v3), http://lwn.net/Articles/235534/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  991) 8. Singh, Balbir. RSS controller v2 test results (lmbench),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  992)    http://lkml.org/lkml/2007/5/17/232
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  993) 9. Singh, Balbir. RSS controller v2 AIM9 results
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  994)    http://lkml.org/lkml/2007/5/18/1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  995) 10. Singh, Balbir. Memory controller v6 test results,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  996)     http://lkml.org/lkml/2007/8/19/36
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  997) 11. Singh, Balbir. Memory controller introduction (v6),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  998)     http://lkml.org/lkml/2007/8/17/69
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  999) 12. Corbet, Jonathan, Controlling memory use in cgroups,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1000)     http://lwn.net/Articles/243795/