Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. _cpusets:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) CPUSETS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) Copyright (C) 2004 BULL SA.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) Written by Simon.Derr@bull.net
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) - Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) - Modified by Paul Jackson <pj@sgi.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) - Modified by Christoph Lameter <cl@linux.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) - Modified by Paul Menage <menage@google.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) - Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) .. CONTENTS:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19)    1. Cpusets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20)      1.1 What are cpusets ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21)      1.2 Why are cpusets needed ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22)      1.3 How are cpusets implemented ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23)      1.4 What are exclusive cpusets ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24)      1.5 What is memory_pressure ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25)      1.6 What is memory spread ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26)      1.7 What is sched_load_balance ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27)      1.8 What is sched_relax_domain_level ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28)      1.9 How do I use cpusets ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29)    2. Usage Examples and Syntax
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30)      2.1 Basic Usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31)      2.2 Adding/removing cpus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32)      2.3 Setting flags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33)      2.4 Attaching processes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34)    3. Questions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35)    4. Contact
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) 1. Cpusets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) 1.1 What are cpusets ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) ----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) Cpusets provide a mechanism for assigning a set of CPUs and Memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) Nodes to a set of tasks.   In this document "Memory Node" refers to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) an on-line node that contains memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) Cpusets constrain the CPU and Memory placement of tasks to only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) the resources within a task's current cpuset.  They form a nested
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) hierarchy visible in a virtual file system.  These are the essential
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) hooks, beyond what is already present, required to manage dynamic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) job placement on large systems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) Cpusets use the generic cgroup subsystem described in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) Documentation/admin-guide/cgroup-v1/cgroups.rst.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) Requests by a task, using the sched_setaffinity(2) system call to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) include CPUs in its CPU affinity mask, and using the mbind(2) and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) set_mempolicy(2) system calls to include Memory Nodes in its memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) policy, are both filtered through that task's cpuset, filtering out any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) CPUs or Memory Nodes not in that cpuset.  The scheduler will not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) schedule a task on a CPU that is not allowed in its cpus_allowed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) vector, and the kernel page allocator will not allocate a page on a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) node that is not allowed in the requesting task's mems_allowed vector.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) User level code may create and destroy cpusets by name in the cgroup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) virtual file system, manage the attributes and permissions of these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) specify and query to which cpuset a task is assigned, and list the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) task pids assigned to a cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 1.2 Why are cpusets needed ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) ----------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) The management of large computer systems, with many processors (CPUs),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) complex memory cache hierarchies and multiple Memory Nodes having
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) non-uniform access times (NUMA) presents additional challenges for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) the efficient scheduling and memory placement of processes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) Frequently more modest sized systems can be operated with adequate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) efficiency just by letting the operating system automatically share
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) the available CPU and Memory resources amongst the requesting tasks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) But larger systems, which benefit more from careful processor and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) memory placement to reduce memory access times and contention,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) and which typically represent a larger investment for the customer,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) can benefit from explicitly placing jobs on properly sized subsets of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) This can be especially valuable on:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92)     * Web Servers running multiple instances of the same web application,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93)     * Servers running different applications (for instance, a web server
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94)       and a database), or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95)     * NUMA systems running large HPC applications with demanding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96)       performance characteristics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) These subsets, or "soft partitions" must be able to be dynamically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) adjusted, as the job mix changes, without impacting other concurrently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) executing jobs. The location of the running jobs pages may also be moved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) when the memory locations are changed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) The kernel cpuset patch provides the minimum essential kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) mechanisms required to efficiently implement such subsets.  It
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) leverages existing CPU and Memory Placement facilities in the Linux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) kernel to avoid any additional impact on the critical scheduler or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) memory allocator code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) 1.3 How are cpusets implemented ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) ---------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) Cpusets provide a Linux kernel mechanism to constrain which CPUs and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) Memory Nodes are used by a process or set of processes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) The Linux kernel already has a pair of mechanisms to specify on which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) CPUs a task may be scheduled (sched_setaffinity) and on which Memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) Nodes it may obtain memory (mbind, set_mempolicy).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) Cpusets extends these two mechanisms as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)  - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123)    kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)  - Each task in the system is attached to a cpuset, via a pointer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)    in the task structure to a reference counted cgroup structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)  - Calls to sched_setaffinity are filtered to just those CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)    allowed in that task's cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)  - Calls to mbind and set_mempolicy are filtered to just
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)    those Memory Nodes allowed in that task's cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)  - The root cpuset contains all the systems CPUs and Memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)    Nodes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)  - For any cpuset, one can define child cpusets containing a subset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)    of the parents CPU and Memory Node resources.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)  - The hierarchy of cpusets can be mounted at /dev/cpuset, for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)    browsing and manipulation from user space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)  - A cpuset may be marked exclusive, which ensures that no other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)    cpuset (except direct ancestors and descendants) may contain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)    any overlapping CPUs or Memory Nodes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)  - You can list all the tasks (by pid) attached to any cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) The implementation of cpusets requires a few, simple hooks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) into the rest of the kernel, none in performance critical paths:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144)  - in init/main.c, to initialize the root cpuset at system boot.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)  - in fork and exit, to attach and detach a task from its cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)  - in sched_setaffinity, to mask the requested CPUs by what's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)    allowed in that task's cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148)  - in sched.c migrate_live_tasks(), to keep migrating tasks within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)    the CPUs allowed by their cpuset, if possible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)  - in the mbind and set_mempolicy system calls, to mask the requested
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)    Memory Nodes by what's allowed in that task's cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)  - in page_alloc.c, to restrict memory to allowed nodes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)  - in vmscan.c, to restrict page recovery to the current cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) You should mount the "cgroup" filesystem type in order to enable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) browsing and modifying the cpusets presently known to the kernel.  No
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) new system calls are added for cpusets - all support for querying and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) modifying cpusets is via this cpuset file system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) The /proc/<pid>/status file for each task has four added lines,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) displaying the task's cpus_allowed (on which CPUs it may be scheduled)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) and mems_allowed (on which Memory Nodes it may obtain memory),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) in the two formats seen in the following example::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165)   Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166)   Cpus_allowed_list:      0-127
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167)   Mems_allowed:   ffffffff,ffffffff
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)   Mems_allowed_list:      0-63
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) Each cpuset is represented by a directory in the cgroup file system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) containing (on top of the standard cgroup files) the following
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) files describing that cpuset:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)  - cpuset.cpus: list of CPUs in that cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)  - cpuset.mems: list of Memory Nodes in that cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176)  - cpuset.memory_migrate flag: if set, move pages to cpusets nodes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)  - cpuset.cpu_exclusive flag: is cpu placement exclusive?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)  - cpuset.mem_exclusive flag: is memory placement exclusive?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)  - cpuset.mem_hardwall flag:  is memory allocation hardwalled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180)  - cpuset.memory_pressure: measure of how much paging pressure in cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181)  - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)  - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)  - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184)  - cpuset.sched_relax_domain_level: the searching range when migrating tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) In addition, only the root cpuset has the following file:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)  - cpuset.memory_pressure_enabled flag: compute memory_pressure?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) New cpusets are created using the mkdir system call or shell
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) command.  The properties of a cpuset, such as its flags, allowed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) CPUs and Memory Nodes, and attached tasks, are modified by writing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) to the appropriate file in that cpusets directory, as listed above.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) The named hierarchical structure of nested cpusets allows partitioning
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) a large system into nested, dynamically changeable, "soft-partitions".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) The attachment of each task, automatically inherited at fork by any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) children of that task, to a cpuset allows organizing the work load
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) on a system into related sets of tasks such that each set is constrained
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) to using the CPUs and Memory Nodes of a particular cpuset.  A task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) may be re-attached to any other cpuset, if allowed by the permissions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) on the necessary cpuset file system directories.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) Such management of a system "in the large" integrates smoothly with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) the detailed placement done on individual tasks and memory regions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) using the sched_setaffinity, mbind and set_mempolicy system calls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) The following rules apply to each cpuset:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211)  - Its CPUs and Memory Nodes must be a subset of its parents.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)  - It can't be marked exclusive unless its parent is.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213)  - If its cpu or memory is exclusive, they may not overlap any sibling.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) These rules, and the natural hierarchy of cpusets, enable efficient
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) enforcement of the exclusive guarantee, without having to scan all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) cpusets every time any of them change to ensure nothing overlaps a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) exclusive cpuset.  Also, the use of a Linux virtual file system (vfs)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) to represent the cpuset hierarchy provides for a familiar permission
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) and name space for cpusets, with a minimum of additional kernel code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) The cpus and mems files in the root (top_cpuset) cpuset are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) read-only.  The cpus file automatically tracks the value of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) cpu_online_mask using a CPU hotplug notifier, and the mems file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) automatically tracks the value of node_states[N_MEMORY]--i.e.,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) nodes with memory--using the cpuset_track_online_nodes() hook.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) The cpuset.effective_cpus and cpuset.effective_mems files are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) normally read-only copies of cpuset.cpus and cpuset.mems files
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) respectively.  If the cpuset cgroup filesystem is mounted with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) special "cpuset_v2_mode" option, the behavior of these files will become
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) similar to the corresponding files in cpuset v2.  In other words, hotplug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) events will not change cpuset.cpus and cpuset.mems.  Those events will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) only affect cpuset.effective_cpus and cpuset.effective_mems which show
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) the actual cpus and memory nodes that are currently used by this cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) See Documentation/admin-guide/cgroup-v2.rst for more information about
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) cpuset v2 behavior.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) 1.4 What are exclusive cpusets ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) --------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) If a cpuset is cpu or mem exclusive, no other cpuset, other than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) a direct ancestor or descendant, may share any of the same CPUs or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) Memory Nodes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) i.e. it restricts kernel allocations for page, buffer and other data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) commonly shared by the kernel across multiple users.  All cpusets,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) whether hardwalled or not, restrict allocations of memory for user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) space.  This enables configuring a system so that several independent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) jobs can share common kernel data, such as file system pages, while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) isolating each job's user allocation in its own cpuset.  To do this,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) construct a large mem_exclusive cpuset to hold all the jobs, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) construct child, non-mem_exclusive cpusets for each individual job.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) Only a small amount of typical kernel memory, such as requests from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) interrupt handlers, is allowed to be taken outside even a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) mem_exclusive cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) 1.5 What is memory_pressure ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) -----------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) The memory_pressure of a cpuset provides a simple per-cpuset metric
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) of the rate that the tasks in a cpuset are attempting to free up in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) use memory on the nodes of the cpuset to satisfy additional memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) requests.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) This enables batch managers monitoring jobs running in dedicated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) cpusets to efficiently detect what level of memory pressure that job
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) is causing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) This is useful both on tightly managed systems running a wide mix of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) submitted jobs, which may choose to terminate or re-prioritize jobs that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) are trying to use more memory than allowed on the nodes assigned to them,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) and with tightly coupled, long running, massively parallel scientific
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) computing jobs that will dramatically fail to meet required performance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) goals if they start to use more memory than allowed to them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) This mechanism provides a very economical way for the batch manager
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) to monitor a cpuset for signs of memory pressure.  It's up to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) batch manager or other user code to decide what to do about it and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) take action.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) ==>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285)     Unless this feature is enabled by writing "1" to the special file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286)     /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)     code of __alloc_pages() for this metric reduces to simply noticing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288)     that the cpuset_memory_pressure_enabled flag is zero.  So only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289)     systems that enable this feature will compute the metric.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) Why a per-cpuset, running average:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293)     Because this meter is per-cpuset, rather than per-task or mm,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294)     the system load imposed by a batch scheduler monitoring this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295)     metric is sharply reduced on large systems, because a scan of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296)     the tasklist can be avoided on each set of queries.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298)     Because this meter is a running average, instead of an accumulating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299)     counter, a batch scheduler can detect memory pressure with a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300)     single read, instead of having to read and accumulate results
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301)     for a period of time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303)     Because this meter is per-cpuset rather than per-task or mm,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304)     the batch scheduler can obtain the key information, memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305)     pressure in a cpuset, with a single read, rather than having to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306)     query and accumulate results over all the (dynamically changing)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307)     set of tasks in the cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) A per-cpuset simple digital filter (requires a spinlock and 3 words
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) of data per-cpuset) is kept, and updated by any task attached to that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) cpuset, if it enters the synchronous (direct) page reclaim code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) A per-cpuset file provides an integer number representing the recent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) (half-life of 10 seconds) rate of direct page reclaims caused by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) the tasks in the cpuset, in units of reclaims attempted per second,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) times 1000.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) 1.6 What is memory spread ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) There are two boolean flag files per cpuset that control where the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) kernel allocates pages for the file system buffers and related in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) kernel data structures.  They are called 'cpuset.memory_spread_page' and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) 'cpuset.memory_spread_slab'.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) the kernel will spread the file system buffers (page cache) evenly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) over all the nodes that the faulting task is allowed to use, instead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) of preferring to put those pages on the node where the task is running.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) then the kernel will spread some file system related slab caches,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) such as for inodes and dentries evenly over all the nodes that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) faulting task is allowed to use, instead of preferring to put those
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) pages on the node where the task is running.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) The setting of these flags does not affect anonymous data segment or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) stack segment pages of a task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) By default, both kinds of memory spreading are off, and memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) pages are allocated on the node local to where the task is running,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) except perhaps as modified by the task's NUMA mempolicy or cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) configuration, so long as sufficient free memory pages are available.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) When new cpusets are created, they inherit the memory spread settings
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) of their parent.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) Setting memory spreading causes allocations for the affected page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) or slab caches to ignore the task's NUMA mempolicy and be spread
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) instead.    Tasks using mbind() or set_mempolicy() calls to set NUMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) mempolicies will not notice any change in these calls as a result of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) their containing task's memory spread settings.  If memory spreading
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) is turned off, then the currently specified NUMA mempolicy once again
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) applies to memory page allocations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) files.  By default they contain "0", meaning that the feature is off
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) for that cpuset.  If a "1" is written to that file, then that turns
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) the named feature on.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) The implementation is simple.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) PFA_SPREAD_PAGE for each task that is in that cpuset or subsequently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) joins that cpuset.  The page allocation calls for the page cache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) is modified to perform an inline check for this PFA_SPREAD_PAGE task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) flag, and if set, a call to a new routine cpuset_mem_spread_node()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) returns the node to prefer for the allocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) Similarly, setting 'cpuset.memory_spread_slab' turns on the flag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) PFA_SPREAD_SLAB, and appropriately marked slab caches will allocate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) pages from the node returned by cpuset_mem_spread_node().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) The cpuset_mem_spread_node() routine is also simple.  It uses the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) value of a per-task rotor cpuset_mem_spread_rotor to select the next
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) node in the current task's mems_allowed to prefer for the allocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) This memory placement policy is also known (in other contexts) as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) round-robin or interleave.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) This policy can provide substantial improvements for jobs that need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) to place thread local data on the corresponding node, but that need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) to access large file system data sets that need to be spread across
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) the several nodes in the jobs cpuset in order to fit.  Without this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) policy, especially for jobs that might have one thread reading in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) data set, the memory allocation across the nodes in the jobs cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) can become very uneven.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) 1.7 What is sched_load_balance ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) --------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) The kernel scheduler (kernel/sched/core.c) automatically load balances
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) tasks.  If one CPU is underutilized, kernel code running on that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) CPU will look for tasks on other more overloaded CPUs and move those
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) tasks to itself, within the constraints of such placement mechanisms
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) as cpusets and sched_setaffinity.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) The algorithmic cost of load balancing and its impact on key shared
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) kernel data structures such as the task list increases more than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) linearly with the number of CPUs being balanced.  So the scheduler
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) has support to partition the systems CPUs into a number of sched
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) domains such that it only load balances within each sched domain.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) Each sched domain covers some subset of the CPUs in the system;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) no two sched domains overlap; some CPUs might not be in any sched
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) domain and hence won't be load balanced.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) Put simply, it costs less to balance between two smaller sched domains
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) than one big one, but doing so means that overloads in one of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) two domains won't be load balanced to the other one.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) By default, there is one sched domain covering all CPUs, including those
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) marked isolated using the kernel boot time "isolcpus=" argument. However,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) the isolated CPUs will not participate in load balancing, and will not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) have tasks running on them unless explicitly assigned.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) This default load balancing across all CPUs is not well suited for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) the following two situations:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419)  1) On large systems, load balancing across many CPUs is expensive.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420)     If the system is managed using cpusets to place independent jobs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421)     on separate sets of CPUs, full load balancing is unnecessary.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422)  2) Systems supporting realtime on some CPUs need to minimize
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423)     system overhead on those CPUs, including avoiding task load
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424)     balancing if that is not needed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) be contained in a single sched domain, ensuring that load balancing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) can move a task (not otherwised pinned, as by sched_setaffinity)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) from any CPU in that cpuset to any other.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) scheduler will avoid load balancing across the CPUs in that cpuset,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) --except-- in so far as is necessary because some overlapping cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) has "sched_load_balance" enabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) So, for example, if the top cpuset has the flag "cpuset.sched_load_balance"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) enabled, then the scheduler will have one sched domain covering all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) cpusets won't matter, as we're already fully load balancing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) Therefore in the above two situations, the top cpuset flag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) "cpuset.sched_load_balance" should be disabled, and only some of the smaller,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) child cpusets have this flag enabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) When doing this, you don't usually want to leave any unpinned tasks in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) the top cpuset that might use non-trivial amounts of CPU, as such tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) may be artificially constrained to some subset of CPUs, depending on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) the particulars of this flag setting in descendant cpusets.  Even if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) such a task could use spare CPU cycles in some other CPUs, the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) scheduler might not consider the possibility of load balancing that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) task to that underused CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) Of course, tasks pinned to a particular CPU can be left in a cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456) else anyway.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) There is an impedance mismatch here, between cpusets and sched domains.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) Cpusets are hierarchical and nest.  Sched domains are flat; they don't
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) overlap and each CPU is in at most one sched domain.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) It is necessary for sched domains to be flat because load balancing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) across partially overlapping sets of CPUs would risk unstable dynamics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) that would be beyond our understanding.  So if each of two partially
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) form a single sched domain that is a superset of both.  We won't move
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) a task to a CPU outside its cpuset, but the scheduler load balancing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) code might waste some compute cycles considering that possibility.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) This mismatch is why there is not a simple one-to-one relation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) between which cpusets have the flag "cpuset.sched_load_balance" enabled,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) and the sched domain configuration.  If a cpuset enables the flag, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) will get balancing across all its CPUs, but if it disables the flag,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) it will only be assured of no load balancing if no other overlapping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) cpuset enables the flag.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478) one of them has this flag enabled, then the other may find its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) tasks only partially load balanced, just on the overlapping CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) This is just the general case of the top_cpuset example given a few
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481) paragraphs above.  In the general case, as in the top cpuset case,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) don't leave tasks that might use non-trivial amounts of CPU in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) such partially load balanced cpusets, as they may be artificially
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) constrained to some subset of the CPUs allowed to them, for lack of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) load balancing to the other CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) CPUs in "cpuset.isolcpus" were excluded from load balancing by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) isolcpus= kernel boot option, and will never be load balanced regardless
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) of the value of "cpuset.sched_load_balance" in any cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) 1.7.1 sched_load_balance implementation details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) ------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) to most cpuset flags.)  When enabled for a cpuset, the kernel will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) ensure that it can load balance across all the CPUs in that cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) (makes sure that all the CPUs in the cpus_allowed of that cpuset are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) in the same sched domain.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) then they will be (must be) both in the same sched domain.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) then by the above that means there is a single sched domain covering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) the whole system, regardless of any other cpuset settings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507) The kernel commits to user space that it will avoid load balancing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) where it can.  It will pick as fine a granularity partition of sched
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) domains as it can while still providing load balancing for any set
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) The internal kernel cpuset to scheduler interface passes from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) cpuset code to the scheduler code a partition of the load balanced
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) CPUs in the system. This partition is a set of subsets (represented
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) as an array of struct cpumask) of CPUs, pairwise disjoint, that cover
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) all the CPUs that must be load balanced.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) The cpuset code builds a new such partition and passes it to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) scheduler sched domain setup code, to have the sched domains rebuilt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) as necessary, whenever:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522)  - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523)  - or CPUs come or go from a cpuset with this flag enabled,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524)  - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525)    and with this flag enabled changes,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526)  - or a cpuset with non-empty CPUs and with this flag enabled is removed,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527)  - or a cpu is offlined/onlined.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) This partition exactly defines what sched domains the scheduler should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) setup - one sched domain for each element (struct cpumask) in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) partition.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) The scheduler remembers the currently active sched domain partitions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534) When the scheduler routine partition_sched_domains() is invoked from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) the cpuset code to update these sched domains, it compares the new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) partition requested with the current, and updates its sched domains,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) removing the old and adding the new, for each change.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540) 1.8 What is sched_relax_domain_level ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543) In sched domain, the scheduler migrates tasks in 2 ways; periodic load
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) balance on tick, and at time of some schedule events.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) When a task is woken up, scheduler try to move the task on idle CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547) For example, if a task A running on CPU X activates another task B
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) on the same CPU X, and if CPU Y is X's sibling and performing idle,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) then scheduler migrate task B to CPU Y so that task B can start on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) CPU Y without waiting task A on CPU X.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552) And if a CPU run out of tasks in its runqueue, the CPU try to pull
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) extra tasks from other busy CPUs to help them before it is going to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) be idle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556) Of course it takes some searching cost to find movable tasks and/or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557) idle CPUs, the scheduler might not search all CPUs in the domain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) every time.  In fact, in some architectures, the searching ranges on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559) events are limited in the same socket or node where the CPU locates,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560) while the load balance on tick searches all.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562) For example, assume CPU Z is relatively far from CPU X.  Even if CPU Z
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563) is idle while CPU X and the siblings are busy, scheduler can't migrate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564) woken task B from X to Z since it is out of its searching range.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565) As the result, task B on CPU X need to wait task A or wait load balance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566) on the next tick.  For some applications in special situation, waiting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567) 1 tick may be too long.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569) The 'cpuset.sched_relax_domain_level' file allows you to request changing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570) this searching range as you like.  This file takes int value which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571) indicates size of searching range in levels ideally as follows,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572) otherwise initial value -1 that indicates the cpuset has no request.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574) ====== ===========================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575)   -1   no request. use system default or follow request of others.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576)    0   no search.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577)    1   search siblings (hyperthreads in a core).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 578)    2   search cores in a package.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 579)    3   search cpus in a node [= system wide on non-NUMA system]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 580)    4   search nodes in a chunk of node [on NUMA system]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 581)    5   search system wide [on NUMA system]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 582) ====== ===========================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 583) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 584) The system default is architecture dependent.  The system default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 585) can be changed using the relax_domain_level= boot parameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 586) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 587) This file is per-cpuset and affect the sched domain where the cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 588) belongs to.  Therefore if the flag 'cpuset.sched_load_balance' of a cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 589) is disabled, then 'cpuset.sched_relax_domain_level' have no effect since
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 590) there is no sched domain belonging the cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 591) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 592) If multiple cpusets are overlapping and hence they form a single sched
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 593) domain, the largest value among those is used.  Be careful, if one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 594) requests 0 and others are -1 then 0 is used.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 595) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 596) Note that modifying this file will have both good and bad effects,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 597) and whether it is acceptable or not depends on your situation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 598) Don't modify this file if you are not sure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 599) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 600) If your situation is:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 601) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 602)  - The migration costs between each cpu can be assumed considerably
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 603)    small(for you) due to your special application's behavior or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 604)    special hardware support for CPU cache etc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 605)  - The searching cost doesn't have impact(for you) or you can make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 606)    the searching cost enough small by managing cpuset to compact etc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 607)  - The latency is required even it sacrifices cache hit rate etc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 608)    then increasing 'sched_relax_domain_level' would benefit you.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 609) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 610) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 611) 1.9 How do I use cpusets ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 612) --------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 613) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 614) In order to minimize the impact of cpusets on critical kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 615) code, such as the scheduler, and due to the fact that the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 616) does not support one task updating the memory placement of another
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 617) task directly, the impact on a task of changing its cpuset CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 618) or Memory Node placement, or of changing to which cpuset a task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 619) is attached, is subtle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 620) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 621) If a cpuset has its Memory Nodes modified, then for each task attached
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 622) to that cpuset, the next time that the kernel attempts to allocate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 623) a page of memory for that task, the kernel will notice the change
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 624) in the task's cpuset, and update its per-task memory placement to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 625) remain within the new cpusets memory placement.  If the task was using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 626) mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 627) its new cpuset, then the task will continue to use whatever subset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 628) of MPOL_BIND nodes are still allowed in the new cpuset.  If the task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 629) was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 630) in the new cpuset, then the task will be essentially treated as if it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 631) was MPOL_BIND bound to the new cpuset (even though its NUMA placement,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 632) as queried by get_mempolicy(), doesn't change).  If a task is moved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 633) from one cpuset to another, then the kernel will adjust the task's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 634) memory placement, as above, the next time that the kernel attempts
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 635) to allocate a page of memory for that task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 636) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 637) If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 638) will have its allowed CPU placement changed immediately.  Similarly,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 639) if a task's pid is written to another cpuset's 'tasks' file, then its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 640) allowed CPU placement is changed immediately.  If such a task had been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 641) bound to some subset of its cpuset using the sched_setaffinity() call,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 642) the task will be allowed to run on any CPU allowed in its new cpuset,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 643) negating the effect of the prior sched_setaffinity() call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 644) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 645) In summary, the memory placement of a task whose cpuset is changed is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 646) updated by the kernel, on the next allocation of a page for that task,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 647) and the processor placement is updated immediately.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 648) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 649) Normally, once a page is allocated (given a physical page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 650) of main memory) then that page stays on whatever node it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 651) was allocated, so long as it remains allocated, even if the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 652) cpusets memory placement policy 'cpuset.mems' subsequently changes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 653) If the cpuset flag file 'cpuset.memory_migrate' is set true, then when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 654) tasks are attached to that cpuset, any pages that task had
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 655) allocated to it on nodes in its previous cpuset are migrated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 656) to the task's new cpuset. The relative placement of the page within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 657) the cpuset is preserved during these migration operations if possible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 658) For example if the page was on the second valid node of the prior cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 659) then the page will be placed on the second valid node of the new cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 660) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 661) Also if 'cpuset.memory_migrate' is set true, then if that cpuset's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 662) 'cpuset.mems' file is modified, pages allocated to tasks in that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 663) cpuset, that were on nodes in the previous setting of 'cpuset.mems',
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 664) will be moved to nodes in the new setting of 'mems.'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 665) Pages that were not in the task's prior cpuset, or in the cpuset's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 666) prior 'cpuset.mems' setting, will not be moved.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 667) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 668) There is an exception to the above.  If hotplug functionality is used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 669) to remove all the CPUs that are currently assigned to a cpuset,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 670) then all the tasks in that cpuset will be moved to the nearest ancestor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 671) with non-empty cpus.  But the moving of some (or all) tasks might fail if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 672) cpuset is bound with another cgroup subsystem which has some restrictions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 673) on task attaching.  In this failing case, those tasks will stay
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 674) in the original cpuset, and the kernel will automatically update
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 675) their cpus_allowed to allow all online CPUs.  When memory hotplug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 676) functionality for removing Memory Nodes is available, a similar exception
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 677) is expected to apply there as well.  In general, the kernel prefers to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 678) violate cpuset placement, over starving a task that has had all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 679) its allowed CPUs or Memory Nodes taken offline.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 680) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 681) There is a second exception to the above.  GFP_ATOMIC requests are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 682) kernel internal allocations that must be satisfied, immediately.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 683) The kernel may drop some request, in rare cases even panic, if a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 684) GFP_ATOMIC alloc fails.  If the request cannot be satisfied within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 685) the current task's cpuset, then we relax the cpuset, and look for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 686) memory anywhere we can find it.  It's better to violate the cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 687) than stress the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 688) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 689) To start a new job that is to be contained within a cpuset, the steps are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 690) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 691)  1) mkdir /sys/fs/cgroup/cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 692)  2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 693)  3) Create the new cpuset by doing mkdir's and write's (or echo's) in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 694)     the /sys/fs/cgroup/cpuset virtual file system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 695)  4) Start a task that will be the "founding father" of the new job.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 696)  5) Attach that task to the new cpuset by writing its pid to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 697)     /sys/fs/cgroup/cpuset tasks file for that cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 698)  6) fork, exec or clone the job tasks from this founding father task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 699) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 700) For example, the following sequence of commands will setup a cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 701) named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 702) and then start a subshell 'sh' in that cpuset::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 703) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 704)   mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 705)   cd /sys/fs/cgroup/cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 706)   mkdir Charlie
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 707)   cd Charlie
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 708)   /bin/echo 2-3 > cpuset.cpus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 709)   /bin/echo 1 > cpuset.mems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 710)   /bin/echo $$ > tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 711)   sh
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 712)   # The subshell 'sh' is now running in cpuset Charlie
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 713)   # The next line should display '/Charlie'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 714)   cat /proc/self/cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 715) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 716) There are ways to query or modify cpusets:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 717) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 718)  - via the cpuset file system directly, using the various cd, mkdir, echo,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 719)    cat, rmdir commands from the shell, or their equivalent from C.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 720)  - via the C library libcpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 721)  - via the C library libcgroup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 722)    (http://sourceforge.net/projects/libcg/)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 723)  - via the python application cset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 724)    (http://code.google.com/p/cpuset/)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 725) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 726) The sched_setaffinity calls can also be done at the shell prompt using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 727) SGI's runon or Robert Love's taskset.  The mbind and set_mempolicy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 728) calls can be done at the shell prompt using the numactl command
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 729) (part of Andi Kleen's numa package).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 730) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 731) 2. Usage Examples and Syntax
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 732) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 733) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 734) 2.1 Basic Usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 735) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 736) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 737) Creating, modifying, using the cpusets can be done through the cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 738) virtual filesystem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 739) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 740) To mount it, type:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 741) # mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 742) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 743) Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 744) tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 745) is the cpuset that holds the whole system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 746) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 747) If you want to create a new cpuset under /sys/fs/cgroup/cpuset::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 748) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 749)   # cd /sys/fs/cgroup/cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 750)   # mkdir my_cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 751) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 752) Now you want to do something with this cpuset::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 753) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 754)   # cd my_cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 755) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 756) In this directory you can find several files::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 757) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 758)   # ls
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 759)   cgroup.clone_children  cpuset.memory_pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 760)   cgroup.event_control   cpuset.memory_spread_page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 761)   cgroup.procs           cpuset.memory_spread_slab
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 762)   cpuset.cpu_exclusive   cpuset.mems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 763)   cpuset.cpus            cpuset.sched_load_balance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 764)   cpuset.mem_exclusive   cpuset.sched_relax_domain_level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 765)   cpuset.mem_hardwall    notify_on_release
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 766)   cpuset.memory_migrate  tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 767) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 768) Reading them will give you information about the state of this cpuset:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 769) the CPUs and Memory Nodes it can use, the processes that are using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 770) it, its properties.  By writing to these files you can manipulate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 771) the cpuset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 772) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 773) Set some flags::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 774) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 775)   # /bin/echo 1 > cpuset.cpu_exclusive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 776) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 777) Add some cpus::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 778) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 779)   # /bin/echo 0-7 > cpuset.cpus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 780) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 781) Add some mems::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 782) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 783)   # /bin/echo 0-7 > cpuset.mems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 784) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 785) Now attach your shell to this cpuset::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 786) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 787)   # /bin/echo $$ > tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 788) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 789) You can also create cpusets inside your cpuset by using mkdir in this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 790) directory::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 791) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 792)   # mkdir my_sub_cs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 793) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 794) To remove a cpuset, just use rmdir::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 795) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 796)   # rmdir my_sub_cs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 797) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 798) This will fail if the cpuset is in use (has cpusets inside, or has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 799) processes attached).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 800) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 801) Note that for legacy reasons, the "cpuset" filesystem exists as a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 802) wrapper around the cgroup filesystem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 803) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 804) The command::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 805) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 806)   mount -t cpuset X /sys/fs/cgroup/cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 807) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 808) is equivalent to::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 809) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 810)   mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 811)   echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 812) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 813) 2.2 Adding/removing cpus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 814) ------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 815) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 816) This is the syntax to use when writing in the cpus or mems files
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 817) in cpuset directories::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 818) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 819)   # /bin/echo 1-4 > cpuset.cpus		-> set cpus list to cpus 1,2,3,4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 820)   # /bin/echo 1,2,3,4 > cpuset.cpus	-> set cpus list to cpus 1,2,3,4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 821) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 822) To add a CPU to a cpuset, write the new list of CPUs including the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 823) CPU to be added. To add 6 to the above cpuset::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 824) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 825)   # /bin/echo 1-4,6 > cpuset.cpus	-> set cpus list to cpus 1,2,3,4,6
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 826) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 827) Similarly to remove a CPU from a cpuset, write the new list of CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 828) without the CPU to be removed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 829) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 830) To remove all the CPUs::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 831) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 832)   # /bin/echo "" > cpuset.cpus		-> clear cpus list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 833) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 834) 2.3 Setting flags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 835) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 836) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 837) The syntax is very simple::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 838) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 839)   # /bin/echo 1 > cpuset.cpu_exclusive 	-> set flag 'cpuset.cpu_exclusive'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 840)   # /bin/echo 0 > cpuset.cpu_exclusive 	-> unset flag 'cpuset.cpu_exclusive'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 841) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 842) 2.4 Attaching processes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 843) -----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 844) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 845) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 846) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 847)   # /bin/echo PID > tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 848) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 849) Note that it is PID, not PIDs. You can only attach ONE task at a time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 850) If you have several tasks to attach, you have to do it one after another::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 851) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 852)   # /bin/echo PID1 > tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 853)   # /bin/echo PID2 > tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 854) 	...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 855)   # /bin/echo PIDn > tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 856) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 857) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 858) 3. Questions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 859) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 860) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 861) Q:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 862)    what's up with this '/bin/echo' ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 863) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 864) A:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 865)    bash's builtin 'echo' command does not check calls to write() against
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 866)    errors. If you use it in the cpuset file system, you won't be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 867)    able to tell whether a command succeeded or failed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 868) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 869) Q:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 870)    When I attach processes, only the first of the line gets really attached !
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 871) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 872) A:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 873)    We can only return one error code per call to write(). So you should also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 874)    put only ONE pid.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 875) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 876) 4. Contact
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 877) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 878) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 879) Web: http://www.bullopensource.org/cpuset