^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) CFS Bandwidth Control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) [ This document only discusses CPU bandwidth control for SCHED_NORMAL.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.rst ]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) specification of the maximum CPU bandwidth available to a group or hierarchy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) The bandwidth allowed for a group is specified using a quota and period. Within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) each given "period" (microseconds), a task group is allocated up to "quota"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) microseconds of CPU time. That quota is assigned to per-cpu run queues in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) slices as threads in the cgroup become runnable. Once all quota has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) assigned any additional requests for quota will result in those threads being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) throttled. Throttled threads will not be able to run again until the next
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) period when the quota is replenished.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) A group's unassigned quota is globally tracked, being refreshed back to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) cfs_quota units at each period boundary. As threads consume this bandwidth it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) is transferred to cpu-local "silos" on a demand basis. The amount transferred
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) within each of these updates is tunable and described as the "slice".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) Management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) ----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) Quota and period are managed within the cpu subsystem via cgroupfs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) cpu.cfs_period_us: the length of a period (in microseconds)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) cpu.stat: exports throttling statistics [explained further below]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) The default values are::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) cpu.cfs_period_us=100ms
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) cpu.cfs_quota=-1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) bandwidth restriction in place, such a group is described as an unconstrained
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) bandwidth group. This represents the traditional work-conserving behavior for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) CFS.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) Writing any (valid) positive value(s) will enact the specified bandwidth limit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) The minimum quota allowed for the quota or period is 1ms. There is also an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) upper bound on the period length of 1s. Additional restrictions exist when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) bandwidth limits are used in a hierarchical fashion, these are explained in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) more detail below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) and return the group to an unconstrained state once more.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) Any updates to a group's bandwidth specification will result in it becoming
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) unthrottled if it is in a constrained state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) System wide settings
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) --------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) For efficiency run-time is transferred between the global pool and CPU local
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) "silos" in a batch fashion. This greatly reduces global accounting pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) on large systems. The amount transferred each time such an update is required
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) is described as the "slice".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) This is tunable via procfs::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) /proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) Larger slice values will reduce transfer overheads, while smaller values allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) for more fine-grained consumption.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) Statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) ----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) A group's bandwidth statistics are exported via 3 fields in cpu.stat.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) cpu.stat:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) - nr_periods: Number of enforcement intervals that have elapsed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) - nr_throttled: Number of times the group has been throttled/limited.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) - throttled_time: The total time duration (in nanoseconds) for which entities
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) of the group have been throttled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) This interface is read-only.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) Hierarchical considerations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) The interface enforces that an individual entity's bandwidth is always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) attainable, that is: max(c_i) <= C. However, over-subscription in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) aggregate case is explicitly allowed to enable work-conserving semantics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) within a hierarchy:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) e.g. \Sum (c_i) may exceed C
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) [ Where C is the parent's bandwidth, and c_i its children ]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) There are two ways in which a group may become throttled:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) a. it fully consumes its own quota within a period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) b. a parent's quota is fully consumed within its period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) In case b) above, even though the child may have runtime remaining it will not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) be allowed to until the parent's runtime is refreshed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) CFS Bandwidth Quota Caveats
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) Once a slice is assigned to a cpu it does not expire. However all but 1ms of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) the slice may be returned to the global pool if all threads on that cpu become
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) unrunnable. This is configured at compile time by the min_cfs_rq_runtime
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) variable. This is a performance tweak that helps prevent added contention on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) the global lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) The fact that cpu-local slices do not expire results in some interesting corner
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) cases that should be understood.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) For cgroup cpu constrained applications that are cpu limited this is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) relatively moot point because they will naturally consume the entirety of their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) quota as well as the entirety of each cpu-local slice in each period. As a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) result it is expected that nr_periods roughly equal nr_throttled, and that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) For highly-threaded, non-cpu bound applications this non-expiration nuance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) allows applications to briefly burst past their quota limits by the amount of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) unused slice on each cpu that the task group is running on (typically at most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) 1ms per cpu or as defined by min_cfs_rq_runtime). This slight burst only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) applies if quota had been assigned to a cpu and then not fully used or returned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) in previous periods. This burst amount will not be transferred between cores.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) As a result, this mechanism still strictly limits the task group to quota
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) average usage, albeit over a longer time window than a single period. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) also limits the burst ability to no more than 1ms per cpu. This provides
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) better more predictable user experience for highly threaded applications with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) small quota limits on high core count machines. It also eliminates the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) propensity to throttle these applications while simultanously using less than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) quota amounts of cpu. Another way to say this, is that by allowing the unused
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) portion of a slice to remain valid across periods we have decreased the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) possibility of wastefully expiring quota on cpu-local silos that don't need a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) full slice's amount of cpu time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) The interaction between cpu-bound and non-cpu-bound-interactive applications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) should also be considered, especially when single core usage hits 100%. If you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) gave each of these applications half of a cpu-core and they both got scheduled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) on the same CPU it is theoretically possible that the non-cpu bound application
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) will use up to 1ms additional quota in some periods, thereby preventing the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) cpu-bound application from fully using its quota by that same amount. In these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) decide which application is chosen to run, as they will both be runnable and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) have remaining quota. This runtime discrepancy will be made up in the following
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) periods when the interactive application idles.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) Examples
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) --------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) 1. Limit a group to 1 CPU worth of runtime::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) If period is 250ms and quota is also 250ms, the group will get
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 1 CPU worth of runtime every 250ms.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) # echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) # echo 250000 > cpu.cfs_period_us /* period = 250ms */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) runtime every 500ms::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) # echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) # echo 500000 > cpu.cfs_period_us /* period = 500ms */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) The larger period here allows for increased burst capacity.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) 3. Limit a group to 20% of 1 CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) # echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) # echo 50000 > cpu.cfs_period_us /* period = 50ms */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) By using a small period here we are ensuring a consistent latency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) response at the expense of burst capacity.