Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) CFS Bandwidth Control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) [ This document only discusses CPU bandwidth control for SCHED_NORMAL.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6)   The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.rst ]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) specification of the maximum CPU bandwidth available to a group or hierarchy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) The bandwidth allowed for a group is specified using a quota and period. Within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) each given "period" (microseconds), a task group is allocated up to "quota"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) microseconds of CPU time. That quota is assigned to per-cpu run queues in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) slices as threads in the cgroup become runnable. Once all quota has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) assigned any additional requests for quota will result in those threads being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) throttled. Throttled threads will not be able to run again until the next
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) period when the quota is replenished.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) A group's unassigned quota is globally tracked, being refreshed back to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) cfs_quota units at each period boundary. As threads consume this bandwidth it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) is transferred to cpu-local "silos" on a demand basis. The amount transferred
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) within each of these updates is tunable and described as the "slice".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) Management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) ----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) Quota and period are managed within the cpu subsystem via cgroupfs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) cpu.cfs_period_us: the length of a period (in microseconds)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) cpu.stat: exports throttling statistics [explained further below]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) The default values are::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 	cpu.cfs_period_us=100ms
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) 	cpu.cfs_quota=-1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) bandwidth restriction in place, such a group is described as an unconstrained
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) bandwidth group. This represents the traditional work-conserving behavior for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) CFS.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) Writing any (valid) positive value(s) will enact the specified bandwidth limit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) The minimum quota allowed for the quota or period is 1ms. There is also an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) upper bound on the period length of 1s. Additional restrictions exist when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) bandwidth limits are used in a hierarchical fashion, these are explained in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) more detail below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) and return the group to an unconstrained state once more.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) Any updates to a group's bandwidth specification will result in it becoming
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) unthrottled if it is in a constrained state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) System wide settings
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) --------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) For efficiency run-time is transferred between the global pool and CPU local
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) "silos" in a batch fashion. This greatly reduces global accounting pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) on large systems. The amount transferred each time such an update is required
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) is described as the "slice".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) This is tunable via procfs::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 	/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) Larger slice values will reduce transfer overheads, while smaller values allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) for more fine-grained consumption.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) Statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) ----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) A group's bandwidth statistics are exported via 3 fields in cpu.stat.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) cpu.stat:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) - nr_periods: Number of enforcement intervals that have elapsed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) - nr_throttled: Number of times the group has been throttled/limited.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) - throttled_time: The total time duration (in nanoseconds) for which entities
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77)   of the group have been throttled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) This interface is read-only.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) Hierarchical considerations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) The interface enforces that an individual entity's bandwidth is always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) attainable, that is: max(c_i) <= C. However, over-subscription in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) aggregate case is explicitly allowed to enable work-conserving semantics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) within a hierarchy:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88)   e.g. \Sum (c_i) may exceed C
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) [ Where C is the parent's bandwidth, and c_i its children ]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) There are two ways in which a group may become throttled:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) 	a. it fully consumes its own quota within a period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) 	b. a parent's quota is fully consumed within its period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) In case b) above, even though the child may have runtime remaining it will not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) be allowed to until the parent's runtime is refreshed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) CFS Bandwidth Quota Caveats
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) the slice may be returned to the global pool if all threads on that cpu become
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) unrunnable. This is configured at compile time by the min_cfs_rq_runtime
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) variable. This is a performance tweak that helps prevent added contention on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) the global lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) The fact that cpu-local slices do not expire results in some interesting corner
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) cases that should be understood.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) For cgroup cpu constrained applications that are cpu limited this is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) relatively moot point because they will naturally consume the entirety of their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) quota as well as the entirety of each cpu-local slice in each period. As a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) result it is expected that nr_periods roughly equal nr_throttled, and that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) For highly-threaded, non-cpu bound applications this non-expiration nuance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) allows applications to briefly burst past their quota limits by the amount of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) unused slice on each cpu that the task group is running on (typically at most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) 1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) applies if quota had been assigned to a cpu and then not fully used or returned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) in previous periods. This burst amount will not be transferred between cores.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) As a result, this mechanism still strictly limits the task group to quota
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) average usage, albeit over a longer time window than a single period.  This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) also limits the burst ability to no more than 1ms per cpu.  This provides
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) better more predictable user experience for highly threaded applications with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) small quota limits on high core count machines. It also eliminates the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) propensity to throttle these applications while simultanously using less than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) quota amounts of cpu. Another way to say this, is that by allowing the unused
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) portion of a slice to remain valid across periods we have decreased the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) possibility of wastefully expiring quota on cpu-local silos that don't need a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) full slice's amount of cpu time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) The interaction between cpu-bound and non-cpu-bound-interactive applications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) should also be considered, especially when single core usage hits 100%. If you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) gave each of these applications half of a cpu-core and they both got scheduled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) on the same CPU it is theoretically possible that the non-cpu bound application
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) will use up to 1ms additional quota in some periods, thereby preventing the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) cpu-bound application from fully using its quota by that same amount. In these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) decide which application is chosen to run, as they will both be runnable and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) have remaining quota. This runtime discrepancy will be made up in the following
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) periods when the interactive application idles.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) Examples
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) --------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) 1. Limit a group to 1 CPU worth of runtime::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 	If period is 250ms and quota is also 250ms, the group will get
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 	1 CPU worth of runtime every 250ms.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) 	# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) 	# echo 250000 > cpu.cfs_period_us /* period = 250ms */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158)    With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)    runtime every 500ms::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 	# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) 	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 	The larger period here allows for increased burst capacity.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) 3. Limit a group to 20% of 1 CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)    With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) 	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173)    By using a small period here we are ensuring a consistent latency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)    response at the expense of burst capacity.