Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) ======================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) NO_HZ: Reducing Scheduling-Clock Ticks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ======================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) This document describes Kconfig options and boot parameters that can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) reduce the number of scheduling-clock interrupts, thereby improving energy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) efficiency and reducing OS jitter.  Reducing OS jitter is important for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) some types of computationally intensive high-performance computing (HPC)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) applications and for real-time applications.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) There are three main ways of managing scheduling-clock interrupts
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) (also known as "scheduling-clock ticks" or simply "ticks"):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) 1.	Never omit scheduling-clock ticks (CONFIG_HZ_PERIODIC=y or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) 	CONFIG_NO_HZ=n for older kernels).  You normally will -not-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) 	want to choose this option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) 2.	Omit scheduling-clock ticks on idle CPUs (CONFIG_NO_HZ_IDLE=y or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) 	CONFIG_NO_HZ=y for older kernels).  This is the most common
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) 	approach, and should be the default.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) 3.	Omit scheduling-clock ticks on CPUs that are either idle or that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) 	have only one runnable task (CONFIG_NO_HZ_FULL=y).  Unless you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 	are running realtime applications or certain types of HPC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 	workloads, you will normally -not- want this option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) These three cases are described in the following three sections, followed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) by a third section on RCU-specific considerations, a fourth section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) discussing testing, and a fifth and final section listing known issues.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) Never Omit Scheduling-Clock Ticks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) =================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) Very old versions of Linux from the 1990s and the very early 2000s
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) are incapable of omitting scheduling-clock ticks.  It turns out that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) there are some situations where this old-school approach is still the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) right approach, for example, in heavy workloads with lots of tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) that use short bursts of CPU, where there are very frequent idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) periods, but where these idle periods are also quite short (tens or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) hundreds of microseconds).  For these types of workloads, scheduling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) clock interrupts will normally be delivered any way because there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) will frequently be multiple runnable tasks per CPU.  In these cases,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) attempting to turn off the scheduling clock interrupt will have no effect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) other than increasing the overhead of switching to and from idle and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) transitioning between user and kernel execution.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) This mode of operation can be selected using CONFIG_HZ_PERIODIC=y (or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) CONFIG_NO_HZ=n for older kernels).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) However, if you are instead running a light workload with long idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) periods, failing to omit scheduling-clock interrupts will result in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) excessive power consumption.  This is especially bad on battery-powered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) devices, where it results in extremely short battery lifetimes.  If you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) are running light workloads, you should therefore read the following
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) In addition, if you are running either a real-time workload or an HPC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) workload with short iterations, the scheduling-clock interrupts can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) degrade your applications performance.  If this describes your workload,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) you should read the following two sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) Omit Scheduling-Clock Ticks For Idle CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) =========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) If a CPU is idle, there is little point in sending it a scheduling-clock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) interrupt.  After all, the primary purpose of a scheduling-clock interrupt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) is to force a busy CPU to shift its attention among multiple duties,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) and an idle CPU has no duties to shift its attention among.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) The CONFIG_NO_HZ_IDLE=y Kconfig option causes the kernel to avoid sending
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) scheduling-clock interrupts to idle CPUs, which is critically important
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) both to battery-powered devices and to highly virtualized mainframes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) A battery-powered device running a CONFIG_HZ_PERIODIC=y kernel would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) drain its battery very quickly, easily 2-3 times as fast as would the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) same device running a CONFIG_NO_HZ_IDLE=y kernel.  A mainframe running
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) 1,500 OS instances might find that half of its CPU time was consumed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) unnecessary scheduling-clock interrupts.  In these situations, there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) is strong motivation to avoid sending scheduling-clock interrupts to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) idle CPUs.  That said, dyntick-idle mode is not free:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) 1.	It increases the number of instructions executed on the path
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) 	to and from the idle loop.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) 2.	On many architectures, dyntick-idle mode also increases the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) 	number of expensive clock-reprogramming operations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) Therefore, systems with aggressive real-time response constraints often
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) run CONFIG_HZ_PERIODIC=y kernels (or CONFIG_NO_HZ=n for older kernels)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) in order to avoid degrading from-idle transition latencies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) An idle CPU that is not receiving scheduling-clock interrupts is said to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) tickless".  The remainder of this document will use "dyntick-idle mode".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) There is also a boot parameter "nohz=" that can be used to disable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) dyntick-idle mode in CONFIG_NO_HZ_IDLE=y kernels by specifying "nohz=off".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) By default, CONFIG_NO_HZ_IDLE=y kernels boot with "nohz=on", enabling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) dyntick-idle mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) Omit Scheduling-Clock Ticks For CPUs With Only One Runnable Task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) ================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) If a CPU has only one runnable task, there is little point in sending it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) a scheduling-clock interrupt because there is no other task to switch to.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) Note that omitting scheduling-clock ticks for CPUs with only one runnable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) task implies also omitting them for idle CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) sending scheduling-clock interrupts to CPUs with a single runnable task,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) and such CPUs are said to be "adaptive-ticks CPUs".  This is important
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) for applications with aggressive real-time response constraints because
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) it allows them to improve their worst-case response times by the maximum
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) duration of a scheduling-clock interrupt.  It is also important for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) computationally intensive short-iteration workloads:  If any CPU is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) delayed during a given iteration, all the other CPUs will be forced to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) wait idle while the delayed CPU finishes.  Thus, the delay is multiplied
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) by one less than the number of CPUs.  In these situations, there is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) again strong motivation to avoid sending scheduling-clock interrupts.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) By default, no CPU will be an adaptive-ticks CPU.  The "nohz_full="
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) boot parameter specifies the adaptive-ticks CPUs.  For example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) "nohz_full=1,6-8" says that CPUs 1, 6, 7, and 8 are to be adaptive-ticks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) CPUs.  Note that you are prohibited from marking all of the CPUs as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) adaptive-tick CPUs:  At least one non-adaptive-tick CPU must remain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) online to handle timekeeping tasks in order to ensure that system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) calls like gettimeofday() returns accurate values on adaptive-tick CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) (This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no running
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) user processes to observe slight drifts in clock rate.)  Therefore, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) boot CPU is prohibited from entering adaptive-ticks mode.  Specifying a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) "nohz_full=" mask that includes the boot CPU will result in a boot-time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) error message, and the boot CPU will be removed from the mask.  Note that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) this means that your system must have at least two CPUs in order for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) CONFIG_NO_HZ_FULL=y to do anything for you.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) This is covered in the "RCU IMPLICATIONS" section below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) Normally, a CPU remains in adaptive-ticks mode as long as possible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) In particular, transitioning to kernel mode does not automatically change
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) the mode.  Instead, the CPU will exit adaptive-ticks mode only if needed,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) for example, if that CPU enqueues an RCU callback.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) Just as with dyntick-idle mode, the benefits of adaptive-tick mode do
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) not come for free:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 1.	CONFIG_NO_HZ_FULL selects CONFIG_NO_HZ_COMMON, so you cannot run
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 	adaptive ticks without also running dyntick idle.  This dependency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 	extends down into the implementation, so that all of the costs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) 	of CONFIG_NO_HZ_IDLE are also incurred by CONFIG_NO_HZ_FULL.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 2.	The user/kernel transitions are slightly more expensive due
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 	to the need to inform kernel subsystems (such as RCU) about
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) 	the change in mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) 3.	POSIX CPU timers prevent CPUs from entering adaptive-tick mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 	Real-time applications needing to take actions based on CPU time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 	consumption need to use other means of doing so.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 4.	If there are more perf events pending than the hardware can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 	accommodate, they are normally round-robined so as to collect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 	all of them over time.  Adaptive-tick mode may prevent this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) 	round-robining from happening.  This will likely be fixed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) 	preventing CPUs with large numbers of perf events pending from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 	entering adaptive-tick mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) 5.	Scheduler statistics for adaptive-tick CPUs may be computed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 	slightly differently than those for non-adaptive-tick CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) 	This might in turn perturb load-balancing of real-time tasks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) Although improvements are expected over time, adaptive ticks is quite
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) useful for many types of real-time and compute-intensive applications.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) However, the drawbacks listed above mean that adaptive ticks should not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) (yet) be enabled by default.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) RCU Implications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) There are situations in which idle CPUs cannot be permitted to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) enter either dyntick-idle mode or adaptive-tick mode, the most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) common being when that CPU has RCU callbacks pending.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) to enter dyntick-idle mode or adaptive-tick mode anyway.  In this case,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) a timer will awaken these CPUs every four jiffies in order to ensure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) that the RCU callbacks are processed in a timely fashion.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) Another approach is to offload RCU callback processing to "rcuo" kthreads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) using the CONFIG_RCU_NOCB_CPU=y Kconfig option.  The specific CPUs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) offload may be selected using The "rcu_nocbs=" kernel boot parameter,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) which takes a comma-separated list of CPUs and CPU ranges, for example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) "1,3-5" selects CPUs 1, 3, 4, and 5.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) The offloaded CPUs will never queue RCU callbacks, and therefore RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) never prevents offloaded CPUs from entering either dyntick-idle mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) or adaptive-tick mode.  That said, note that it is up to userspace to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) pin the "rcuo" kthreads to specific CPUs if desired.  Otherwise, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) scheduler will decide where to run them, which might or might not be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) where you want them to run.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) Testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) So you enable all the OS-jitter features described in this document,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) but do not see any change in your workload's behavior.  Is this because
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) your workload isn't affected that much by OS jitter, or is it because
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) something else is in the way?  This section helps answer this question
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) by providing a simple OS-jitter test suite, which is available on branch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) master of the following git archive:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) Clone this archive and follow the instructions in the README file.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) This test procedure will produce a trace that will allow you to evaluate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) whether or not you have succeeded in removing OS jitter from your system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) If this trace shows that you have removed OS jitter as much as is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) possible, then you can conclude that your workload is not all that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) sensitive to OS jitter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) Note: this test requires that your system have at least two CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) We do not currently have a good way to remove OS jitter from single-CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) systems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) Known Issues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) *	Dyntick-idle slows transitions to and from idle slightly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) 	In practice, this has not been a problem except for the most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) 	aggressive real-time workloads, which have the option of disabling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) 	dyntick-idle mode, an option that most of them take.  However,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) 	some workloads will no doubt want to use adaptive ticks to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) 	eliminate scheduling-clock interrupt latencies.  Here are some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) 	options for these workloads:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) 	a.	Use PMQOS from userspace to inform the kernel of your
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) 		latency requirements (preferred).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) 	b.	On x86 systems, use the "idle=mwait" boot parameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) 	c.	On x86 systems, use the "intel_idle.max_cstate=" to limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) 	`	the maximum C-state depth.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) 	d.	On x86 systems, use the "idle=poll" boot parameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) 		However, please note that use of this parameter can cause
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) 		your CPU to overheat, which may cause thermal throttling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) 		to degrade your latencies -- and that this degradation can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) 		be even worse than that of dyntick-idle.  Furthermore,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) 		this parameter effectively disables Turbo Mode on Intel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) 		CPUs, which can significantly reduce maximum performance.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) *	Adaptive-ticks slows user/kernel transitions slightly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) 	This is not expected to be a problem for computationally intensive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) 	workloads, which have few such transitions.  Careful benchmarking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) 	will be required to determine whether or not other workloads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) 	are significantly affected by this effect.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) *	Adaptive-ticks does not do anything unless there is only one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) 	runnable task for a given CPU, even though there are a number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) 	of other situations where the scheduling-clock tick is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) 	needed.  To give but one example, consider a CPU that has one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) 	runnable high-priority SCHED_FIFO task and an arbitrary number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) 	of low-priority SCHED_OTHER tasks.  In this case, the CPU is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) 	required to run the SCHED_FIFO task until it either blocks or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) 	some other higher-priority task awakens on (or is assigned to)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) 	this CPU, so there is no point in sending a scheduling-clock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) 	interrupt to this CPU.	However, the current implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) 	nevertheless sends scheduling-clock interrupts to CPUs having a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) 	single runnable SCHED_FIFO task and multiple runnable SCHED_OTHER
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) 	tasks, even though these interrupts are unnecessary.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) 	And even when there are multiple runnable tasks on a given CPU,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) 	there is little point in interrupting that CPU until the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) 	running task's timeslice expires, which is almost always way
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) 	longer than the time of the next scheduling-clock interrupt.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) 	Better handling of these sorts of situations is future work.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) *	A reboot is required to reconfigure both adaptive idle and RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) 	callback offloading.  Runtime reconfiguration could be provided
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) 	if needed, however, due to the complexity of reconfiguring RCU at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) 	runtime, there would need to be an earthshakingly good reason.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) 	Especially given that you have the straightforward option of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) 	simply offloading RCU callbacks from all CPUs and pinning them
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) 	where you want them whenever you want them pinned.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) *	Additional configuration is required to deal with other sources
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) 	of OS jitter, including interrupts and system-utility tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) 	and processes.  This configuration normally involves binding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) 	interrupts and tasks to particular CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) *	Some sources of OS jitter can currently be eliminated only by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) 	constraining the workload.  For example, the only way to eliminate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) 	OS jitter due to global TLB shootdowns is to avoid the unmapping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) 	operations (such as kernel module unload operations) that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) 	result in these shootdowns.  For another example, page faults
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) 	and TLB misses can be reduced (and in some cases eliminated) by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) 	using huge pages and by constraining the amount of memory used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) 	by the application.  Pre-faulting the working set can also be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) 	helpful, especially when combined with the mlock() and mlockall()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) 	system calls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) *	Unless all CPUs are idle, at least one CPU must keep the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) 	scheduling-clock interrupt going in order to support accurate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) 	timekeeping.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) *	If there might potentially be some adaptive-ticks CPUs, there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) 	will be at least one CPU keeping the scheduling-clock interrupt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) 	going, even if all CPUs are otherwise idle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) 	Better handling of this situation is ongoing work.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) *	Some process-handling operations still require the occasional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) 	scheduling-clock tick.	These operations include calculating CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) 	load, maintaining sched average, computing CFS entity vruntime,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) 	computing avenrun, and carrying out load balancing.  They are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) 	currently accommodated by scheduling-clock tick every second
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) 	or so.	On-going work will eliminate the need even for these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) 	infrequent scheduling-clock ticks.