^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) ======================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) NO_HZ: Reducing Scheduling-Clock Ticks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ======================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) This document describes Kconfig options and boot parameters that can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) reduce the number of scheduling-clock interrupts, thereby improving energy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) efficiency and reducing OS jitter. Reducing OS jitter is important for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) some types of computationally intensive high-performance computing (HPC)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) applications and for real-time applications.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) There are three main ways of managing scheduling-clock interrupts
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) (also known as "scheduling-clock ticks" or simply "ticks"):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) 1. Never omit scheduling-clock ticks (CONFIG_HZ_PERIODIC=y or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) CONFIG_NO_HZ=n for older kernels). You normally will -not-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) want to choose this option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) 2. Omit scheduling-clock ticks on idle CPUs (CONFIG_NO_HZ_IDLE=y or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) CONFIG_NO_HZ=y for older kernels). This is the most common
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) approach, and should be the default.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) 3. Omit scheduling-clock ticks on CPUs that are either idle or that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) have only one runnable task (CONFIG_NO_HZ_FULL=y). Unless you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) are running realtime applications or certain types of HPC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) workloads, you will normally -not- want this option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) These three cases are described in the following three sections, followed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) by a third section on RCU-specific considerations, a fourth section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) discussing testing, and a fifth and final section listing known issues.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) Never Omit Scheduling-Clock Ticks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) =================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) Very old versions of Linux from the 1990s and the very early 2000s
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) are incapable of omitting scheduling-clock ticks. It turns out that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) there are some situations where this old-school approach is still the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) right approach, for example, in heavy workloads with lots of tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) that use short bursts of CPU, where there are very frequent idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) periods, but where these idle periods are also quite short (tens or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) hundreds of microseconds). For these types of workloads, scheduling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) clock interrupts will normally be delivered any way because there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) will frequently be multiple runnable tasks per CPU. In these cases,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) attempting to turn off the scheduling clock interrupt will have no effect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) other than increasing the overhead of switching to and from idle and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) transitioning between user and kernel execution.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) This mode of operation can be selected using CONFIG_HZ_PERIODIC=y (or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) CONFIG_NO_HZ=n for older kernels).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) However, if you are instead running a light workload with long idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) periods, failing to omit scheduling-clock interrupts will result in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) excessive power consumption. This is especially bad on battery-powered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) devices, where it results in extremely short battery lifetimes. If you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) are running light workloads, you should therefore read the following
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) In addition, if you are running either a real-time workload or an HPC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) workload with short iterations, the scheduling-clock interrupts can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) degrade your applications performance. If this describes your workload,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) you should read the following two sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) Omit Scheduling-Clock Ticks For Idle CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) =========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) If a CPU is idle, there is little point in sending it a scheduling-clock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) interrupt. After all, the primary purpose of a scheduling-clock interrupt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) is to force a busy CPU to shift its attention among multiple duties,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) and an idle CPU has no duties to shift its attention among.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) The CONFIG_NO_HZ_IDLE=y Kconfig option causes the kernel to avoid sending
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) scheduling-clock interrupts to idle CPUs, which is critically important
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) both to battery-powered devices and to highly virtualized mainframes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) A battery-powered device running a CONFIG_HZ_PERIODIC=y kernel would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) drain its battery very quickly, easily 2-3 times as fast as would the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) same device running a CONFIG_NO_HZ_IDLE=y kernel. A mainframe running
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) 1,500 OS instances might find that half of its CPU time was consumed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) unnecessary scheduling-clock interrupts. In these situations, there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) is strong motivation to avoid sending scheduling-clock interrupts to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) idle CPUs. That said, dyntick-idle mode is not free:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) 1. It increases the number of instructions executed on the path
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) to and from the idle loop.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) 2. On many architectures, dyntick-idle mode also increases the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) number of expensive clock-reprogramming operations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) Therefore, systems with aggressive real-time response constraints often
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) run CONFIG_HZ_PERIODIC=y kernels (or CONFIG_NO_HZ=n for older kernels)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) in order to avoid degrading from-idle transition latencies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) An idle CPU that is not receiving scheduling-clock interrupts is said to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) tickless". The remainder of this document will use "dyntick-idle mode".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) There is also a boot parameter "nohz=" that can be used to disable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) dyntick-idle mode in CONFIG_NO_HZ_IDLE=y kernels by specifying "nohz=off".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) By default, CONFIG_NO_HZ_IDLE=y kernels boot with "nohz=on", enabling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) dyntick-idle mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) Omit Scheduling-Clock Ticks For CPUs With Only One Runnable Task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) ================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) If a CPU has only one runnable task, there is little point in sending it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) a scheduling-clock interrupt because there is no other task to switch to.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) Note that omitting scheduling-clock ticks for CPUs with only one runnable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) task implies also omitting them for idle CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) sending scheduling-clock interrupts to CPUs with a single runnable task,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) and such CPUs are said to be "adaptive-ticks CPUs". This is important
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) for applications with aggressive real-time response constraints because
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) it allows them to improve their worst-case response times by the maximum
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) duration of a scheduling-clock interrupt. It is also important for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) computationally intensive short-iteration workloads: If any CPU is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) delayed during a given iteration, all the other CPUs will be forced to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) wait idle while the delayed CPU finishes. Thus, the delay is multiplied
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) by one less than the number of CPUs. In these situations, there is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) again strong motivation to avoid sending scheduling-clock interrupts.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) By default, no CPU will be an adaptive-ticks CPU. The "nohz_full="
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) boot parameter specifies the adaptive-ticks CPUs. For example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) "nohz_full=1,6-8" says that CPUs 1, 6, 7, and 8 are to be adaptive-ticks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) CPUs. Note that you are prohibited from marking all of the CPUs as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) online to handle timekeeping tasks in order to ensure that system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) calls like gettimeofday() returns accurate values on adaptive-tick CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) (This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no running
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) user processes to observe slight drifts in clock rate.) Therefore, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) "nohz_full=" mask that includes the boot CPU will result in a boot-time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) error message, and the boot CPU will be removed from the mask. Note that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) this means that your system must have at least two CPUs in order for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) CONFIG_NO_HZ_FULL=y to do anything for you.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) This is covered in the "RCU IMPLICATIONS" section below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) Normally, a CPU remains in adaptive-ticks mode as long as possible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) In particular, transitioning to kernel mode does not automatically change
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) the mode. Instead, the CPU will exit adaptive-ticks mode only if needed,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) for example, if that CPU enqueues an RCU callback.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) Just as with dyntick-idle mode, the benefits of adaptive-tick mode do
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) not come for free:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 1. CONFIG_NO_HZ_FULL selects CONFIG_NO_HZ_COMMON, so you cannot run
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) adaptive ticks without also running dyntick idle. This dependency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) extends down into the implementation, so that all of the costs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) of CONFIG_NO_HZ_IDLE are also incurred by CONFIG_NO_HZ_FULL.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 2. The user/kernel transitions are slightly more expensive due
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) to the need to inform kernel subsystems (such as RCU) about
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) the change in mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) 3. POSIX CPU timers prevent CPUs from entering adaptive-tick mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) Real-time applications needing to take actions based on CPU time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) consumption need to use other means of doing so.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 4. If there are more perf events pending than the hardware can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) accommodate, they are normally round-robined so as to collect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) all of them over time. Adaptive-tick mode may prevent this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) round-robining from happening. This will likely be fixed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) preventing CPUs with large numbers of perf events pending from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) entering adaptive-tick mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) 5. Scheduler statistics for adaptive-tick CPUs may be computed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) slightly differently than those for non-adaptive-tick CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) This might in turn perturb load-balancing of real-time tasks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) Although improvements are expected over time, adaptive ticks is quite
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) useful for many types of real-time and compute-intensive applications.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) However, the drawbacks listed above mean that adaptive ticks should not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) (yet) be enabled by default.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) RCU Implications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) There are situations in which idle CPUs cannot be permitted to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) enter either dyntick-idle mode or adaptive-tick mode, the most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) common being when that CPU has RCU callbacks pending.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) to enter dyntick-idle mode or adaptive-tick mode anyway. In this case,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) a timer will awaken these CPUs every four jiffies in order to ensure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) that the RCU callbacks are processed in a timely fashion.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) Another approach is to offload RCU callback processing to "rcuo" kthreads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) using the CONFIG_RCU_NOCB_CPU=y Kconfig option. The specific CPUs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) offload may be selected using The "rcu_nocbs=" kernel boot parameter,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) which takes a comma-separated list of CPUs and CPU ranges, for example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) "1,3-5" selects CPUs 1, 3, 4, and 5.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) The offloaded CPUs will never queue RCU callbacks, and therefore RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) never prevents offloaded CPUs from entering either dyntick-idle mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) or adaptive-tick mode. That said, note that it is up to userspace to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) scheduler will decide where to run them, which might or might not be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) where you want them to run.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) Testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) So you enable all the OS-jitter features described in this document,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) but do not see any change in your workload's behavior. Is this because
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) your workload isn't affected that much by OS jitter, or is it because
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) something else is in the way? This section helps answer this question
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) by providing a simple OS-jitter test suite, which is available on branch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) master of the following git archive:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) Clone this archive and follow the instructions in the README file.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) This test procedure will produce a trace that will allow you to evaluate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) whether or not you have succeeded in removing OS jitter from your system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) If this trace shows that you have removed OS jitter as much as is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) possible, then you can conclude that your workload is not all that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) sensitive to OS jitter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) Note: this test requires that your system have at least two CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) We do not currently have a good way to remove OS jitter from single-CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) systems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) Known Issues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) * Dyntick-idle slows transitions to and from idle slightly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) In practice, this has not been a problem except for the most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) aggressive real-time workloads, which have the option of disabling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) dyntick-idle mode, an option that most of them take. However,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) some workloads will no doubt want to use adaptive ticks to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) eliminate scheduling-clock interrupt latencies. Here are some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) options for these workloads:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) a. Use PMQOS from userspace to inform the kernel of your
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) latency requirements (preferred).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) b. On x86 systems, use the "idle=mwait" boot parameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) c. On x86 systems, use the "intel_idle.max_cstate=" to limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) ` the maximum C-state depth.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) d. On x86 systems, use the "idle=poll" boot parameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) However, please note that use of this parameter can cause
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) your CPU to overheat, which may cause thermal throttling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) to degrade your latencies -- and that this degradation can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) be even worse than that of dyntick-idle. Furthermore,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) this parameter effectively disables Turbo Mode on Intel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) CPUs, which can significantly reduce maximum performance.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) * Adaptive-ticks slows user/kernel transitions slightly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) This is not expected to be a problem for computationally intensive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) workloads, which have few such transitions. Careful benchmarking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) will be required to determine whether or not other workloads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) are significantly affected by this effect.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) * Adaptive-ticks does not do anything unless there is only one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) runnable task for a given CPU, even though there are a number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) of other situations where the scheduling-clock tick is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) needed. To give but one example, consider a CPU that has one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) runnable high-priority SCHED_FIFO task and an arbitrary number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) of low-priority SCHED_OTHER tasks. In this case, the CPU is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) required to run the SCHED_FIFO task until it either blocks or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) some other higher-priority task awakens on (or is assigned to)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) this CPU, so there is no point in sending a scheduling-clock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) interrupt to this CPU. However, the current implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) nevertheless sends scheduling-clock interrupts to CPUs having a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) single runnable SCHED_FIFO task and multiple runnable SCHED_OTHER
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) tasks, even though these interrupts are unnecessary.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) And even when there are multiple runnable tasks on a given CPU,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) there is little point in interrupting that CPU until the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) running task's timeslice expires, which is almost always way
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) longer than the time of the next scheduling-clock interrupt.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) Better handling of these sorts of situations is future work.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) * A reboot is required to reconfigure both adaptive idle and RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) callback offloading. Runtime reconfiguration could be provided
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) if needed, however, due to the complexity of reconfiguring RCU at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) runtime, there would need to be an earthshakingly good reason.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) Especially given that you have the straightforward option of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) simply offloading RCU callbacks from all CPUs and pinning them
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) where you want them whenever you want them pinned.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) * Additional configuration is required to deal with other sources
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) of OS jitter, including interrupts and system-utility tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) and processes. This configuration normally involves binding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) interrupts and tasks to particular CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) * Some sources of OS jitter can currently be eliminated only by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) constraining the workload. For example, the only way to eliminate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) OS jitter due to global TLB shootdowns is to avoid the unmapping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) operations (such as kernel module unload operations) that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) result in these shootdowns. For another example, page faults
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) and TLB misses can be reduced (and in some cases eliminated) by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) using huge pages and by constraining the amount of memory used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) by the application. Pre-faulting the working set can also be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) helpful, especially when combined with the mlock() and mlockall()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) system calls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) * Unless all CPUs are idle, at least one CPU must keep the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) scheduling-clock interrupt going in order to support accurate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) timekeeping.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) * If there might potentially be some adaptive-ticks CPUs, there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) will be at least one CPU keeping the scheduling-clock interrupt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) going, even if all CPUs are otherwise idle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) Better handling of this situation is ongoing work.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) * Some process-handling operations still require the occasional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) scheduling-clock tick. These operations include calculating CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) load, maintaining sched average, computing CFS entity vruntime,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) computing avenrun, and carrying out load balancing. They are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) currently accommodated by scheduling-clock tick every second
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) or so. On-going work will eliminate the need even for these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) infrequent scheduling-clock ticks.