Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) .. include:: <isonum.txt>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) .. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state <cpuidle_state>`
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) .. |cpufreq| replace:: :doc:`CPU Performance Scaling <cpufreq>`
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) CPU Idle Time Management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) :Copyright: |copy| 2018 Intel Corporation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) :Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) Concepts
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) Modern processors are generally able to enter states in which the execution of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) a program is suspended and instructions belonging to it are not fetched from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) memory or executed.  Those states are the *idle* states of the processor.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) Since part of the processor hardware is not used in idle states, entering them
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) generally allows power drawn by the processor to be reduced and, in consequence,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) it is an opportunity to save energy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) CPU idle time management is an energy-efficiency feature concerned about using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) the idle states of processors for this purpose.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) Logical CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) ------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) CPU idle time management operates on CPUs as seen by the *CPU scheduler* (that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) is the part of the kernel responsible for the distribution of computational
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) work in the system).  In its view, CPUs are *logical* units.  That is, they need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) not be separate physical entities and may just be interfaces appearing to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) software as individual single-core processors.  In other words, a CPU is an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) entity which appears to be fetching instructions that belong to one sequence
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) (program) from memory and executing them, but it need not work this way
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) physically.  Generally, three different cases can be consider here.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) First, if the whole processor can only follow one sequence of instructions (one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) program) at a time, it is a CPU.  In that case, if the hardware is asked to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) enter an idle state, that applies to the processor as a whole.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) Second, if the processor is multi-core, each core in it is able to follow at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) least one program at a time.  The cores need not be entirely independent of each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) other (for example, they may share caches), but still most of the time they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) work physically in parallel with each other, so if each of them executes only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) one program, those programs run mostly independently of each other at the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) time.  The entire cores are CPUs in that case and if the hardware is asked to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) enter an idle state, that applies to the core that asked for it in the first
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) place, but it also may apply to a larger unit (say a "package" or a "cluster")
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) that the core belongs to (in fact, it may apply to an entire hierarchy of larger
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) units containing the core).  Namely, if all of the cores in the larger unit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) except for one have been put into idle states at the "core level" and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) remaining core asks the processor to enter an idle state, that may trigger it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) to put the whole larger unit into an idle state which also will affect the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) other cores in that unit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) Finally, each core in a multi-core processor may be able to follow more than one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) program in the same time frame (that is, each core may be able to fetch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) instructions from multiple locations in memory and execute them in the same time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) frame, but not necessarily entirely in parallel with each other).  In that case
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) the cores present themselves to software as "bundles" each consisting of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) multiple individual single-core "processors", referred to as *hardware threads*
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) (or hyper-threads specifically on Intel hardware), that each can follow one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) sequence of instructions.  Then, the hardware threads are CPUs from the CPU idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) time management perspective and if the processor is asked to enter an idle state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) by one of them, the hardware thread (or CPU) that asked for it is stopped, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) nothing more happens, unless all of the other hardware threads within the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) core also have asked the processor to enter an idle state.  In that situation,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) the core may be put into an idle state individually or a larger unit containing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) it may be put into an idle state as a whole (if the other cores within the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) larger unit are in idle states already).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) Idle CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) ---------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) Logical CPUs, simply referred to as "CPUs" in what follows, are regarded as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) *idle* by the Linux kernel when there are no tasks to run on them except for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) special "idle" task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) Tasks are the CPU scheduler's representation of work.  Each task consists of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) sequence of instructions to execute, or code, data to be manipulated while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) running that code, and some context information that needs to be loaded into the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) processor every time the task's code is run by a CPU.  The CPU scheduler
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) distributes work by assigning tasks to run to the CPUs present in the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) Tasks can be in various states.  In particular, they are *runnable* if there are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) no specific conditions preventing their code from being run by a CPU as long as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) there is a CPU available for that (for example, they are not waiting for any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) events to occur or similar).  When a task becomes runnable, the CPU scheduler
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) assigns it to one of the available CPUs to run and if there are no more runnable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) tasks assigned to it, the CPU will load the given task's context and run its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) code (from the instruction following the last one executed so far, possibly by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) another CPU).  [If there are multiple runnable tasks assigned to one CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) simultaneously, they will be subject to prioritization and time sharing in order
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) to allow them to make some progress over time.]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) The special "idle" task becomes runnable if there are no other runnable tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) assigned to the given CPU and the CPU is then regarded as idle.  In other words,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) in Linux idle CPUs run the code of the "idle" task called *the idle loop*.  That
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) code may cause the processor to be put into one of its idle states, if they are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) supported, in order to save energy, but if the processor does not support any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) idle states, or there is not enough time to spend in an idle state before the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) next wakeup event, or there are strict latency constraints preventing any of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) available idle states from being used, the CPU will simply execute more or less
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) useless instructions in a loop until it is assigned a new task to run.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) .. _idle-loop:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) The Idle Loop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) The idle loop code takes two major steps in every iteration of it.  First, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) calls into a code module referred to as the *governor* that belongs to the CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) idle time management subsystem called ``CPUIdle`` to select an idle state for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) the CPU to ask the hardware to enter.  Second, it invokes another code module
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) from the ``CPUIdle`` subsystem, called the *driver*, to actually ask the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) processor hardware to enter the idle state selected by the governor.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) The role of the governor is to find an idle state most suitable for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) conditions at hand.  For this purpose, idle states that the hardware can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) asked to enter by logical CPUs are represented in an abstract way independent of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) the platform or the processor architecture and organized in a one-dimensional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) (linear) array.  That array has to be prepared and supplied by the ``CPUIdle``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) driver matching the platform the kernel is running on at the initialization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) time.  This allows ``CPUIdle`` governors to be independent of the underlying
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) hardware and to work with any platforms that the Linux kernel can run on.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) Each idle state present in that array is characterized by two parameters to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) taken into account by the governor, the *target residency* and the (worst-case)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) *exit latency*.  The target residency is the minimum time the hardware must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) spend in the given state, including the time needed to enter it (which may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) substantial), in order to save more energy than it would save by entering one of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) the shallower idle states instead.  [The "depth" of an idle state roughly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) corresponds to the power drawn by the processor in that state.]  The exit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) latency, in turn, is the maximum time it will take a CPU asking the processor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) hardware to enter an idle state to start executing the first instruction after a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) wakeup from that state.  Note that in general the exit latency also must cover
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) the time needed to enter the given state in case the wakeup occurs when the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) hardware is entering it and it must be entered completely to be exited in an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) ordered manner.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) There are two types of information that can influence the governor's decisions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) First of all, the governor knows the time until the closest timer event.  That
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) time is known exactly, because the kernel programs timers and it knows exactly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) when they will trigger, and it is the maximum time the hardware that the given
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) CPU depends on can spend in an idle state, including the time necessary to enter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) and exit it.  However, the CPU may be woken up by a non-timer event at any time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) (in particular, before the closest timer triggers) and it generally is not known
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) when that may happen.  The governor can only see how much time the CPU actually
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) was idle after it has been woken up (that time will be referred to as the *idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) duration* from now on) and it can use that information somehow along with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) time until the closest timer to estimate the idle duration in future.  How the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) governor uses that information depends on what algorithm is implemented by it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) and that is the primary reason for having more than one governor in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) ``CPUIdle`` subsystem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) There are four ``CPUIdle`` governors available, ``menu``, `TEO <teo-gov_>`_,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) ``ladder`` and ``haltpoll``.  Which of them is used by default depends on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) configuration of the kernel and in particular on whether or not the scheduler
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) tick can be `stopped by the idle loop <idle-cpus-and-tick_>`_.  Available
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) governors can be read from the :file:`available_governors`, and the governor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) can be changed at runtime.  The name of the ``CPUIdle`` governor currently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) used by the kernel can be read from the :file:`current_governor_ro` or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) :file:`current_governor` file under :file:`/sys/devices/system/cpu/cpuidle/`
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) in ``sysfs``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) Which ``CPUIdle`` driver is used, on the other hand, usually depends on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) platform the kernel is running on, but there are platforms with more than one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) matching driver.  For example, there are two drivers that can work with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) majority of Intel platforms, ``intel_idle`` and ``acpi_idle``, one with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) hardcoded idle states information and the other able to read that information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) from the system's ACPI tables, respectively.  Still, even in those cases, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) driver chosen at the system initialization time cannot be replaced later, so the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) decision on which one of them to use has to be made early (on Intel platforms
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) the ``acpi_idle`` driver will be used if ``intel_idle`` is disabled for some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) reason or if it does not recognize the processor).  The name of the ``CPUIdle``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) driver currently used by the kernel can be read from the :file:`current_driver`
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) file under :file:`/sys/devices/system/cpu/cpuidle/` in ``sysfs``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) .. _idle-cpus-and-tick:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) Idle CPUs and The Scheduler Tick
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) ================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) The scheduler tick is a timer that triggers periodically in order to implement
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) the time sharing strategy of the CPU scheduler.  Of course, if there are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) multiple runnable tasks assigned to one CPU at the same time, the only way to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) allow them to make reasonable progress in a given time frame is to make them
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) share the available CPU time.  Namely, in rough approximation, each task is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) given a slice of the CPU time to run its code, subject to the scheduling class,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) prioritization and so on and when that time slice is used up, the CPU should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) switched over to running (the code of) another task.  The currently running task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) may not want to give the CPU away voluntarily, however, and the scheduler tick
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) is there to make the switch happen regardless.  That is not the only role of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) tick, but it is the primary reason for using it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) The scheduler tick is problematic from the CPU idle time management perspective,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) because it triggers periodically and relatively often (depending on the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) configuration, the length of the tick period is between 1 ms and 10 ms).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) Thus, if the tick is allowed to trigger on idle CPUs, it will not make sense
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) for them to ask the hardware to enter idle states with target residencies above
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) the tick period length.  Moreover, in that case the idle duration of any CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) will never exceed the tick period length and the energy used for entering and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) exiting idle states due to the tick wakeups on idle CPUs will be wasted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) Fortunately, it is not really necessary to allow the tick to trigger on idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) CPUs, because (by definition) they have no tasks to run except for the special
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) "idle" one.  In other words, from the CPU scheduler perspective, the only user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) of the CPU time on them is the idle loop.  Since the time of an idle CPU need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) not be shared between multiple runnable tasks, the primary reason for using the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) tick goes away if the given CPU is idle.  Consequently, it is possible to stop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) the scheduler tick entirely on idle CPUs in principle, even though that may not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) always be worth the effort.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) Whether or not it makes sense to stop the scheduler tick in the idle loop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) depends on what is expected by the governor.  First, if there is another
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) (non-tick) timer due to trigger within the tick range, stopping the tick clearly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) would be a waste of time, even though the timer hardware may not need to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) reprogrammed in that case.  Second, if the governor is expecting a non-timer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) wakeup within the tick range, stopping the tick is not necessary and it may even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) be harmful.  Namely, in that case the governor will select an idle state with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) the target residency within the time until the expected wakeup, so that state is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) going to be relatively shallow.  The governor really cannot select a deep idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) state then, as that would contradict its own expectation of a wakeup in short
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) order.  Now, if the wakeup really occurs shortly, stopping the tick would be a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) waste of time and in this case the timer hardware would need to be reprogrammed,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) which is expensive.  On the other hand, if the tick is stopped and the wakeup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) does not occur any time soon, the hardware may spend indefinite amount of time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) in the shallow idle state selected by the governor, which will be a waste of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) energy.  Hence, if the governor is expecting a wakeup of any kind within the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) tick range, it is better to allow the tick trigger.  Otherwise, however, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) governor will select a relatively deep idle state, so the tick should be stopped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) so that it does not wake up the CPU too early.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) In any case, the governor knows what it is expecting and the decision on whether
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) or not to stop the scheduler tick belongs to it.  Still, if the tick has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) stopped already (in one of the previous iterations of the loop), it is better
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) to leave it as is and the governor needs to take that into account.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) The kernel can be configured to disable stopping the scheduler tick in the idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) loop altogether.  That can be done through the build-time configuration of it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) (by unsetting the ``CONFIG_NO_HZ_IDLE`` configuration option) or by passing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) ``nohz=off`` to it in the command line.  In both cases, as the stopping of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) scheduler tick is disabled, the governor's decisions regarding it are simply
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) ignored by the idle loop code and the tick is never stopped.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) The systems that run kernels configured to allow the scheduler tick to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) stopped on idle CPUs are referred to as *tickless* systems and they are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) generally regarded as more energy-efficient than the systems running kernels in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) which the tick cannot be stopped.  If the given system is tickless, it will use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) the ``menu`` governor by default and if it is not tickless, the default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) ``CPUIdle`` governor on it will be ``ladder``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) .. _menu-gov:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) The ``menu`` Governor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) The ``menu`` governor is the default ``CPUIdle`` governor for tickless systems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) It is quite complex, but the basic principle of its design is straightforward.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) Namely, when invoked to select an idle state for a CPU (i.e. an idle state that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) the CPU will ask the processor hardware to enter), it attempts to predict the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) idle duration and uses the predicted value for idle state selection.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) It first obtains the time until the closest timer event with the assumption
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) that the scheduler tick will be stopped.  That time, referred to as the *sleep
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) length* in what follows, is the upper bound on the time before the next CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) wakeup.  It is used to determine the sleep length range, which in turn is needed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) to get the sleep length correction factor.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) The ``menu`` governor maintains two arrays of sleep length correction factors.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) One of them is used when tasks previously running on the given CPU are waiting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) for some I/O operations to complete and the other one is used when that is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) the case.  Each array contains several correction factor values that correspond
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) to different sleep length ranges organized so that each range represented in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) array is approximately 10 times wider than the previous one.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) The correction factor for the given sleep length range (determined before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) selecting the idle state for the CPU) is updated after the CPU has been woken
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) up and the closer the sleep length is to the observed idle duration, the closer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) to 1 the correction factor becomes (it must fall between 0 and 1 inclusive).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) The sleep length is multiplied by the correction factor for the range that it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) falls into to obtain the first approximation of the predicted idle duration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) Next, the governor uses a simple pattern recognition algorithm to refine its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) idle duration prediction.  Namely, it saves the last 8 observed idle duration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) values and, when predicting the idle duration next time, it computes the average
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) and variance of them.  If the variance is small (smaller than 400 square
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) milliseconds) or it is small relative to the average (the average is greater
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) that 6 times the standard deviation), the average is regarded as the "typical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) interval" value.  Otherwise, the longest of the saved observed idle duration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) values is discarded and the computation is repeated for the remaining ones.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) Again, if the variance of them is small (in the above sense), the average is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) taken as the "typical interval" value and so on, until either the "typical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) interval" is determined or too many data points are disregarded, in which case
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) the "typical interval" is assumed to equal "infinity" (the maximum unsigned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) integer value).  The "typical interval" computed this way is compared with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) sleep length multiplied by the correction factor and the minimum of the two is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) taken as the predicted idle duration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) Then, the governor computes an extra latency limit to help "interactive"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) workloads.  It uses the observation that if the exit latency of the selected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) idle state is comparable with the predicted idle duration, the total time spent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) in that state probably will be very short and the amount of energy to save by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) entering it will be relatively small, so likely it is better to avoid the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) overhead related to entering that state and exiting it.  Thus selecting a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) shallower state is likely to be a better option then.   The first approximation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) of the extra latency limit is the predicted idle duration itself which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) additionally is divided by a value depending on the number of tasks that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) previously ran on the given CPU and now they are waiting for I/O operations to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) complete.  The result of that division is compared with the latency limit coming
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) framework and the minimum of the two is taken as the limit for the idle states'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) exit latency.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) Now, the governor is ready to walk the list of idle states and choose one of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) them.  For this purpose, it compares the target residency of each state with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) the predicted idle duration and the exit latency of it with the computed latency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) limit.  It selects the state with the target residency closest to the predicted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) idle duration, but still below it, and exit latency that does not exceed the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) limit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) In the final step the governor may still need to refine the idle state selection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) if it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_.  That
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) happens if the idle duration predicted by it is less than the tick period and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) the tick has not been stopped already (in a previous iteration of the idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) loop).  Then, the sleep length used in the previous computations may not reflect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) the real time until the closest timer event and if it really is greater than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) that time, the governor may need to select a shallower state with a suitable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) target residency.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) .. _teo-gov:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) The Timer Events Oriented (TEO) Governor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) ========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) The timer events oriented (TEO) governor is an alternative ``CPUIdle`` governor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) for tickless systems.  It follows the same basic strategy as the ``menu`` `one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) <menu-gov_>`_: it always tries to find the deepest idle state suitable for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) given conditions.  However, it applies a different approach to that problem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) First, it does not use sleep length correction factors, but instead it attempts
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) to correlate the observed idle duration values with the available idle states
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) and use that information to pick up the idle state that is most likely to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) "match" the upcoming CPU idle interval.   Second, it does not take the tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) that were running on the given CPU in the past and are waiting on some I/O
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) operations to complete now at all (there is no guarantee that they will run on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) the same CPU when they become runnable again) and the pattern detection code in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) it avoids taking timer wakeups into account.  It also only uses idle duration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) values less than the current time till the closest timer (with the scheduler
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) tick excluded) for that purpose.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) Like in the ``menu`` governor `case <menu-gov_>`_, the first step is to obtain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) the *sleep length*, which is the time until the closest timer event with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) assumption that the scheduler tick will be stopped (that also is the upper bound
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) on the time until the next CPU wakeup).  That value is then used to preselect an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) idle state on the basis of three metrics maintained for each idle state provided
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) by the ``CPUIdle`` driver: ``hits``, ``misses`` and ``early_hits``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) The ``hits`` and ``misses`` metrics measure the likelihood that a given idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) state will "match" the observed (post-wakeup) idle duration if it "matches" the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) sleep length.  They both are subject to decay (after a CPU wakeup) every time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) the target residency of the idle state corresponding to them is less than or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) equal to the sleep length and the target residency of the next idle state is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) greater than the sleep length (that is, when the idle state corresponding to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) them "matches" the sleep length).  The ``hits`` metric is increased if the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) former condition is satisfied and the target residency of the given idle state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) is less than or equal to the observed idle duration and the target residency of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) the next idle state is greater than the observed idle duration at the same time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) (that is, it is increased when the given idle state "matches" both the sleep
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) length and the observed idle duration).  In turn, the ``misses`` metric is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) increased when the given idle state "matches" the sleep length only and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) observed idle duration is too short for its target residency.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) The ``early_hits`` metric measures the likelihood that a given idle state will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) "match" the observed (post-wakeup) idle duration if it does not "match" the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) sleep length.  It is subject to decay on every CPU wakeup and it is increased
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) when the idle state corresponding to it "matches" the observed (post-wakeup)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) idle duration and the target residency of the next idle state is less than or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) equal to the sleep length (i.e. the idle state "matching" the sleep length is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) deeper than the given one).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) The governor walks the list of idle states provided by the ``CPUIdle`` driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) and finds the last (deepest) one with the target residency less than or equal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) to the sleep length.  Then, the ``hits`` and ``misses`` metrics of that idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) state are compared with each other and it is preselected if the ``hits`` one is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) greater (which means that that idle state is likely to "match" the observed idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) duration after CPU wakeup).  If the ``misses`` one is greater, the governor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) preselects the shallower idle state with the maximum ``early_hits`` metric
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) (or if there are multiple shallower idle states with equal ``early_hits``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) metric which also is the maximum, the shallowest of them will be preselected).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) [If there is a wakeup latency constraint coming from the `PM QoS framework
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) <cpu-pm-qos_>`_ which is hit before reaching the deepest idle state with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) target residency within the sleep length, the deepest idle state with the exit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) latency within the constraint is preselected without consulting the ``hits``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) ``misses`` and ``early_hits`` metrics.]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) Next, the governor takes several idle duration values observed most recently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) into consideration and if at least a half of them are greater than or equal to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) the target residency of the preselected idle state, that idle state becomes the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) final candidate to ask for.  Otherwise, the average of the most recent idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) duration values below the target residency of the preselected idle state is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) computed and the governor walks the idle states shallower than the preselected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) one and finds the deepest of them with the target residency within that average.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) That idle state is then taken as the final candidate to ask for.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) Still, at this point the governor may need to refine the idle state selection if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_.  That
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) generally happens if the target residency of the idle state selected so far is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) less than the tick period and the tick has not been stopped already (in a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) previous iteration of the idle loop).  Then, like in the ``menu`` governor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) `case <menu-gov_>`_, the sleep length used in the previous computations may not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) reflect the real time until the closest timer event and if it really is greater
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) than that time, a shallower state with a suitable target residency may need to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) be selected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) .. _idle-states-representation:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) Representation of Idle States
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) =============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) For the CPU idle time management purposes all of the physical idle states
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) supported by the processor have to be represented as a one-dimensional array of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) |struct cpuidle_state| objects each allowing an individual (logical) CPU to ask
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) the processor hardware to enter an idle state of certain properties.  If there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) is a hierarchy of units in the processor, one |struct cpuidle_state| object can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) cover a combination of idle states supported by the units at different levels of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) the hierarchy.  In that case, the `target residency and exit latency parameters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) of it <idle-loop_>`_, must reflect the properties of the idle state at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) deepest level (i.e. the idle state of the unit containing all of the other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) units).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) For example, take a processor with two cores in a larger unit referred to as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) a "module" and suppose that asking the hardware to enter a specific idle state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) (say "X") at the "core" level by one core will trigger the module to try to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) enter a specific idle state of its own (say "MX") if the other core is in idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) state "X" already.  In other words, asking for idle state "X" at the "core"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) level gives the hardware a license to go as deep as to idle state "MX" at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) "module" level, but there is no guarantee that this is going to happen (the core
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) asking for idle state "X" may just end up in that state by itself instead).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) Then, the target residency of the |struct cpuidle_state| object representing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) idle state "X" must reflect the minimum time to spend in idle state "MX" of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) the module (including the time needed to enter it), because that is the minimum
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) time the CPU needs to be idle to save any energy in case the hardware enters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) that state.  Analogously, the exit latency parameter of that object must cover
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) the exit time of idle state "MX" of the module (and usually its entry time too),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456) because that is the maximum delay between a wakeup signal and the time the CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) will start to execute the first new instruction (assuming that both cores in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) module will always be ready to execute instructions as soon as the module
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) becomes operational as a whole).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) There are processors without direct coordination between different levels of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) hierarchy of units inside them, however.  In those cases asking for an idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) state at the "core" level does not automatically affect the "module" level, for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) example, in any way and the ``CPUIdle`` driver is responsible for the entire
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) handling of the hierarchy.  Then, the definition of the idle state objects is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) entirely up to the driver, but still the physical properties of the idle state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) that the processor hardware finally goes into must always follow the parameters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) used by the governor for idle state selection (for instance, the actual exit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) latency of that idle state must not exceed the exit latency parameter of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) idle state object selected by the governor).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) In addition to the target residency and exit latency idle state parameters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) discussed above, the objects representing idle states each contain a few other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) parameters describing the idle state and a pointer to the function to run in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) order to ask the hardware to enter that state.  Also, for each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) |struct cpuidle_state| object, there is a corresponding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) :c:type:`struct cpuidle_state_usage <cpuidle_state_usage>` one containing usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478) statistics of the given idle state.  That information is exposed by the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) via ``sysfs``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481) For each CPU in the system, there is a :file:`/sys/devices/system/cpu/cpu<N>/cpuidle/`
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) directory in ``sysfs``, where the number ``<N>`` is assigned to the given
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) CPU at the initialization time.  That directory contains a set of subdirectories
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) called :file:`state0`, :file:`state1` and so on, up to the number of idle state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) objects defined for the given CPU minus one.  Each of these directories
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) corresponds to one idle state object and the larger the number in its name, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) deeper the (effective) idle state represented by it.  Each of them contains
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) a number of files (attributes) representing the properties of the idle state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) object corresponding to it, as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) ``above``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) 	Total number of times this idle state had been asked for, but the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) 	observed idle duration was certainly too short to match its target
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) 	residency.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) ``below``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) 	Total number of times this idle state had been asked for, but certainly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) 	a deeper idle state would have been a better match for the observed idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) 	duration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) ``desc``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) 	Description of the idle state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) ``disable``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) 	Whether or not this idle state is disabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507) ``default_status``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) 	The default status of this state, "enabled" or "disabled".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) ``latency``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) 	Exit latency of the idle state in microseconds.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) ``name``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) 	Name of the idle state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) ``power``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) 	Power drawn by hardware in this idle state in milliwatts (if specified,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) 	0 otherwise).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) ``residency``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) 	Target residency of the idle state in microseconds.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) ``time``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524) 	Total time spent in this idle state by the given CPU (as measured by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525) 	kernel) in microseconds.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527) ``usage``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) 	Total number of times the hardware has been asked by the given CPU to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) 	enter this idle state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) ``rejected``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) 	Total number of times a request to enter this idle state on the given
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) 	CPU was rejected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) The :file:`desc` and :file:`name` files both contain strings.  The difference
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) between them is that the name is expected to be more concise, while the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) description may be longer and it may contain white space or special characters.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538) The other files listed above contain integer numbers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540) The :file:`disable` attribute is the only writeable one.  If it contains 1, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) given idle state is disabled for this particular CPU, which means that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) governor will never select it for this particular CPU and the ``CPUIdle``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543) driver will never ask the hardware to enter it for that CPU as a result.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) However, disabling an idle state for one CPU does not prevent it from being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545) asked for by the other CPUs, so it must be disabled for all of them in order to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) never be asked for by any of them.  [Note that, due to the way the ``ladder``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547) governor is implemented, disabling an idle state prevents that governor from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) selecting any idle states deeper than the disabled one too.]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) If the :file:`disable` attribute contains 0, the given idle state is enabled for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551) this particular CPU, but it still may be disabled for some or all of the other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552) CPUs in the system at the same time.  Writing 1 to it causes the idle state to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) be disabled for this particular CPU and writing 0 to it allows the governor to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) take it into consideration for the given CPU and the driver to ask for it,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555) unless that state was disabled globally in the driver (in which case it cannot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556) be used at all).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) The :file:`power` attribute is not defined very well, especially for idle state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559) objects representing combinations of idle states at different levels of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560) hierarchy of units in the processor, and it generally is hard to obtain idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561) state power numbers for complex hardware, so :file:`power` often contains 0 (not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562) available) and if it contains a nonzero number, that number may not be very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563) accurate and it should not be relied on for anything meaningful.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565) The number in the :file:`time` file generally may be greater than the total time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566) really spent by the given CPU in the given idle state, because it is measured by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567) the kernel and it may not cover the cases in which the hardware refused to enter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568) this idle state and entered a shallower one instead of it (or even it did not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569) enter any idle state at all).  The kernel can only measure the time span between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570) asking the hardware to enter an idle state and the subsequent wakeup of the CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571) and it cannot say what really happened in the meantime at the hardware level.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572) Moreover, if the idle state object in question represents a combination of idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573) states at different levels of the hierarchy of units in the processor,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574) the kernel can never say how deep the hardware went down the hierarchy in any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575) particular case.  For these reasons, the only reliable way to find out how
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576) much time has been spent by the hardware in different idle states supported by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577) it is to use idle state residency counters in the hardware, if available.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 578) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 579) Generally, an interrupt received when trying to enter an idle state causes the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 580) idle state entry request to be rejected, in which case the ``CPUIdle`` driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 581) may return an error code to indicate that this was the case. The :file:`usage`
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 582) and :file:`rejected` files report the number of times the given idle state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 583) was entered successfully or rejected, respectively.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 584) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 585) .. _cpu-pm-qos:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 586) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 587) Power Management Quality of Service for CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 588) ============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 589) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 590) The power management quality of service (PM QoS) framework in the Linux kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 591) allows kernel code and user space processes to set constraints on various
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 592) energy-efficiency features of the kernel to prevent performance from dropping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 593) below a required level.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 594) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 595) CPU idle time management can be affected by PM QoS in two ways, through the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 596) global CPU latency limit and through the resume latency constraints for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 597) individual CPUs.  Kernel code (e.g. device drivers) can set both of them with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 598) the help of special internal interfaces provided by the PM QoS framework.  User
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 599) space can modify the former by opening the :file:`cpu_dma_latency` special
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 600) device file under :file:`/dev/` and writing a binary value (interpreted as a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 601) signed 32-bit integer) to it.  In turn, the resume latency constraint for a CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 602) can be modified from user space by writing a string (representing a signed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 603) 32-bit integer) to the :file:`power/pm_qos_resume_latency_us` file under
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 604) :file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs``, where the CPU number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 605) ``<N>`` is allocated at the system initialization time.  Negative values
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 606) will be rejected in both cases and, also in both cases, the written integer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 607) number will be interpreted as a requested PM QoS constraint in microseconds.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 608) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 609) The requested value is not automatically applied as a new constraint, however,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 610) as it may be less restrictive (greater in this particular case) than another
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 611) constraint previously requested by someone else.  For this reason, the PM QoS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 612) framework maintains a list of requests that have been made so far for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 613) global CPU latency limit and for each individual CPU, aggregates them and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 614) applies the effective (minimum in this particular case) value as the new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 615) constraint.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 616) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 617) In fact, opening the :file:`cpu_dma_latency` special device file causes a new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 618) PM QoS request to be created and added to a global priority list of CPU latency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 619) limit requests and the file descriptor coming from the "open" operation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 620) represents that request.  If that file descriptor is then used for writing, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 621) number written to it will be associated with the PM QoS request represented by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 622) it as a new requested limit value.  Next, the priority list mechanism will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 623) used to determine the new effective value of the entire list of requests and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 624) that effective value will be set as a new CPU latency limit.  Thus requesting a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 625) new limit value will only change the real limit if the effective "list" value is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 626) affected by it, which is the case if it is the minimum of the requested values
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 627) in the list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 628) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 629) The process holding a file descriptor obtained by opening the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 630) :file:`cpu_dma_latency` special device file controls the PM QoS request
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 631) associated with that file descriptor, but it controls this particular PM QoS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 632) request only.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 633) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 634) Closing the :file:`cpu_dma_latency` special device file or, more precisely, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 635) file descriptor obtained while opening it, causes the PM QoS request associated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 636) with that file descriptor to be removed from the global priority list of CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 637) latency limit requests and destroyed.  If that happens, the priority list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 638) mechanism will be used again, to determine the new effective value for the whole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 639) list and that value will become the new limit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 640) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 641) In turn, for each CPU there is one resume latency PM QoS request associated with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 642) the :file:`power/pm_qos_resume_latency_us` file under
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 643) :file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs`` and writing to it causes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 644) this single PM QoS request to be updated regardless of which user space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 645) process does that.  In other words, this PM QoS request is shared by the entire
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 646) user space, so access to the file associated with it needs to be arbitrated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 647) to avoid confusion.  [Arguably, the only legitimate use of this mechanism in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 648) practice is to pin a process to the CPU in question and let it use the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 649) ``sysfs`` interface to control the resume latency constraint for it.]  It is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 650) still only a request, however.  It is an entry in a priority list used to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 651) determine the effective value to be set as the resume latency constraint for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 652) CPU in question every time the list of requests is updated this way or another
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 653) (there may be other requests coming from kernel code in that list).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 654) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 655) CPU idle time governors are expected to regard the minimum of the global
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 656) (effective) CPU latency limit and the effective resume latency constraint for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 657) the given CPU as the upper limit for the exit latency of the idle states that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 658) they are allowed to select for that CPU.  They should never select any idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 659) states with exit latency beyond that limit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 660) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 661) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 662) Idle States Control Via Kernel Command Line
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 663) ===========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 664) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 665) In addition to the ``sysfs`` interface allowing individual idle states to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 666) `disabled for individual CPUs <idle-states-representation_>`_, there are kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 667) command line parameters affecting CPU idle time management.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 668) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 669) The ``cpuidle.off=1`` kernel command line option can be used to disable the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 670) CPU idle time management entirely.  It does not prevent the idle loop from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 671) running on idle CPUs, but it prevents the CPU idle time governors and drivers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 672) from being invoked.  If it is added to the kernel command line, the idle loop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 673) will ask the hardware to enter idle states on idle CPUs via the CPU architecture
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 674) support code that is expected to provide a default mechanism for this purpose.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 675) That default mechanism usually is the least common denominator for all of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 676) processors implementing the architecture (i.e. CPU instruction set) in question,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 677) however, so it is rather crude and not very energy-efficient.  For this reason,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 678) it is not recommended for production use.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 679) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 680) The ``cpuidle.governor=`` kernel command line switch allows the ``CPUIdle``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 681) governor to use to be specified.  It has to be appended with a string matching
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 682) the name of an available governor (e.g. ``cpuidle.governor=menu``) and that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 683) governor will be used instead of the default one.  It is possible to force
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 684) the ``menu`` governor to be used on the systems that use the ``ladder`` governor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 685) by default this way, for example.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 686) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 687) The other kernel command line parameters controlling CPU idle time management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 688) described below are only relevant for the *x86* architecture and some of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 689) them affect Intel processors only.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 690) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 691) The *x86* architecture support code recognizes three kernel command line
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 692) options related to CPU idle time management: ``idle=poll``, ``idle=halt``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 693) and ``idle=nomwait``.  The first two of them disable the ``acpi_idle`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 694) ``intel_idle`` drivers altogether, which effectively causes the entire
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 695) ``CPUIdle`` subsystem to be disabled and makes the idle loop invoke the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 696) architecture support code to deal with idle CPUs.  How it does that depends on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 697) which of the two parameters is added to the kernel command line.  In the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 698) ``idle=halt`` case, the architecture support code will use the ``HLT``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 699) instruction of the CPUs (which, as a rule, suspends the execution of the program
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 700) and causes the hardware to attempt to enter the shallowest available idle state)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 701) for this purpose, and if ``idle=poll`` is used, idle CPUs will execute a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 702) more or less "lightweight" sequence of instructions in a tight loop.  [Note
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 703) that using ``idle=poll`` is somewhat drastic in many cases, as preventing idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 704) CPUs from saving almost any energy at all may not be the only effect of it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 705) For example, on Intel hardware it effectively prevents CPUs from using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 706) P-states (see |cpufreq|) that require any number of CPUs in a package to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 707) idle, so it very well may hurt single-thread computations performance as well as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 708) energy-efficiency.  Thus using it for performance reasons may not be a good idea
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 709) at all.]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 710) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 711) The ``idle=nomwait`` option disables the ``intel_idle`` driver and causes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 712) ``acpi_idle`` to be used (as long as all of the information needed by it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 713) there in the system's ACPI tables), but it is not allowed to use the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 714) ``MWAIT`` instruction of the CPUs to ask the hardware to enter idle states.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 715) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 716) In addition to the architecture-level kernel command line options affecting CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 717) idle time management, there are parameters affecting individual ``CPUIdle``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 718) drivers that can be passed to them via the kernel command line.  Specifically,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 719) the ``intel_idle.max_cstate=<n>`` and ``processor.max_cstate=<n>`` parameters,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 720) where ``<n>`` is an idle state index also used in the name of the given
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 721) state's directory in ``sysfs`` (see
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 722) `Representation of Idle States <idle-states-representation_>`_), causes the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 723) ``intel_idle`` and ``acpi_idle`` drivers, respectively, to discard all of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 724) idle states deeper than idle state ``<n>``.  In that case, they will never ask
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 725) for any of those idle states or expose them to the governor.  [The behavior of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 726) the two drivers is different for ``<n>`` equal to ``0``.  Adding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 727) ``intel_idle.max_cstate=0`` to the kernel command line disables the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 728) ``intel_idle`` driver and allows ``acpi_idle`` to be used, whereas
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 729) ``processor.max_cstate=0`` is equivalent to ``processor.max_cstate=1``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 730) Also, the ``acpi_idle`` driver is part of the ``processor`` kernel module that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 731) can be loaded separately and ``max_cstate=<n>`` can be passed to it as a module
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 732) parameter when it is loaded.]