^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) ===========================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) Clock sources, Clock events, sched_clock() and delay timers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ===========================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) This document tries to briefly explain some basic kernel timekeeping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) abstractions. It partly pertains to the drivers usually found in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) drivers/clocksource in the kernel tree, but the code may be spread out
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) across the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) If you grep through the kernel source you will find a number of architecture-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) specific implementations of clock sources, clockevents and several likewise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) architecture-specific overrides of the sched_clock() function and some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) delay timers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) To provide timekeeping for your platform, the clock source provides
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) the basic timeline, whereas clock events shoot interrupts on certain points
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) on this timeline, providing facilities such as high-resolution timers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) sched_clock() is used for scheduling and timestamping, and delay timers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) provide an accurate delay source using hardware counters.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) Clock sources
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) The purpose of the clock source is to provide a timeline for the system that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) tells you where you are in time. For example issuing the command 'date' on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) a Linux system will eventually read the clock source to determine exactly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) what time it is.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) Typically the clock source is a monotonic, atomic counter which will provide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) n bits which count from 0 to (2^n)-1 and then wraps around to 0 and start over.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) It will ideally NEVER stop ticking as long as the system is running. It
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) may stop during system suspend.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) The clock source shall have as high resolution as possible, and the frequency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) shall be as stable and correct as possible as compared to a real-world wall
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) clock. It should not move unpredictably back and forth in time or miss a few
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) cycles here and there.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) It must be immune to the kind of effects that occur in hardware where e.g.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) the counter register is read in two phases on the bus lowest 16 bits first
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) and the higher 16 bits in a second bus cycle with the counter bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) potentially being updated in between leading to the risk of very strange
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) values from the counter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) When the wall-clock accuracy of the clock source isn't satisfactory, there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) are various quirks and layers in the timekeeping code for e.g. synchronizing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) the user-visible time to RTC clocks in the system or against networked time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) servers using NTP, but all they do basically is update an offset against
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) the clock source, which provides the fundamental timeline for the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) These measures does not affect the clock source per se, they only adapt the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) system to the shortcomings of it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) The clock source struct shall provide means to translate the provided counter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) into a nanosecond value as an unsigned long long (unsigned 64 bit) number.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) Since this operation may be invoked very often, doing this in a strict
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) mathematical sense is not desirable: instead the number is taken as close as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) possible to a nanosecond value using only the arithmetic operations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) multiply and shift, so in clocksource_cyc2ns() you find:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) ns ~= (clocksource * mult) >> shift
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) You will find a number of helper functions in the clock source code intended
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) to aid in providing these mult and shift values, such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) clocksource_khz2mult(), clocksource_hz2mult() that help determine the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) mult factor from a fixed shift, and clocksource_register_hz() and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) clocksource_register_khz() which will help out assigning both shift and mult
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) factors using the frequency of the clock source as the only input.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) For real simple clock sources accessed from a single I/O memory location
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) there is nowadays even clocksource_mmio_init() which will take a memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) location, bit width, a parameter telling whether the counter in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) register counts up or down, and the timer clock rate, and then conjure all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) necessary parameters.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) Since a 32-bit counter at say 100 MHz will wrap around to zero after some 43
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) seconds, the code handling the clock source will have to compensate for this.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) That is the reason why the clock source struct also contains a 'mask'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) member telling how many bits of the source are valid. This way the timekeeping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) code knows when the counter will wrap around and can insert the necessary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) compensation code on both sides of the wrap point so that the system timeline
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) remains monotonic.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) Clock events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) ------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) Clock events are the conceptual reverse of clock sources: they take a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) desired time specification value and calculate the values to poke into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) hardware timer registers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) Clock events are orthogonal to clock sources. The same hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) and register range may be used for the clock event, but it is essentially
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) a different thing. The hardware driving clock events has to be able to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) fire interrupts, so as to trigger events on the system timeline. On an SMP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) system, it is ideal (and customary) to have one such event driving timer per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) CPU core, so that each core can trigger events independently of any other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) core.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) You will notice that the clock event device code is based on the same basic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) idea about translating counters to nanoseconds using mult and shift
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) arithmetic, and you find the same family of helper functions again for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) assigning these values. The clock event driver does not need a 'mask'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) attribute however: the system will not try to plan events beyond the time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) horizon of the clock event.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) sched_clock()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) In addition to the clock sources and clock events there is a special weak
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) function in the kernel called sched_clock(). This function shall return the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) number of nanoseconds since the system was started. An architecture may or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) may not provide an implementation of sched_clock() on its own. If a local
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) implementation is not provided, the system jiffy counter will be used as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) sched_clock().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) As the name suggests, sched_clock() is used for scheduling the system,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) determining the absolute timeslice for a certain process in the CFS scheduler
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) for example. It is also used for printk timestamps when you have selected to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) include time information in printk for things like bootcharts.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) Compared to clock sources, sched_clock() has to be very fast: it is called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) much more often, especially by the scheduler. If you have to do trade-offs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) between accuracy compared to the clock source, you may sacrifice accuracy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) for speed in sched_clock(). It however requires some of the same basic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) characteristics as the clock source, i.e. it should be monotonic.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) The sched_clock() function may wrap only on unsigned long long boundaries,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) i.e. after 64 bits. Since this is a nanosecond value this will mean it wraps
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) after circa 585 years. (For most practical systems this means "never".)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) If an architecture does not provide its own implementation of this function,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) it will fall back to using jiffies, making its maximum resolution 1/HZ of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) jiffy frequency for the architecture. This will affect scheduling accuracy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) and will likely show up in system benchmarks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) The clock driving sched_clock() may stop or reset to zero during system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) suspend/sleep. This does not matter to the function it serves of scheduling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) events on the system. However it may result in interesting timestamps in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) printk().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) The sched_clock() function should be callable in any context, IRQ- and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) NMI-safe and return a sane value in any context.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) Some architectures may have a limited set of time sources and lack a nice
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) counter to derive a 64-bit nanosecond value, so for example on the ARM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) architecture, special helper functions have been created to provide a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) sched_clock() nanosecond base from a 16- or 32-bit counter. Sometimes the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) same counter that is also used as clock source is used for this purpose.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) On SMP systems, it is crucial for performance that sched_clock() can be called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) independently on each CPU without any synchronization performance hits.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) Some hardware (such as the x86 TSC) will cause the sched_clock() function to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) drift between the CPUs on the system. The kernel can work around this by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) enabling the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK option. This is another aspect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) that makes sched_clock() different from the ordinary clock source.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) Delay timers (some architectures only)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) On systems with variable CPU frequency, the various kernel delay() functions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) will sometimes behave strangely. Basically these delays usually use a hard
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) loop to delay a certain number of jiffy fractions using a "lpj" (loops per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) jiffy) value, calibrated on boot.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) Let's hope that your system is running on maximum frequency when this value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) is calibrated: as an effect when the frequency is geared down to half the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) full frequency, any delay() will be twice as long. Usually this does not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) hurt, as you're commonly requesting that amount of delay *or more*. But
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) basically the semantics are quite unpredictable on such systems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) Enter timer-based delays. Using these, a timer read may be used instead of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) a hard-coded loop for providing the desired delay.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) This is done by declaring a struct delay_timer and assigning the appropriate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) function pointers and rate settings for this delay timer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) This is available on some architectures like OpenRISC or ARM.