^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) ======================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) hrtimers - subsystem for high-resolution kernel timers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ======================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) This patch introduces a new subsystem for high-resolution kernel timers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) One might ask the question: we already have a timer subsystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) (kernel/timers.c), why do we need two timer subsystems? After a lot of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) back and forth trying to integrate high-resolution and high-precision
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) features into the existing timer framework, and after testing various
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) such high-resolution timer implementations in practice, we came to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) conclusion that the timer wheel code is fundamentally not suitable for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) such an approach. We initially didn't believe this ('there must be a way
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) to solve this'), and spent a considerable effort trying to integrate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) things into the timer wheel, but we failed. In hindsight, there are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) several reasons why such integration is hard/impossible:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) - the forced handling of low-resolution and high-resolution timers in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) the same way leads to a lot of compromises, macro magic and #ifdef
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) mess. The timers.c code is very "tightly coded" around jiffies and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) 32-bitness assumptions, and has been honed and micro-optimized for a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) relatively narrow use case (jiffies in a relatively narrow HZ range)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) for many years - and thus even small extensions to it easily break
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) the wheel concept, leading to even worse compromises. The timer wheel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) code is very good and tight code, there's zero problems with it in its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) current usage - but it is simply not suitable to be extended for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) high-res timers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) - the unpredictable [O(N)] overhead of cascading leads to delays which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) necessitate a more complex handling of high resolution timers, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) in turn decreases robustness. Such a design still leads to rather large
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) timing inaccuracies. Cascading is a fundamental property of the timer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) wheel concept, it cannot be 'designed out' without inevitably
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) degrading other portions of the timers.c code in an unacceptable way.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) - the implementation of the current posix-timer subsystem on top of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) the timer wheel has already introduced a quite complex handling of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) the required readjusting of absolute CLOCK_REALTIME timers at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) settimeofday or NTP time - further underlying our experience by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) example: that the timer wheel data structure is too rigid for high-res
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) timers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) - the timer wheel code is most optimal for use cases which can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) identified as "timeouts". Such timeouts are usually set up to cover
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) error conditions in various I/O paths, such as networking and block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) I/O. The vast majority of those timers never expire and are rarely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) recascaded because the expected correct event arrives in time so they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) can be removed from the timer wheel before any further processing of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) them becomes necessary. Thus the users of these timeouts can accept
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) the granularity and precision tradeoffs of the timer wheel, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) largely expect the timer subsystem to have near-zero overhead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) Accurate timing for them is not a core purpose - in fact most of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) timeout values used are ad-hoc. For them it is at most a necessary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) evil to guarantee the processing of actual timeout completions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) (because most of the timeouts are deleted before completion), which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) should thus be as cheap and unintrusive as possible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) The primary users of precision timers are user-space applications that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) users like drivers and subsystems which require precise timed events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) (e.g. multimedia) can benefit from the availability of a separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) high-resolution timer subsystem as well.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) While this subsystem does not offer high-resolution clock sources just
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) yet, the hrtimer subsystem can be easily extended with high-resolution
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) clock capabilities, and patches for that exist and are maturing quickly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) The increasing demand for realtime and multimedia applications along
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) with other potential users for precise timers gives another reason to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) separate the "timeout" and "precise timer" subsystems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) Another potential benefit is that such a separation allows even more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) special-purpose optimization of the existing timer wheel for the low
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) resolution and low precision use cases - once the precision-sensitive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) APIs are separated from the timer wheel and are migrated over to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) hrtimers. E.g. we could decrease the frequency of the timeout subsystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) from 250 Hz to 100 HZ (or even smaller).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) hrtimer subsystem implementation details
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) ----------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) the basic design considerations were:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) - simplicity
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) - data structure not bound to jiffies or any other granularity. All the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) kernel logic works at 64-bit nanoseconds resolution - no compromises.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) - simplification of existing, timing related kernel code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) another basic requirement was the immediate enqueueing and ordering of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) timers at activation time. After looking at several possible solutions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) such as radix trees and hashes, we chose the red black tree as the basic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) data structure. Rbtrees are available as a library in the kernel and are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) used in various performance-critical areas of e.g. memory management and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) file systems. The rbtree is solely used for time sorted ordering, while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) a separate list is used to give the expiry code fast access to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) queued timers, without having to walk the rbtree.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) (This separate list is also useful for later when we'll introduce
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) high-resolution clocks, where we need separate pending and expired
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) queues while keeping the time-order intact.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) Time-ordered enqueueing is not purely for the purposes of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) high-resolution clocks though, it also simplifies the handling of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) absolute timers based on a low-resolution CLOCK_REALTIME. The existing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) implementation needed to keep an extra list of all armed absolute
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) CLOCK_REALTIME timers along with complex locking. In case of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) settimeofday and NTP, all the timers (!) had to be dequeued, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) time-changing code had to fix them up one by one, and all of them had to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) be enqueued again. The time-ordered enqueueing and the storage of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) expiry time in absolute time units removes all this complex and poorly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) scaling code from the posix-timer implementation - the clock can simply
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) be set without having to touch the rbtree. This also makes the handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) of posix-timers simpler in general.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) The locking and per-CPU behavior of hrtimers was mostly taken from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) existing timer wheel code, as it is mature and well suited. Sharing code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) was not really a win, due to the different data structures. Also, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) hrtimer functions now have clearer behavior and clearer names - such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) equivalent to del_timer() and del_timer_sync()] - so there's no direct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) 1:1 mapping between them on the algorithmic level, and thus no real
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) potential for code sharing either.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) Basic data types: every time value, absolute or relative, is in a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) special nanosecond-resolution type: ktime_t. The kernel-internal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) representation of ktime_t values and operations is implemented via
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) macros and inline functions, and can be switched between a "hybrid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) union" type and a plain "scalar" 64bit nanoseconds representation (at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) compile time). The hybrid union type optimizes time conversions on 32bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) CPUs. This build-time-selectable ktime_t storage format was implemented
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) to avoid the performance impact of 64-bit multiplications and divisions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) on 32bit CPUs. Such operations are frequently necessary to convert
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) between the storage formats provided by kernel and userspace interfaces
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) and the internal time format. (See include/linux/ktime.h for further
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) details.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) hrtimers - rounding of timer values
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) -----------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) the hrtimer code will round timer events to lower-resolution clocks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) because it has to. Otherwise it will do no artificial rounding at all.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) one question is, what resolution value should be returned to the user by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) the clock_getres() interface. This will return whatever real resolution
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) a given clock has - be it low-res, high-res, or artificially-low-res.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) hrtimers - testing and verification
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) -----------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) We used the high-resolution clock subsystem ontop of hrtimers to verify
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) the hrtimer implementation details in praxis, and we also ran the posix
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) timer tests in order to ensure specification compliance. We also ran
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) tests on low-resolution clocks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) The hrtimer patch converts the following kernel functionality to use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) hrtimers:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) - nanosleep
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) - itimers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) - posix-timers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) The conversion of nanosleep and posix-timers enabled the unification of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) nanosleep and clock_nanosleep.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) The code was successfully compiled for the following platforms:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) i386, x86_64, ARM, PPC, PPC64, IA64
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) The code was run-tested on the following platforms:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) hrtimers were also integrated into the -rt tree, along with a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) hrtimers-based high-resolution clock implementation, so the hrtimers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) code got a healthy amount of testing and use in practice.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) Thomas Gleixner, Ingo Molnar