Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) ======================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) hrtimers - subsystem for high-resolution kernel timers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ======================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) This patch introduces a new subsystem for high-resolution kernel timers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) One might ask the question: we already have a timer subsystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) (kernel/timers.c), why do we need two timer subsystems? After a lot of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) back and forth trying to integrate high-resolution and high-precision
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) features into the existing timer framework, and after testing various
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) such high-resolution timer implementations in practice, we came to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) conclusion that the timer wheel code is fundamentally not suitable for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) such an approach. We initially didn't believe this ('there must be a way
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) to solve this'), and spent a considerable effort trying to integrate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) things into the timer wheel, but we failed. In hindsight, there are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) several reasons why such integration is hard/impossible:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) - the forced handling of low-resolution and high-resolution timers in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19)   the same way leads to a lot of compromises, macro magic and #ifdef
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20)   mess. The timers.c code is very "tightly coded" around jiffies and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21)   32-bitness assumptions, and has been honed and micro-optimized for a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22)   relatively narrow use case (jiffies in a relatively narrow HZ range)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23)   for many years - and thus even small extensions to it easily break
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24)   the wheel concept, leading to even worse compromises. The timer wheel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25)   code is very good and tight code, there's zero problems with it in its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26)   current usage - but it is simply not suitable to be extended for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27)   high-res timers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) - the unpredictable [O(N)] overhead of cascading leads to delays which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30)   necessitate a more complex handling of high resolution timers, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31)   in turn decreases robustness. Such a design still leads to rather large
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32)   timing inaccuracies. Cascading is a fundamental property of the timer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33)   wheel concept, it cannot be 'designed out' without inevitably
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34)   degrading other portions of the timers.c code in an unacceptable way.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) - the implementation of the current posix-timer subsystem on top of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37)   the timer wheel has already introduced a quite complex handling of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38)   the required readjusting of absolute CLOCK_REALTIME timers at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39)   settimeofday or NTP time - further underlying our experience by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40)   example: that the timer wheel data structure is too rigid for high-res
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41)   timers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) - the timer wheel code is most optimal for use cases which can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44)   identified as "timeouts". Such timeouts are usually set up to cover
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45)   error conditions in various I/O paths, such as networking and block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46)   I/O. The vast majority of those timers never expire and are rarely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47)   recascaded because the expected correct event arrives in time so they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48)   can be removed from the timer wheel before any further processing of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49)   them becomes necessary. Thus the users of these timeouts can accept
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50)   the granularity and precision tradeoffs of the timer wheel, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51)   largely expect the timer subsystem to have near-zero overhead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52)   Accurate timing for them is not a core purpose - in fact most of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53)   timeout values used are ad-hoc. For them it is at most a necessary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54)   evil to guarantee the processing of actual timeout completions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55)   (because most of the timeouts are deleted before completion), which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56)   should thus be as cheap and unintrusive as possible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) The primary users of precision timers are user-space applications that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) users like drivers and subsystems which require precise timed events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) (e.g. multimedia) can benefit from the availability of a separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) high-resolution timer subsystem as well.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) While this subsystem does not offer high-resolution clock sources just
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) yet, the hrtimer subsystem can be easily extended with high-resolution
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) clock capabilities, and patches for that exist and are maturing quickly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) The increasing demand for realtime and multimedia applications along
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) with other potential users for precise timers gives another reason to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) separate the "timeout" and "precise timer" subsystems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) Another potential benefit is that such a separation allows even more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) special-purpose optimization of the existing timer wheel for the low
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) resolution and low precision use cases - once the precision-sensitive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) APIs are separated from the timer wheel and are migrated over to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) hrtimers. E.g. we could decrease the frequency of the timeout subsystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) from 250 Hz to 100 HZ (or even smaller).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) hrtimer subsystem implementation details
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) ----------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) the basic design considerations were:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) - simplicity
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) - data structure not bound to jiffies or any other granularity. All the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86)   kernel logic works at 64-bit nanoseconds resolution - no compromises.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) - simplification of existing, timing related kernel code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) another basic requirement was the immediate enqueueing and ordering of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) timers at activation time. After looking at several possible solutions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) such as radix trees and hashes, we chose the red black tree as the basic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) data structure. Rbtrees are available as a library in the kernel and are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) used in various performance-critical areas of e.g. memory management and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) file systems. The rbtree is solely used for time sorted ordering, while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) a separate list is used to give the expiry code fast access to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) queued timers, without having to walk the rbtree.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) (This separate list is also useful for later when we'll introduce
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) high-resolution clocks, where we need separate pending and expired
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) queues while keeping the time-order intact.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) Time-ordered enqueueing is not purely for the purposes of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) high-resolution clocks though, it also simplifies the handling of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) absolute timers based on a low-resolution CLOCK_REALTIME. The existing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) implementation needed to keep an extra list of all armed absolute
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) CLOCK_REALTIME timers along with complex locking. In case of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) settimeofday and NTP, all the timers (!) had to be dequeued, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) time-changing code had to fix them up one by one, and all of them had to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) be enqueued again. The time-ordered enqueueing and the storage of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) expiry time in absolute time units removes all this complex and poorly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) scaling code from the posix-timer implementation - the clock can simply
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) be set without having to touch the rbtree. This also makes the handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) of posix-timers simpler in general.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) The locking and per-CPU behavior of hrtimers was mostly taken from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) existing timer wheel code, as it is mature and well suited. Sharing code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) was not really a win, due to the different data structures. Also, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) hrtimer functions now have clearer behavior and clearer names - such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) equivalent to del_timer() and del_timer_sync()] - so there's no direct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) 1:1 mapping between them on the algorithmic level, and thus no real
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) potential for code sharing either.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) Basic data types: every time value, absolute or relative, is in a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) special nanosecond-resolution type: ktime_t. The kernel-internal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) representation of ktime_t values and operations is implemented via
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) macros and inline functions, and can be switched between a "hybrid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) union" type and a plain "scalar" 64bit nanoseconds representation (at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) compile time). The hybrid union type optimizes time conversions on 32bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) CPUs. This build-time-selectable ktime_t storage format was implemented
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) to avoid the performance impact of 64-bit multiplications and divisions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) on 32bit CPUs. Such operations are frequently necessary to convert
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) between the storage formats provided by kernel and userspace interfaces
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) and the internal time format. (See include/linux/ktime.h for further
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) details.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) hrtimers - rounding of timer values
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) -----------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) the hrtimer code will round timer events to lower-resolution clocks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) because it has to. Otherwise it will do no artificial rounding at all.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) one question is, what resolution value should be returned to the user by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) the clock_getres() interface. This will return whatever real resolution
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) a given clock has - be it low-res, high-res, or artificially-low-res.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) hrtimers - testing and verification
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) -----------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) We used the high-resolution clock subsystem ontop of hrtimers to verify
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) the hrtimer implementation details in praxis, and we also ran the posix
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) timer tests in order to ensure specification compliance. We also ran
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) tests on low-resolution clocks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) The hrtimer patch converts the following kernel functionality to use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) hrtimers:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)  - nanosleep
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160)  - itimers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161)  - posix-timers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) The conversion of nanosleep and posix-timers enabled the unification of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) nanosleep and clock_nanosleep.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) The code was successfully compiled for the following platforms:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)  i386, x86_64, ARM, PPC, PPC64, IA64
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) The code was run-tested on the following platforms:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172)  i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) hrtimers were also integrated into the -rt tree, along with a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) hrtimers-based high-resolution clock implementation, so the hrtimers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) code got a healthy amount of testing and use in practice.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) 	Thomas Gleixner, Ingo Molnar