Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) Lightweight PI-futexes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) We are calling them lightweight for 3 reasons:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7)  - in the user-space fastpath a PI-enabled futex involves no kernel work
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8)    (or any other PI complexity) at all. No registration, no extra kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9)    calls - just pure fast atomic ops in userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11)  - even in the slowpath, the system call and scheduling pattern is very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12)    similar to normal futexes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14)  - the in-kernel PI implementation is streamlined around the mutex
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15)    abstraction, with strict rules that keep the implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16)    relatively simple: only a single owner may own a lock (i.e. no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17)    read-write lock support), only the owner may unlock a lock, no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18)    recursive locking, etc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) Priority Inheritance - why?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) The short reply: user-space PI helps achieving/improving determinism for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) user-space applications. In the best-case, it can help achieve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) determinism and well-bound latencies. Even in the worst-case, PI will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) improve the statistical distribution of locking related application
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) delays.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) The longer reply
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) ----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) Firstly, sharing locks between multiple tasks is a common programming
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) technique that often cannot be replaced with lockless algorithms. As we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) can see it in the kernel [which is a quite complex program in itself],
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) lockless structures are rather the exception than the norm - the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) ratio of lockless vs. locky code for shared data structures is somewhere
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) between 1:10 and 1:100. Lockless is hard, and the complexity of lockless
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) algorithms often endangers to ability to do robust reviews of said code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) I.e. critical RT apps often choose lock structures to protect critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) data structures, instead of lockless algorithms. Furthermore, there are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) cases (like shared hardware, or other resource limits) where lockless
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) access is mathematically impossible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) Media players (such as Jack) are an example of reasonable application
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) design with multiple tasks (with multiple priority levels) sharing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) short-held locks: for example, a highprio audio playback thread is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) combined with medium-prio construct-audio-data threads and low-prio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) display-colory-stuff threads. Add video and decoding to the mix and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) we've got even more priority levels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) So once we accept that synchronization objects (locks) are an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) unavoidable fact of life, and once we accept that multi-task userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) apps have a very fair expectation of being able to use locks, we've got
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) to think about how to offer the option of a deterministic locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) implementation to user-space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) Most of the technical counter-arguments against doing priority
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) inheritance only apply to kernel-space locks. But user-space locks are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) different, there we cannot disable interrupts or make the task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) non-preemptible in a critical section, so the 'use spinlocks' argument
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) does not apply (user-space spinlocks have the same priority inversion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) problems as other user-space locking constructs). Fact is, pretty much
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) the only technique that currently enables good determinism for userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) locks (such as futex-based pthread mutexes) is priority inheritance:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) Currently (without PI), if a high-prio and a low-prio task shares a lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) [this is a quite common scenario for most non-trivial RT applications],
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) even if all critical sections are coded carefully to be deterministic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) (i.e. all critical sections are short in duration and only execute a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) limited number of instructions), the kernel cannot guarantee any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) deterministic execution of the high-prio task: any medium-priority task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) could preempt the low-prio task while it holds the shared lock and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) executes the critical section, and could delay it indefinitely.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) Implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) As mentioned before, the userspace fastpath of PI-enabled pthread
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) mutexes involves no kernel work at all - they behave quite similarly to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) normal futex-based locks: a 0 value means unlocked, and a value==TID
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) means locked. (This is the same method as used by list-based robust
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) futexes.) Userspace uses atomic ops to lock/unlock these mutexes without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) entering the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) To handle the slowpath, we have added two new futex ops:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87)   - FUTEX_LOCK_PI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88)   - FUTEX_UNLOCK_PI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) TID fails], then FUTEX_LOCK_PI is called. The kernel does all the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) remaining work: if there is no futex-queue attached to the futex address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) yet then the code looks up the task that owns the futex [it has put its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) own TID into the futex value], and attaches a 'PI state' structure to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) the futex-queue. The pi_state includes an rt-mutex, which is a PI-aware,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) kernel-based synchronization object. The 'other' task is made the owner
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) of the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) futex value. Then this task tries to lock the rt-mutex, on which it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) blocks. Once it returns, it has the mutex acquired, and it sets the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) futex value to its own TID and returns. Userspace has no other work to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) perform - it now owns the lock, and futex value contains
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) FUTEX_WAITERS|TID.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) If the unlock side fastpath succeeds, [i.e. userspace manages to do a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) TID -> 0 atomic transition of the futex value], then no kernel work is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) triggered.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) If the unlock fastpath fails (because the FUTEX_WAITERS bit is set),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) then FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) behalf of userspace - and it also unlocks the attached
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) pi_state->rt_mutex and thus wakes up any potential waiters.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) Note that under this approach, contrary to previous PI-futex approaches,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) there is no prior 'registration' of a PI-futex. [which is not quite
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) possible anyway, due to existing ABI properties of pthread mutexes.]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) Also, under this scheme, 'robustness' and 'PI' are two orthogonal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) properties of futexes, and all four combinations are possible: futex,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) robust-futex, PI-futex, robust+PI-futex.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) More details about priority inheritance can be found in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) Documentation/locking/rt-mutex.rst.