Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) Futex Requeue PI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) Requeueing of tasks from a non-PI futex to a PI futex requires
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) special handling in order to ensure the underlying rt_mutex is never
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) left without an owner if it has waiters; doing so would break the PI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) boosting logic [see rt-mutex-desgin.txt] For the purposes of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) brevity, this action will be referred to as "requeue_pi" throughout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) this document.  Priority inheritance is abbreviated throughout as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) "PI".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) Motivation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) ----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) Without requeue_pi, the glibc implementation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) pthread_cond_broadcast() must resort to waking all the tasks waiting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) on a pthread_condvar and letting them try to sort out which task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) gets to run first in classic thundering-herd formation.  An ideal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) implementation would wake the highest-priority waiter, and leave the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) rest to the natural wakeup inherent in unlocking the mutex
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) associated with the condvar.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) Consider the simplified glibc calls::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 	/* caller must lock mutex */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) 	pthread_cond_wait(cond, mutex)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) 		lock(cond->__data.__lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) 		unlock(mutex);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) 		do {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) 		unlock(cond->__data.__lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) 		futex_wait(cond->__data.__futex);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 		lock(cond->__data.__lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) 		} while(...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) 		unlock(cond->__data.__lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) 		lock(mutex);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) 	pthread_cond_broadcast(cond)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) 		lock(cond->__data.__lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) 		unlock(cond->__data.__lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) 		futex_requeue(cond->data.__futex, cond->mutex);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) Once pthread_cond_broadcast() requeues the tasks, the cond->mutex
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) has waiters. Note that pthread_cond_wait() attempts to lock the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) mutex only after it has returned to user space.  This will leave the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) underlying rt_mutex with waiters, and no owner, breaking the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) previously mentioned PI-boosting algorithms.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) In order to support PI-aware pthread_condvar's, the kernel needs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) be able to requeue tasks to PI futexes.  This support implies that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) upon a successful futex_wait system call, the caller would return to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) user space already holding the PI futex.  The glibc implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) would be modified as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) 	/* caller must lock mutex */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) 	pthread_cond_wait_pi(cond, mutex)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 		lock(cond->__data.__lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 		unlock(mutex);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) 		do {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) 		unlock(cond->__data.__lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) 		futex_wait_requeue_pi(cond->__data.__futex);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) 		lock(cond->__data.__lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) 		} while(...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) 		unlock(cond->__data.__lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 		/* the kernel acquired the mutex for us */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) 	pthread_cond_broadcast_pi(cond)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) 		lock(cond->__data.__lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 		unlock(cond->__data.__lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) 		futex_requeue_pi(cond->data.__futex, cond->mutex);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) The actual glibc implementation will likely test for PI and make the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) necessary changes inside the existing calls rather than creating new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) calls for the PI cases.  Similar changes are needed for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) pthread_cond_timedwait() and pthread_cond_signal().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) Implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) In order to ensure the rt_mutex has an owner if it has waiters, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) is necessary for both the requeue code, as well as the waiting code,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) to be able to acquire the rt_mutex before returning to user space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) The requeue code cannot simply wake the waiter and leave it to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) acquire the rt_mutex as it would open a race window between the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) requeue call returning to user space and the waiter waking and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) starting to run.  This is especially true in the uncontended case.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) The solution involves two new rt_mutex helper routines,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) allow the requeue code to acquire an uncontended rt_mutex on behalf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) of the waiter and to enqueue the waiter on a contended rt_mutex.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) Two new system calls provide the kernel<->user interface to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) and pthread_cond_timedwait()) to block on the initial futex and wait
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) to be requeued to a PI-aware futex.  The implementation is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) result of a high-speed collision between futex_wait() and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) futex_lock_pi(), with some extra logic to check for the additional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) wake-up scenarios.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) FUTEX_CMP_REQUEUE_PI is called by the waker
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) (pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) possibly wake the waiting tasks. Internally, this system call is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) still handled by futex_requeue (by passing requeue_pi=1).  Before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) requeueing, futex_requeue() attempts to acquire the requeue target
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) PI futex on behalf of the top waiter.  If it can, this waiter is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) woken.  futex_requeue() then proceeds to requeue the remaining
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) nr_wake+nr_requeue tasks to the PI futex, calling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) rt_mutex_start_proxy_lock() prior to each requeue to prepare the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) task as a waiter on the underlying rt_mutex.  It is possible that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) the lock can be acquired at this stage as well, if so, the next
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) waiter is woken to finish the acquisition of the lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) FUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) their sum is all that really matters.  futex_requeue() will wake or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) requeue up to nr_wake + nr_requeue tasks.  It will wake only as many
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) tasks as it can acquire the lock for, which in the majority of cases
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) should be 0 as good programming practice dictates that the caller of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) either pthread_cond_broadcast() or pthread_cond_signal() acquire the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) mutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) nr_wake=1.  nr_requeue should be INT_MAX for broadcast and 0 for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) signal.