Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) ==============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) RT-mutex implementation design
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ==============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) Copyright (c) 2006 Steven Rostedt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) Licensed under the GNU Free Documentation License, Version 1.2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) This document tries to describe the design of the rtmutex.c implementation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) It doesn't describe the reasons why rtmutex.c exists. For that please see
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) Documentation/locking/rt-mutex.rst.  Although this document does explain problems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) that happen without this code, but that is in the concept to understand
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) what the code actually is doing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) The goal of this document is to help others understand the priority
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) inheritance (PI) algorithm that is used, as well as reasons for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) decisions that were made to implement PI in the manner that was done.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) Unbounded Priority Inversion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) ----------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) Priority inversion is when a lower priority process executes while a higher
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) priority process wants to run.  This happens for several reasons, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) most of the time it can't be helped.  Anytime a high priority process wants
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) to use a resource that a lower priority process has (a mutex for example),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) the high priority process must wait until the lower priority process is done
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) with the resource.  This is a priority inversion.  What we want to prevent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) is something called unbounded priority inversion.  That is when the high
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) priority process is prevented from running by a lower priority process for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) an undetermined amount of time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) The classic example of unbounded priority inversion is where you have three
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) processes, let's call them processes A, B, and C, where A is the highest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) priority process, C is the lowest, and B is in between. A tries to grab a lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) that C owns and must wait and lets C run to release the lock. But in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) meantime, B executes, and since B is of a higher priority than C, it preempts C,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) but by doing so, it is in fact preempting A which is a higher priority process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) Now there's no way of knowing how long A will be sleeping waiting for C
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) to release the lock, because for all we know, B is a CPU hog and will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) never give C a chance to release the lock.  This is called unbounded priority
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) inversion.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) Here's a little ASCII art to show the problem::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47)      grab lock L1 (owned by C)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48)        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49)   A ---+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50)           C preempted by B
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51)             |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52)   C    +----+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54)   B         +-------->
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55)                   B now keeps A from running.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) Priority Inheritance (PI)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) -------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) There are several ways to solve this issue, but other ways are out of scope
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) for this document.  Here we only discuss PI.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) PI is where a process inherits the priority of another process if the other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) process blocks on a lock owned by the current process.  To make this easier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) to understand, let's use the previous example, with processes A, B, and C again.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) This time, when A blocks on the lock owned by C, C would inherit the priority
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) of A.  So now if B becomes runnable, it would not preempt C, since C now has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) the high priority of A.  As soon as C releases the lock, it loses its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) inherited priority, and A then can continue with the resource that C had.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) Terminology
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) -----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) Here I explain some terminology that is used in this document to help describe
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) the design that is used to implement PI.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) PI chain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80)          - The PI chain is an ordered series of locks and processes that cause
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81)            processes to inherit priorities from a previous process that is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82)            blocked on one of its locks.  This is described in more detail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83)            later in this document.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) mutex
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86)          - In this document, to differentiate from locks that implement
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87)            PI and spin locks that are used in the PI code, from now on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88)            the PI locks will be called a mutex.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91)          - In this document from now on, I will use the term lock when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92)            referring to spin locks that are used to protect parts of the PI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93)            algorithm.  These locks disable preemption for UP (when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94)            CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95)            entering critical sections simultaneously.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) spin lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98)          - Same as lock above.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) waiter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101)          - A waiter is a struct that is stored on the stack of a blocked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)            process.  Since the scope of the waiter is within the code for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)            a process being blocked on the mutex, it is fine to allocate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104)            the waiter on the process's stack (local variable).  This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105)            structure holds a pointer to the task, as well as the mutex that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)            the task is blocked on.  It also has rbtree node structures to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107)            place the task in the waiters rbtree of a mutex as well as the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)            pi_waiters rbtree of a mutex owner task (described below).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110)            waiter is sometimes used in reference to the task that is waiting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111)            on a mutex. This is the same as waiter->task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) waiters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)          - A list of processes that are blocked on a mutex.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) top waiter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)          - The highest priority process waiting on a specific mutex.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) top pi waiter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120)               - The highest priority process waiting on one of the mutexes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121)                 that a specific process owns.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) Note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)        task and process are used interchangeably in this document, mostly to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)        differentiate between two processes that are being described together.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) PI chain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) --------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) The PI chain is a list of processes and mutexes that may cause priority
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) inheritance to take place.  Multiple chains may converge, but a chain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) would never diverge, since a process can't be blocked on more than one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) mutex at a time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) Example::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)    Process:  A, B, C, D, E
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)    Mutexes:  L1, L2, L3, L4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)    A owns: L1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)            B blocked on L1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)            B owns L2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144)                   C blocked on L2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)                   C owns L3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)                          D blocked on L3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)                          D owns L4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148)                                 E blocked on L4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) The chain would be::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)    E->L4->D->L3->C->L2->B->L1->A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) To show where two chains merge, we could add another process F and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) another mutex L5 where B owns L5 and F is blocked on mutex L5.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) The chain for F would be::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)    F->L5->B->L1->A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) Since a process may own more than one mutex, but never be blocked on more than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) one, the chains merge.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) Here we show both chains::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166)    E->L4->D->L3->C->L2-+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167)                        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)                        +->B->L1->A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169)                        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170)                  F->L5-+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) For PI to work, the processes at the right end of these chains (or we may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) also call it the Top of the chain) must be equal to or higher in priority
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) than the processes to the left or below in the chain.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) Also since a mutex may have more than one process blocked on it, we can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) have multiple chains merge at mutexes.  If we add another process G that is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) blocked on mutex L2::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180)   G->L2->B->L1->A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) And once again, to show how this can grow I will show the merging chains
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) again::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185)    E->L4->D->L3->C-+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186)                    +->L2-+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187)                    |     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)                  G-+     +->B->L1->A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189)                          |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190)                    F->L5-+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) If process G has the highest priority in the chain, then all the tasks up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) the chain (A and B in this example), must have their priorities increased
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) to that of G.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) Mutex Waiters Tree
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) Every mutex keeps track of all the waiters that are blocked on itself. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) mutex has a rbtree to store these waiters by priority.  This tree is protected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) by a spin lock that is located in the struct of the mutex. This lock is called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) wait_lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) Task PI Tree
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) ------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) To keep track of the PI chains, each process has its own PI rbtree.  This is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) a tree of all top waiters of the mutexes that are owned by the process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) Note that this tree only holds the top waiters and not all waiters that are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) blocked on mutexes owned by the process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) The top of the task's PI tree is always the highest priority task that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) is waiting on a mutex that is owned by the task.  So if the task has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) inherited a priority, it will always be the priority of the task that is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) at the top of this tree.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) This tree is stored in the task structure of a process as a rbtree called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) pi_waiters.  It is protected by a spin lock also in the task structure,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) called pi_lock.  This lock may also be taken in interrupt context, so when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) locking the pi_lock, interrupts must be disabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) Depth of the PI Chain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) ---------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) The maximum depth of the PI chain is not dynamic, and could actually be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) defined.  But is very complex to figure it out, since it depends on all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) the nesting of mutexes.  Let's look at the example where we have 3 mutexes,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) L1, L2, and L3, and four separate functions func1, func2, func3 and func4.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) The following shows a locking order of L1->L2->L3, but may not actually
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) be directly nested that way::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234)   void func1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235)   {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) 	mutex_lock(L1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) 	/* do anything */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) 	mutex_unlock(L1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241)   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243)   void func2(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244)   {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) 	mutex_lock(L1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) 	mutex_lock(L2);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) 	/* do something */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) 	mutex_unlock(L2);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) 	mutex_unlock(L1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252)   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254)   void func3(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255)   {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) 	mutex_lock(L2);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) 	mutex_lock(L3);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) 	/* do something else */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) 	mutex_unlock(L3);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) 	mutex_unlock(L2);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263)   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265)   void func4(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266)   {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) 	mutex_lock(L3);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) 	/* do something again */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) 	mutex_unlock(L3);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272)   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) Now we add 4 processes that run each of these functions separately.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) Processes A, B, C, and D which run functions func1, func2, func3 and func4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) respectively, and such that D runs first and A last.  With D being preempted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) in func4 in the "do something again" area, we have a locking that follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279)   D owns L3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280)          C blocked on L3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)          C owns L2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282)                 B blocked on L2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283)                 B owns L1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)                        A blocked on L1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286)   And thus we have the chain A->L1->B->L2->C->L3->D.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) This gives us a PI depth of 4 (four processes), but looking at any of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) functions individually, it seems as though they only have at most a locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) depth of two.  So, although the locking depth is defined at compile time,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) it still is very difficult to find the possibilities of that depth.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) Now since mutexes can be defined by user-land applications, we don't want a DOS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) type of application that nests large amounts of mutexes to create a large
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) PI chain, and have the code holding spin locks while looking at a large
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) amount of data.  So to prevent this, the implementation not only implements
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) a maximum lock depth, but also only holds at most two different locks at a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) time, as it walks the PI chain.  More about this below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) Mutex owner and flags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) ---------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) The mutex structure contains a pointer to the owner of the mutex.  If the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) mutex is not owned, this owner is set to NULL.  Since all architectures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) have the task structure on at least a two byte alignment (and if this is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) not true, the rtmutex.c code will be broken!), this allows for the least
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) significant bit to be used as a flag.  Bit 0 is used as the "Has Waiters"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) flag. It's set whenever there are waiters on a mutex.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) See Documentation/locking/rt-mutex.rst for further details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) cmpxchg Tricks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) Some architectures implement an atomic cmpxchg (Compare and Exchange).  This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) is used (when applicable) to keep the fast path of grabbing and releasing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) mutexes short.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) cmpxchg is basically the following function performed atomically::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322)   unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323)   {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) 	unsigned long T = *A;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) 	if (*A == *B) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) 		*A = *C;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) 	return T;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329)   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330)   #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) This is really nice to have, since it allows you to only update a variable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) if the variable is what you expect it to be.  You know if it succeeded if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) the return value (the old value of A) is equal to B.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) the architecture does not support CMPXCHG, then this macro is simply set
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) to fail every time.  But if CMPXCHG is supported, then this will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) help out extremely to keep the fast path short.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) The use of rt_mutex_cmpxchg with the flags in the owner field help optimize
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) the system for architectures that support it.  This will also be explained
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) later in this document.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) Priority adjustments
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) --------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) The implementation of the PI code in rtmutex.c has several places that a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) process must adjust its priority.  With the help of the pi_waiters of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) process this is rather easy to know what needs to be adjusted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) The functions implementing the task adjustments are rt_mutex_adjust_prio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) and rt_mutex_setprio. rt_mutex_setprio is only used in rt_mutex_adjust_prio.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) rt_mutex_adjust_prio examines the priority of the task, and the highest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) priority process that is waiting any of mutexes owned by the task. Since
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) the pi_waiters of a task holds an order by priority of all the top waiters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) of all the mutexes that the task owns, we simply need to compare the top
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) pi waiter to its own normal/deadline priority and take the higher one.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) Then rt_mutex_setprio is called to adjust the priority of the task to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) new priority. Note that rt_mutex_setprio is defined in kernel/sched/core.c
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) to implement the actual change in priority.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) Note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) 	For the "prio" field in task_struct, the lower the number, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) 	higher the priority. A "prio" of 5 is of higher priority than a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) 	"prio" of 10.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) It is interesting to note that rt_mutex_adjust_prio can either increase
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) or decrease the priority of the task.  In the case that a higher priority
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) process has just blocked on a mutex owned by the task, rt_mutex_adjust_prio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) would increase/boost the task's priority.  But if a higher priority task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) were for some reason to leave the mutex (timeout or signal), this same function
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) would decrease/unboost the priority of the task.  That is because the pi_waiters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) always contains the highest priority task that is waiting on a mutex owned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) by the task, so we only need to compare the priority of that top pi waiter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) to the normal priority of the given task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) High level overview of the PI chain walk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) ----------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) The implementation has gone through several iterations, and has ended up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) with what we believe is the best.  It walks the PI chain by only grabbing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) at most two locks at a time, and is very efficient.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) The rt_mutex_adjust_prio_chain can be used either to boost or lower process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) priorities.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) rt_mutex_adjust_prio_chain is called with a task to be checked for PI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) (de)boosting (the owner of a mutex that a process is blocking on), a flag to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) check for deadlocking, the mutex that the task owns, a pointer to a waiter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) that is the process's waiter struct that is blocked on the mutex (although this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) parameter may be NULL for deboosting), a pointer to the mutex on which the task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) is blocked, and a top_task as the top waiter of the mutex.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) For this explanation, I will not mention deadlock detection. This explanation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) will try to stay at a high level.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) When this function is called, there are no locks held.  That also means
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) that the state of the owner and lock can change when entered into this function.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) Before this function is called, the task has already had rt_mutex_adjust_prio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) performed on it.  This means that the task is set to the priority that it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) should be at, but the rbtree nodes of the task's waiter have not been updated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) with the new priorities, and this task may not be in the proper locations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) in the pi_waiters and waiters trees that the task is blocked on. This function
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) solves all that.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) The main operation of this function is summarized by Thomas Gleixner in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) rtmutex.c. See the 'Chain walk basics and protection scope' comment for further
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) Taking of a mutex (The walk through)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) ------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) OK, now let's take a look at the detailed walk through of what happens when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) taking a mutex.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) The first thing that is tried is the fast taking of the mutex.  This is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) done when we have CMPXCHG enabled (otherwise the fast taking automatically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) fails).  Only when the owner field of the mutex is NULL can the lock be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) taken with the CMPXCHG and nothing else needs to be done.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) If there is contention on the lock, we go about the slow path
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) (rt_mutex_slowlock).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) The slow path function is where the task's waiter structure is created on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) the stack.  This is because the waiter structure is only needed for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) scope of this function.  The waiter structure holds the nodes to store
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) the task on the waiters tree of the mutex, and if need be, the pi_waiters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) tree of the owner.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) The wait_lock of the mutex is taken since the slow path of unlocking the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) mutex also takes this lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) We then call try_to_take_rt_mutex.  This is where the architecture that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) does not implement CMPXCHG would always grab the lock (if there's no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) contention).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) try_to_take_rt_mutex is used every time the task tries to grab a mutex in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) slow path.  The first thing that is done here is an atomic setting of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) the "Has Waiters" flag of the mutex's owner field. By setting this flag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) now, the current owner of the mutex being contended for can't release the mutex
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) without going into the slow unlock path, and it would then need to grab the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) wait_lock, which this code currently holds. So setting the "Has Waiters" flag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) forces the current owner to synchronize with this code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) The lock is taken if the following are true:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454)    1) The lock has no owner
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455)    2) The current task is the highest priority against all other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456)       waiters of the lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) If the task succeeds to acquire the lock, then the task is set as the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) owner of the lock, and if the lock still has waiters, the top_waiter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) (highest priority task waiting on the lock) is added to this task's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) pi_waiters tree.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) If the lock is not taken by try_to_take_rt_mutex(), then the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) task_blocks_on_rt_mutex() function is called. This will add the task to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) the lock's waiter tree and propagate the pi chain of the lock as well
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) as the lock's owner's pi_waiters tree. This is described in the next
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) Task blocks on mutex
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) --------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) The accounting of a mutex and process is done with the waiter structure of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) the process.  The "task" field is set to the process, and the "lock" field
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) to the mutex.  The rbtree node of waiter are initialized to the processes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) current priority.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) Since the wait_lock was taken at the entry of the slow lock, we can safely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478) add the waiter to the task waiter tree.  If the current process is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) highest priority process currently waiting on this mutex, then we remove the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) previous top waiter process (if it exists) from the pi_waiters of the owner,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481) and add the current process to that tree.  Since the pi_waiter of the owner
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) has changed, we call rt_mutex_adjust_prio on the owner to see if the owner
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) should adjust its priority accordingly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) If the owner is also blocked on a lock, and had its pi_waiters changed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) (or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) and run rt_mutex_adjust_prio_chain on the owner, as described earlier.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) Now all locks are released, and if the current process is still blocked on a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) mutex (waiter "task" field is not NULL), then we go to sleep (call schedule).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) Waking up in the loop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) ---------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) The task can then wake up for a couple of reasons:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496)   1) The previous lock owner released the lock, and the task now is top_waiter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497)   2) we received a signal or timeout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) In both cases, the task will try again to acquire the lock. If it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) does, then it will take itself off the waiters tree and set itself back
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) to the TASK_RUNNING state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) In first case, if the lock was acquired by another task before this task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) could get the lock, then it will go back to sleep and wait to be woken again.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) The second case is only applicable for tasks that are grabbing a mutex
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507) that can wake up before getting the lock, either due to a signal or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) a timeout (i.e. rt_mutex_timed_futex_lock()). When woken, it will try to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) take the lock again, if it succeeds, then the task will return with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) lock held, otherwise it will return with -EINTR if the task was woken
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) by a signal, or -ETIMEDOUT if it timed out.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) Unlocking the Mutex
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) -------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) The unlocking of a mutex also has a fast path for those architectures with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) CMPXCHG.  Since the taking of a mutex on contention always sets the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) "Has Waiters" flag of the mutex's owner, we use this to know if we need to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) take the slow path when unlocking the mutex.  If the mutex doesn't have any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) waiters, the owner field of the mutex would equal the current process and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) the mutex can be unlocked by just replacing the owner field with NULL.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524) If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525) the slow unlock path is taken.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527) The first thing done in the slow unlock path is to take the wait_lock of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) mutex.  This synchronizes the locking and unlocking of the mutex.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) A check is made to see if the mutex has waiters or not.  On architectures that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) do not have CMPXCHG, this is the location that the owner of the mutex will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) determine if a waiter needs to be awoken or not.  On architectures that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) do have CMPXCHG, that check is done in the fast path, but it is still needed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534) in the slow path too.  If a waiter of a mutex woke up because of a signal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) or timeout between the time the owner failed the fast path CMPXCHG check and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) the grabbing of the wait_lock, the mutex may not have any waiters, thus the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) owner still needs to make this check. If there are no waiters then the mutex
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538) owner field is set to NULL, the wait_lock is released and nothing more is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539) needed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) If there are waiters, then we need to wake one up.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543) On the wake up code, the pi_lock of the current owner is taken.  The top
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) waiter of the lock is found and removed from the waiters tree of the mutex
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545) as well as the pi_waiters tree of the current owner. The "Has Waiters" bit is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) marked to prevent lower priority tasks from stealing the lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) Finally we unlock the pi_lock of the pending owner and wake it up.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551) Contact
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) For updates on this document, please email Steven Rostedt <rostedt@goodmis.org>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557) Credits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560) Author:  Steven Rostedt <rostedt@goodmis.org>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562) Updated: Alex Shi <alex.shi@linaro.org>	- 7/6/2017
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564) Original Reviewers:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565) 		     Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566) 		     Randy Dunlap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568) Update (7/6/2017) Reviewers: Steven Rostedt and Sebastian Siewior
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570) Updates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573) This document was originally written for 2.6.17-rc3-mm1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574) was updated on 4.12