^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) =================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) A Tour Through RCU's Requirements
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) =================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) Copyright IBM Corporation, 2015
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Author: Paul E. McKenney
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) The initial version of this document appeared in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) `LWN <https://lwn.net/>`_ on those articles:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) `part 1 <https://lwn.net/Articles/652156/>`_,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) `part 2 <https://lwn.net/Articles/652677/>`_, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) `part 3 <https://lwn.net/Articles/653326/>`_.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) ------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) Read-copy update (RCU) is a synchronization mechanism that is often used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) as a replacement for reader-writer locking. RCU is unusual in that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) updaters do not block readers, which means that RCU's read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) primitives can be exceedingly fast and scalable. In addition, updaters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) can make useful forward progress concurrently with readers. However, all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) this concurrency between RCU readers and updaters does raise the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) question of exactly what RCU readers are doing, which in turn raises the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) question of exactly what RCU's requirements are.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) This document therefore summarizes RCU's requirements, and can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) thought of as an informal, high-level specification for RCU. It is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) important to understand that RCU's specification is primarily empirical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) in nature; in fact, I learned about many of these requirements the hard
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) way. This situation might cause some consternation, however, not only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) has this learning process been a lot of fun, but it has also been a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) great privilege to work with so many people willing to apply
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) technologies in interesting new ways.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) All that aside, here are the categories of currently known RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) requirements:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) #. `Fundamental Requirements`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) #. `Fundamental Non-Requirements`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) #. `Parallelism Facts of Life`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) #. `Quality-of-Implementation Requirements`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) #. `Linux Kernel Complications`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) #. `Software-Engineering Requirements`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) #. `Other RCU Flavors`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) #. `Possible Future Changes`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) This is followed by a `summary <#Summary>`__, however, the answers to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) each quick quiz immediately follows the quiz. Select the big white space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) with your mouse to see the answer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) Fundamental Requirements
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) ------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) RCU's fundamental requirements are the closest thing RCU has to hard
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) mathematical requirements. These are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) #. `Grace-Period Guarantee`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) #. `Publish/Subscribe Guarantee`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) #. `Memory-Barrier Guarantees`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) #. `RCU Primitives Guaranteed to Execute Unconditionally`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) #. `Guaranteed Read-to-Write Upgrade`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) Grace-Period Guarantee
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) ~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) RCU's grace-period guarantee is unusual in being premeditated: Jack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) Slingwine and I had this guarantee firmly in mind when we started work
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) on RCU (then called “rclock”) in the early 1990s. That said, the past
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) two decades of experience with RCU have produced a much more detailed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) understanding of this guarantee.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) RCU's grace-period guarantee allows updaters to wait for the completion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) of all pre-existing RCU read-side critical sections. An RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) critical section begins with the marker ``rcu_read_lock()`` and ends
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) with the marker ``rcu_read_unlock()``. These markers may be nested, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) RCU treats a nested set as one big RCU read-side critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) Production-quality implementations of ``rcu_read_lock()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) ``rcu_read_unlock()`` are extremely lightweight, and in fact have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) exactly zero overhead in Linux kernels built for production use with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) ``CONFIG_PREEMPT=n``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) This guarantee allows ordering to be enforced with extremely low
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) overhead to readers, for example:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) 1 int x, y;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) 3 void thread0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) 4 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) 5 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) 6 r1 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) 7 r2 = READ_ONCE(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) 8 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) 9 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) 10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) 11 void thread1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) 12 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) 13 WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) 14 synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) 15 WRITE_ONCE(y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 16 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) Because the ``synchronize_rcu()`` on line 14 waits for all pre-existing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) readers, any instance of ``thread0()`` that loads a value of zero from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) ``x`` must complete before ``thread1()`` stores to ``y``, so that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) instance must also load a value of zero from ``y``. Similarly, any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) instance of ``thread0()`` that loads a value of one from ``y`` must have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) started after the ``synchronize_rcu()`` started, and must therefore also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) load a value of one from ``x``. Therefore, the outcome:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) (r1 == 0 && r2 == 1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) cannot happen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) | Wait a minute! You said that updaters can make useful forward |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) | progress concurrently with readers, but pre-existing readers will |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) | block ``synchronize_rcu()``!!! |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) | Just who are you trying to fool??? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) | First, if updaters do not wish to be blocked by readers, they can use |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) | ``call_rcu()`` or ``kfree_rcu()``, which will be discussed later. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) | Second, even when using ``synchronize_rcu()``, the other update-side |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) | code does run concurrently with readers, whether pre-existing or not. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) This scenario resembles one of the first uses of RCU in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) `DYNIX/ptx <https://en.wikipedia.org/wiki/DYNIX>`__, which managed a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) distributed lock manager's transition into a state suitable for handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) recovery from node failure, more or less as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 1 #define STATE_NORMAL 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) 2 #define STATE_WANT_RECOVERY 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) 3 #define STATE_RECOVERING 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 4 #define STATE_WANT_NORMAL 3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 5
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) 6 int state = STATE_NORMAL;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) 7
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 8 void do_something_dlm(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 9 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 10 int state_snap;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 11
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) 12 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) 13 state_snap = READ_ONCE(state);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 14 if (state_snap == STATE_NORMAL)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 15 do_something();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) 16 else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 17 do_something_carefully();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) 18 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 19 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 20
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) 21 void start_recovery(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 22 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 23 WRITE_ONCE(state, STATE_WANT_RECOVERY);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 24 synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) 25 WRITE_ONCE(state, STATE_RECOVERING);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) 26 recovery();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 27 WRITE_ONCE(state, STATE_WANT_NORMAL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 28 synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) 29 WRITE_ONCE(state, STATE_NORMAL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 30 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) The RCU read-side critical section in ``do_something_dlm()`` works with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) the ``synchronize_rcu()`` in ``start_recovery()`` to guarantee that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) ``do_something()`` never runs concurrently with ``recovery()``, but with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) little or no synchronization overhead in ``do_something_dlm()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) | Why is the ``synchronize_rcu()`` on line 28 needed? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) | Without that extra grace period, memory reordering could result in |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) | ``do_something_dlm()`` executing ``do_something()`` concurrently with |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) | the last bits of ``recovery()``. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) In order to avoid fatal problems such as deadlocks, an RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) critical section must not contain calls to ``synchronize_rcu()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) Similarly, an RCU read-side critical section must not contain anything
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) that waits, directly or indirectly, on completion of an invocation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) ``synchronize_rcu()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) Although RCU's grace-period guarantee is useful in and of itself, with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) `quite a few use cases <https://lwn.net/Articles/573497/>`__, it would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) be good to be able to use RCU to coordinate read-side access to linked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) data structures. For this, the grace-period guarantee is not sufficient,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) as can be seen in function ``add_gp_buggy()`` below. We will look at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) reader's code later, but in the meantime, just think of the reader as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) locklessly picking up the ``gp`` pointer, and, if the value loaded is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) 1 bool add_gp_buggy(int a, int b)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) 2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) 4 if (!p)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) 5 return -ENOMEM;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) 6 spin_lock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) 7 if (rcu_access_pointer(gp)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) 8 spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) 9 return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) 10 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) 11 p->a = a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) 12 p->b = a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) 13 gp = p; /* ORDERING BUG */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) 14 spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) 15 return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) 16 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) The problem is that both the compiler and weakly ordered CPUs are within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) their rights to reorder this code as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) 1 bool add_gp_buggy_optimized(int a, int b)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) 2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) 4 if (!p)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) 5 return -ENOMEM;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) 6 spin_lock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) 7 if (rcu_access_pointer(gp)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) 8 spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) 9 return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) 10 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) 11 gp = p; /* ORDERING BUG */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) 12 p->a = a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) 13 p->b = a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) 14 spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) 15 return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) 16 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) If an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) executes line 11, it will see garbage in the ``->a`` and ``->b`` fields.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) And this is but one of many ways in which compiler and hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) optimizations could cause trouble. Therefore, we clearly need some way
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) to prevent the compiler and the CPU from reordering in this manner,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) which brings us to the publish-subscribe guarantee discussed in the next
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) Publish/Subscribe Guarantee
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) ~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) RCU's publish-subscribe guarantee allows data to be inserted into a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) linked data structure without disrupting RCU readers. The updater uses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) ``rcu_assign_pointer()`` to insert the new data, and readers use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) ``rcu_dereference()`` to access data, whether new or old. The following
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) shows an example of insertion:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) 1 bool add_gp(int a, int b)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) 2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) 4 if (!p)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) 5 return -ENOMEM;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) 6 spin_lock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) 7 if (rcu_access_pointer(gp)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) 8 spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) 9 return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) 10 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) 11 p->a = a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) 12 p->b = a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) 13 rcu_assign_pointer(gp, p);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) 14 spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) 15 return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) 16 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) The ``rcu_assign_pointer()`` on line 13 is conceptually equivalent to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) simple assignment statement, but also guarantees that its assignment
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) will happen after the two assignments in lines 11 and 12, similar to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) C11 ``memory_order_release`` store operation. It also prevents any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) number of “interesting” compiler optimizations, for example, the use of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) ``gp`` as a scratch location immediately preceding the assignment.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) | But ``rcu_assign_pointer()`` does nothing to prevent the two |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) | assignments to ``p->a`` and ``p->b`` from being reordered. Can't that |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) | also cause problems? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) | No, it cannot. The readers cannot see either of these two fields |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) | until the assignment to ``gp``, by which time both fields are fully |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) | initialized. So reordering the assignments to ``p->a`` and ``p->b`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) | cannot possibly cause any problems. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) It is tempting to assume that the reader need not do anything special to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) control its accesses to the RCU-protected data, as shown in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) ``do_something_gp_buggy()`` below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) 1 bool do_something_gp_buggy(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) 2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) 3 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) 4 p = gp; /* OPTIMIZATIONS GALORE!!! */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) 5 if (p) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) 6 do_something(p->a, p->b);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) 7 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) 8 return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) 9 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) 10 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) 11 return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) 12 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) However, this temptation must be resisted because there are a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) surprisingly large number of ways that the compiler (to say nothing of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) `DEC Alpha CPUs <https://h71000.www7.hp.com/wizard/wiz_2637.html>`__)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) can trip this code up. For but one example, if the compiler were short
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) of registers, it might choose to refetch from ``gp`` rather than keeping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) a separate copy in ``p`` as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) 1 bool do_something_gp_buggy_optimized(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) 2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) 3 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) 5 do_something(gp->a, gp->b);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) 6 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) 7 return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) 8 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) 9 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) 10 return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) 11 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) If this function ran concurrently with a series of updates that replaced
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) the current structure with a new one, the fetches of ``gp->a`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) ``gp->b`` might well come from two different structures, which could
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) cause serious confusion. To prevent this (and much else besides),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) ``do_something_gp()`` uses ``rcu_dereference()`` to fetch from ``gp``:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) 1 bool do_something_gp(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) 2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) 3 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) 4 p = rcu_dereference(gp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) 5 if (p) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) 6 do_something(p->a, p->b);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) 7 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) 8 return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) 9 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) 10 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) 11 return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) 12 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) The ``rcu_dereference()`` uses volatile casts and (for DEC Alpha) memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) barriers in the Linux kernel. Should a `high-quality implementation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) C11 ``memory_order_consume``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) [PDF] <http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) ever appear, then ``rcu_dereference()`` could be implemented as a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) ``memory_order_consume`` load. Regardless of the exact implementation, a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) pointer fetched by ``rcu_dereference()`` may not be used outside of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) outermost RCU read-side critical section containing that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) ``rcu_dereference()``, unless protection of the corresponding data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) element has been passed from RCU to some other synchronization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) mechanism, most commonly locking or `reference
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) counting <https://www.kernel.org/doc/Documentation/RCU/rcuref.txt>`__.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) In short, updaters use ``rcu_assign_pointer()`` and readers use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) ``rcu_dereference()``, and these two RCU API elements work together to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) ensure that readers have a consistent view of newly added data elements.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) Of course, it is also necessary to remove elements from RCU-protected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) data structures, for example, using the following process:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) #. Remove the data element from the enclosing structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) #. Wait for all pre-existing RCU read-side critical sections to complete
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) (because only pre-existing readers can possibly have a reference to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) the newly removed data element).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) #. At this point, only the updater has a reference to the newly removed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) data element, so it can safely reclaim the data element, for example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) by passing it to ``kfree()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) This process is implemented by ``remove_gp_synchronous()``:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) 1 bool remove_gp_synchronous(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) 2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) 3 struct foo *p;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) 4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) 5 spin_lock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) 6 p = rcu_access_pointer(gp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) 7 if (!p) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) 8 spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) 9 return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) 10 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) 11 rcu_assign_pointer(gp, NULL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) 12 spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) 13 synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) 14 kfree(p);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) 15 return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) 16 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) This function is straightforward, with line 13 waiting for a grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) period before line 14 frees the old data element. This waiting ensures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) that readers will reach line 7 of ``do_something_gp()`` before the data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) element referenced by ``p`` is freed. The ``rcu_access_pointer()`` on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) line 6 is similar to ``rcu_dereference()``, except that:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) #. The value returned by ``rcu_access_pointer()`` cannot be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) dereferenced. If you want to access the value pointed to as well as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) the pointer itself, use ``rcu_dereference()`` instead of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) ``rcu_access_pointer()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) #. The call to ``rcu_access_pointer()`` need not be protected. In
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) contrast, ``rcu_dereference()`` must either be within an RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) read-side critical section or in a code segment where the pointer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) cannot change, for example, in code protected by the corresponding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) update-side lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) | Without the ``rcu_dereference()`` or the ``rcu_access_pointer()``, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) | what destructive optimizations might the compiler make use of? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) | Let's start with what happens to ``do_something_gp()`` if it fails to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) | use ``rcu_dereference()``. It could reuse a value formerly fetched |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) | from this same pointer. It could also fetch the pointer from ``gp`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) | in a byte-at-a-time manner, resulting in *load tearing*, in turn |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) | resulting a bytewise mash-up of two distinct pointer values. It might |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) | even use value-speculation optimizations, where it makes a wrong |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) | guess, but by the time it gets around to checking the value, an |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) | update has changed the pointer to match the wrong guess. Too bad |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) | about any dereferences that returned pre-initialization garbage in |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) | the meantime! |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) | For ``remove_gp_synchronous()``, as long as all modifications to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) | ``gp`` are carried out while holding ``gp_lock``, the above |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) | optimizations are harmless. However, ``sparse`` will complain if you |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) | define ``gp`` with ``__rcu`` and then access it without using either |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) | ``rcu_access_pointer()`` or ``rcu_dereference()``. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) In short, RCU's publish-subscribe guarantee is provided by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456) combination of ``rcu_assign_pointer()`` and ``rcu_dereference()``. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) guarantee allows data elements to be safely added to RCU-protected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) linked data structures without disrupting RCU readers. This guarantee
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) can be used in combination with the grace-period guarantee to also allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) data elements to be removed from RCU-protected linked data structures,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) again without disrupting RCU readers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) This guarantee was only partially premeditated. DYNIX/ptx used an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) explicit memory barrier for publication, but had nothing resembling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) ``rcu_dereference()`` for subscription, nor did it have anything
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) resembling the dependency-ordering barrier that was later subsumed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) into ``rcu_dereference()`` and later still into ``READ_ONCE()``. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) need for these operations made itself known quite suddenly at a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) late-1990s meeting with the DEC Alpha architects, back in the days when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) DEC was still a free-standing company. It took the Alpha architects a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) good hour to convince me that any sort of barrier would ever be needed,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) and it then took me a good *two* hours to convince them that their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) documentation did not make this point clear. More recent work with the C
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) and C++ standards committees have provided much education on tricks and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) traps from the compiler. In short, compilers were much less tricky in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) the early 1990s, but in 2015, don't even think about omitting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) ``rcu_dereference()``!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) Memory-Barrier Guarantees
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) ~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) The previous section's simple linked-data-structure scenario clearly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) demonstrates the need for RCU's stringent memory-ordering guarantees on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) systems with more than one CPU:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) #. Each CPU that has an RCU read-side critical section that begins
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) before ``synchronize_rcu()`` starts is guaranteed to execute a full
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) memory barrier between the time that the RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) section ends and the time that ``synchronize_rcu()`` returns. Without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) this guarantee, a pre-existing RCU read-side critical section might
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) hold a reference to the newly removed ``struct foo`` after the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) ``kfree()`` on line 14 of ``remove_gp_synchronous()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) #. Each CPU that has an RCU read-side critical section that ends after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) ``synchronize_rcu()`` returns is guaranteed to execute a full memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) barrier between the time that ``synchronize_rcu()`` begins and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) time that the RCU read-side critical section begins. Without this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) guarantee, a later RCU read-side critical section running after the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) ``kfree()`` on line 14 of ``remove_gp_synchronous()`` might later run
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) ``do_something_gp()`` and find the newly deleted ``struct foo``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) #. If the task invoking ``synchronize_rcu()`` remains on a given CPU,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) then that CPU is guaranteed to execute a full memory barrier sometime
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) during the execution of ``synchronize_rcu()``. This guarantee ensures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) that the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) does execute after the removal on line 11.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) #. If the task invoking ``synchronize_rcu()`` migrates among a group of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) CPUs during that invocation, then each of the CPUs in that group is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507) guaranteed to execute a full memory barrier sometime during the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) execution of ``synchronize_rcu()``. This guarantee also ensures that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) execute after the removal on line 11, but also in the case where the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) thread executing the ``synchronize_rcu()`` migrates in the meantime.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) | Given that multiple CPUs can start RCU read-side critical sections at |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) | any time without any ordering whatsoever, how can RCU possibly tell |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) | whether or not a given RCU read-side critical section starts before a |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) | given instance of ``synchronize_rcu()``? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) | If RCU cannot tell whether or not a given RCU read-side critical |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524) | section starts before a given instance of ``synchronize_rcu()``, then |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525) | it must assume that the RCU read-side critical section started first. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) | In other words, a given instance of ``synchronize_rcu()`` can avoid |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527) | waiting on a given RCU read-side critical section only if it can |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) | prove that ``synchronize_rcu()`` started first. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) | A related question is “When ``rcu_read_lock()`` doesn't generate any |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) | code, why does it matter how it relates to a grace period?” The |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) | answer is that it is not the relationship of ``rcu_read_lock()`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) | itself that is important, but rather the relationship of the code |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) | within the enclosed RCU read-side critical section to the code |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534) | preceding and following the grace period. If we take this viewpoint, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) | then a given RCU read-side critical section begins before a given |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) | grace period when some access preceding the grace period observes the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) | effect of some access within the critical section, in which case none |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538) | of the accesses within the critical section may observe the effects |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539) | of any access following the grace period. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) | As of late 2016, mathematical models of RCU take this viewpoint, for |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) | example, see slides 62 and 63 of the `2016 LinuxCon |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543) | EU <http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.201 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) | 6.10.04c.LCE.pdf>`__ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545) | presentation. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551) | The first and second guarantees require unbelievably strict ordering! |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552) | Are all these memory barriers *really* required? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556) | Yes, they really are required. To see why the first guarantee is |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557) | required, consider the following sequence of events: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559) | #. CPU 1: ``rcu_read_lock()`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560) | #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561) | #. CPU 0: ``list_del_rcu(p);`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562) | #. CPU 0: ``synchronize_rcu()`` starts. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563) | #. CPU 1: ``do_something_with(q->a);`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564) | ``/* No smp_mb(), so might happen after kfree(). */`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565) | #. CPU 1: ``rcu_read_unlock()`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566) | #. CPU 0: ``synchronize_rcu()`` returns. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567) | #. CPU 0: ``kfree(p);`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569) | Therefore, there absolutely must be a full memory barrier between the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570) | end of the RCU read-side critical section and the end of the grace |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571) | period. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573) | The sequence of events demonstrating the necessity of the second rule |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574) | is roughly similar: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576) | #. CPU 0: ``list_del_rcu(p);`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577) | #. CPU 0: ``synchronize_rcu()`` starts. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 578) | #. CPU 1: ``rcu_read_lock()`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 579) | #. CPU 1: ``q = rcu_dereference(gp);`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 580) | ``/* Might return p if no memory barrier. */`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 581) | #. CPU 0: ``synchronize_rcu()`` returns. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 582) | #. CPU 0: ``kfree(p);`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 583) | #. CPU 1: ``do_something_with(q->a); /* Boom!!! */`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 584) | #. CPU 1: ``rcu_read_unlock()`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 585) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 586) | And similarly, without a memory barrier between the beginning of the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 587) | grace period and the beginning of the RCU read-side critical section, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 588) | CPU 1 might end up accessing the freelist. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 589) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 590) | The “as if” rule of course applies, so that any implementation that |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 591) | acts as if the appropriate memory barriers were in place is a correct |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 592) | implementation. That said, it is much easier to fool yourself into |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 593) | believing that you have adhered to the as-if rule than it is to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 594) | actually adhere to it! |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 595) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 596)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 597) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 598) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 599) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 600) | You claim that ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 601) | absolutely no code in some kernel builds. This means that the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 602) | compiler might arbitrarily rearrange consecutive RCU read-side |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 603) | critical sections. Given such rearrangement, if a given RCU read-side |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 604) | critical section is done, how can you be sure that all prior RCU |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 605) | read-side critical sections are done? Won't the compiler |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 606) | rearrangements make that impossible to determine? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 607) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 608) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 609) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 610) | In cases where ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 611) | absolutely no code, RCU infers quiescent states only at special |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 612) | locations, for example, within the scheduler. Because calls to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 613) | ``schedule()`` had better prevent calling-code accesses to shared |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 614) | variables from being rearranged across the call to ``schedule()``, if |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 615) | RCU detects the end of a given RCU read-side critical section, it |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 616) | will necessarily detect the end of all prior RCU read-side critical |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 617) | sections, no matter how aggressively the compiler scrambles the code. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 618) | Again, this all assumes that the compiler cannot scramble code across |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 619) | calls to the scheduler, out of interrupt handlers, into the idle |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 620) | loop, into user-mode code, and so on. But if your kernel build allows |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 621) | that sort of scrambling, you have broken far more than just RCU! |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 622) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 623)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 624) Note that these memory-barrier requirements do not replace the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 625) fundamental RCU requirement that a grace period wait for all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 626) pre-existing readers. On the contrary, the memory barriers called out in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 627) this section must operate in such a way as to *enforce* this fundamental
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 628) requirement. Of course, different implementations enforce this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 629) requirement in different ways, but enforce it they must.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 630)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 631) RCU Primitives Guaranteed to Execute Unconditionally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 632) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 633)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 634) The common-case RCU primitives are unconditional. They are invoked, they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 635) do their job, and they return, with no possibility of error, and no need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 636) to retry. This is a key RCU design philosophy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 637)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 638) However, this philosophy is pragmatic rather than pigheaded. If someone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 639) comes up with a good justification for a particular conditional RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 640) primitive, it might well be implemented and added. After all, this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 641) guarantee was reverse-engineered, not premeditated. The unconditional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 642) nature of the RCU primitives was initially an accident of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 643) implementation, and later experience with synchronization primitives
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 644) with conditional primitives caused me to elevate this accident to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 645) guarantee. Therefore, the justification for adding a conditional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 646) primitive to RCU would need to be based on detailed and compelling use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 647) cases.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 648)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 649) Guaranteed Read-to-Write Upgrade
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 650) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 651)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 652) As far as RCU is concerned, it is always possible to carry out an update
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 653) within an RCU read-side critical section. For example, that RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 654) read-side critical section might search for a given data element, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 655) then might acquire the update-side spinlock in order to update that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 656) element, all while remaining in that RCU read-side critical section. Of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 657) course, it is necessary to exit the RCU read-side critical section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 658) before invoking ``synchronize_rcu()``, however, this inconvenience can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 659) be avoided through use of the ``call_rcu()`` and ``kfree_rcu()`` API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 660) members described later in this document.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 661)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 662) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 663) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 664) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 665) | But how does the upgrade-to-write operation exclude other readers? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 666) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 667) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 668) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 669) | It doesn't, just like normal RCU updates, which also do not exclude |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 670) | RCU readers. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 671) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 672)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 673) This guarantee allows lookup code to be shared between read-side and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 674) update-side code, and was premeditated, appearing in the earliest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 675) DYNIX/ptx RCU documentation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 676)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 677) Fundamental Non-Requirements
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 678) ----------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 679)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 680) RCU provides extremely lightweight readers, and its read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 681) guarantees, though quite useful, are correspondingly lightweight. It is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 682) therefore all too easy to assume that RCU is guaranteeing more than it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 683) really is. Of course, the list of things that RCU does not guarantee is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 684) infinitely long, however, the following sections list a few
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 685) non-guarantees that have caused confusion. Except where otherwise noted,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 686) these non-guarantees were premeditated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 687)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 688) #. `Readers Impose Minimal Ordering`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 689) #. `Readers Do Not Exclude Updaters`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 690) #. `Updaters Only Wait For Old Readers`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 691) #. `Grace Periods Don't Partition Read-Side Critical Sections`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 692) #. `Read-Side Critical Sections Don't Partition Grace Periods`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 693)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 694) Readers Impose Minimal Ordering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 695) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 696)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 697) Reader-side markers such as ``rcu_read_lock()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 698) ``rcu_read_unlock()`` provide absolutely no ordering guarantees except
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 699) through their interaction with the grace-period APIs such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 700) ``synchronize_rcu()``. To see this, consider the following pair of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 701) threads:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 702)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 703) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 704)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 705) 1 void thread0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 706) 2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 707) 3 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 708) 4 WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 709) 5 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 710) 6 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 711) 7 WRITE_ONCE(y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 712) 8 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 713) 9 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 714) 10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 715) 11 void thread1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 716) 12 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 717) 13 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 718) 14 r1 = READ_ONCE(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 719) 15 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 720) 16 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 721) 17 r2 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 722) 18 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 723) 19 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 724)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 725) After ``thread0()`` and ``thread1()`` execute concurrently, it is quite
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 726) possible to have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 727)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 728) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 729)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 730) (r1 == 1 && r2 == 0)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 731)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 732) (that is, ``y`` appears to have been assigned before ``x``), which would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 733) not be possible if ``rcu_read_lock()`` and ``rcu_read_unlock()`` had
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 734) much in the way of ordering properties. But they do not, so the CPU is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 735) within its rights to do significant reordering. This is by design: Any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 736) significant ordering constraints would slow down these fast-path APIs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 737)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 738) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 739) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 740) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 741) | Can't the compiler also reorder this code? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 742) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 743) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 744) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 745) | No, the volatile casts in ``READ_ONCE()`` and ``WRITE_ONCE()`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 746) | prevent the compiler from reordering in this particular case. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 747) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 748)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 749) Readers Do Not Exclude Updaters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 750) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 751)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 752) Neither ``rcu_read_lock()`` nor ``rcu_read_unlock()`` exclude updates.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 753) All they do is to prevent grace periods from ending. The following
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 754) example illustrates this:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 755)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 756) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 757)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 758) 1 void thread0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 759) 2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 760) 3 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 761) 4 r1 = READ_ONCE(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 762) 5 if (r1) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 763) 6 do_something_with_nonzero_x();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 764) 7 r2 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 765) 8 WARN_ON(!r2); /* BUG!!! */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 766) 9 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 767) 10 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 768) 11 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 769) 12
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 770) 13 void thread1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 771) 14 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 772) 15 spin_lock(&my_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 773) 16 WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 774) 17 WRITE_ONCE(y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 775) 18 spin_unlock(&my_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 776) 19 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 777)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 778) If the ``thread0()`` function's ``rcu_read_lock()`` excluded the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 779) ``thread1()`` function's update, the ``WARN_ON()`` could never fire. But
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 780) the fact is that ``rcu_read_lock()`` does not exclude much of anything
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 781) aside from subsequent grace periods, of which ``thread1()`` has none, so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 782) the ``WARN_ON()`` can and does fire.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 783)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 784) Updaters Only Wait For Old Readers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 785) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 786)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 787) It might be tempting to assume that after ``synchronize_rcu()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 788) completes, there are no readers executing. This temptation must be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 789) avoided because new readers can start immediately after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 790) ``synchronize_rcu()`` starts, and ``synchronize_rcu()`` is under no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 791) obligation to wait for these new readers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 792)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 793) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 794) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 795) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 796) | Suppose that synchronize_rcu() did wait until *all* readers had |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 797) | completed instead of waiting only on pre-existing readers. For how |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 798) | long would the updater be able to rely on there being no readers? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 799) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 800) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 801) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 802) | For no time at all. Even if ``synchronize_rcu()`` were to wait until |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 803) | all readers had completed, a new reader might start immediately after |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 804) | ``synchronize_rcu()`` completed. Therefore, the code following |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 805) | ``synchronize_rcu()`` can *never* rely on there being no readers. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 806) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 807)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 808) Grace Periods Don't Partition Read-Side Critical Sections
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 809) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 810)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 811) It is tempting to assume that if any part of one RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 812) section precedes a given grace period, and if any part of another RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 813) read-side critical section follows that same grace period, then all of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 814) the first RCU read-side critical section must precede all of the second.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 815) However, this just isn't the case: A single grace period does not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 816) partition the set of RCU read-side critical sections. An example of this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 817) situation can be illustrated as follows, where ``x``, ``y``, and ``z``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 818) are initially all zero:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 819)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 820) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 821)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 822) 1 void thread0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 823) 2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 824) 3 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 825) 4 WRITE_ONCE(a, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 826) 5 WRITE_ONCE(b, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 827) 6 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 828) 7 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 829) 8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 830) 9 void thread1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 831) 10 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 832) 11 r1 = READ_ONCE(a);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 833) 12 synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 834) 13 WRITE_ONCE(c, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 835) 14 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 836) 15
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 837) 16 void thread2(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 838) 17 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 839) 18 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 840) 19 r2 = READ_ONCE(b);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 841) 20 r3 = READ_ONCE(c);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 842) 21 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 843) 22 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 844)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 845) It turns out that the outcome:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 846)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 847) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 848)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 849) (r1 == 1 && r2 == 0 && r3 == 1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 850)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 851) is entirely possible. The following figure show how this can happen,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 852) with each circled ``QS`` indicating the point at which RCU recorded a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 853) *quiescent state* for each thread, that is, a state in which RCU knows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 854) that the thread cannot be in the midst of an RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 855) section that started before the current grace period:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 856)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 857) .. kernel-figure:: GPpartitionReaders1.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 858)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 859) If it is necessary to partition RCU read-side critical sections in this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 860) manner, it is necessary to use two grace periods, where the first grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 861) period is known to end before the second grace period starts:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 862)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 863) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 864)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 865) 1 void thread0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 866) 2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 867) 3 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 868) 4 WRITE_ONCE(a, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 869) 5 WRITE_ONCE(b, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 870) 6 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 871) 7 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 872) 8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 873) 9 void thread1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 874) 10 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 875) 11 r1 = READ_ONCE(a);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 876) 12 synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 877) 13 WRITE_ONCE(c, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 878) 14 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 879) 15
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 880) 16 void thread2(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 881) 17 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 882) 18 r2 = READ_ONCE(c);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 883) 19 synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 884) 20 WRITE_ONCE(d, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 885) 21 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 886) 22
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 887) 23 void thread3(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 888) 24 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 889) 25 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 890) 26 r3 = READ_ONCE(b);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 891) 27 r4 = READ_ONCE(d);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 892) 28 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 893) 29 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 894)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 895) Here, if ``(r1 == 1)``, then ``thread0()``'s write to ``b`` must happen
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 896) before the end of ``thread1()``'s grace period. If in addition
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 897) ``(r4 == 1)``, then ``thread3()``'s read from ``b`` must happen after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 898) the beginning of ``thread2()``'s grace period. If it is also the case
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 899) that ``(r2 == 1)``, then the end of ``thread1()``'s grace period must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 900) precede the beginning of ``thread2()``'s grace period. This mean that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 901) the two RCU read-side critical sections cannot overlap, guaranteeing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 902) that ``(r3 == 1)``. As a result, the outcome:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 903)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 904) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 905)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 906) (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 907)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 908) cannot happen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 909)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 910) This non-requirement was also non-premeditated, but became apparent when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 911) studying RCU's interaction with memory ordering.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 912)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 913) Read-Side Critical Sections Don't Partition Grace Periods
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 914) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 915)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 916) It is also tempting to assume that if an RCU read-side critical section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 917) happens between a pair of grace periods, then those grace periods cannot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 918) overlap. However, this temptation leads nowhere good, as can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 919) illustrated by the following, with all variables initially zero:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 920)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 921) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 922)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 923) 1 void thread0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 924) 2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 925) 3 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 926) 4 WRITE_ONCE(a, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 927) 5 WRITE_ONCE(b, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 928) 6 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 929) 7 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 930) 8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 931) 9 void thread1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 932) 10 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 933) 11 r1 = READ_ONCE(a);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 934) 12 synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 935) 13 WRITE_ONCE(c, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 936) 14 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 937) 15
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 938) 16 void thread2(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 939) 17 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 940) 18 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 941) 19 WRITE_ONCE(d, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 942) 20 r2 = READ_ONCE(c);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 943) 21 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 944) 22 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 945) 23
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 946) 24 void thread3(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 947) 25 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 948) 26 r3 = READ_ONCE(d);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 949) 27 synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 950) 28 WRITE_ONCE(e, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 951) 29 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 952) 30
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 953) 31 void thread4(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 954) 32 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 955) 33 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 956) 34 r4 = READ_ONCE(b);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 957) 35 r5 = READ_ONCE(e);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 958) 36 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 959) 37 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 960)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 961) In this case, the outcome:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 962)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 963) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 964)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 965) (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 966)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 967) is entirely possible, as illustrated below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 968)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 969) .. kernel-figure:: ReadersPartitionGP1.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 970)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 971) Again, an RCU read-side critical section can overlap almost all of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 972) given grace period, just so long as it does not overlap the entire grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 973) period. As a result, an RCU read-side critical section cannot partition
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 974) a pair of RCU grace periods.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 975)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 976) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 977) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 978) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 979) | How long a sequence of grace periods, each separated by an RCU |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 980) | read-side critical section, would be required to partition the RCU |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 981) | read-side critical sections at the beginning and end of the chain? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 982) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 983) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 984) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 985) | In theory, an infinite number. In practice, an unknown number that is |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 986) | sensitive to both implementation details and timing considerations. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 987) | Therefore, even in practice, RCU users must abide by the theoretical |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 988) | rather than the practical answer. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 989) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 990)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 991) Parallelism Facts of Life
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 992) -------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 993)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 994) These parallelism facts of life are by no means specific to RCU, but the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 995) RCU implementation must abide by them. They therefore bear repeating:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 996)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 997) #. Any CPU or task may be delayed at any time, and any attempts to avoid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 998) these delays by disabling preemption, interrupts, or whatever are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 999) completely futile. This is most obvious in preemptible user-level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1000) environments and in virtualized environments (where a given guest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1001) OS's VCPUs can be preempted at any time by the underlying
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1002) hypervisor), but can also happen in bare-metal environments due to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1003) ECC errors, NMIs, and other hardware events. Although a delay of more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1004) than about 20 seconds can result in splats, the RCU implementation is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1005) obligated to use algorithms that can tolerate extremely long delays,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1006) but where “extremely long” is not long enough to allow wrap-around
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1007) when incrementing a 64-bit counter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1008) #. Both the compiler and the CPU can reorder memory accesses. Where it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1009) matters, RCU must use compiler directives and memory-barrier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1010) instructions to preserve ordering.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1011) #. Conflicting writes to memory locations in any given cache line will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1012) result in expensive cache misses. Greater numbers of concurrent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1013) writes and more-frequent concurrent writes will result in more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1014) dramatic slowdowns. RCU is therefore obligated to use algorithms that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1015) have sufficient locality to avoid significant performance and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1016) scalability problems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1017) #. As a rough rule of thumb, only one CPU's worth of processing may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1018) carried out under the protection of any given exclusive lock. RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1019) must therefore use scalable locking designs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1020) #. Counters are finite, especially on 32-bit systems. RCU's use of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1021) counters must therefore tolerate counter wrap, or be designed such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1022) that counter wrap would take way more time than a single system is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1023) likely to run. An uptime of ten years is quite possible, a runtime of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1024) a century much less so. As an example of the latter, RCU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1025) dyntick-idle nesting counter allows 54 bits for interrupt nesting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1026) level (this counter is 64 bits even on a 32-bit system). Overflowing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1027) this counter requires 2\ :sup:`54` half-interrupts on a given CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1028) without that CPU ever going idle. If a half-interrupt happened every
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1029) microsecond, it would take 570 years of runtime to overflow this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1030) counter, which is currently believed to be an acceptably long time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1031) #. Linux systems can have thousands of CPUs running a single Linux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1032) kernel in a single shared-memory environment. RCU must therefore pay
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1033) close attention to high-end scalability.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1034)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1035) This last parallelism fact of life means that RCU must pay special
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1036) attention to the preceding facts of life. The idea that Linux might
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1037) scale to systems with thousands of CPUs would have been met with some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1038) skepticism in the 1990s, but these requirements would have otherwise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1039) have been unsurprising, even in the early 1990s.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1040)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1041) Quality-of-Implementation Requirements
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1042) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1043)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1044) These sections list quality-of-implementation requirements. Although an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1045) RCU implementation that ignores these requirements could still be used,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1046) it would likely be subject to limitations that would make it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1047) inappropriate for industrial-strength production use. Classes of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1048) quality-of-implementation requirements are as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1049)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1050) #. `Specialization`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1051) #. `Performance and Scalability`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1052) #. `Forward Progress`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1053) #. `Composability`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1054) #. `Corner Cases`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1055)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1056) These classes is covered in the following sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1057)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1058) Specialization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1059) ~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1060)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1061) RCU is and always has been intended primarily for read-mostly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1062) situations, which means that RCU's read-side primitives are optimized,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1063) often at the expense of its update-side primitives. Experience thus far
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1064) is captured by the following list of situations:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1065)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1066) #. Read-mostly data, where stale and inconsistent data is not a problem:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1067) RCU works great!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1068) #. Read-mostly data, where data must be consistent: RCU works well.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1069) #. Read-write data, where data must be consistent: RCU *might* work OK.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1070) Or not.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1071) #. Write-mostly data, where data must be consistent: RCU is very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1072) unlikely to be the right tool for the job, with the following
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1073) exceptions, where RCU can provide:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1074)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1075) a. Existence guarantees for update-friendly mechanisms.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1076) b. Wait-free read-side primitives for real-time use.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1077)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1078) This focus on read-mostly situations means that RCU must interoperate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1079) with other synchronization primitives. For example, the ``add_gp()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1080) ``remove_gp_synchronous()`` examples discussed earlier use RCU to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1081) protect readers and locking to coordinate updaters. However, the need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1082) extends much farther, requiring that a variety of synchronization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1083) primitives be legal within RCU read-side critical sections, including
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1084) spinlocks, sequence locks, atomic operations, reference counters, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1085) memory barriers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1086)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1087) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1088) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1089) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1090) | What about sleeping locks? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1091) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1092) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1093) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1094) | These are forbidden within Linux-kernel RCU read-side critical |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1095) | sections because it is not legal to place a quiescent state (in this |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1096) | case, voluntary context switch) within an RCU read-side critical |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1097) | section. However, sleeping locks may be used within userspace RCU |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1098) | read-side critical sections, and also within Linux-kernel sleepable |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1099) | RCU `(SRCU) <#Sleepable%20RCU>`__ read-side critical sections. In |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1100) | addition, the -rt patchset turns spinlocks into a sleeping locks so |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1101) | that the corresponding critical sections can be preempted, which also |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1102) | means that these sleeplockified spinlocks (but not other sleeping |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1103) | locks!) may be acquire within -rt-Linux-kernel RCU read-side critical |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1104) | sections. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1105) | Note that it *is* legal for a normal RCU read-side critical section |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1106) | to conditionally acquire a sleeping locks (as in |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1107) | ``mutex_trylock()``), but only as long as it does not loop |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1108) | indefinitely attempting to conditionally acquire that sleeping locks. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1109) | The key point is that things like ``mutex_trylock()`` either return |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1110) | with the mutex held, or return an error indication if the mutex was |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1111) | not immediately available. Either way, ``mutex_trylock()`` returns |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1112) | immediately without sleeping. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1113) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1114)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1115) It often comes as a surprise that many algorithms do not require a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1116) consistent view of data, but many can function in that mode, with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1117) network routing being the poster child. Internet routing algorithms take
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1118) significant time to propagate updates, so that by the time an update
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1119) arrives at a given system, that system has been sending network traffic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1120) the wrong way for a considerable length of time. Having a few threads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1121) continue to send traffic the wrong way for a few more milliseconds is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1122) clearly not a problem: In the worst case, TCP retransmissions will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1123) eventually get the data where it needs to go. In general, when tracking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1124) the state of the universe outside of the computer, some level of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1125) inconsistency must be tolerated due to speed-of-light delays if nothing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1126) else.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1127)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1128) Furthermore, uncertainty about external state is inherent in many cases.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1129) For example, a pair of veterinarians might use heartbeat to determine
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1130) whether or not a given cat was alive. But how long should they wait
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1131) after the last heartbeat to decide that the cat is in fact dead? Waiting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1132) less than 400 milliseconds makes no sense because this would mean that a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1133) relaxed cat would be considered to cycle between death and life more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1134) than 100 times per minute. Moreover, just as with human beings, a cat's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1135) heart might stop for some period of time, so the exact wait period is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1136) judgment call. One of our pair of veterinarians might wait 30 seconds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1137) before pronouncing the cat dead, while the other might insist on waiting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1138) a full minute. The two veterinarians would then disagree on the state of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1139) the cat during the final 30 seconds of the minute following the last
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1140) heartbeat.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1141)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1142) Interestingly enough, this same situation applies to hardware. When push
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1143) comes to shove, how do we tell whether or not some external server has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1144) failed? We send messages to it periodically, and declare it failed if we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1145) don't receive a response within a given period of time. Policy decisions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1146) can usually tolerate short periods of inconsistency. The policy was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1147) decided some time ago, and is only now being put into effect, so a few
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1148) milliseconds of delay is normally inconsequential.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1149)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1150) However, there are algorithms that absolutely must see consistent data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1151) For example, the translation between a user-level SystemV semaphore ID
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1152) to the corresponding in-kernel data structure is protected by RCU, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1153) it is absolutely forbidden to update a semaphore that has just been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1154) removed. In the Linux kernel, this need for consistency is accommodated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1155) by acquiring spinlocks located in the in-kernel data structure from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1156) within the RCU read-side critical section, and this is indicated by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1157) green box in the figure above. Many other techniques may be used, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1158) are in fact used within the Linux kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1159)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1160) In short, RCU is not required to maintain consistency, and other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1161) mechanisms may be used in concert with RCU when consistency is required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1162) RCU's specialization allows it to do its job extremely well, and its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1163) ability to interoperate with other synchronization mechanisms allows the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1164) right mix of synchronization tools to be used for a given job.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1165)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1166) Performance and Scalability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1167) ~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1168)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1169) Energy efficiency is a critical component of performance today, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1170) Linux-kernel RCU implementations must therefore avoid unnecessarily
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1171) awakening idle CPUs. I cannot claim that this requirement was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1172) premeditated. In fact, I learned of it during a telephone conversation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1173) in which I was given “frank and open” feedback on the importance of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1174) energy efficiency in battery-powered systems and on specific
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1175) energy-efficiency shortcomings of the Linux-kernel RCU implementation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1176) In my experience, the battery-powered embedded community will consider
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1177) any unnecessary wakeups to be extremely unfriendly acts. So much so that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1178) mere Linux-kernel-mailing-list posts are insufficient to vent their ire.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1179)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1180) Memory consumption is not particularly important for in most situations,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1181) and has become decreasingly so as memory sizes have expanded and memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1182) costs have plummeted. However, as I learned from Matt Mackall's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1183) `bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1184) footprint is critically important on single-CPU systems with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1185) non-preemptible (``CONFIG_PREEMPT=n``) kernels, and thus `tiny
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1186) RCU <https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1187) was born. Josh Triplett has since taken over the small-memory banner
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1188) with his `Linux kernel tinification <https://tiny.wiki.kernel.org/>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1189) project, which resulted in `SRCU <#Sleepable%20RCU>`__ becoming optional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1190) for those kernels not needing it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1191)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1192) The remaining performance requirements are, for the most part,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1193) unsurprising. For example, in keeping with RCU's read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1194) specialization, ``rcu_dereference()`` should have negligible overhead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1195) (for example, suppression of a few minor compiler optimizations).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1196) Similarly, in non-preemptible environments, ``rcu_read_lock()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1197) ``rcu_read_unlock()`` should have exactly zero overhead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1198)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1199) In preemptible environments, in the case where the RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1200) critical section was not preempted (as will be the case for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1201) highest-priority real-time process), ``rcu_read_lock()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1202) ``rcu_read_unlock()`` should have minimal overhead. In particular, they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1203) should not contain atomic read-modify-write operations, memory-barrier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1204) instructions, preemption disabling, interrupt disabling, or backwards
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1205) branches. However, in the case where the RCU read-side critical section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1206) was preempted, ``rcu_read_unlock()`` may acquire spinlocks and disable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1207) interrupts. This is why it is better to nest an RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1208) section within a preempt-disable region than vice versa, at least in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1209) cases where that critical section is short enough to avoid unduly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1210) degrading real-time latencies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1211)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1212) The ``synchronize_rcu()`` grace-period-wait primitive is optimized for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1213) throughput. It may therefore incur several milliseconds of latency in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1214) addition to the duration of the longest RCU read-side critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1215) On the other hand, multiple concurrent invocations of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1216) ``synchronize_rcu()`` are required to use batching optimizations so that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1217) they can be satisfied by a single underlying grace-period-wait
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1218) operation. For example, in the Linux kernel, it is not unusual for a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1219) single grace-period-wait operation to serve more than `1,000 separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1220) invocations <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1221) of ``synchronize_rcu()``, thus amortizing the per-invocation overhead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1222) down to nearly zero. However, the grace-period optimization is also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1223) required to avoid measurable degradation of real-time scheduling and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1224) interrupt latencies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1225)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1226) In some cases, the multi-millisecond ``synchronize_rcu()`` latencies are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1227) unacceptable. In these cases, ``synchronize_rcu_expedited()`` may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1228) used instead, reducing the grace-period latency down to a few tens of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1229) microseconds on small systems, at least in cases where the RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1230) critical sections are short. There are currently no special latency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1231) requirements for ``synchronize_rcu_expedited()`` on large systems, but,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1232) consistent with the empirical nature of the RCU specification, that is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1233) subject to change. However, there most definitely are scalability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1234) requirements: A storm of ``synchronize_rcu_expedited()`` invocations on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1235) 4096 CPUs should at least make reasonable forward progress. In return
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1236) for its shorter latencies, ``synchronize_rcu_expedited()`` is permitted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1237) to impose modest degradation of real-time latency on non-idle online
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1238) CPUs. Here, “modest” means roughly the same latency degradation as a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1239) scheduling-clock interrupt.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1240)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1241) There are a number of situations where even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1242) ``synchronize_rcu_expedited()``'s reduced grace-period latency is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1243) unacceptable. In these situations, the asynchronous ``call_rcu()`` can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1244) be used in place of ``synchronize_rcu()`` as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1245)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1246) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1247)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1248) 1 struct foo {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1249) 2 int a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1250) 3 int b;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1251) 4 struct rcu_head rh;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1252) 5 };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1253) 6
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1254) 7 static void remove_gp_cb(struct rcu_head *rhp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1255) 8 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1256) 9 struct foo *p = container_of(rhp, struct foo, rh);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1257) 10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1258) 11 kfree(p);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1259) 12 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1260) 13
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1261) 14 bool remove_gp_asynchronous(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1262) 15 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1263) 16 struct foo *p;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1264) 17
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1265) 18 spin_lock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1266) 19 p = rcu_access_pointer(gp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1267) 20 if (!p) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1268) 21 spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1269) 22 return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1270) 23 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1271) 24 rcu_assign_pointer(gp, NULL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1272) 25 call_rcu(&p->rh, remove_gp_cb);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1273) 26 spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1274) 27 return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1275) 28 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1276)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1277) A definition of ``struct foo`` is finally needed, and appears on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1278) lines 1-5. The function ``remove_gp_cb()`` is passed to ``call_rcu()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1279) on line 25, and will be invoked after the end of a subsequent grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1280) period. This gets the same effect as ``remove_gp_synchronous()``, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1281) without forcing the updater to wait for a grace period to elapse. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1282) ``call_rcu()`` function may be used in a number of situations where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1283) neither ``synchronize_rcu()`` nor ``synchronize_rcu_expedited()`` would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1284) be legal, including within preempt-disable code, ``local_bh_disable()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1285) code, interrupt-disable code, and interrupt handlers. However, even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1286) ``call_rcu()`` is illegal within NMI handlers and from idle and offline
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1287) CPUs. The callback function (``remove_gp_cb()`` in this case) will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1288) executed within softirq (software interrupt) environment within the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1289) Linux kernel, either within a real softirq handler or under the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1290) protection of ``local_bh_disable()``. In both the Linux kernel and in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1291) userspace, it is bad practice to write an RCU callback function that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1292) takes too long. Long-running operations should be relegated to separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1293) threads or (in the Linux kernel) workqueues.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1294)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1295) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1296) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1297) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1298) | Why does line 19 use ``rcu_access_pointer()``? After all, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1299) | ``call_rcu()`` on line 25 stores into the structure, which would |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1300) | interact badly with concurrent insertions. Doesn't this mean that |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1301) | ``rcu_dereference()`` is required? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1302) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1303) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1304) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1305) | Presumably the ``->gp_lock`` acquired on line 18 excludes any |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1306) | changes, including any insertions that ``rcu_dereference()`` would |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1307) | protect against. Therefore, any insertions will be delayed until |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1308) | after ``->gp_lock`` is released on line 25, which in turn means that |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1309) | ``rcu_access_pointer()`` suffices. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1310) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1311)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1312) However, all that ``remove_gp_cb()`` is doing is invoking ``kfree()`` on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1313) the data element. This is a common idiom, and is supported by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1314) ``kfree_rcu()``, which allows “fire and forget” operation as shown
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1315) below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1316)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1317) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1318)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1319) 1 struct foo {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1320) 2 int a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1321) 3 int b;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1322) 4 struct rcu_head rh;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1323) 5 };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1324) 6
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1325) 7 bool remove_gp_faf(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1326) 8 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1327) 9 struct foo *p;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1328) 10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1329) 11 spin_lock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1330) 12 p = rcu_dereference(gp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1331) 13 if (!p) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1332) 14 spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1333) 15 return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1334) 16 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1335) 17 rcu_assign_pointer(gp, NULL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1336) 18 kfree_rcu(p, rh);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1337) 19 spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1338) 20 return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1339) 21 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1340)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1341) Note that ``remove_gp_faf()`` simply invokes ``kfree_rcu()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1342) proceeds, without any need to pay any further attention to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1343) subsequent grace period and ``kfree()``. It is permissible to invoke
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1344) ``kfree_rcu()`` from the same environments as for ``call_rcu()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1345) Interestingly enough, DYNIX/ptx had the equivalents of ``call_rcu()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1346) and ``kfree_rcu()``, but not ``synchronize_rcu()``. This was due to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1347) fact that RCU was not heavily used within DYNIX/ptx, so the very few
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1348) places that needed something like ``synchronize_rcu()`` simply
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1349) open-coded it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1350)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1351) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1352) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1353) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1354) | Earlier it was claimed that ``call_rcu()`` and ``kfree_rcu()`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1355) | allowed updaters to avoid being blocked by readers. But how can that |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1356) | be correct, given that the invocation of the callback and the freeing |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1357) | of the memory (respectively) must still wait for a grace period to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1358) | elapse? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1359) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1360) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1361) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1362) | We could define things this way, but keep in mind that this sort of |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1363) | definition would say that updates in garbage-collected languages |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1364) | cannot complete until the next time the garbage collector runs, which |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1365) | does not seem at all reasonable. The key point is that in most cases, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1366) | an updater using either ``call_rcu()`` or ``kfree_rcu()`` can proceed |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1367) | to the next update as soon as it has invoked ``call_rcu()`` or |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1368) | ``kfree_rcu()``, without having to wait for a subsequent grace |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1369) | period. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1370) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1371)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1372) But what if the updater must wait for the completion of code to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1373) executed after the end of the grace period, but has other tasks that can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1374) be carried out in the meantime? The polling-style
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1375) ``get_state_synchronize_rcu()`` and ``cond_synchronize_rcu()`` functions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1376) may be used for this purpose, as shown below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1377)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1378) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1379)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1380) 1 bool remove_gp_poll(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1381) 2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1382) 3 struct foo *p;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1383) 4 unsigned long s;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1384) 5
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1385) 6 spin_lock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1386) 7 p = rcu_access_pointer(gp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1387) 8 if (!p) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1388) 9 spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1389) 10 return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1390) 11 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1391) 12 rcu_assign_pointer(gp, NULL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1392) 13 spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1393) 14 s = get_state_synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1394) 15 do_something_while_waiting();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1395) 16 cond_synchronize_rcu(s);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1396) 17 kfree(p);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1397) 18 return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1398) 19 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1399)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1400) On line 14, ``get_state_synchronize_rcu()`` obtains a “cookie” from RCU,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1401) then line 15 carries out other tasks, and finally, line 16 returns
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1402) immediately if a grace period has elapsed in the meantime, but otherwise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1403) waits as required. The need for ``get_state_synchronize_rcu`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1404) ``cond_synchronize_rcu()`` has appeared quite recently, so it is too
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1405) early to tell whether they will stand the test of time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1406)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1407) RCU thus provides a range of tools to allow updaters to strike the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1408) required tradeoff between latency, flexibility and CPU overhead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1409)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1410) Forward Progress
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1411) ~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1412)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1413) In theory, delaying grace-period completion and callback invocation is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1414) harmless. In practice, not only are memory sizes finite but also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1415) callbacks sometimes do wakeups, and sufficiently deferred wakeups can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1416) difficult to distinguish from system hangs. Therefore, RCU must provide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1417) a number of mechanisms to promote forward progress.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1418)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1419) These mechanisms are not foolproof, nor can they be. For one simple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1420) example, an infinite loop in an RCU read-side critical section must by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1421) definition prevent later grace periods from ever completing. For a more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1422) involved example, consider a 64-CPU system built with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1423) ``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1424) CPUs 1 through 63 spin in tight loops that invoke ``call_rcu()``. Even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1425) if these tight loops also contain calls to ``cond_resched()`` (thus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1426) allowing grace periods to complete), CPU 0 simply will not be able to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1427) invoke callbacks as fast as the other 63 CPUs can register them, at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1428) least not until the system runs out of memory. In both of these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1429) examples, the Spiderman principle applies: With great power comes great
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1430) responsibility. However, short of this level of abuse, RCU is required
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1431) to ensure timely completion of grace periods and timely invocation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1432) callbacks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1433)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1434) RCU takes the following steps to encourage timely completion of grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1435) periods:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1436)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1437) #. If a grace period fails to complete within 100 milliseconds, RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1438) causes future invocations of ``cond_resched()`` on the holdout CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1439) to provide an RCU quiescent state. RCU also causes those CPUs'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1440) ``need_resched()`` invocations to return ``true``, but only after the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1441) corresponding CPU's next scheduling-clock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1442) #. CPUs mentioned in the ``nohz_full`` kernel boot parameter can run
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1443) indefinitely in the kernel without scheduling-clock interrupts, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1444) defeats the above ``need_resched()`` strategem. RCU will therefore
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1445) invoke ``resched_cpu()`` on any ``nohz_full`` CPUs still holding out
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1446) after 109 milliseconds.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1447) #. In kernels built with ``CONFIG_RCU_BOOST=y``, if a given task that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1448) has been preempted within an RCU read-side critical section is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1449) holding out for more than 500 milliseconds, RCU will resort to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1450) priority boosting.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1451) #. If a CPU is still holding out 10 seconds into the grace period, RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1452) will invoke ``resched_cpu()`` on it regardless of its ``nohz_full``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1453) state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1454)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1455) The above values are defaults for systems running with ``HZ=1000``. They
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1456) will vary as the value of ``HZ`` varies, and can also be changed using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1457) the relevant Kconfig options and kernel boot parameters. RCU currently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1458) does not do much sanity checking of these parameters, so please use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1459) caution when changing them. Note that these forward-progress measures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1460) are provided only for RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1461) RCU <#Tasks%20RCU>`__.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1462)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1463) RCU takes the following steps in ``call_rcu()`` to encourage timely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1464) invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1465) 10,000 callbacks, or has 10,000 more callbacks than it had the last time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1466) encouragement was provided:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1467)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1468) #. Starts a grace period, if one is not already in progress.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1469) #. Forces immediate checking for quiescent states, rather than waiting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1470) for three milliseconds to have elapsed since the beginning of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1471) grace period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1472) #. Immediately tags the CPU's callbacks with their grace period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1473) completion numbers, rather than waiting for the ``RCU_SOFTIRQ``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1474) handler to get around to it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1475) #. Lifts callback-execution batch limits, which speeds up callback
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1476) invocation at the expense of degrading realtime response.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1477)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1478) Again, these are default values when running at ``HZ=1000``, and can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1479) overridden. Again, these forward-progress measures are provided only for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1480) RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1481) RCU <#Tasks%20RCU>`__. Even for RCU, callback-invocation forward
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1482) progress for ``rcu_nocbs`` CPUs is much less well-developed, in part
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1483) because workloads benefiting from ``rcu_nocbs`` CPUs tend to invoke
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1484) ``call_rcu()`` relatively infrequently. If workloads emerge that need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1485) both ``rcu_nocbs`` CPUs and high ``call_rcu()`` invocation rates, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1486) additional forward-progress work will be required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1487)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1488) Composability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1489) ~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1490)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1491) Composability has received much attention in recent years, perhaps in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1492) part due to the collision of multicore hardware with object-oriented
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1493) techniques designed in single-threaded environments for single-threaded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1494) use. And in theory, RCU read-side critical sections may be composed, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1495) in fact may be nested arbitrarily deeply. In practice, as with all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1496) real-world implementations of composable constructs, there are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1497) limitations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1498)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1499) Implementations of RCU for which ``rcu_read_lock()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1500) ``rcu_read_unlock()`` generate no code, such as Linux-kernel RCU when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1501) ``CONFIG_PREEMPT=n``, can be nested arbitrarily deeply. After all, there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1502) is no overhead. Except that if all these instances of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1503) ``rcu_read_lock()`` and ``rcu_read_unlock()`` are visible to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1504) compiler, compilation will eventually fail due to exhausting memory,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1505) mass storage, or user patience, whichever comes first. If the nesting is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1506) not visible to the compiler, as is the case with mutually recursive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1507) functions each in its own translation unit, stack overflow will result.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1508) If the nesting takes the form of loops, perhaps in the guise of tail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1509) recursion, either the control variable will overflow or (in the Linux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1510) kernel) you will get an RCU CPU stall warning. Nevertheless, this class
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1511) of RCU implementations is one of the most composable constructs in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1512) existence.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1513)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1514) RCU implementations that explicitly track nesting depth are limited by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1515) the nesting-depth counter. For example, the Linux kernel's preemptible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1516) RCU limits nesting to ``INT_MAX``. This should suffice for almost all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1517) practical purposes. That said, a consecutive pair of RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1518) critical sections between which there is an operation that waits for a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1519) grace period cannot be enclosed in another RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1520) section. This is because it is not legal to wait for a grace period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1521) within an RCU read-side critical section: To do so would result either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1522) in deadlock or in RCU implicitly splitting the enclosing RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1523) critical section, neither of which is conducive to a long-lived and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1524) prosperous kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1525)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1526) It is worth noting that RCU is not alone in limiting composability. For
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1527) example, many transactional-memory implementations prohibit composing a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1528) pair of transactions separated by an irrevocable operation (for example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1529) a network receive operation). For another example, lock-based critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1530) sections can be composed surprisingly freely, but only if deadlock is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1531) avoided.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1532)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1533) In short, although RCU read-side critical sections are highly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1534) composable, care is required in some situations, just as is the case for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1535) any other composable synchronization mechanism.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1536)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1537) Corner Cases
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1538) ~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1539)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1540) A given RCU workload might have an endless and intense stream of RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1541) read-side critical sections, perhaps even so intense that there was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1542) never a point in time during which there was not at least one RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1543) read-side critical section in flight. RCU cannot allow this situation to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1544) block grace periods: As long as all the RCU read-side critical sections
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1545) are finite, grace periods must also be finite.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1546)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1547) That said, preemptible RCU implementations could potentially result in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1548) RCU read-side critical sections being preempted for long durations,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1549) which has the effect of creating a long-duration RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1550) section. This situation can arise only in heavily loaded systems, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1551) systems using real-time priorities are of course more vulnerable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1552) Therefore, RCU priority boosting is provided to help deal with this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1553) case. That said, the exact requirements on RCU priority boosting will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1554) likely evolve as more experience accumulates.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1555)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1556) Other workloads might have very high update rates. Although one can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1557) argue that such workloads should instead use something other than RCU,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1558) the fact remains that RCU must handle such workloads gracefully. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1559) requirement is another factor driving batching of grace periods, but it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1560) is also the driving force behind the checks for large numbers of queued
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1561) RCU callbacks in the ``call_rcu()`` code path. Finally, high update
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1562) rates should not delay RCU read-side critical sections, although some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1563) small read-side delays can occur when using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1564) ``synchronize_rcu_expedited()``, courtesy of this function's use of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1565) ``smp_call_function_single()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1566)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1567) Although all three of these corner cases were understood in the early
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1568) 1990s, a simple user-level test consisting of ``close(open(path))`` in a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1569) tight loop in the early 2000s suddenly provided a much deeper
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1570) appreciation of the high-update-rate corner case. This test also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1571) motivated addition of some RCU code to react to high update rates, for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1572) example, if a given CPU finds itself with more than 10,000 RCU callbacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1573) queued, it will cause RCU to take evasive action by more aggressively
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1574) starting grace periods and more aggressively forcing completion of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1575) grace-period processing. This evasive action causes the grace period to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1576) complete more quickly, but at the cost of restricting RCU's batching
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1577) optimizations, thus increasing the CPU overhead incurred by that grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1578) period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1579)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1580) Software-Engineering Requirements
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1581) ---------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1582)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1583) Between Murphy's Law and “To err is human”, it is necessary to guard
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1584) against mishaps and misuse:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1585)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1586) #. It is all too easy to forget to use ``rcu_read_lock()`` everywhere
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1587) that it is needed, so kernels built with ``CONFIG_PROVE_RCU=y`` will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1588) splat if ``rcu_dereference()`` is used outside of an RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1589) critical section. Update-side code can use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1590) ``rcu_dereference_protected()``, which takes a `lockdep
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1591) expression <https://lwn.net/Articles/371986/>`__ to indicate what is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1592) providing the protection. If the indicated protection is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1593) provided, a lockdep splat is emitted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1594) Code shared between readers and updaters can use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1595) ``rcu_dereference_check()``, which also takes a lockdep expression,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1596) and emits a lockdep splat if neither ``rcu_read_lock()`` nor the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1597) indicated protection is in place. In addition,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1598) ``rcu_dereference_raw()`` is used in those (hopefully rare) cases
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1599) where the required protection cannot be easily described. Finally,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1600) ``rcu_read_lock_held()`` is provided to allow a function to verify
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1601) that it has been invoked within an RCU read-side critical section. I
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1602) was made aware of this set of requirements shortly after Thomas
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1603) Gleixner audited a number of RCU uses.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1604) #. A given function might wish to check for RCU-related preconditions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1605) upon entry, before using any other RCU API. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1606) ``rcu_lockdep_assert()`` does this job, asserting the expression in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1607) kernels having lockdep enabled and doing nothing otherwise.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1608) #. It is also easy to forget to use ``rcu_assign_pointer()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1609) ``rcu_dereference()``, perhaps (incorrectly) substituting a simple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1610) assignment. To catch this sort of error, a given RCU-protected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1611) pointer may be tagged with ``__rcu``, after which sparse will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1612) complain about simple-assignment accesses to that pointer. Arnd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1613) Bergmann made me aware of this requirement, and also supplied the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1614) needed `patch series <https://lwn.net/Articles/376011/>`__.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1615) #. Kernels built with ``CONFIG_DEBUG_OBJECTS_RCU_HEAD=y`` will splat if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1616) a data element is passed to ``call_rcu()`` twice in a row, without a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1617) grace period in between. (This error is similar to a double free.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1618) The corresponding ``rcu_head`` structures that are dynamically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1619) allocated are automatically tracked, but ``rcu_head`` structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1620) allocated on the stack must be initialized with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1621) ``init_rcu_head_on_stack()`` and cleaned up with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1622) ``destroy_rcu_head_on_stack()``. Similarly, statically allocated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1623) non-stack ``rcu_head`` structures must be initialized with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1624) ``init_rcu_head()`` and cleaned up with ``destroy_rcu_head()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1625) Mathieu Desnoyers made me aware of this requirement, and also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1626) supplied the needed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1627) `patch <https://lkml.kernel.org/g/20100319013024.GA28456@Krystal>`__.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1628) #. An infinite loop in an RCU read-side critical section will eventually
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1629) trigger an RCU CPU stall warning splat, with the duration of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1630) “eventually” being controlled by the ``RCU_CPU_STALL_TIMEOUT``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1631) ``Kconfig`` option, or, alternatively, by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1632) ``rcupdate.rcu_cpu_stall_timeout`` boot/sysfs parameter. However, RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1633) is not obligated to produce this splat unless there is a grace period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1634) waiting on that particular RCU read-side critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1635)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1636) Some extreme workloads might intentionally delay RCU grace periods,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1637) and systems running those workloads can be booted with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1638) ``rcupdate.rcu_cpu_stall_suppress`` to suppress the splats. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1639) kernel parameter may also be set via ``sysfs``. Furthermore, RCU CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1640) stall warnings are counter-productive during sysrq dumps and during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1641) panics. RCU therefore supplies the ``rcu_sysrq_start()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1642) ``rcu_sysrq_end()`` API members to be called before and after long
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1643) sysrq dumps. RCU also supplies the ``rcu_panic()`` notifier that is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1644) automatically invoked at the beginning of a panic to suppress further
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1645) RCU CPU stall warnings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1646)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1647) This requirement made itself known in the early 1990s, pretty much
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1648) the first time that it was necessary to debug a CPU stall. That said,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1649) the initial implementation in DYNIX/ptx was quite generic in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1650) comparison with that of Linux.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1651)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1652) #. Although it would be very good to detect pointers leaking out of RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1653) read-side critical sections, there is currently no good way of doing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1654) this. One complication is the need to distinguish between pointers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1655) leaking and pointers that have been handed off from RCU to some other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1656) synchronization mechanism, for example, reference counting.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1657) #. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1658) is provided via event tracing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1659) #. Open-coded use of ``rcu_assign_pointer()`` and ``rcu_dereference()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1660) to create typical linked data structures can be surprisingly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1661) error-prone. Therefore, RCU-protected `linked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1662) lists <https://lwn.net/Articles/609973/#RCU%20List%20APIs>`__ and,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1663) more recently, RCU-protected `hash
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1664) tables <https://lwn.net/Articles/612100/>`__ are available. Many
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1665) other special-purpose RCU-protected data structures are available in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1666) the Linux kernel and the userspace RCU library.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1667) #. Some linked structures are created at compile time, but still require
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1668) ``__rcu`` checking. The ``RCU_POINTER_INITIALIZER()`` macro serves
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1669) this purpose.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1670) #. It is not necessary to use ``rcu_assign_pointer()`` when creating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1671) linked structures that are to be published via a single external
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1672) pointer. The ``RCU_INIT_POINTER()`` macro is provided for this task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1673) and also for assigning ``NULL`` pointers at runtime.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1674)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1675) This not a hard-and-fast list: RCU's diagnostic capabilities will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1676) continue to be guided by the number and type of usage bugs found in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1677) real-world RCU usage.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1678)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1679) Linux Kernel Complications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1680) --------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1681)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1682) The Linux kernel provides an interesting environment for all kinds of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1683) software, including RCU. Some of the relevant points of interest are as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1684) follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1685)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1686) #. `Configuration`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1687) #. `Firmware Interface`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1688) #. `Early Boot`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1689) #. `Interrupts and NMIs`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1690) #. `Loadable Modules`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1691) #. `Hotplug CPU`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1692) #. `Scheduler and RCU`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1693) #. `Tracing and RCU`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1694) #. `Accesses to User Memory and RCU`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1695) #. `Energy Efficiency`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1696) #. `Scheduling-Clock Interrupts and RCU`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1697) #. `Memory Efficiency`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1698) #. `Performance, Scalability, Response Time, and Reliability`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1699)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1700) This list is probably incomplete, but it does give a feel for the most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1701) notable Linux-kernel complications. Each of the following sections
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1702) covers one of the above topics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1703)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1704) Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1705) ~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1706)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1707) RCU's goal is automatic configuration, so that almost nobody needs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1708) worry about RCU's ``Kconfig`` options. And for almost all users, RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1709) does in fact work well “out of the box.”
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1710)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1711) However, there are specialized use cases that are handled by kernel boot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1712) parameters and ``Kconfig`` options. Unfortunately, the ``Kconfig``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1713) system will explicitly ask users about new ``Kconfig`` options, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1714) requires almost all of them be hidden behind a ``CONFIG_RCU_EXPERT``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1715) ``Kconfig`` option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1716)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1717) This all should be quite obvious, but the fact remains that Linus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1718) Torvalds recently had to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1719) `remind <https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1720) me of this requirement.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1721)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1722) Firmware Interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1723) ~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1724)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1725) In many cases, kernel obtains information about the system from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1726) firmware, and sometimes things are lost in translation. Or the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1727) translation is accurate, but the original message is bogus.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1728)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1729) For example, some systems' firmware overreports the number of CPUs,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1730) sometimes by a large factor. If RCU naively believed the firmware, as it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1731) used to do, it would create too many per-CPU kthreads. Although the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1732) resulting system will still run correctly, the extra kthreads needlessly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1733) consume memory and can cause confusion when they show up in ``ps``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1734) listings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1735)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1736) RCU must therefore wait for a given CPU to actually come online before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1737) it can allow itself to believe that the CPU actually exists. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1738) resulting “ghost CPUs” (which are never going to come online) cause a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1739) number of `interesting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1740) complications <https://paulmck.livejournal.com/37494.html>`__.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1741)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1742) Early Boot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1743) ~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1744)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1745) The Linux kernel's boot sequence is an interesting process, and RCU is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1746) used early, even before ``rcu_init()`` is invoked. In fact, a number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1747) RCU's primitives can be used as soon as the initial task's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1748) ``task_struct`` is available and the boot CPU's per-CPU variables are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1749) set up. The read-side primitives (``rcu_read_lock()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1750) ``rcu_read_unlock()``, ``rcu_dereference()``, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1751) ``rcu_access_pointer()``) will operate normally very early on, as will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1752) ``rcu_assign_pointer()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1753)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1754) Although ``call_rcu()`` may be invoked at any time during boot,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1755) callbacks are not guaranteed to be invoked until after all of RCU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1756) kthreads have been spawned, which occurs at ``early_initcall()`` time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1757) This delay in callback invocation is due to the fact that RCU does not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1758) invoke callbacks until it is fully initialized, and this full
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1759) initialization cannot occur until after the scheduler has initialized
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1760) itself to the point where RCU can spawn and run its kthreads. In theory,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1761) it would be possible to invoke callbacks earlier, however, this is not a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1762) panacea because there would be severe restrictions on what operations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1763) those callbacks could invoke.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1764)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1765) Perhaps surprisingly, ``synchronize_rcu()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1766) ``synchronize_rcu_expedited()``, will operate normally during very early
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1767) boot, the reason being that there is only one CPU and preemption is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1768) disabled. This means that the call ``synchronize_rcu()`` (or friends)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1769) itself is a quiescent state and thus a grace period, so the early-boot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1770) implementation can be a no-op.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1771)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1772) However, once the scheduler has spawned its first kthread, this early
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1773) boot trick fails for ``synchronize_rcu()`` (as well as for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1774) ``synchronize_rcu_expedited()``) in ``CONFIG_PREEMPT=y`` kernels. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1775) reason is that an RCU read-side critical section might be preempted,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1776) which means that a subsequent ``synchronize_rcu()`` really does have to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1777) wait for something, as opposed to simply returning immediately.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1778) Unfortunately, ``synchronize_rcu()`` can't do this until all of its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1779) kthreads are spawned, which doesn't happen until some time during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1780) ``early_initcalls()`` time. But this is no excuse: RCU is nevertheless
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1781) required to correctly handle synchronous grace periods during this time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1782) period. Once all of its kthreads are up and running, RCU starts running
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1783) normally.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1784)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1785) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1786) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1787) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1788) | How can RCU possibly handle grace periods before all of its kthreads |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1789) | have been spawned??? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1790) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1791) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1792) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1793) | Very carefully! |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1794) | During the “dead zone” between the time that the scheduler spawns the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1795) | first task and the time that all of RCU's kthreads have been spawned, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1796) | all synchronous grace periods are handled by the expedited |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1797) | grace-period mechanism. At runtime, this expedited mechanism relies |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1798) | on workqueues, but during the dead zone the requesting task itself |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1799) | drives the desired expedited grace period. Because dead-zone |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1800) | execution takes place within task context, everything works. Once the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1801) | dead zone ends, expedited grace periods go back to using workqueues, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1802) | as is required to avoid problems that would otherwise occur when a |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1803) | user task received a POSIX signal while driving an expedited grace |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1804) | period. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1805) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1806) | And yes, this does mean that it is unhelpful to send POSIX signals to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1807) | random tasks between the time that the scheduler spawns its first |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1808) | kthread and the time that RCU's kthreads have all been spawned. If |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1809) | there ever turns out to be a good reason for sending POSIX signals |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1810) | during that time, appropriate adjustments will be made. (If it turns |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1811) | out that POSIX signals are sent during this time for no good reason, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1812) | other adjustments will be made, appropriate or otherwise.) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1813) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1814)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1815) I learned of these boot-time requirements as a result of a series of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1816) system hangs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1817)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1818) Interrupts and NMIs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1819) ~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1820)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1821) The Linux kernel has interrupts, and RCU read-side critical sections are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1822) legal within interrupt handlers and within interrupt-disabled regions of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1823) code, as are invocations of ``call_rcu()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1824)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1825) Some Linux-kernel architectures can enter an interrupt handler from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1826) non-idle process context, and then just never leave it, instead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1827) stealthily transitioning back to process context. This trick is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1828) sometimes used to invoke system calls from inside the kernel. These
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1829) “half-interrupts” mean that RCU has to be very careful about how it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1830) counts interrupt nesting levels. I learned of this requirement the hard
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1831) way during a rewrite of RCU's dyntick-idle code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1832)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1833) The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1834) critical sections are legal within NMI handlers. Thankfully, RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1835) update-side primitives, including ``call_rcu()``, are prohibited within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1836) NMI handlers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1837)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1838) The name notwithstanding, some Linux-kernel architectures can have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1839) nested NMIs, which RCU must handle correctly. Andy Lutomirski `surprised
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1840) me <https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1841) with this requirement; he also kindly surprised me with `an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1842) algorithm <https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1843) that meets this requirement.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1844)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1845) Furthermore, NMI handlers can be interrupted by what appear to RCU to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1846) normal interrupts. One way that this can happen is for code that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1847) directly invokes ``rcu_irq_enter()`` and ``rcu_irq_exit()`` to be called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1848) from an NMI handler. This astonishing fact of life prompted the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1849) code structure, which has ``rcu_irq_enter()`` invoking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1850) ``rcu_nmi_enter()`` and ``rcu_irq_exit()`` invoking ``rcu_nmi_exit()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1851) And yes, I also learned of this requirement the hard way.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1852)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1853) Loadable Modules
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1854) ~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1855)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1856) The Linux kernel has loadable modules, and these modules can also be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1857) unloaded. After a given module has been unloaded, any attempt to call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1858) one of its functions results in a segmentation fault. The module-unload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1859) functions must therefore cancel any delayed calls to loadable-module
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1860) functions, for example, any outstanding ``mod_timer()`` must be dealt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1861) with via ``del_timer_sync()`` or similar.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1862)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1863) Unfortunately, there is no way to cancel an RCU callback; once you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1864) invoke ``call_rcu()``, the callback function is eventually going to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1865) invoked, unless the system goes down first. Because it is normally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1866) considered socially irresponsible to crash the system in response to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1867) module unload request, we need some other way to deal with in-flight RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1868) callbacks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1869)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1870) RCU therefore provides ``rcu_barrier()``, which waits until all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1871) in-flight RCU callbacks have been invoked. If a module uses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1872) ``call_rcu()``, its exit function should therefore prevent any future
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1873) invocation of ``call_rcu()``, then invoke ``rcu_barrier()``. In theory,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1874) the underlying module-unload code could invoke ``rcu_barrier()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1875) unconditionally, but in practice this would incur unacceptable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1876) latencies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1877)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1878) Nikita Danilov noted this requirement for an analogous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1879) filesystem-unmount situation, and Dipankar Sarma incorporated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1880) ``rcu_barrier()`` into RCU. The need for ``rcu_barrier()`` for module
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1881) unloading became apparent later.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1882)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1883) .. important::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1884)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1885) The ``rcu_barrier()`` function is not, repeat,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1886) *not*, obligated to wait for a grace period. It is instead only required
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1887) to wait for RCU callbacks that have already been posted. Therefore, if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1888) there are no RCU callbacks posted anywhere in the system,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1889) ``rcu_barrier()`` is within its rights to return immediately. Even if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1890) there are callbacks posted, ``rcu_barrier()`` does not necessarily need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1891) to wait for a grace period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1892)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1893) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1894) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1895) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1896) | Wait a minute! Each RCU callbacks must wait for a grace period to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1897) | complete, and ``rcu_barrier()`` must wait for each pre-existing |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1898) | callback to be invoked. Doesn't ``rcu_barrier()`` therefore need to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1899) | wait for a full grace period if there is even one callback posted |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1900) | anywhere in the system? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1901) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1902) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1903) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1904) | Absolutely not!!! |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1905) | Yes, each RCU callbacks must wait for a grace period to complete, but |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1906) | it might well be partly (or even completely) finished waiting by the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1907) | time ``rcu_barrier()`` is invoked. In that case, ``rcu_barrier()`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1908) | need only wait for the remaining portion of the grace period to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1909) | elapse. So even if there are quite a few callbacks posted, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1910) | ``rcu_barrier()`` might well return quite quickly. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1911) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1912) | So if you need to wait for a grace period as well as for all |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1913) | pre-existing callbacks, you will need to invoke both |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1914) | ``synchronize_rcu()`` and ``rcu_barrier()``. If latency is a concern, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1915) | you can always use workqueues to invoke them concurrently. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1916) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1917)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1918) Hotplug CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1919) ~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1920)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1921) The Linux kernel supports CPU hotplug, which means that CPUs can come
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1922) and go. It is of course illegal to use any RCU API member from an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1923) offline CPU, with the exception of `SRCU <#Sleepable%20RCU>`__ read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1924) critical sections. This requirement was present from day one in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1925) DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1926) implementation is “interesting.”
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1927)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1928) The Linux-kernel CPU-hotplug implementation has notifiers that are used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1929) to allow the various kernel subsystems (including RCU) to respond
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1930) appropriately to a given CPU-hotplug operation. Most RCU operations may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1931) be invoked from CPU-hotplug notifiers, including even synchronous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1932) grace-period operations such as ``synchronize_rcu()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1933) ``synchronize_rcu_expedited()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1934)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1935) However, all-callback-wait operations such as ``rcu_barrier()`` are also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1936) not supported, due to the fact that there are phases of CPU-hotplug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1937) operations where the outgoing CPU's callbacks will not be invoked until
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1938) after the CPU-hotplug operation ends, which could also result in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1939) deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1940) during its execution, which results in another type of deadlock when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1941) invoked from a CPU-hotplug notifier.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1942)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1943) Scheduler and RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1944) ~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1945)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1946) RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1947) accumulation by these kthreads. This requirement was no surprise, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1948) RCU's violation of it when running context-switch-heavy workloads when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1949) built with ``CONFIG_NO_HZ_FULL=y`` `did come as a surprise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1950) [PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1951) RCU has made good progress towards meeting this requirement, even for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1952) context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1953) room for further improvement.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1954)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1955) There is no longer any prohibition against holding any of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1956) scheduler's runqueue or priority-inheritance spinlocks across an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1957) ``rcu_read_unlock()``, even if interrupts and preemption were enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1958) somewhere within the corresponding RCU read-side critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1959) Therefore, it is now perfectly legal to execute ``rcu_read_lock()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1960) with preemption enabled, acquire one of the scheduler locks, and hold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1961) that lock across the matching ``rcu_read_unlock()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1962)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1963) Similarly, the RCU flavor consolidation has removed the need for negative
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1964) nesting. The fact that interrupt-disabled regions of code act as RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1965) read-side critical sections implicitly avoids earlier issues that used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1966) to result in destructive recursion via interrupt handler's use of RCU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1967)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1968) Tracing and RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1969) ~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1970)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1971) It is possible to use tracing on RCU code, but tracing itself uses RCU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1972) For this reason, ``rcu_dereference_raw_check()`` is provided for use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1973) by tracing, which avoids the destructive recursion that could otherwise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1974) ensue. This API is also used by virtualization in some architectures,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1975) where RCU readers execute in environments in which tracing cannot be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1976) used. The tracing folks both located the requirement and provided the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1977) needed fix, so this surprise requirement was relatively painless.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1978)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1979) Accesses to User Memory and RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1980) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1981)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1982) The kernel needs to access user-space memory, for example, to access data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1983) referenced by system-call parameters. The ``get_user()`` macro does this job.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1984)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1985) However, user-space memory might well be paged out, which means that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1986) ``get_user()`` might well page-fault and thus block while waiting for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1987) resulting I/O to complete. It would be a very bad thing for the compiler to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1988) reorder a ``get_user()`` invocation into an RCU read-side critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1989)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1990) For example, suppose that the source code looked like this:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1991)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1992) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1993)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1994) 1 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1995) 2 p = rcu_dereference(gp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1996) 3 v = p->value;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1997) 4 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1998) 5 get_user(user_v, user_p);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1999) 6 do_something_with(v, user_v);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2000)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2001) The compiler must not be permitted to transform this source code into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2002) the following:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2003)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2004) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2005)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2006) 1 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2007) 2 p = rcu_dereference(gp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2008) 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2009) 4 v = p->value;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2010) 5 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2011) 6 do_something_with(v, user_v);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2012)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2013) If the compiler did make this transformation in a ``CONFIG_PREEMPT=n`` kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2014) build, and if ``get_user()`` did page fault, the result would be a quiescent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2015) state in the middle of an RCU read-side critical section. This misplaced
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2016) quiescent state could result in line 4 being a use-after-free access,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2017) which could be bad for your kernel's actuarial statistics. Similar examples
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2018) can be constructed with the call to ``get_user()`` preceding the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2019) ``rcu_read_lock()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2020)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2021) Unfortunately, ``get_user()`` doesn't have any particular ordering properties,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2022) and in some architectures the underlying ``asm`` isn't even marked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2023) ``volatile``. And even if it was marked ``volatile``, the above access to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2024) ``p->value`` is not volatile, so the compiler would not have any reason to keep
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2025) those two accesses in order.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2026)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2027) Therefore, the Linux-kernel definitions of ``rcu_read_lock()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2028) ``rcu_read_unlock()`` must act as compiler barriers, at least for outermost
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2029) instances of ``rcu_read_lock()`` and ``rcu_read_unlock()`` within a nested set
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2030) of RCU read-side critical sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2031)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2032) Energy Efficiency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2033) ~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2034)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2035) Interrupting idle CPUs is considered socially unacceptable, especially
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2036) by people with battery-powered embedded systems. RCU therefore conserves
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2037) energy by detecting which CPUs are idle, including tracking CPUs that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2038) have been interrupted from idle. This is a large part of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2039) energy-efficiency requirement, so I learned of this via an irate phone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2040) call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2041)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2042) Because RCU avoids interrupting idle CPUs, it is illegal to execute an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2043) RCU read-side critical section on an idle CPU. (Kernels built with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2044) ``CONFIG_PROVE_RCU=y`` will splat if you try it.) The ``RCU_NONIDLE()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2045) macro and ``_rcuidle`` event tracing is provided to work around this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2046) restriction. In addition, ``rcu_is_watching()`` may be used to test
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2047) whether or not it is currently legal to run RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2048) sections on this CPU. I learned of the need for diagnostics on the one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2049) hand and ``RCU_NONIDLE()`` on the other while inspecting idle-loop code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2050) Steven Rostedt supplied ``_rcuidle`` event tracing, which is used quite
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2051) heavily in the idle loop. However, there are some restrictions on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2052) code placed within ``RCU_NONIDLE()``:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2053)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2054) #. Blocking is prohibited. In practice, this is not a serious
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2055) restriction given that idle tasks are prohibited from blocking to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2056) begin with.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2057) #. Although nesting ``RCU_NONIDLE()`` is permitted, they cannot nest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2058) indefinitely deeply. However, given that they can be nested on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2059) order of a million deep, even on 32-bit systems, this should not be a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2060) serious restriction. This nesting limit would probably be reached
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2061) long after the compiler OOMed or the stack overflowed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2062) #. Any code path that enters ``RCU_NONIDLE()`` must sequence out of that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2063) same ``RCU_NONIDLE()``. For example, the following is grossly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2064) illegal:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2065)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2066) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2067)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2068) 1 RCU_NONIDLE({
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2069) 2 do_something();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2070) 3 goto bad_idea; /* BUG!!! */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2071) 4 do_something_else();});
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2072) 5 bad_idea:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2073)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2074)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2075) It is just as illegal to transfer control into the middle of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2076) ``RCU_NONIDLE()``'s argument. Yes, in theory, you could transfer in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2077) as long as you also transferred out, but in practice you could also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2078) expect to get sharply worded review comments.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2079)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2080) It is similarly socially unacceptable to interrupt an ``nohz_full`` CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2081) running in userspace. RCU must therefore track ``nohz_full`` userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2082) execution. RCU must therefore be able to sample state at two points in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2083) time, and be able to determine whether or not some other CPU spent any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2084) time idle and/or executing in userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2085)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2086) These energy-efficiency requirements have proven quite difficult to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2087) understand and to meet, for example, there have been more than five
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2088) clean-sheet rewrites of RCU's energy-efficiency code, the last of which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2089) was finally able to demonstrate `real energy savings running on real
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2090) hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2091) [PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf>`__.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2092) As noted earlier, I learned of many of these requirements via angry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2093) phone calls: Flaming me on the Linux-kernel mailing list was apparently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2094) not sufficient to fully vent their ire at RCU's energy-efficiency bugs!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2095)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2096) Scheduling-Clock Interrupts and RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2097) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2098)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2099) The kernel transitions between in-kernel non-idle execution, userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2100) execution, and the idle loop. Depending on kernel configuration, RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2101) handles these states differently:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2102)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2103) +-----------------+------------------+------------------+-----------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2104) | ``HZ`` Kconfig | In-Kernel | Usermode | Idle |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2105) +=================+==================+==================+=================+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2106) | ``HZ_PERIODIC`` | Can rely on | Can rely on | Can rely on |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2107) | | scheduling-clock | scheduling-clock | RCU's |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2108) | | interrupt. | interrupt and | dyntick-idle |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2109) | | | its detection | detection. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2110) | | | of interrupt | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2111) | | | from usermode. | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2112) +-----------------+------------------+------------------+-----------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2113) | ``NO_HZ_IDLE`` | Can rely on | Can rely on | Can rely on |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2114) | | scheduling-clock | scheduling-clock | RCU's |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2115) | | interrupt. | interrupt and | dyntick-idle |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2116) | | | its detection | detection. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2117) | | | of interrupt | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2118) | | | from usermode. | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2119) +-----------------+------------------+------------------+-----------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2120) | ``NO_HZ_FULL`` | Can only | Can rely on | Can rely on |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2121) | | sometimes rely | RCU's | RCU's |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2122) | | on | dyntick-idle | dyntick-idle |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2123) | | scheduling-clock | detection. | detection. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2124) | | interrupt. In | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2125) | | other cases, it | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2126) | | is necessary to | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2127) | | bound kernel | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2128) | | execution times | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2129) | | and/or use | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2130) | | IPIs. | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2131) +-----------------+------------------+------------------+-----------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2132)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2133) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2134) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2135) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2136) | Why can't ``NO_HZ_FULL`` in-kernel execution rely on the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2137) | scheduling-clock interrupt, just like ``HZ_PERIODIC`` and |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2138) | ``NO_HZ_IDLE`` do? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2139) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2140) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2141) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2142) | Because, as a performance optimization, ``NO_HZ_FULL`` does not |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2143) | necessarily re-enable the scheduling-clock interrupt on entry to each |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2144) | and every system call. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2145) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2147) However, RCU must be reliably informed as to whether any given CPU is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2148) currently in the idle loop, and, for ``NO_HZ_FULL``, also whether that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2149) CPU is executing in usermode, as discussed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2150) `earlier <#Energy%20Efficiency>`__. It also requires that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2151) scheduling-clock interrupt be enabled when RCU needs it to be:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2152)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2153) #. If a CPU is either idle or executing in usermode, and RCU believes it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2154) is non-idle, the scheduling-clock tick had better be running.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2155) Otherwise, you will get RCU CPU stall warnings. Or at best, very long
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2156) (11-second) grace periods, with a pointless IPI waking the CPU from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2157) time to time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2158) #. If a CPU is in a portion of the kernel that executes RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2159) critical sections, and RCU believes this CPU to be idle, you will get
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2160) random memory corruption. **DON'T DO THIS!!!**
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2161) This is one reason to test with lockdep, which will complain about
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2162) this sort of thing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2163) #. If a CPU is in a portion of the kernel that is absolutely positively
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2164) no-joking guaranteed to never execute any RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2165) sections, and RCU believes this CPU to be idle, no problem. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2166) sort of thing is used by some architectures for light-weight
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2167) exception handlers, which can then avoid the overhead of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2168) ``rcu_irq_enter()`` and ``rcu_irq_exit()`` at exception entry and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2169) exit, respectively. Some go further and avoid the entireties of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2170) ``irq_enter()`` and ``irq_exit()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2171) Just make very sure you are running some of your tests with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2172) ``CONFIG_PROVE_RCU=y``, just in case one of your code paths was in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2173) fact joking about not doing RCU read-side critical sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2174) #. If a CPU is executing in the kernel with the scheduling-clock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2175) interrupt disabled and RCU believes this CPU to be non-idle, and if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2176) the CPU goes idle (from an RCU perspective) every few jiffies, no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2177) problem. It is usually OK for there to be the occasional gap between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2178) idle periods of up to a second or so.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2179) If the gap grows too long, you get RCU CPU stall warnings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2180) #. If a CPU is either idle or executing in usermode, and RCU believes it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2181) to be idle, of course no problem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2182) #. If a CPU is executing in the kernel, the kernel code path is passing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2183) through quiescent states at a reasonable frequency (preferably about
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2184) once per few jiffies, but the occasional excursion to a second or so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2185) is usually OK) and the scheduling-clock interrupt is enabled, of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2186) course no problem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2187) If the gap between a successive pair of quiescent states grows too
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2188) long, you get RCU CPU stall warnings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2189)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2190) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2191) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2192) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2193) | But what if my driver has a hardware interrupt handler that can run |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2194) | for many seconds? I cannot invoke ``schedule()`` from an hardware |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2195) | interrupt handler, after all! |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2196) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2197) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2198) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2199) | One approach is to do ``rcu_irq_exit();rcu_irq_enter();`` every so |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2200) | often. But given that long-running interrupt handlers can cause other |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2201) | problems, not least for response time, shouldn't you work to keep |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2202) | your interrupt handler's runtime within reasonable bounds? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2203) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2204)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2205) But as long as RCU is properly informed of kernel state transitions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2206) between in-kernel execution, usermode execution, and idle, and as long
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2207) as the scheduling-clock interrupt is enabled when RCU needs it to be,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2208) you can rest assured that the bugs you encounter will be in some other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2209) part of RCU or some other part of the kernel!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2210)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2211) Memory Efficiency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2212) ~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2213)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2214) Although small-memory non-realtime systems can simply use Tiny RCU, code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2215) size is only one aspect of memory efficiency. Another aspect is the size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2216) of the ``rcu_head`` structure used by ``call_rcu()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2217) ``kfree_rcu()``. Although this structure contains nothing more than a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2218) pair of pointers, it does appear in many RCU-protected data structures,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2219) including some that are size critical. The ``page`` structure is a case
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2220) in point, as evidenced by the many occurrences of the ``union`` keyword
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2221) within that structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2222)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2223) This need for memory efficiency is one reason that RCU uses hand-crafted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2224) singly linked lists to track the ``rcu_head`` structures that are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2225) waiting for a grace period to elapse. It is also the reason why
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2226) ``rcu_head`` structures do not contain debug information, such as fields
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2227) tracking the file and line of the ``call_rcu()`` or ``kfree_rcu()`` that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2228) posted them. Although this information might appear in debug-only kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2229) builds at some point, in the meantime, the ``->func`` field will often
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2230) provide the needed debug information.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2231)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2232) However, in some cases, the need for memory efficiency leads to even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2233) more extreme measures. Returning to the ``page`` structure, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2234) ``rcu_head`` field shares storage with a great many other structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2235) that are used at various points in the corresponding page's lifetime. In
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2236) order to correctly resolve certain `race
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2237) conditions <https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com>`__,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2238) the Linux kernel's memory-management subsystem needs a particular bit to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2239) remain zero during all phases of grace-period processing, and that bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2240) happens to map to the bottom bit of the ``rcu_head`` structure's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2241) ``->next`` field. RCU makes this guarantee as long as ``call_rcu()`` is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2242) used to post the callback, as opposed to ``kfree_rcu()`` or some future
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2243) “lazy” variant of ``call_rcu()`` that might one day be created for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2244) energy-efficiency purposes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2245)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2246) That said, there are limits. RCU requires that the ``rcu_head``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2247) structure be aligned to a two-byte boundary, and passing a misaligned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2248) ``rcu_head`` structure to one of the ``call_rcu()`` family of functions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2249) will result in a splat. It is therefore necessary to exercise caution
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2250) when packing structures containing fields of type ``rcu_head``. Why not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2251) a four-byte or even eight-byte alignment requirement? Because the m68k
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2252) architecture provides only two-byte alignment, and thus acts as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2253) alignment's least common denominator.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2254)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2255) The reason for reserving the bottom bit of pointers to ``rcu_head``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2256) structures is to leave the door open to “lazy” callbacks whose
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2257) invocations can safely be deferred. Deferring invocation could
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2258) potentially have energy-efficiency benefits, but only if the rate of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2259) non-lazy callbacks decreases significantly for some important workload.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2260) In the meantime, reserving the bottom bit keeps this option open in case
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2261) it one day becomes useful.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2262)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2263) Performance, Scalability, Response Time, and Reliability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2264) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2265)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2266) Expanding on the `earlier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2267) discussion <#Performance%20and%20Scalability>`__, RCU is used heavily by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2268) hot code paths in performance-critical portions of the Linux kernel's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2269) networking, security, virtualization, and scheduling code paths. RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2270) must therefore use efficient implementations, especially in its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2271) read-side primitives. To that end, it would be good if preemptible RCU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2272) implementation of ``rcu_read_lock()`` could be inlined, however, doing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2273) this requires resolving ``#include`` issues with the ``task_struct``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2274) structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2275)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2276) The Linux kernel supports hardware configurations with up to 4096 CPUs,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2277) which means that RCU must be extremely scalable. Algorithms that involve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2278) frequent acquisitions of global locks or frequent atomic operations on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2279) global variables simply cannot be tolerated within the RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2280) implementation. RCU therefore makes heavy use of a combining tree based
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2281) on the ``rcu_node`` structure. RCU is required to tolerate all CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2282) continuously invoking any combination of RCU's runtime primitives with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2283) minimal per-operation overhead. In fact, in many cases, increasing load
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2284) must *decrease* the per-operation overhead, witness the batching
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2285) optimizations for ``synchronize_rcu()``, ``call_rcu()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2286) ``synchronize_rcu_expedited()``, and ``rcu_barrier()``. As a general
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2287) rule, RCU must cheerfully accept whatever the rest of the Linux kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2288) decides to throw at it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2289)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2290) The Linux kernel is used for real-time workloads, especially in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2291) conjunction with the `-rt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2292) patchset <https://rt.wiki.kernel.org/index.php/Main_Page>`__. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2293) real-time-latency response requirements are such that the traditional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2294) approach of disabling preemption across RCU read-side critical sections
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2295) is inappropriate. Kernels built with ``CONFIG_PREEMPT=y`` therefore use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2296) an RCU implementation that allows RCU read-side critical sections to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2297) preempted. This requirement made its presence known after users made it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2298) clear that an earlier `real-time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2299) patch <https://lwn.net/Articles/107930/>`__ did not meet their needs, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2300) conjunction with some `RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2301) issues <https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2302) encountered by a very early version of the -rt patchset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2303)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2304) In addition, RCU must make do with a sub-100-microsecond real-time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2305) latency budget. In fact, on smaller systems with the -rt patchset, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2306) Linux kernel provides sub-20-microsecond real-time latencies for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2307) whole kernel, including RCU. RCU's scalability and latency must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2308) therefore be sufficient for these sorts of configurations. To my
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2309) surprise, the sub-100-microsecond real-time latency budget `applies to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2310) even the largest systems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2311) [PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf>`__,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2312) up to and including systems with 4096 CPUs. This real-time requirement
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2313) motivated the grace-period kthread, which also simplified handling of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2314) number of race conditions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2315)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2316) RCU must avoid degrading real-time response for CPU-bound threads,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2317) whether executing in usermode (which is one use case for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2318) ``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2319) the kernel must execute ``cond_resched()`` at least once per few tens of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2320) milliseconds in order to avoid receiving an IPI from RCU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2321)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2322) Finally, RCU's status as a synchronization primitive means that any RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2323) failure can result in arbitrary memory corruption that can be extremely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2324) difficult to debug. This means that RCU must be extremely reliable,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2325) which in practice also means that RCU must have an aggressive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2326) stress-test suite. This stress-test suite is called ``rcutorture``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2327)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2328) Although the need for ``rcutorture`` was no surprise, the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2329) immense popularity of the Linux kernel is posing interesting—and perhaps
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2330) unprecedented—validation challenges. To see this, keep in mind that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2331) there are well over one billion instances of the Linux kernel running
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2332) today, given Android smartphones, Linux-powered televisions, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2333) servers. This number can be expected to increase sharply with the advent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2334) of the celebrated Internet of Things.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2335)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2336) Suppose that RCU contains a race condition that manifests on average
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2337) once per million years of runtime. This bug will be occurring about
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2338) three times per *day* across the installed base. RCU could simply hide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2339) behind hardware error rates, given that no one should really expect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2340) their smartphone to last for a million years. However, anyone taking too
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2341) much comfort from this thought should consider the fact that in most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2342) jurisdictions, a successful multi-year test of a given mechanism, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2343) might include a Linux kernel, suffices for a number of types of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2344) safety-critical certifications. In fact, rumor has it that the Linux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2345) kernel is already being used in production for safety-critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2346) applications. I don't know about you, but I would feel quite bad if a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2347) bug in RCU killed someone. Which might explain my recent focus on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2348) validation and verification.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2349)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2350) Other RCU Flavors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2351) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2352)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2353) One of the more surprising things about RCU is that there are now no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2354) fewer than five *flavors*, or API families. In addition, the primary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2355) flavor that has been the sole focus up to this point has two different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2356) implementations, non-preemptible and preemptible. The other four flavors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2357) are listed below, with requirements for each described in a separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2358) section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2359)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2360) #. `Bottom-Half Flavor (Historical)`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2361) #. `Sched Flavor (Historical)`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2362) #. `Sleepable RCU`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2363) #. `Tasks RCU`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2364)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2365) Bottom-Half Flavor (Historical)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2366) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2367)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2368) The RCU-bh flavor of RCU has since been expressed in terms of the other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2369) RCU flavors as part of a consolidation of the three flavors into a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2370) single flavor. The read-side API remains, and continues to disable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2371) softirq and to be accounted for by lockdep. Much of the material in this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2372) section is therefore strictly historical in nature.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2373)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2374) The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2375) flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2376) flavor of RCU that could withstand the network-based denial-of-service
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2377) attacks researched by Robert Olsson. These attacks placed so much
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2378) networking load on the system that some of the CPUs never exited softirq
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2379) execution, which in turn prevented those CPUs from ever executing a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2380) context switch, which, in the RCU implementation of that time, prevented
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2381) grace periods from ever ending. The result was an out-of-memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2382) condition and a system hang.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2383)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2384) The solution was the creation of RCU-bh, which does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2385) ``local_bh_disable()`` across its read-side critical sections, and which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2386) uses the transition from one type of softirq processing to another as a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2387) quiescent state in addition to context switch, idle, user mode, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2388) offline. This means that RCU-bh grace periods can complete even when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2389) some of the CPUs execute in softirq indefinitely, thus allowing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2390) algorithms based on RCU-bh to withstand network-based denial-of-service
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2391) attacks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2392)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2393) Because ``rcu_read_lock_bh()`` and ``rcu_read_unlock_bh()`` disable and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2394) re-enable softirq handlers, any attempt to start a softirq handlers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2395) during the RCU-bh read-side critical section will be deferred. In this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2396) case, ``rcu_read_unlock_bh()`` will invoke softirq processing, which can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2397) take considerable time. One can of course argue that this softirq
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2398) overhead should be associated with the code following the RCU-bh
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2399) read-side critical section rather than ``rcu_read_unlock_bh()``, but the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2400) fact is that most profiling tools cannot be expected to make this sort
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2401) of fine distinction. For example, suppose that a three-millisecond-long
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2402) RCU-bh read-side critical section executes during a time of heavy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2403) networking load. There will very likely be an attempt to invoke at least
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2404) one softirq handler during that three milliseconds, but any such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2405) invocation will be delayed until the time of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2406) ``rcu_read_unlock_bh()``. This can of course make it appear at first
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2407) glance as if ``rcu_read_unlock_bh()`` was executing very slowly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2408)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2409) The `RCU-bh
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2410) API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2411) includes ``rcu_read_lock_bh()``, ``rcu_read_unlock_bh()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2412) ``rcu_dereference_bh()``, ``rcu_dereference_bh_check()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2413) ``synchronize_rcu_bh()``, ``synchronize_rcu_bh_expedited()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2414) ``call_rcu_bh()``, ``rcu_barrier_bh()``, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2415) ``rcu_read_lock_bh_held()``. However, the update-side APIs are now
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2416) simple wrappers for other RCU flavors, namely RCU-sched in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2417) CONFIG_PREEMPT=n kernels and RCU-preempt otherwise.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2418)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2419) Sched Flavor (Historical)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2420) ~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2421)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2422) The RCU-sched flavor of RCU has since been expressed in terms of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2423) other RCU flavors as part of a consolidation of the three flavors into a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2424) single flavor. The read-side API remains, and continues to disable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2425) preemption and to be accounted for by lockdep. Much of the material in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2426) this section is therefore strictly historical in nature.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2427)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2428) Before preemptible RCU, waiting for an RCU grace period had the side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2429) effect of also waiting for all pre-existing interrupt and NMI handlers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2430) However, there are legitimate preemptible-RCU implementations that do
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2431) not have this property, given that any point in the code outside of an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2432) RCU read-side critical section can be a quiescent state. Therefore,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2433) *RCU-sched* was created, which follows “classic” RCU in that an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2434) RCU-sched grace period waits for pre-existing interrupt and NMI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2435) handlers. In kernels built with ``CONFIG_PREEMPT=n``, the RCU and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2436) RCU-sched APIs have identical implementations, while kernels built with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2437) ``CONFIG_PREEMPT=y`` provide a separate implementation for each.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2438)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2439) Note well that in ``CONFIG_PREEMPT=y`` kernels,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2440) ``rcu_read_lock_sched()`` and ``rcu_read_unlock_sched()`` disable and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2441) re-enable preemption, respectively. This means that if there was a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2442) preemption attempt during the RCU-sched read-side critical section,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2443) ``rcu_read_unlock_sched()`` will enter the scheduler, with all the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2444) latency and overhead entailed. Just as with ``rcu_read_unlock_bh()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2445) this can make it look as if ``rcu_read_unlock_sched()`` was executing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2446) very slowly. However, the highest-priority task won't be preempted, so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2447) that task will enjoy low-overhead ``rcu_read_unlock_sched()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2448) invocations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2449)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2450) The `RCU-sched
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2451) API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2452) includes ``rcu_read_lock_sched()``, ``rcu_read_unlock_sched()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2453) ``rcu_read_lock_sched_notrace()``, ``rcu_read_unlock_sched_notrace()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2454) ``rcu_dereference_sched()``, ``rcu_dereference_sched_check()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2455) ``synchronize_sched()``, ``synchronize_rcu_sched_expedited()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2456) ``call_rcu_sched()``, ``rcu_barrier_sched()``, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2457) ``rcu_read_lock_sched_held()``. However, anything that disables
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2458) preemption also marks an RCU-sched read-side critical section, including
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2459) ``preempt_disable()`` and ``preempt_enable()``, ``local_irq_save()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2460) ``local_irq_restore()``, and so on.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2461)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2462) Sleepable RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2463) ~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2464)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2465) For well over a decade, someone saying “I need to block within an RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2466) read-side critical section” was a reliable indication that this someone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2467) did not understand RCU. After all, if you are always blocking in an RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2468) read-side critical section, you can probably afford to use a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2469) higher-overhead synchronization mechanism. However, that changed with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2470) the advent of the Linux kernel's notifiers, whose RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2471) sections almost never sleep, but sometimes need to. This resulted in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2472) introduction of `sleepable RCU <https://lwn.net/Articles/202847/>`__, or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2473) *SRCU*.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2474)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2475) SRCU allows different domains to be defined, with each such domain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2476) defined by an instance of an ``srcu_struct`` structure. A pointer to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2477) this structure must be passed in to each SRCU function, for example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2478) ``synchronize_srcu(&ss)``, where ``ss`` is the ``srcu_struct``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2479) structure. The key benefit of these domains is that a slow SRCU reader
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2480) in one domain does not delay an SRCU grace period in some other domain.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2481) That said, one consequence of these domains is that read-side code must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2482) pass a “cookie” from ``srcu_read_lock()`` to ``srcu_read_unlock()``, for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2483) example, as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2484)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2485) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2486)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2487) 1 int idx;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2488) 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2489) 3 idx = srcu_read_lock(&ss);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2490) 4 do_something();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2491) 5 srcu_read_unlock(&ss, idx);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2492)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2493) As noted above, it is legal to block within SRCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2494) sections, however, with great power comes great responsibility. If you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2495) block forever in one of a given domain's SRCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2496) sections, then that domain's grace periods will also be blocked forever.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2497) Of course, one good way to block forever is to deadlock, which can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2498) happen if any operation in a given domain's SRCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2499) section can wait, either directly or indirectly, for that domain's grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2500) period to elapse. For example, this results in a self-deadlock:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2501)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2502) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2503)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2504) 1 int idx;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2505) 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2506) 3 idx = srcu_read_lock(&ss);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2507) 4 do_something();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2508) 5 synchronize_srcu(&ss);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2509) 6 srcu_read_unlock(&ss, idx);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2510)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2511) However, if line 5 acquired a mutex that was held across a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2512) ``synchronize_srcu()`` for domain ``ss``, deadlock would still be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2513) possible. Furthermore, if line 5 acquired a mutex that was held across a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2514) ``synchronize_srcu()`` for some other domain ``ss1``, and if an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2515) ``ss1``-domain SRCU read-side critical section acquired another mutex
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2516) that was held across as ``ss``-domain ``synchronize_srcu()``, deadlock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2517) would again be possible. Such a deadlock cycle could extend across an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2518) arbitrarily large number of different SRCU domains. Again, with great
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2519) power comes great responsibility.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2520)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2521) Unlike the other RCU flavors, SRCU read-side critical sections can run
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2522) on idle and even offline CPUs. This ability requires that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2523) ``srcu_read_lock()`` and ``srcu_read_unlock()`` contain memory barriers,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2524) which means that SRCU readers will run a bit slower than would RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2525) readers. It also motivates the ``smp_mb__after_srcu_read_unlock()`` API,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2526) which, in combination with ``srcu_read_unlock()``, guarantees a full
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2527) memory barrier.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2528)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2529) Also unlike other RCU flavors, ``synchronize_srcu()`` may **not** be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2530) invoked from CPU-hotplug notifiers, due to the fact that SRCU grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2531) periods make use of timers and the possibility of timers being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2532) temporarily “stranded” on the outgoing CPU. This stranding of timers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2533) means that timers posted to the outgoing CPU will not fire until late in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2534) the CPU-hotplug process. The problem is that if a notifier is waiting on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2535) an SRCU grace period, that grace period is waiting on a timer, and that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2536) timer is stranded on the outgoing CPU, then the notifier will never be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2537) awakened, in other words, deadlock has occurred. This same situation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2538) course also prohibits ``srcu_barrier()`` from being invoked from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2539) CPU-hotplug notifiers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2540)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2541) SRCU also differs from other RCU flavors in that SRCU's expedited and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2542) non-expedited grace periods are implemented by the same mechanism. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2543) means that in the current SRCU implementation, expediting a future grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2544) period has the side effect of expediting all prior grace periods that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2545) have not yet completed. (But please note that this is a property of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2546) current implementation, not necessarily of future implementations.) In
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2547) addition, if SRCU has been idle for longer than the interval specified
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2548) by the ``srcutree.exp_holdoff`` kernel boot parameter (25 microseconds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2549) by default), and if a ``synchronize_srcu()`` invocation ends this idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2550) period, that invocation will be automatically expedited.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2551)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2552) As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2553) locking bottleneck present in prior kernel versions. Although this will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2554) allow users to put much heavier stress on ``call_srcu()``, it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2555) important to note that SRCU does not yet take any special steps to deal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2556) with callback flooding. So if you are posting (say) 10,000 SRCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2557) callbacks per second per CPU, you are probably totally OK, but if you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2558) intend to post (say) 1,000,000 SRCU callbacks per second per CPU, please
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2559) run some tests first. SRCU just might need a few adjustment to deal with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2560) that sort of load. Of course, your mileage may vary based on the speed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2561) of your CPUs and the size of your memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2562)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2563) The `SRCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2564) API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2565) includes ``srcu_read_lock()``, ``srcu_read_unlock()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2566) ``srcu_dereference()``, ``srcu_dereference_check()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2567) ``synchronize_srcu()``, ``synchronize_srcu_expedited()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2568) ``call_srcu()``, ``srcu_barrier()``, and ``srcu_read_lock_held()``. It
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2569) also includes ``DEFINE_SRCU()``, ``DEFINE_STATIC_SRCU()``, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2570) ``init_srcu_struct()`` APIs for defining and initializing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2571) ``srcu_struct`` structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2572)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2573) Tasks RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2574) ~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2575)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2576) Some forms of tracing use “trampolines” to handle the binary rewriting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2577) required to install different types of probes. It would be good to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2578) able to free old trampolines, which sounds like a job for some form of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2579) RCU. However, because it is necessary to be able to install a trace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2580) anywhere in the code, it is not possible to use read-side markers such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2581) as ``rcu_read_lock()`` and ``rcu_read_unlock()``. In addition, it does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2582) not work to have these markers in the trampoline itself, because there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2583) would need to be instructions following ``rcu_read_unlock()``. Although
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2584) ``synchronize_rcu()`` would guarantee that execution reached the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2585) ``rcu_read_unlock()``, it would not be able to guarantee that execution
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2586) had completely left the trampoline. Worse yet, in some situations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2587) the trampoline's protection must extend a few instructions *prior* to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2588) execution reaching the trampoline. For example, these few instructions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2589) might calculate the address of the trampoline, so that entering the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2590) trampoline would be pre-ordained a surprisingly long time before execution
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2591) actually reached the trampoline itself.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2592)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2593) The solution, in the form of `Tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2594) RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2595) critical sections that are delimited by voluntary context switches, that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2596) is, calls to ``schedule()``, ``cond_resched()``, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2597) ``synchronize_rcu_tasks()``. In addition, transitions to and from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2598) userspace execution also delimit tasks-RCU read-side critical sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2599)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2600) The tasks-RCU API is quite compact, consisting only of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2601) ``call_rcu_tasks()``, ``synchronize_rcu_tasks()``, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2602) ``rcu_barrier_tasks()``. In ``CONFIG_PREEMPT=n`` kernels, trampolines
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2603) cannot be preempted, so these APIs map to ``call_rcu()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2604) ``synchronize_rcu()``, and ``rcu_barrier()``, respectively. In
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2605) ``CONFIG_PREEMPT=y`` kernels, trampolines can be preempted, and these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2606) three APIs are therefore implemented by separate functions that check
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2607) for voluntary context switches.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2608)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2609) Possible Future Changes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2610) -----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2611)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2612) One of the tricks that RCU uses to attain update-side scalability is to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2613) increase grace-period latency with increasing numbers of CPUs. If this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2614) becomes a serious problem, it will be necessary to rework the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2615) grace-period state machine so as to avoid the need for the additional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2616) latency.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2617)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2618) RCU disables CPU hotplug in a few places, perhaps most notably in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2619) ``rcu_barrier()`` operations. If there is a strong reason to use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2620) ``rcu_barrier()`` in CPU-hotplug notifiers, it will be necessary to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2621) avoid disabling CPU hotplug. This would introduce some complexity, so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2622) there had better be a *very* good reason.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2623)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2624) The tradeoff between grace-period latency on the one hand and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2625) interruptions of other CPUs on the other hand may need to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2626) re-examined. The desire is of course for zero grace-period latency as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2627) well as zero interprocessor interrupts undertaken during an expedited
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2628) grace period operation. While this ideal is unlikely to be achievable,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2629) it is quite possible that further improvements can be made.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2630)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2631) The multiprocessor implementations of RCU use a combining tree that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2632) groups CPUs so as to reduce lock contention and increase cache locality.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2633) However, this combining tree does not spread its memory across NUMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2634) nodes nor does it align the CPU groups with hardware features such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2635) sockets or cores. Such spreading and alignment is currently believed to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2636) be unnecessary because the hotpath read-side primitives do not access
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2637) the combining tree, nor does ``call_rcu()`` in the common case. If you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2638) believe that your architecture needs such spreading and alignment, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2639) your architecture should also benefit from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2640) ``rcutree.rcu_fanout_leaf`` boot parameter, which can be set to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2641) number of CPUs in a socket, NUMA node, or whatever. If the number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2642) CPUs is too large, use a fraction of the number of CPUs. If the number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2643) of CPUs is a large prime number, well, that certainly is an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2644) “interesting” architectural choice! More flexible arrangements might be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2645) considered, but only if ``rcutree.rcu_fanout_leaf`` has proven
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2646) inadequate, and only if the inadequacy has been demonstrated by a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2647) carefully run and realistic system-level workload.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2648)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2649) Please note that arrangements that require RCU to remap CPU numbers will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2650) require extremely good demonstration of need and full exploration of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2651) alternatives.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2652)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2653) RCU's various kthreads are reasonably recent additions. It is quite
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2654) likely that adjustments will be required to more gracefully handle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2655) extreme loads. It might also be necessary to be able to relate CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2656) utilization by RCU's kthreads and softirq handlers to the code that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2657) instigated this CPU utilization. For example, RCU callback overhead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2658) might be charged back to the originating ``call_rcu()`` instance, though
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2659) probably not in production kernels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2660)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2661) Additional work may be required to provide reasonable forward-progress
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2662) guarantees under heavy load for grace periods and for callback
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2663) invocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2664)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2665) Summary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2666) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2667)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2668) This document has presented more than two decade's worth of RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2669) requirements. Given that the requirements keep changing, this will not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2670) be the last word on this subject, but at least it serves to get an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2671) important subset of the requirements set forth.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2672)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2673) Acknowledgments
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2674) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2675)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2676) I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, Oleg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2677) Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and Andy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2678) Lutomirski for their help in rendering this article human readable, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2679) to Michelle Rankin for her support of this effort. Other contributions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2680) are acknowledged in the Linux kernel's git archive.