Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    1) =================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    2) A Tour Through RCU's Requirements
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    3) =================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    5) Copyright IBM Corporation, 2015
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    7) Author: Paul E. McKenney
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    9) The initial version of this document appeared in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   10) `LWN <https://lwn.net/>`_ on those articles:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   11) `part 1 <https://lwn.net/Articles/652156/>`_,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   12) `part 2 <https://lwn.net/Articles/652677/>`_, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   13) `part 3 <https://lwn.net/Articles/653326/>`_.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   14) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   15) Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   16) ------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   17) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   18) Read-copy update (RCU) is a synchronization mechanism that is often used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   19) as a replacement for reader-writer locking. RCU is unusual in that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   20) updaters do not block readers, which means that RCU's read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   21) primitives can be exceedingly fast and scalable. In addition, updaters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   22) can make useful forward progress concurrently with readers. However, all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   23) this concurrency between RCU readers and updaters does raise the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   24) question of exactly what RCU readers are doing, which in turn raises the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   25) question of exactly what RCU's requirements are.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   26) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   27) This document therefore summarizes RCU's requirements, and can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   28) thought of as an informal, high-level specification for RCU. It is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   29) important to understand that RCU's specification is primarily empirical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   30) in nature; in fact, I learned about many of these requirements the hard
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   31) way. This situation might cause some consternation, however, not only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   32) has this learning process been a lot of fun, but it has also been a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   33) great privilege to work with so many people willing to apply
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   34) technologies in interesting new ways.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   35) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   36) All that aside, here are the categories of currently known RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   37) requirements:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   38) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   39) #. `Fundamental Requirements`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   40) #. `Fundamental Non-Requirements`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   41) #. `Parallelism Facts of Life`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   42) #. `Quality-of-Implementation Requirements`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   43) #. `Linux Kernel Complications`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   44) #. `Software-Engineering Requirements`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   45) #. `Other RCU Flavors`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   46) #. `Possible Future Changes`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   47) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   48) This is followed by a `summary <#Summary>`__, however, the answers to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   49) each quick quiz immediately follows the quiz. Select the big white space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   50) with your mouse to see the answer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   51) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   52) Fundamental Requirements
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   53) ------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   54) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   55) RCU's fundamental requirements are the closest thing RCU has to hard
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   56) mathematical requirements. These are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   57) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   58) #. `Grace-Period Guarantee`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   59) #. `Publish/Subscribe Guarantee`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   60) #. `Memory-Barrier Guarantees`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   61) #. `RCU Primitives Guaranteed to Execute Unconditionally`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   62) #. `Guaranteed Read-to-Write Upgrade`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   63) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   64) Grace-Period Guarantee
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   65) ~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   66) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   67) RCU's grace-period guarantee is unusual in being premeditated: Jack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   68) Slingwine and I had this guarantee firmly in mind when we started work
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   69) on RCU (then called “rclock”) in the early 1990s. That said, the past
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   70) two decades of experience with RCU have produced a much more detailed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   71) understanding of this guarantee.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   72) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   73) RCU's grace-period guarantee allows updaters to wait for the completion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   74) of all pre-existing RCU read-side critical sections. An RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   75) critical section begins with the marker ``rcu_read_lock()`` and ends
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   76) with the marker ``rcu_read_unlock()``. These markers may be nested, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   77) RCU treats a nested set as one big RCU read-side critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   78) Production-quality implementations of ``rcu_read_lock()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   79) ``rcu_read_unlock()`` are extremely lightweight, and in fact have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   80) exactly zero overhead in Linux kernels built for production use with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   81) ``CONFIG_PREEMPT=n``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   82) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   83) This guarantee allows ordering to be enforced with extremely low
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   84) overhead to readers, for example:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   85) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   86)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   87) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   88)        1 int x, y;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   89)        2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   90)        3 void thread0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   91)        4 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   92)        5   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   93)        6   r1 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   94)        7   r2 = READ_ONCE(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   95)        8   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   96)        9 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   97)       10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   98)       11 void thread1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   99)       12 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  100)       13   WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  101)       14   synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  102)       15   WRITE_ONCE(y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  103)       16 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  104) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  105) Because the ``synchronize_rcu()`` on line 14 waits for all pre-existing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  106) readers, any instance of ``thread0()`` that loads a value of zero from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  107) ``x`` must complete before ``thread1()`` stores to ``y``, so that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  108) instance must also load a value of zero from ``y``. Similarly, any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  109) instance of ``thread0()`` that loads a value of one from ``y`` must have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  110) started after the ``synchronize_rcu()`` started, and must therefore also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  111) load a value of one from ``x``. Therefore, the outcome:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  112) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  113)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  114) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  115)       (r1 == 0 && r2 == 1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  116) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  117) cannot happen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  118) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  119) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  120) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  121) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  122) | Wait a minute! You said that updaters can make useful forward         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  123) | progress concurrently with readers, but pre-existing readers will     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  124) | block ``synchronize_rcu()``!!!                                        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  125) | Just who are you trying to fool???                                    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  126) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  127) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  128) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  129) | First, if updaters do not wish to be blocked by readers, they can use |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  130) | ``call_rcu()`` or ``kfree_rcu()``, which will be discussed later.     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  131) | Second, even when using ``synchronize_rcu()``, the other update-side  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  132) | code does run concurrently with readers, whether pre-existing or not. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  133) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  134) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  135) This scenario resembles one of the first uses of RCU in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  136) `DYNIX/ptx <https://en.wikipedia.org/wiki/DYNIX>`__, which managed a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  137) distributed lock manager's transition into a state suitable for handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  138) recovery from node failure, more or less as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  139) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  140)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  141) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  142)        1 #define STATE_NORMAL        0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  143)        2 #define STATE_WANT_RECOVERY 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  144)        3 #define STATE_RECOVERING    2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  145)        4 #define STATE_WANT_NORMAL   3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  146)        5
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  147)        6 int state = STATE_NORMAL;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  148)        7
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  149)        8 void do_something_dlm(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  150)        9 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  151)       10   int state_snap;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  152)       11
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  153)       12   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  154)       13   state_snap = READ_ONCE(state);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  155)       14   if (state_snap == STATE_NORMAL)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  156)       15     do_something();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  157)       16   else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  158)       17     do_something_carefully();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  159)       18   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  160)       19 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  161)       20
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  162)       21 void start_recovery(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  163)       22 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  164)       23   WRITE_ONCE(state, STATE_WANT_RECOVERY);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  165)       24   synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  166)       25   WRITE_ONCE(state, STATE_RECOVERING);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  167)       26   recovery();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  168)       27   WRITE_ONCE(state, STATE_WANT_NORMAL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  169)       28   synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  170)       29   WRITE_ONCE(state, STATE_NORMAL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  171)       30 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  172) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  173) The RCU read-side critical section in ``do_something_dlm()`` works with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  174) the ``synchronize_rcu()`` in ``start_recovery()`` to guarantee that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  175) ``do_something()`` never runs concurrently with ``recovery()``, but with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  176) little or no synchronization overhead in ``do_something_dlm()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  177) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  178) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  179) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  180) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  181) | Why is the ``synchronize_rcu()`` on line 28 needed?                   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  182) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  183) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  184) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  185) | Without that extra grace period, memory reordering could result in    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  186) | ``do_something_dlm()`` executing ``do_something()`` concurrently with |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  187) | the last bits of ``recovery()``.                                      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  188) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  189) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  190) In order to avoid fatal problems such as deadlocks, an RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  191) critical section must not contain calls to ``synchronize_rcu()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  192) Similarly, an RCU read-side critical section must not contain anything
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  193) that waits, directly or indirectly, on completion of an invocation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  194) ``synchronize_rcu()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  195) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  196) Although RCU's grace-period guarantee is useful in and of itself, with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  197) `quite a few use cases <https://lwn.net/Articles/573497/>`__, it would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  198) be good to be able to use RCU to coordinate read-side access to linked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  199) data structures. For this, the grace-period guarantee is not sufficient,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  200) as can be seen in function ``add_gp_buggy()`` below. We will look at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  201) reader's code later, but in the meantime, just think of the reader as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  202) locklessly picking up the ``gp`` pointer, and, if the value loaded is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  203) non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  204) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  205)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  206) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  207)        1 bool add_gp_buggy(int a, int b)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  208)        2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  209)        3   p = kmalloc(sizeof(*p), GFP_KERNEL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  210)        4   if (!p)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  211)        5     return -ENOMEM;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  212)        6   spin_lock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  213)        7   if (rcu_access_pointer(gp)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  214)        8     spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  215)        9     return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  216)       10   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  217)       11   p->a = a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  218)       12   p->b = a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  219)       13   gp = p; /* ORDERING BUG */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  220)       14   spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  221)       15   return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  222)       16 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  223) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  224) The problem is that both the compiler and weakly ordered CPUs are within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  225) their rights to reorder this code as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  226) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  227)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  228) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  229)        1 bool add_gp_buggy_optimized(int a, int b)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  230)        2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  231)        3   p = kmalloc(sizeof(*p), GFP_KERNEL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  232)        4   if (!p)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  233)        5     return -ENOMEM;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  234)        6   spin_lock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  235)        7   if (rcu_access_pointer(gp)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  236)        8     spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  237)        9     return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  238)       10   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  239)       11   gp = p; /* ORDERING BUG */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  240)       12   p->a = a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  241)       13   p->b = a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  242)       14   spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  243)       15   return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  244)       16 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  245) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  246) If an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  247) executes line 11, it will see garbage in the ``->a`` and ``->b`` fields.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  248) And this is but one of many ways in which compiler and hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  249) optimizations could cause trouble. Therefore, we clearly need some way
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  250) to prevent the compiler and the CPU from reordering in this manner,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  251) which brings us to the publish-subscribe guarantee discussed in the next
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  252) section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  253) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  254) Publish/Subscribe Guarantee
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  255) ~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  256) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  257) RCU's publish-subscribe guarantee allows data to be inserted into a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  258) linked data structure without disrupting RCU readers. The updater uses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  259) ``rcu_assign_pointer()`` to insert the new data, and readers use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  260) ``rcu_dereference()`` to access data, whether new or old. The following
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  261) shows an example of insertion:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  262) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  263)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  264) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  265)        1 bool add_gp(int a, int b)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  266)        2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  267)        3   p = kmalloc(sizeof(*p), GFP_KERNEL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  268)        4   if (!p)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  269)        5     return -ENOMEM;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  270)        6   spin_lock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  271)        7   if (rcu_access_pointer(gp)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  272)        8     spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  273)        9     return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  274)       10   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  275)       11   p->a = a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  276)       12   p->b = a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  277)       13   rcu_assign_pointer(gp, p);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  278)       14   spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  279)       15   return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  280)       16 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  281) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  282) The ``rcu_assign_pointer()`` on line 13 is conceptually equivalent to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  283) simple assignment statement, but also guarantees that its assignment
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  284) will happen after the two assignments in lines 11 and 12, similar to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  285) C11 ``memory_order_release`` store operation. It also prevents any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  286) number of “interesting” compiler optimizations, for example, the use of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  287) ``gp`` as a scratch location immediately preceding the assignment.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  288) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  289) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  290) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  291) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  292) | But ``rcu_assign_pointer()`` does nothing to prevent the two          |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  293) | assignments to ``p->a`` and ``p->b`` from being reordered. Can't that |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  294) | also cause problems?                                                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  295) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  296) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  297) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  298) | No, it cannot. The readers cannot see either of these two fields      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  299) | until the assignment to ``gp``, by which time both fields are fully   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  300) | initialized. So reordering the assignments to ``p->a`` and ``p->b``   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  301) | cannot possibly cause any problems.                                   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  302) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  303) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  304) It is tempting to assume that the reader need not do anything special to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  305) control its accesses to the RCU-protected data, as shown in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  306) ``do_something_gp_buggy()`` below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  307) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  308)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  309) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  310)        1 bool do_something_gp_buggy(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  311)        2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  312)        3   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  313)        4   p = gp;  /* OPTIMIZATIONS GALORE!!! */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  314)        5   if (p) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  315)        6     do_something(p->a, p->b);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  316)        7     rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  317)        8     return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  318)        9   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  319)       10   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  320)       11   return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  321)       12 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  322) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  323) However, this temptation must be resisted because there are a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  324) surprisingly large number of ways that the compiler (to say nothing of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  325) `DEC Alpha CPUs <https://h71000.www7.hp.com/wizard/wiz_2637.html>`__)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  326) can trip this code up. For but one example, if the compiler were short
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  327) of registers, it might choose to refetch from ``gp`` rather than keeping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  328) a separate copy in ``p`` as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  329) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  330)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  331) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  332)        1 bool do_something_gp_buggy_optimized(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  333)        2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  334)        3   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  335)        4   if (gp) { /* OPTIMIZATIONS GALORE!!! */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  336)        5     do_something(gp->a, gp->b);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  337)        6     rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  338)        7     return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  339)        8   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  340)        9   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  341)       10   return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  342)       11 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  343) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  344) If this function ran concurrently with a series of updates that replaced
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  345) the current structure with a new one, the fetches of ``gp->a`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  346) ``gp->b`` might well come from two different structures, which could
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  347) cause serious confusion. To prevent this (and much else besides),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  348) ``do_something_gp()`` uses ``rcu_dereference()`` to fetch from ``gp``:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  349) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  350)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  351) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  352)        1 bool do_something_gp(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  353)        2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  354)        3   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  355)        4   p = rcu_dereference(gp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  356)        5   if (p) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  357)        6     do_something(p->a, p->b);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  358)        7     rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  359)        8     return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  360)        9   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  361)       10   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  362)       11   return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  363)       12 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  364) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  365) The ``rcu_dereference()`` uses volatile casts and (for DEC Alpha) memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  366) barriers in the Linux kernel. Should a `high-quality implementation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  367) C11 ``memory_order_consume``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  368) [PDF] <http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  369) ever appear, then ``rcu_dereference()`` could be implemented as a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  370) ``memory_order_consume`` load. Regardless of the exact implementation, a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  371) pointer fetched by ``rcu_dereference()`` may not be used outside of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  372) outermost RCU read-side critical section containing that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  373) ``rcu_dereference()``, unless protection of the corresponding data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  374) element has been passed from RCU to some other synchronization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  375) mechanism, most commonly locking or `reference
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  376) counting <https://www.kernel.org/doc/Documentation/RCU/rcuref.txt>`__.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  377) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  378) In short, updaters use ``rcu_assign_pointer()`` and readers use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  379) ``rcu_dereference()``, and these two RCU API elements work together to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  380) ensure that readers have a consistent view of newly added data elements.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  381) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  382) Of course, it is also necessary to remove elements from RCU-protected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  383) data structures, for example, using the following process:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  384) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  385) #. Remove the data element from the enclosing structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  386) #. Wait for all pre-existing RCU read-side critical sections to complete
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  387)    (because only pre-existing readers can possibly have a reference to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  388)    the newly removed data element).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  389) #. At this point, only the updater has a reference to the newly removed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  390)    data element, so it can safely reclaim the data element, for example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  391)    by passing it to ``kfree()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  392) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  393) This process is implemented by ``remove_gp_synchronous()``:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  394) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  395)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  396) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  397)        1 bool remove_gp_synchronous(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  398)        2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  399)        3   struct foo *p;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  400)        4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  401)        5   spin_lock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  402)        6   p = rcu_access_pointer(gp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  403)        7   if (!p) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  404)        8     spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  405)        9     return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  406)       10   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  407)       11   rcu_assign_pointer(gp, NULL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  408)       12   spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  409)       13   synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  410)       14   kfree(p);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  411)       15   return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  412)       16 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  413) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  414) This function is straightforward, with line 13 waiting for a grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  415) period before line 14 frees the old data element. This waiting ensures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  416) that readers will reach line 7 of ``do_something_gp()`` before the data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  417) element referenced by ``p`` is freed. The ``rcu_access_pointer()`` on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  418) line 6 is similar to ``rcu_dereference()``, except that:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  419) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  420) #. The value returned by ``rcu_access_pointer()`` cannot be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  421)    dereferenced. If you want to access the value pointed to as well as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  422)    the pointer itself, use ``rcu_dereference()`` instead of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  423)    ``rcu_access_pointer()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  424) #. The call to ``rcu_access_pointer()`` need not be protected. In
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  425)    contrast, ``rcu_dereference()`` must either be within an RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  426)    read-side critical section or in a code segment where the pointer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  427)    cannot change, for example, in code protected by the corresponding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  428)    update-side lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  429) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  430) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  431) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  432) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  433) | Without the ``rcu_dereference()`` or the ``rcu_access_pointer()``,    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  434) | what destructive optimizations might the compiler make use of?        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  435) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  436) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  437) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  438) | Let's start with what happens to ``do_something_gp()`` if it fails to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  439) | use ``rcu_dereference()``. It could reuse a value formerly fetched    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  440) | from this same pointer. It could also fetch the pointer from ``gp``   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  441) | in a byte-at-a-time manner, resulting in *load tearing*, in turn      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  442) | resulting a bytewise mash-up of two distinct pointer values. It might |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  443) | even use value-speculation optimizations, where it makes a wrong      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  444) | guess, but by the time it gets around to checking the value, an       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  445) | update has changed the pointer to match the wrong guess. Too bad      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  446) | about any dereferences that returned pre-initialization garbage in    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  447) | the meantime!                                                         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  448) | For ``remove_gp_synchronous()``, as long as all modifications to      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  449) | ``gp`` are carried out while holding ``gp_lock``, the above           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  450) | optimizations are harmless. However, ``sparse`` will complain if you  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  451) | define ``gp`` with ``__rcu`` and then access it without using either  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  452) | ``rcu_access_pointer()`` or ``rcu_dereference()``.                    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  453) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  454) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  455) In short, RCU's publish-subscribe guarantee is provided by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  456) combination of ``rcu_assign_pointer()`` and ``rcu_dereference()``. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  457) guarantee allows data elements to be safely added to RCU-protected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  458) linked data structures without disrupting RCU readers. This guarantee
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  459) can be used in combination with the grace-period guarantee to also allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  460) data elements to be removed from RCU-protected linked data structures,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  461) again without disrupting RCU readers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  462) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  463) This guarantee was only partially premeditated. DYNIX/ptx used an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  464) explicit memory barrier for publication, but had nothing resembling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  465) ``rcu_dereference()`` for subscription, nor did it have anything
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  466) resembling the dependency-ordering barrier that was later subsumed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  467) into ``rcu_dereference()`` and later still into ``READ_ONCE()``. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  468) need for these operations made itself known quite suddenly at a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  469) late-1990s meeting with the DEC Alpha architects, back in the days when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  470) DEC was still a free-standing company. It took the Alpha architects a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  471) good hour to convince me that any sort of barrier would ever be needed,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  472) and it then took me a good *two* hours to convince them that their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  473) documentation did not make this point clear. More recent work with the C
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  474) and C++ standards committees have provided much education on tricks and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  475) traps from the compiler. In short, compilers were much less tricky in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  476) the early 1990s, but in 2015, don't even think about omitting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  477) ``rcu_dereference()``!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  478) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  479) Memory-Barrier Guarantees
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  480) ~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  481) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  482) The previous section's simple linked-data-structure scenario clearly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  483) demonstrates the need for RCU's stringent memory-ordering guarantees on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  484) systems with more than one CPU:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  485) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  486) #. Each CPU that has an RCU read-side critical section that begins
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  487)    before ``synchronize_rcu()`` starts is guaranteed to execute a full
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  488)    memory barrier between the time that the RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  489)    section ends and the time that ``synchronize_rcu()`` returns. Without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  490)    this guarantee, a pre-existing RCU read-side critical section might
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  491)    hold a reference to the newly removed ``struct foo`` after the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  492)    ``kfree()`` on line 14 of ``remove_gp_synchronous()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  493) #. Each CPU that has an RCU read-side critical section that ends after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  494)    ``synchronize_rcu()`` returns is guaranteed to execute a full memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  495)    barrier between the time that ``synchronize_rcu()`` begins and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  496)    time that the RCU read-side critical section begins. Without this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  497)    guarantee, a later RCU read-side critical section running after the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  498)    ``kfree()`` on line 14 of ``remove_gp_synchronous()`` might later run
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  499)    ``do_something_gp()`` and find the newly deleted ``struct foo``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  500) #. If the task invoking ``synchronize_rcu()`` remains on a given CPU,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  501)    then that CPU is guaranteed to execute a full memory barrier sometime
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  502)    during the execution of ``synchronize_rcu()``. This guarantee ensures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  503)    that the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  504)    does execute after the removal on line 11.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  505) #. If the task invoking ``synchronize_rcu()`` migrates among a group of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  506)    CPUs during that invocation, then each of the CPUs in that group is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  507)    guaranteed to execute a full memory barrier sometime during the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  508)    execution of ``synchronize_rcu()``. This guarantee also ensures that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  509)    the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  510)    execute after the removal on line 11, but also in the case where the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  511)    thread executing the ``synchronize_rcu()`` migrates in the meantime.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  512) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  513) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  514) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  515) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  516) | Given that multiple CPUs can start RCU read-side critical sections at |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  517) | any time without any ordering whatsoever, how can RCU possibly tell   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  518) | whether or not a given RCU read-side critical section starts before a |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  519) | given instance of ``synchronize_rcu()``?                              |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  520) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  521) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  522) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  523) | If RCU cannot tell whether or not a given RCU read-side critical      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  524) | section starts before a given instance of ``synchronize_rcu()``, then |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  525) | it must assume that the RCU read-side critical section started first. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  526) | In other words, a given instance of ``synchronize_rcu()`` can avoid   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  527) | waiting on a given RCU read-side critical section only if it can      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  528) | prove that ``synchronize_rcu()`` started first.                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  529) | A related question is “When ``rcu_read_lock()`` doesn't generate any  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  530) | code, why does it matter how it relates to a grace period?” The       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  531) | answer is that it is not the relationship of ``rcu_read_lock()``      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  532) | itself that is important, but rather the relationship of the code     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  533) | within the enclosed RCU read-side critical section to the code        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  534) | preceding and following the grace period. If we take this viewpoint,  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  535) | then a given RCU read-side critical section begins before a given     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  536) | grace period when some access preceding the grace period observes the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  537) | effect of some access within the critical section, in which case none |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  538) | of the accesses within the critical section may observe the effects   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  539) | of any access following the grace period.                             |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  540) |                                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  541) | As of late 2016, mathematical models of RCU take this viewpoint, for  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  542) | example, see slides 62 and 63 of the `2016 LinuxCon                   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  543) | EU <http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.201 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  544) | 6.10.04c.LCE.pdf>`__                                                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  545) | presentation.                                                         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  546) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  547) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  548) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  549) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  550) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  551) | The first and second guarantees require unbelievably strict ordering! |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  552) | Are all these memory barriers *really* required?                      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  553) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  554) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  555) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  556) | Yes, they really are required. To see why the first guarantee is      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  557) | required, consider the following sequence of events:                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  558) |                                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  559) | #. CPU 1: ``rcu_read_lock()``                                         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  560) | #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  561) | #. CPU 0: ``list_del_rcu(p);``                                        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  562) | #. CPU 0: ``synchronize_rcu()`` starts.                               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  563) | #. CPU 1: ``do_something_with(q->a);``                                |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  564) |    ``/* No smp_mb(), so might happen after kfree(). */``              |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  565) | #. CPU 1: ``rcu_read_unlock()``                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  566) | #. CPU 0: ``synchronize_rcu()`` returns.                              |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  567) | #. CPU 0: ``kfree(p);``                                               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  568) |                                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  569) | Therefore, there absolutely must be a full memory barrier between the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  570) | end of the RCU read-side critical section and the end of the grace    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  571) | period.                                                               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  572) |                                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  573) | The sequence of events demonstrating the necessity of the second rule |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  574) | is roughly similar:                                                   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  575) |                                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  576) | #. CPU 0: ``list_del_rcu(p);``                                        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  577) | #. CPU 0: ``synchronize_rcu()`` starts.                               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  578) | #. CPU 1: ``rcu_read_lock()``                                         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  579) | #. CPU 1: ``q = rcu_dereference(gp);``                                |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  580) |    ``/* Might return p if no memory barrier. */``                     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  581) | #. CPU 0: ``synchronize_rcu()`` returns.                              |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  582) | #. CPU 0: ``kfree(p);``                                               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  583) | #. CPU 1: ``do_something_with(q->a); /* Boom!!! */``                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  584) | #. CPU 1: ``rcu_read_unlock()``                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  585) |                                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  586) | And similarly, without a memory barrier between the beginning of the  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  587) | grace period and the beginning of the RCU read-side critical section, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  588) | CPU 1 might end up accessing the freelist.                            |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  589) |                                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  590) | The “as if” rule of course applies, so that any implementation that   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  591) | acts as if the appropriate memory barriers were in place is a correct |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  592) | implementation. That said, it is much easier to fool yourself into    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  593) | believing that you have adhered to the as-if rule than it is to       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  594) | actually adhere to it!                                                |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  595) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  596) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  597) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  598) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  599) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  600) | You claim that ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  601) | absolutely no code in some kernel builds. This means that the         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  602) | compiler might arbitrarily rearrange consecutive RCU read-side        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  603) | critical sections. Given such rearrangement, if a given RCU read-side |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  604) | critical section is done, how can you be sure that all prior RCU      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  605) | read-side critical sections are done? Won't the compiler              |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  606) | rearrangements make that impossible to determine?                     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  607) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  608) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  609) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  610) | In cases where ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  611) | absolutely no code, RCU infers quiescent states only at special       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  612) | locations, for example, within the scheduler. Because calls to        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  613) | ``schedule()`` had better prevent calling-code accesses to shared     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  614) | variables from being rearranged across the call to ``schedule()``, if |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  615) | RCU detects the end of a given RCU read-side critical section, it     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  616) | will necessarily detect the end of all prior RCU read-side critical   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  617) | sections, no matter how aggressively the compiler scrambles the code. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  618) | Again, this all assumes that the compiler cannot scramble code across |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  619) | calls to the scheduler, out of interrupt handlers, into the idle      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  620) | loop, into user-mode code, and so on. But if your kernel build allows |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  621) | that sort of scrambling, you have broken far more than just RCU!      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  622) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  623) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  624) Note that these memory-barrier requirements do not replace the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  625) fundamental RCU requirement that a grace period wait for all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  626) pre-existing readers. On the contrary, the memory barriers called out in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  627) this section must operate in such a way as to *enforce* this fundamental
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  628) requirement. Of course, different implementations enforce this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  629) requirement in different ways, but enforce it they must.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  630) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  631) RCU Primitives Guaranteed to Execute Unconditionally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  632) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  633) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  634) The common-case RCU primitives are unconditional. They are invoked, they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  635) do their job, and they return, with no possibility of error, and no need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  636) to retry. This is a key RCU design philosophy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  637) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  638) However, this philosophy is pragmatic rather than pigheaded. If someone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  639) comes up with a good justification for a particular conditional RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  640) primitive, it might well be implemented and added. After all, this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  641) guarantee was reverse-engineered, not premeditated. The unconditional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  642) nature of the RCU primitives was initially an accident of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  643) implementation, and later experience with synchronization primitives
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  644) with conditional primitives caused me to elevate this accident to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  645) guarantee. Therefore, the justification for adding a conditional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  646) primitive to RCU would need to be based on detailed and compelling use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  647) cases.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  648) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  649) Guaranteed Read-to-Write Upgrade
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  650) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  651) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  652) As far as RCU is concerned, it is always possible to carry out an update
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  653) within an RCU read-side critical section. For example, that RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  654) read-side critical section might search for a given data element, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  655) then might acquire the update-side spinlock in order to update that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  656) element, all while remaining in that RCU read-side critical section. Of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  657) course, it is necessary to exit the RCU read-side critical section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  658) before invoking ``synchronize_rcu()``, however, this inconvenience can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  659) be avoided through use of the ``call_rcu()`` and ``kfree_rcu()`` API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  660) members described later in this document.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  661) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  662) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  663) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  664) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  665) | But how does the upgrade-to-write operation exclude other readers?    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  666) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  667) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  668) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  669) | It doesn't, just like normal RCU updates, which also do not exclude   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  670) | RCU readers.                                                          |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  671) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  672) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  673) This guarantee allows lookup code to be shared between read-side and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  674) update-side code, and was premeditated, appearing in the earliest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  675) DYNIX/ptx RCU documentation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  676) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  677) Fundamental Non-Requirements
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  678) ----------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  679) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  680) RCU provides extremely lightweight readers, and its read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  681) guarantees, though quite useful, are correspondingly lightweight. It is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  682) therefore all too easy to assume that RCU is guaranteeing more than it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  683) really is. Of course, the list of things that RCU does not guarantee is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  684) infinitely long, however, the following sections list a few
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  685) non-guarantees that have caused confusion. Except where otherwise noted,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  686) these non-guarantees were premeditated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  687) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  688) #. `Readers Impose Minimal Ordering`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  689) #. `Readers Do Not Exclude Updaters`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  690) #. `Updaters Only Wait For Old Readers`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  691) #. `Grace Periods Don't Partition Read-Side Critical Sections`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  692) #. `Read-Side Critical Sections Don't Partition Grace Periods`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  693) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  694) Readers Impose Minimal Ordering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  695) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  696) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  697) Reader-side markers such as ``rcu_read_lock()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  698) ``rcu_read_unlock()`` provide absolutely no ordering guarantees except
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  699) through their interaction with the grace-period APIs such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  700) ``synchronize_rcu()``. To see this, consider the following pair of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  701) threads:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  702) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  703)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  704) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  705)        1 void thread0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  706)        2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  707)        3   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  708)        4   WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  709)        5   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  710)        6   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  711)        7   WRITE_ONCE(y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  712)        8   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  713)        9 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  714)       10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  715)       11 void thread1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  716)       12 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  717)       13   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  718)       14   r1 = READ_ONCE(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  719)       15   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  720)       16   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  721)       17   r2 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  722)       18   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  723)       19 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  724) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  725) After ``thread0()`` and ``thread1()`` execute concurrently, it is quite
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  726) possible to have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  727) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  728)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  729) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  730)       (r1 == 1 && r2 == 0)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  731) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  732) (that is, ``y`` appears to have been assigned before ``x``), which would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  733) not be possible if ``rcu_read_lock()`` and ``rcu_read_unlock()`` had
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  734) much in the way of ordering properties. But they do not, so the CPU is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  735) within its rights to do significant reordering. This is by design: Any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  736) significant ordering constraints would slow down these fast-path APIs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  737) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  738) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  739) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  740) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  741) | Can't the compiler also reorder this code?                            |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  742) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  743) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  744) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  745) | No, the volatile casts in ``READ_ONCE()`` and ``WRITE_ONCE()``        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  746) | prevent the compiler from reordering in this particular case.         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  747) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  748) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  749) Readers Do Not Exclude Updaters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  750) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  751) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  752) Neither ``rcu_read_lock()`` nor ``rcu_read_unlock()`` exclude updates.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  753) All they do is to prevent grace periods from ending. The following
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  754) example illustrates this:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  755) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  756)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  757) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  758)        1 void thread0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  759)        2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  760)        3   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  761)        4   r1 = READ_ONCE(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  762)        5   if (r1) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  763)        6     do_something_with_nonzero_x();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  764)        7     r2 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  765)        8     WARN_ON(!r2); /* BUG!!! */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  766)        9   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  767)       10   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  768)       11 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  769)       12
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  770)       13 void thread1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  771)       14 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  772)       15   spin_lock(&my_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  773)       16   WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  774)       17   WRITE_ONCE(y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  775)       18   spin_unlock(&my_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  776)       19 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  777) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  778) If the ``thread0()`` function's ``rcu_read_lock()`` excluded the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  779) ``thread1()`` function's update, the ``WARN_ON()`` could never fire. But
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  780) the fact is that ``rcu_read_lock()`` does not exclude much of anything
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  781) aside from subsequent grace periods, of which ``thread1()`` has none, so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  782) the ``WARN_ON()`` can and does fire.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  783) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  784) Updaters Only Wait For Old Readers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  785) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  786) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  787) It might be tempting to assume that after ``synchronize_rcu()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  788) completes, there are no readers executing. This temptation must be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  789) avoided because new readers can start immediately after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  790) ``synchronize_rcu()`` starts, and ``synchronize_rcu()`` is under no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  791) obligation to wait for these new readers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  792) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  793) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  794) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  795) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  796) | Suppose that synchronize_rcu() did wait until *all* readers had       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  797) | completed instead of waiting only on pre-existing readers. For how    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  798) | long would the updater be able to rely on there being no readers?     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  799) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  800) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  801) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  802) | For no time at all. Even if ``synchronize_rcu()`` were to wait until  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  803) | all readers had completed, a new reader might start immediately after |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  804) | ``synchronize_rcu()`` completed. Therefore, the code following        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  805) | ``synchronize_rcu()`` can *never* rely on there being no readers.     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  806) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  807) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  808) Grace Periods Don't Partition Read-Side Critical Sections
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  809) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  810) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  811) It is tempting to assume that if any part of one RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  812) section precedes a given grace period, and if any part of another RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  813) read-side critical section follows that same grace period, then all of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  814) the first RCU read-side critical section must precede all of the second.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  815) However, this just isn't the case: A single grace period does not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  816) partition the set of RCU read-side critical sections. An example of this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  817) situation can be illustrated as follows, where ``x``, ``y``, and ``z``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  818) are initially all zero:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  819) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  820)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  821) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  822)        1 void thread0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  823)        2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  824)        3   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  825)        4   WRITE_ONCE(a, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  826)        5   WRITE_ONCE(b, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  827)        6   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  828)        7 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  829)        8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  830)        9 void thread1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  831)       10 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  832)       11   r1 = READ_ONCE(a);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  833)       12   synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  834)       13   WRITE_ONCE(c, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  835)       14 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  836)       15
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  837)       16 void thread2(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  838)       17 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  839)       18   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  840)       19   r2 = READ_ONCE(b);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  841)       20   r3 = READ_ONCE(c);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  842)       21   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  843)       22 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  844) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  845) It turns out that the outcome:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  846) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  847)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  848) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  849)       (r1 == 1 && r2 == 0 && r3 == 1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  850) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  851) is entirely possible. The following figure show how this can happen,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  852) with each circled ``QS`` indicating the point at which RCU recorded a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  853) *quiescent state* for each thread, that is, a state in which RCU knows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  854) that the thread cannot be in the midst of an RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  855) section that started before the current grace period:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  856) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  857) .. kernel-figure:: GPpartitionReaders1.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  858) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  859) If it is necessary to partition RCU read-side critical sections in this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  860) manner, it is necessary to use two grace periods, where the first grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  861) period is known to end before the second grace period starts:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  862) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  863)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  864) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  865)        1 void thread0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  866)        2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  867)        3   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  868)        4   WRITE_ONCE(a, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  869)        5   WRITE_ONCE(b, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  870)        6   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  871)        7 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  872)        8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  873)        9 void thread1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  874)       10 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  875)       11   r1 = READ_ONCE(a);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  876)       12   synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  877)       13   WRITE_ONCE(c, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  878)       14 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  879)       15
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  880)       16 void thread2(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  881)       17 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  882)       18   r2 = READ_ONCE(c);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  883)       19   synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  884)       20   WRITE_ONCE(d, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  885)       21 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  886)       22
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  887)       23 void thread3(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  888)       24 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  889)       25   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  890)       26   r3 = READ_ONCE(b);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  891)       27   r4 = READ_ONCE(d);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  892)       28   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  893)       29 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  894) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  895) Here, if ``(r1 == 1)``, then ``thread0()``'s write to ``b`` must happen
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  896) before the end of ``thread1()``'s grace period. If in addition
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  897) ``(r4 == 1)``, then ``thread3()``'s read from ``b`` must happen after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  898) the beginning of ``thread2()``'s grace period. If it is also the case
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  899) that ``(r2 == 1)``, then the end of ``thread1()``'s grace period must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  900) precede the beginning of ``thread2()``'s grace period. This mean that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  901) the two RCU read-side critical sections cannot overlap, guaranteeing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  902) that ``(r3 == 1)``. As a result, the outcome:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  903) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  904)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  905) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  906)       (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  907) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  908) cannot happen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  909) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  910) This non-requirement was also non-premeditated, but became apparent when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  911) studying RCU's interaction with memory ordering.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  912) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  913) Read-Side Critical Sections Don't Partition Grace Periods
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  914) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  915) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  916) It is also tempting to assume that if an RCU read-side critical section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  917) happens between a pair of grace periods, then those grace periods cannot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  918) overlap. However, this temptation leads nowhere good, as can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  919) illustrated by the following, with all variables initially zero:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  920) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  921)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  922) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  923)        1 void thread0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  924)        2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  925)        3   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  926)        4   WRITE_ONCE(a, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  927)        5   WRITE_ONCE(b, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  928)        6   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  929)        7 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  930)        8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  931)        9 void thread1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  932)       10 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  933)       11   r1 = READ_ONCE(a);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  934)       12   synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  935)       13   WRITE_ONCE(c, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  936)       14 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  937)       15
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  938)       16 void thread2(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  939)       17 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  940)       18   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  941)       19   WRITE_ONCE(d, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  942)       20   r2 = READ_ONCE(c);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  943)       21   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  944)       22 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  945)       23
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  946)       24 void thread3(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  947)       25 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  948)       26   r3 = READ_ONCE(d);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  949)       27   synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  950)       28   WRITE_ONCE(e, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  951)       29 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  952)       30
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  953)       31 void thread4(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  954)       32 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  955)       33   rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  956)       34   r4 = READ_ONCE(b);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  957)       35   r5 = READ_ONCE(e);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  958)       36   rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  959)       37 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  960) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  961) In this case, the outcome:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  962) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  963)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  964) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  965)       (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  966) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  967) is entirely possible, as illustrated below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  968) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  969) .. kernel-figure:: ReadersPartitionGP1.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  970) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  971) Again, an RCU read-side critical section can overlap almost all of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  972) given grace period, just so long as it does not overlap the entire grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  973) period. As a result, an RCU read-side critical section cannot partition
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  974) a pair of RCU grace periods.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  975) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  976) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  977) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  978) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  979) | How long a sequence of grace periods, each separated by an RCU        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  980) | read-side critical section, would be required to partition the RCU    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  981) | read-side critical sections at the beginning and end of the chain?    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  982) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  983) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  984) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  985) | In theory, an infinite number. In practice, an unknown number that is |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  986) | sensitive to both implementation details and timing considerations.   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  987) | Therefore, even in practice, RCU users must abide by the theoretical  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  988) | rather than the practical answer.                                     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  989) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  990) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  991) Parallelism Facts of Life
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  992) -------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  993) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  994) These parallelism facts of life are by no means specific to RCU, but the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  995) RCU implementation must abide by them. They therefore bear repeating:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  996) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  997) #. Any CPU or task may be delayed at any time, and any attempts to avoid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  998)    these delays by disabling preemption, interrupts, or whatever are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  999)    completely futile. This is most obvious in preemptible user-level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1000)    environments and in virtualized environments (where a given guest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1001)    OS's VCPUs can be preempted at any time by the underlying
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1002)    hypervisor), but can also happen in bare-metal environments due to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1003)    ECC errors, NMIs, and other hardware events. Although a delay of more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1004)    than about 20 seconds can result in splats, the RCU implementation is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1005)    obligated to use algorithms that can tolerate extremely long delays,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1006)    but where “extremely long” is not long enough to allow wrap-around
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1007)    when incrementing a 64-bit counter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1008) #. Both the compiler and the CPU can reorder memory accesses. Where it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1009)    matters, RCU must use compiler directives and memory-barrier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1010)    instructions to preserve ordering.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1011) #. Conflicting writes to memory locations in any given cache line will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1012)    result in expensive cache misses. Greater numbers of concurrent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1013)    writes and more-frequent concurrent writes will result in more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1014)    dramatic slowdowns. RCU is therefore obligated to use algorithms that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1015)    have sufficient locality to avoid significant performance and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1016)    scalability problems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1017) #. As a rough rule of thumb, only one CPU's worth of processing may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1018)    carried out under the protection of any given exclusive lock. RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1019)    must therefore use scalable locking designs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1020) #. Counters are finite, especially on 32-bit systems. RCU's use of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1021)    counters must therefore tolerate counter wrap, or be designed such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1022)    that counter wrap would take way more time than a single system is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1023)    likely to run. An uptime of ten years is quite possible, a runtime of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1024)    a century much less so. As an example of the latter, RCU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1025)    dyntick-idle nesting counter allows 54 bits for interrupt nesting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1026)    level (this counter is 64 bits even on a 32-bit system). Overflowing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1027)    this counter requires 2\ :sup:`54` half-interrupts on a given CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1028)    without that CPU ever going idle. If a half-interrupt happened every
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1029)    microsecond, it would take 570 years of runtime to overflow this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1030)    counter, which is currently believed to be an acceptably long time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1031) #. Linux systems can have thousands of CPUs running a single Linux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1032)    kernel in a single shared-memory environment. RCU must therefore pay
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1033)    close attention to high-end scalability.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1034) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1035) This last parallelism fact of life means that RCU must pay special
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1036) attention to the preceding facts of life. The idea that Linux might
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1037) scale to systems with thousands of CPUs would have been met with some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1038) skepticism in the 1990s, but these requirements would have otherwise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1039) have been unsurprising, even in the early 1990s.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1040) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1041) Quality-of-Implementation Requirements
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1042) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1043) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1044) These sections list quality-of-implementation requirements. Although an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1045) RCU implementation that ignores these requirements could still be used,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1046) it would likely be subject to limitations that would make it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1047) inappropriate for industrial-strength production use. Classes of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1048) quality-of-implementation requirements are as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1049) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1050) #. `Specialization`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1051) #. `Performance and Scalability`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1052) #. `Forward Progress`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1053) #. `Composability`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1054) #. `Corner Cases`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1055) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1056) These classes is covered in the following sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1057) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1058) Specialization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1059) ~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1060) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1061) RCU is and always has been intended primarily for read-mostly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1062) situations, which means that RCU's read-side primitives are optimized,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1063) often at the expense of its update-side primitives. Experience thus far
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1064) is captured by the following list of situations:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1065) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1066) #. Read-mostly data, where stale and inconsistent data is not a problem:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1067)    RCU works great!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1068) #. Read-mostly data, where data must be consistent: RCU works well.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1069) #. Read-write data, where data must be consistent: RCU *might* work OK.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1070)    Or not.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1071) #. Write-mostly data, where data must be consistent: RCU is very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1072)    unlikely to be the right tool for the job, with the following
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1073)    exceptions, where RCU can provide:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1074) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1075)    a. Existence guarantees for update-friendly mechanisms.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1076)    b. Wait-free read-side primitives for real-time use.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1077) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1078) This focus on read-mostly situations means that RCU must interoperate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1079) with other synchronization primitives. For example, the ``add_gp()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1080) ``remove_gp_synchronous()`` examples discussed earlier use RCU to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1081) protect readers and locking to coordinate updaters. However, the need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1082) extends much farther, requiring that a variety of synchronization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1083) primitives be legal within RCU read-side critical sections, including
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1084) spinlocks, sequence locks, atomic operations, reference counters, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1085) memory barriers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1086) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1087) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1088) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1089) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1090) | What about sleeping locks?                                            |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1091) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1092) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1093) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1094) | These are forbidden within Linux-kernel RCU read-side critical        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1095) | sections because it is not legal to place a quiescent state (in this  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1096) | case, voluntary context switch) within an RCU read-side critical      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1097) | section. However, sleeping locks may be used within userspace RCU     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1098) | read-side critical sections, and also within Linux-kernel sleepable   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1099) | RCU `(SRCU) <#Sleepable%20RCU>`__ read-side critical sections. In     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1100) | addition, the -rt patchset turns spinlocks into a sleeping locks so   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1101) | that the corresponding critical sections can be preempted, which also |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1102) | means that these sleeplockified spinlocks (but not other sleeping     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1103) | locks!) may be acquire within -rt-Linux-kernel RCU read-side critical |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1104) | sections.                                                             |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1105) | Note that it *is* legal for a normal RCU read-side critical section   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1106) | to conditionally acquire a sleeping locks (as in                      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1107) | ``mutex_trylock()``), but only as long as it does not loop            |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1108) | indefinitely attempting to conditionally acquire that sleeping locks. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1109) | The key point is that things like ``mutex_trylock()`` either return   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1110) | with the mutex held, or return an error indication if the mutex was   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1111) | not immediately available. Either way, ``mutex_trylock()`` returns    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1112) | immediately without sleeping.                                         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1113) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1114) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1115) It often comes as a surprise that many algorithms do not require a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1116) consistent view of data, but many can function in that mode, with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1117) network routing being the poster child. Internet routing algorithms take
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1118) significant time to propagate updates, so that by the time an update
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1119) arrives at a given system, that system has been sending network traffic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1120) the wrong way for a considerable length of time. Having a few threads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1121) continue to send traffic the wrong way for a few more milliseconds is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1122) clearly not a problem: In the worst case, TCP retransmissions will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1123) eventually get the data where it needs to go. In general, when tracking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1124) the state of the universe outside of the computer, some level of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1125) inconsistency must be tolerated due to speed-of-light delays if nothing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1126) else.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1127) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1128) Furthermore, uncertainty about external state is inherent in many cases.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1129) For example, a pair of veterinarians might use heartbeat to determine
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1130) whether or not a given cat was alive. But how long should they wait
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1131) after the last heartbeat to decide that the cat is in fact dead? Waiting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1132) less than 400 milliseconds makes no sense because this would mean that a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1133) relaxed cat would be considered to cycle between death and life more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1134) than 100 times per minute. Moreover, just as with human beings, a cat's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1135) heart might stop for some period of time, so the exact wait period is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1136) judgment call. One of our pair of veterinarians might wait 30 seconds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1137) before pronouncing the cat dead, while the other might insist on waiting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1138) a full minute. The two veterinarians would then disagree on the state of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1139) the cat during the final 30 seconds of the minute following the last
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1140) heartbeat.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1141) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1142) Interestingly enough, this same situation applies to hardware. When push
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1143) comes to shove, how do we tell whether or not some external server has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1144) failed? We send messages to it periodically, and declare it failed if we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1145) don't receive a response within a given period of time. Policy decisions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1146) can usually tolerate short periods of inconsistency. The policy was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1147) decided some time ago, and is only now being put into effect, so a few
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1148) milliseconds of delay is normally inconsequential.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1149) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1150) However, there are algorithms that absolutely must see consistent data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1151) For example, the translation between a user-level SystemV semaphore ID
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1152) to the corresponding in-kernel data structure is protected by RCU, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1153) it is absolutely forbidden to update a semaphore that has just been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1154) removed. In the Linux kernel, this need for consistency is accommodated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1155) by acquiring spinlocks located in the in-kernel data structure from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1156) within the RCU read-side critical section, and this is indicated by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1157) green box in the figure above. Many other techniques may be used, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1158) are in fact used within the Linux kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1159) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1160) In short, RCU is not required to maintain consistency, and other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1161) mechanisms may be used in concert with RCU when consistency is required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1162) RCU's specialization allows it to do its job extremely well, and its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1163) ability to interoperate with other synchronization mechanisms allows the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1164) right mix of synchronization tools to be used for a given job.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1165) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1166) Performance and Scalability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1167) ~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1168) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1169) Energy efficiency is a critical component of performance today, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1170) Linux-kernel RCU implementations must therefore avoid unnecessarily
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1171) awakening idle CPUs. I cannot claim that this requirement was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1172) premeditated. In fact, I learned of it during a telephone conversation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1173) in which I was given “frank and open” feedback on the importance of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1174) energy efficiency in battery-powered systems and on specific
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1175) energy-efficiency shortcomings of the Linux-kernel RCU implementation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1176) In my experience, the battery-powered embedded community will consider
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1177) any unnecessary wakeups to be extremely unfriendly acts. So much so that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1178) mere Linux-kernel-mailing-list posts are insufficient to vent their ire.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1179) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1180) Memory consumption is not particularly important for in most situations,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1181) and has become decreasingly so as memory sizes have expanded and memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1182) costs have plummeted. However, as I learned from Matt Mackall's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1183) `bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1184) footprint is critically important on single-CPU systems with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1185) non-preemptible (``CONFIG_PREEMPT=n``) kernels, and thus `tiny
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1186) RCU <https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1187) was born. Josh Triplett has since taken over the small-memory banner
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1188) with his `Linux kernel tinification <https://tiny.wiki.kernel.org/>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1189) project, which resulted in `SRCU <#Sleepable%20RCU>`__ becoming optional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1190) for those kernels not needing it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1191) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1192) The remaining performance requirements are, for the most part,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1193) unsurprising. For example, in keeping with RCU's read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1194) specialization, ``rcu_dereference()`` should have negligible overhead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1195) (for example, suppression of a few minor compiler optimizations).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1196) Similarly, in non-preemptible environments, ``rcu_read_lock()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1197) ``rcu_read_unlock()`` should have exactly zero overhead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1198) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1199) In preemptible environments, in the case where the RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1200) critical section was not preempted (as will be the case for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1201) highest-priority real-time process), ``rcu_read_lock()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1202) ``rcu_read_unlock()`` should have minimal overhead. In particular, they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1203) should not contain atomic read-modify-write operations, memory-barrier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1204) instructions, preemption disabling, interrupt disabling, or backwards
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1205) branches. However, in the case where the RCU read-side critical section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1206) was preempted, ``rcu_read_unlock()`` may acquire spinlocks and disable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1207) interrupts. This is why it is better to nest an RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1208) section within a preempt-disable region than vice versa, at least in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1209) cases where that critical section is short enough to avoid unduly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1210) degrading real-time latencies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1211) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1212) The ``synchronize_rcu()`` grace-period-wait primitive is optimized for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1213) throughput. It may therefore incur several milliseconds of latency in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1214) addition to the duration of the longest RCU read-side critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1215) On the other hand, multiple concurrent invocations of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1216) ``synchronize_rcu()`` are required to use batching optimizations so that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1217) they can be satisfied by a single underlying grace-period-wait
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1218) operation. For example, in the Linux kernel, it is not unusual for a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1219) single grace-period-wait operation to serve more than `1,000 separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1220) invocations <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1221) of ``synchronize_rcu()``, thus amortizing the per-invocation overhead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1222) down to nearly zero. However, the grace-period optimization is also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1223) required to avoid measurable degradation of real-time scheduling and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1224) interrupt latencies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1225) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1226) In some cases, the multi-millisecond ``synchronize_rcu()`` latencies are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1227) unacceptable. In these cases, ``synchronize_rcu_expedited()`` may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1228) used instead, reducing the grace-period latency down to a few tens of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1229) microseconds on small systems, at least in cases where the RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1230) critical sections are short. There are currently no special latency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1231) requirements for ``synchronize_rcu_expedited()`` on large systems, but,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1232) consistent with the empirical nature of the RCU specification, that is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1233) subject to change. However, there most definitely are scalability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1234) requirements: A storm of ``synchronize_rcu_expedited()`` invocations on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1235) 4096 CPUs should at least make reasonable forward progress. In return
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1236) for its shorter latencies, ``synchronize_rcu_expedited()`` is permitted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1237) to impose modest degradation of real-time latency on non-idle online
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1238) CPUs. Here, “modest” means roughly the same latency degradation as a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1239) scheduling-clock interrupt.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1240) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1241) There are a number of situations where even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1242) ``synchronize_rcu_expedited()``'s reduced grace-period latency is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1243) unacceptable. In these situations, the asynchronous ``call_rcu()`` can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1244) be used in place of ``synchronize_rcu()`` as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1245) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1246)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1247) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1248)        1 struct foo {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1249)        2   int a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1250)        3   int b;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1251)        4   struct rcu_head rh;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1252)        5 };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1253)        6
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1254)        7 static void remove_gp_cb(struct rcu_head *rhp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1255)        8 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1256)        9   struct foo *p = container_of(rhp, struct foo, rh);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1257)       10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1258)       11   kfree(p);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1259)       12 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1260)       13
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1261)       14 bool remove_gp_asynchronous(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1262)       15 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1263)       16   struct foo *p;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1264)       17
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1265)       18   spin_lock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1266)       19   p = rcu_access_pointer(gp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1267)       20   if (!p) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1268)       21     spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1269)       22     return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1270)       23   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1271)       24   rcu_assign_pointer(gp, NULL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1272)       25   call_rcu(&p->rh, remove_gp_cb);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1273)       26   spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1274)       27   return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1275)       28 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1276) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1277) A definition of ``struct foo`` is finally needed, and appears on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1278) lines 1-5. The function ``remove_gp_cb()`` is passed to ``call_rcu()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1279) on line 25, and will be invoked after the end of a subsequent grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1280) period. This gets the same effect as ``remove_gp_synchronous()``, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1281) without forcing the updater to wait for a grace period to elapse. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1282) ``call_rcu()`` function may be used in a number of situations where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1283) neither ``synchronize_rcu()`` nor ``synchronize_rcu_expedited()`` would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1284) be legal, including within preempt-disable code, ``local_bh_disable()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1285) code, interrupt-disable code, and interrupt handlers. However, even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1286) ``call_rcu()`` is illegal within NMI handlers and from idle and offline
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1287) CPUs. The callback function (``remove_gp_cb()`` in this case) will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1288) executed within softirq (software interrupt) environment within the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1289) Linux kernel, either within a real softirq handler or under the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1290) protection of ``local_bh_disable()``. In both the Linux kernel and in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1291) userspace, it is bad practice to write an RCU callback function that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1292) takes too long. Long-running operations should be relegated to separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1293) threads or (in the Linux kernel) workqueues.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1294) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1295) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1296) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1297) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1298) | Why does line 19 use ``rcu_access_pointer()``? After all,             |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1299) | ``call_rcu()`` on line 25 stores into the structure, which would      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1300) | interact badly with concurrent insertions. Doesn't this mean that     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1301) | ``rcu_dereference()`` is required?                                    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1302) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1303) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1304) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1305) | Presumably the ``->gp_lock`` acquired on line 18 excludes any         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1306) | changes, including any insertions that ``rcu_dereference()`` would    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1307) | protect against. Therefore, any insertions will be delayed until      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1308) | after ``->gp_lock`` is released on line 25, which in turn means that  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1309) | ``rcu_access_pointer()`` suffices.                                    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1310) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1311) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1312) However, all that ``remove_gp_cb()`` is doing is invoking ``kfree()`` on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1313) the data element. This is a common idiom, and is supported by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1314) ``kfree_rcu()``, which allows “fire and forget” operation as shown
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1315) below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1316) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1317)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1318) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1319)        1 struct foo {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1320)        2   int a;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1321)        3   int b;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1322)        4   struct rcu_head rh;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1323)        5 };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1324)        6
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1325)        7 bool remove_gp_faf(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1326)        8 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1327)        9   struct foo *p;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1328)       10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1329)       11   spin_lock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1330)       12   p = rcu_dereference(gp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1331)       13   if (!p) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1332)       14     spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1333)       15     return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1334)       16   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1335)       17   rcu_assign_pointer(gp, NULL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1336)       18   kfree_rcu(p, rh);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1337)       19   spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1338)       20   return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1339)       21 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1340) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1341) Note that ``remove_gp_faf()`` simply invokes ``kfree_rcu()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1342) proceeds, without any need to pay any further attention to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1343) subsequent grace period and ``kfree()``. It is permissible to invoke
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1344) ``kfree_rcu()`` from the same environments as for ``call_rcu()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1345) Interestingly enough, DYNIX/ptx had the equivalents of ``call_rcu()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1346) and ``kfree_rcu()``, but not ``synchronize_rcu()``. This was due to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1347) fact that RCU was not heavily used within DYNIX/ptx, so the very few
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1348) places that needed something like ``synchronize_rcu()`` simply
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1349) open-coded it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1350) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1351) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1352) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1353) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1354) | Earlier it was claimed that ``call_rcu()`` and ``kfree_rcu()``        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1355) | allowed updaters to avoid being blocked by readers. But how can that  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1356) | be correct, given that the invocation of the callback and the freeing |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1357) | of the memory (respectively) must still wait for a grace period to    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1358) | elapse?                                                               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1359) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1360) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1361) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1362) | We could define things this way, but keep in mind that this sort of   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1363) | definition would say that updates in garbage-collected languages      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1364) | cannot complete until the next time the garbage collector runs, which |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1365) | does not seem at all reasonable. The key point is that in most cases, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1366) | an updater using either ``call_rcu()`` or ``kfree_rcu()`` can proceed |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1367) | to the next update as soon as it has invoked ``call_rcu()`` or        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1368) | ``kfree_rcu()``, without having to wait for a subsequent grace        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1369) | period.                                                               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1370) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1371) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1372) But what if the updater must wait for the completion of code to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1373) executed after the end of the grace period, but has other tasks that can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1374) be carried out in the meantime? The polling-style
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1375) ``get_state_synchronize_rcu()`` and ``cond_synchronize_rcu()`` functions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1376) may be used for this purpose, as shown below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1377) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1378)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1379) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1380)        1 bool remove_gp_poll(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1381)        2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1382)        3   struct foo *p;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1383)        4   unsigned long s;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1384)        5
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1385)        6   spin_lock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1386)        7   p = rcu_access_pointer(gp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1387)        8   if (!p) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1388)        9     spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1389)       10     return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1390)       11   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1391)       12   rcu_assign_pointer(gp, NULL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1392)       13   spin_unlock(&gp_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1393)       14   s = get_state_synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1394)       15   do_something_while_waiting();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1395)       16   cond_synchronize_rcu(s);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1396)       17   kfree(p);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1397)       18   return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1398)       19 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1399) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1400) On line 14, ``get_state_synchronize_rcu()`` obtains a “cookie” from RCU,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1401) then line 15 carries out other tasks, and finally, line 16 returns
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1402) immediately if a grace period has elapsed in the meantime, but otherwise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1403) waits as required. The need for ``get_state_synchronize_rcu`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1404) ``cond_synchronize_rcu()`` has appeared quite recently, so it is too
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1405) early to tell whether they will stand the test of time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1406) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1407) RCU thus provides a range of tools to allow updaters to strike the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1408) required tradeoff between latency, flexibility and CPU overhead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1409) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1410) Forward Progress
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1411) ~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1412) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1413) In theory, delaying grace-period completion and callback invocation is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1414) harmless. In practice, not only are memory sizes finite but also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1415) callbacks sometimes do wakeups, and sufficiently deferred wakeups can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1416) difficult to distinguish from system hangs. Therefore, RCU must provide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1417) a number of mechanisms to promote forward progress.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1418) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1419) These mechanisms are not foolproof, nor can they be. For one simple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1420) example, an infinite loop in an RCU read-side critical section must by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1421) definition prevent later grace periods from ever completing. For a more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1422) involved example, consider a 64-CPU system built with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1423) ``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1424) CPUs 1 through 63 spin in tight loops that invoke ``call_rcu()``. Even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1425) if these tight loops also contain calls to ``cond_resched()`` (thus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1426) allowing grace periods to complete), CPU 0 simply will not be able to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1427) invoke callbacks as fast as the other 63 CPUs can register them, at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1428) least not until the system runs out of memory. In both of these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1429) examples, the Spiderman principle applies: With great power comes great
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1430) responsibility. However, short of this level of abuse, RCU is required
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1431) to ensure timely completion of grace periods and timely invocation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1432) callbacks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1433) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1434) RCU takes the following steps to encourage timely completion of grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1435) periods:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1436) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1437) #. If a grace period fails to complete within 100 milliseconds, RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1438)    causes future invocations of ``cond_resched()`` on the holdout CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1439)    to provide an RCU quiescent state. RCU also causes those CPUs'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1440)    ``need_resched()`` invocations to return ``true``, but only after the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1441)    corresponding CPU's next scheduling-clock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1442) #. CPUs mentioned in the ``nohz_full`` kernel boot parameter can run
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1443)    indefinitely in the kernel without scheduling-clock interrupts, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1444)    defeats the above ``need_resched()`` strategem. RCU will therefore
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1445)    invoke ``resched_cpu()`` on any ``nohz_full`` CPUs still holding out
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1446)    after 109 milliseconds.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1447) #. In kernels built with ``CONFIG_RCU_BOOST=y``, if a given task that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1448)    has been preempted within an RCU read-side critical section is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1449)    holding out for more than 500 milliseconds, RCU will resort to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1450)    priority boosting.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1451) #. If a CPU is still holding out 10 seconds into the grace period, RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1452)    will invoke ``resched_cpu()`` on it regardless of its ``nohz_full``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1453)    state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1454) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1455) The above values are defaults for systems running with ``HZ=1000``. They
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1456) will vary as the value of ``HZ`` varies, and can also be changed using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1457) the relevant Kconfig options and kernel boot parameters. RCU currently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1458) does not do much sanity checking of these parameters, so please use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1459) caution when changing them. Note that these forward-progress measures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1460) are provided only for RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1461) RCU <#Tasks%20RCU>`__.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1462) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1463) RCU takes the following steps in ``call_rcu()`` to encourage timely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1464) invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1465) 10,000 callbacks, or has 10,000 more callbacks than it had the last time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1466) encouragement was provided:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1467) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1468) #. Starts a grace period, if one is not already in progress.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1469) #. Forces immediate checking for quiescent states, rather than waiting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1470)    for three milliseconds to have elapsed since the beginning of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1471)    grace period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1472) #. Immediately tags the CPU's callbacks with their grace period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1473)    completion numbers, rather than waiting for the ``RCU_SOFTIRQ``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1474)    handler to get around to it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1475) #. Lifts callback-execution batch limits, which speeds up callback
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1476)    invocation at the expense of degrading realtime response.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1477) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1478) Again, these are default values when running at ``HZ=1000``, and can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1479) overridden. Again, these forward-progress measures are provided only for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1480) RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1481) RCU <#Tasks%20RCU>`__. Even for RCU, callback-invocation forward
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1482) progress for ``rcu_nocbs`` CPUs is much less well-developed, in part
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1483) because workloads benefiting from ``rcu_nocbs`` CPUs tend to invoke
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1484) ``call_rcu()`` relatively infrequently. If workloads emerge that need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1485) both ``rcu_nocbs`` CPUs and high ``call_rcu()`` invocation rates, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1486) additional forward-progress work will be required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1487) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1488) Composability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1489) ~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1490) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1491) Composability has received much attention in recent years, perhaps in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1492) part due to the collision of multicore hardware with object-oriented
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1493) techniques designed in single-threaded environments for single-threaded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1494) use. And in theory, RCU read-side critical sections may be composed, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1495) in fact may be nested arbitrarily deeply. In practice, as with all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1496) real-world implementations of composable constructs, there are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1497) limitations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1498) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1499) Implementations of RCU for which ``rcu_read_lock()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1500) ``rcu_read_unlock()`` generate no code, such as Linux-kernel RCU when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1501) ``CONFIG_PREEMPT=n``, can be nested arbitrarily deeply. After all, there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1502) is no overhead. Except that if all these instances of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1503) ``rcu_read_lock()`` and ``rcu_read_unlock()`` are visible to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1504) compiler, compilation will eventually fail due to exhausting memory,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1505) mass storage, or user patience, whichever comes first. If the nesting is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1506) not visible to the compiler, as is the case with mutually recursive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1507) functions each in its own translation unit, stack overflow will result.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1508) If the nesting takes the form of loops, perhaps in the guise of tail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1509) recursion, either the control variable will overflow or (in the Linux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1510) kernel) you will get an RCU CPU stall warning. Nevertheless, this class
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1511) of RCU implementations is one of the most composable constructs in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1512) existence.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1513) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1514) RCU implementations that explicitly track nesting depth are limited by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1515) the nesting-depth counter. For example, the Linux kernel's preemptible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1516) RCU limits nesting to ``INT_MAX``. This should suffice for almost all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1517) practical purposes. That said, a consecutive pair of RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1518) critical sections between which there is an operation that waits for a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1519) grace period cannot be enclosed in another RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1520) section. This is because it is not legal to wait for a grace period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1521) within an RCU read-side critical section: To do so would result either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1522) in deadlock or in RCU implicitly splitting the enclosing RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1523) critical section, neither of which is conducive to a long-lived and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1524) prosperous kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1525) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1526) It is worth noting that RCU is not alone in limiting composability. For
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1527) example, many transactional-memory implementations prohibit composing a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1528) pair of transactions separated by an irrevocable operation (for example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1529) a network receive operation). For another example, lock-based critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1530) sections can be composed surprisingly freely, but only if deadlock is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1531) avoided.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1532) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1533) In short, although RCU read-side critical sections are highly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1534) composable, care is required in some situations, just as is the case for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1535) any other composable synchronization mechanism.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1536) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1537) Corner Cases
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1538) ~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1539) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1540) A given RCU workload might have an endless and intense stream of RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1541) read-side critical sections, perhaps even so intense that there was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1542) never a point in time during which there was not at least one RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1543) read-side critical section in flight. RCU cannot allow this situation to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1544) block grace periods: As long as all the RCU read-side critical sections
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1545) are finite, grace periods must also be finite.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1546) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1547) That said, preemptible RCU implementations could potentially result in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1548) RCU read-side critical sections being preempted for long durations,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1549) which has the effect of creating a long-duration RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1550) section. This situation can arise only in heavily loaded systems, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1551) systems using real-time priorities are of course more vulnerable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1552) Therefore, RCU priority boosting is provided to help deal with this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1553) case. That said, the exact requirements on RCU priority boosting will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1554) likely evolve as more experience accumulates.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1555) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1556) Other workloads might have very high update rates. Although one can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1557) argue that such workloads should instead use something other than RCU,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1558) the fact remains that RCU must handle such workloads gracefully. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1559) requirement is another factor driving batching of grace periods, but it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1560) is also the driving force behind the checks for large numbers of queued
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1561) RCU callbacks in the ``call_rcu()`` code path. Finally, high update
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1562) rates should not delay RCU read-side critical sections, although some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1563) small read-side delays can occur when using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1564) ``synchronize_rcu_expedited()``, courtesy of this function's use of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1565) ``smp_call_function_single()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1566) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1567) Although all three of these corner cases were understood in the early
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1568) 1990s, a simple user-level test consisting of ``close(open(path))`` in a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1569) tight loop in the early 2000s suddenly provided a much deeper
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1570) appreciation of the high-update-rate corner case. This test also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1571) motivated addition of some RCU code to react to high update rates, for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1572) example, if a given CPU finds itself with more than 10,000 RCU callbacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1573) queued, it will cause RCU to take evasive action by more aggressively
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1574) starting grace periods and more aggressively forcing completion of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1575) grace-period processing. This evasive action causes the grace period to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1576) complete more quickly, but at the cost of restricting RCU's batching
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1577) optimizations, thus increasing the CPU overhead incurred by that grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1578) period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1579) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1580) Software-Engineering Requirements
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1581) ---------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1582) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1583) Between Murphy's Law and “To err is human”, it is necessary to guard
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1584) against mishaps and misuse:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1585) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1586) #. It is all too easy to forget to use ``rcu_read_lock()`` everywhere
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1587)    that it is needed, so kernels built with ``CONFIG_PROVE_RCU=y`` will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1588)    splat if ``rcu_dereference()`` is used outside of an RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1589)    critical section. Update-side code can use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1590)    ``rcu_dereference_protected()``, which takes a `lockdep
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1591)    expression <https://lwn.net/Articles/371986/>`__ to indicate what is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1592)    providing the protection. If the indicated protection is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1593)    provided, a lockdep splat is emitted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1594)    Code shared between readers and updaters can use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1595)    ``rcu_dereference_check()``, which also takes a lockdep expression,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1596)    and emits a lockdep splat if neither ``rcu_read_lock()`` nor the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1597)    indicated protection is in place. In addition,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1598)    ``rcu_dereference_raw()`` is used in those (hopefully rare) cases
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1599)    where the required protection cannot be easily described. Finally,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1600)    ``rcu_read_lock_held()`` is provided to allow a function to verify
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1601)    that it has been invoked within an RCU read-side critical section. I
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1602)    was made aware of this set of requirements shortly after Thomas
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1603)    Gleixner audited a number of RCU uses.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1604) #. A given function might wish to check for RCU-related preconditions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1605)    upon entry, before using any other RCU API. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1606)    ``rcu_lockdep_assert()`` does this job, asserting the expression in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1607)    kernels having lockdep enabled and doing nothing otherwise.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1608) #. It is also easy to forget to use ``rcu_assign_pointer()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1609)    ``rcu_dereference()``, perhaps (incorrectly) substituting a simple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1610)    assignment. To catch this sort of error, a given RCU-protected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1611)    pointer may be tagged with ``__rcu``, after which sparse will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1612)    complain about simple-assignment accesses to that pointer. Arnd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1613)    Bergmann made me aware of this requirement, and also supplied the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1614)    needed `patch series <https://lwn.net/Articles/376011/>`__.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1615) #. Kernels built with ``CONFIG_DEBUG_OBJECTS_RCU_HEAD=y`` will splat if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1616)    a data element is passed to ``call_rcu()`` twice in a row, without a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1617)    grace period in between. (This error is similar to a double free.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1618)    The corresponding ``rcu_head`` structures that are dynamically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1619)    allocated are automatically tracked, but ``rcu_head`` structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1620)    allocated on the stack must be initialized with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1621)    ``init_rcu_head_on_stack()`` and cleaned up with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1622)    ``destroy_rcu_head_on_stack()``. Similarly, statically allocated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1623)    non-stack ``rcu_head`` structures must be initialized with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1624)    ``init_rcu_head()`` and cleaned up with ``destroy_rcu_head()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1625)    Mathieu Desnoyers made me aware of this requirement, and also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1626)    supplied the needed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1627)    `patch <https://lkml.kernel.org/g/20100319013024.GA28456@Krystal>`__.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1628) #. An infinite loop in an RCU read-side critical section will eventually
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1629)    trigger an RCU CPU stall warning splat, with the duration of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1630)    “eventually” being controlled by the ``RCU_CPU_STALL_TIMEOUT``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1631)    ``Kconfig`` option, or, alternatively, by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1632)    ``rcupdate.rcu_cpu_stall_timeout`` boot/sysfs parameter. However, RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1633)    is not obligated to produce this splat unless there is a grace period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1634)    waiting on that particular RCU read-side critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1635) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1636)    Some extreme workloads might intentionally delay RCU grace periods,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1637)    and systems running those workloads can be booted with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1638)    ``rcupdate.rcu_cpu_stall_suppress`` to suppress the splats. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1639)    kernel parameter may also be set via ``sysfs``. Furthermore, RCU CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1640)    stall warnings are counter-productive during sysrq dumps and during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1641)    panics. RCU therefore supplies the ``rcu_sysrq_start()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1642)    ``rcu_sysrq_end()`` API members to be called before and after long
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1643)    sysrq dumps. RCU also supplies the ``rcu_panic()`` notifier that is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1644)    automatically invoked at the beginning of a panic to suppress further
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1645)    RCU CPU stall warnings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1646) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1647)    This requirement made itself known in the early 1990s, pretty much
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1648)    the first time that it was necessary to debug a CPU stall. That said,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1649)    the initial implementation in DYNIX/ptx was quite generic in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1650)    comparison with that of Linux.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1651) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1652) #. Although it would be very good to detect pointers leaking out of RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1653)    read-side critical sections, there is currently no good way of doing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1654)    this. One complication is the need to distinguish between pointers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1655)    leaking and pointers that have been handed off from RCU to some other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1656)    synchronization mechanism, for example, reference counting.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1657) #. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1658)    is provided via event tracing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1659) #. Open-coded use of ``rcu_assign_pointer()`` and ``rcu_dereference()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1660)    to create typical linked data structures can be surprisingly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1661)    error-prone. Therefore, RCU-protected `linked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1662)    lists <https://lwn.net/Articles/609973/#RCU%20List%20APIs>`__ and,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1663)    more recently, RCU-protected `hash
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1664)    tables <https://lwn.net/Articles/612100/>`__ are available. Many
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1665)    other special-purpose RCU-protected data structures are available in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1666)    the Linux kernel and the userspace RCU library.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1667) #. Some linked structures are created at compile time, but still require
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1668)    ``__rcu`` checking. The ``RCU_POINTER_INITIALIZER()`` macro serves
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1669)    this purpose.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1670) #. It is not necessary to use ``rcu_assign_pointer()`` when creating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1671)    linked structures that are to be published via a single external
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1672)    pointer. The ``RCU_INIT_POINTER()`` macro is provided for this task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1673)    and also for assigning ``NULL`` pointers at runtime.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1674) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1675) This not a hard-and-fast list: RCU's diagnostic capabilities will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1676) continue to be guided by the number and type of usage bugs found in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1677) real-world RCU usage.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1678) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1679) Linux Kernel Complications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1680) --------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1681) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1682) The Linux kernel provides an interesting environment for all kinds of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1683) software, including RCU. Some of the relevant points of interest are as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1684) follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1685) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1686) #. `Configuration`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1687) #. `Firmware Interface`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1688) #. `Early Boot`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1689) #. `Interrupts and NMIs`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1690) #. `Loadable Modules`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1691) #. `Hotplug CPU`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1692) #. `Scheduler and RCU`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1693) #. `Tracing and RCU`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1694) #. `Accesses to User Memory and RCU`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1695) #. `Energy Efficiency`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1696) #. `Scheduling-Clock Interrupts and RCU`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1697) #. `Memory Efficiency`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1698) #. `Performance, Scalability, Response Time, and Reliability`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1699) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1700) This list is probably incomplete, but it does give a feel for the most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1701) notable Linux-kernel complications. Each of the following sections
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1702) covers one of the above topics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1703) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1704) Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1705) ~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1706) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1707) RCU's goal is automatic configuration, so that almost nobody needs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1708) worry about RCU's ``Kconfig`` options. And for almost all users, RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1709) does in fact work well “out of the box.”
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1710) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1711) However, there are specialized use cases that are handled by kernel boot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1712) parameters and ``Kconfig`` options. Unfortunately, the ``Kconfig``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1713) system will explicitly ask users about new ``Kconfig`` options, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1714) requires almost all of them be hidden behind a ``CONFIG_RCU_EXPERT``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1715) ``Kconfig`` option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1716) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1717) This all should be quite obvious, but the fact remains that Linus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1718) Torvalds recently had to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1719) `remind <https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1720) me of this requirement.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1721) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1722) Firmware Interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1723) ~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1724) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1725) In many cases, kernel obtains information about the system from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1726) firmware, and sometimes things are lost in translation. Or the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1727) translation is accurate, but the original message is bogus.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1728) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1729) For example, some systems' firmware overreports the number of CPUs,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1730) sometimes by a large factor. If RCU naively believed the firmware, as it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1731) used to do, it would create too many per-CPU kthreads. Although the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1732) resulting system will still run correctly, the extra kthreads needlessly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1733) consume memory and can cause confusion when they show up in ``ps``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1734) listings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1735) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1736) RCU must therefore wait for a given CPU to actually come online before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1737) it can allow itself to believe that the CPU actually exists. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1738) resulting “ghost CPUs” (which are never going to come online) cause a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1739) number of `interesting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1740) complications <https://paulmck.livejournal.com/37494.html>`__.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1741) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1742) Early Boot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1743) ~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1744) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1745) The Linux kernel's boot sequence is an interesting process, and RCU is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1746) used early, even before ``rcu_init()`` is invoked. In fact, a number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1747) RCU's primitives can be used as soon as the initial task's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1748) ``task_struct`` is available and the boot CPU's per-CPU variables are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1749) set up. The read-side primitives (``rcu_read_lock()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1750) ``rcu_read_unlock()``, ``rcu_dereference()``, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1751) ``rcu_access_pointer()``) will operate normally very early on, as will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1752) ``rcu_assign_pointer()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1753) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1754) Although ``call_rcu()`` may be invoked at any time during boot,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1755) callbacks are not guaranteed to be invoked until after all of RCU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1756) kthreads have been spawned, which occurs at ``early_initcall()`` time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1757) This delay in callback invocation is due to the fact that RCU does not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1758) invoke callbacks until it is fully initialized, and this full
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1759) initialization cannot occur until after the scheduler has initialized
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1760) itself to the point where RCU can spawn and run its kthreads. In theory,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1761) it would be possible to invoke callbacks earlier, however, this is not a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1762) panacea because there would be severe restrictions on what operations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1763) those callbacks could invoke.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1764) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1765) Perhaps surprisingly, ``synchronize_rcu()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1766) ``synchronize_rcu_expedited()``, will operate normally during very early
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1767) boot, the reason being that there is only one CPU and preemption is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1768) disabled. This means that the call ``synchronize_rcu()`` (or friends)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1769) itself is a quiescent state and thus a grace period, so the early-boot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1770) implementation can be a no-op.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1771) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1772) However, once the scheduler has spawned its first kthread, this early
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1773) boot trick fails for ``synchronize_rcu()`` (as well as for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1774) ``synchronize_rcu_expedited()``) in ``CONFIG_PREEMPT=y`` kernels. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1775) reason is that an RCU read-side critical section might be preempted,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1776) which means that a subsequent ``synchronize_rcu()`` really does have to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1777) wait for something, as opposed to simply returning immediately.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1778) Unfortunately, ``synchronize_rcu()`` can't do this until all of its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1779) kthreads are spawned, which doesn't happen until some time during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1780) ``early_initcalls()`` time. But this is no excuse: RCU is nevertheless
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1781) required to correctly handle synchronous grace periods during this time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1782) period. Once all of its kthreads are up and running, RCU starts running
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1783) normally.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1784) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1785) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1786) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1787) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1788) | How can RCU possibly handle grace periods before all of its kthreads  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1789) | have been spawned???                                                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1790) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1791) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1792) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1793) | Very carefully!                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1794) | During the “dead zone” between the time that the scheduler spawns the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1795) | first task and the time that all of RCU's kthreads have been spawned, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1796) | all synchronous grace periods are handled by the expedited            |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1797) | grace-period mechanism. At runtime, this expedited mechanism relies   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1798) | on workqueues, but during the dead zone the requesting task itself    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1799) | drives the desired expedited grace period. Because dead-zone          |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1800) | execution takes place within task context, everything works. Once the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1801) | dead zone ends, expedited grace periods go back to using workqueues,  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1802) | as is required to avoid problems that would otherwise occur when a    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1803) | user task received a POSIX signal while driving an expedited grace    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1804) | period.                                                               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1805) |                                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1806) | And yes, this does mean that it is unhelpful to send POSIX signals to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1807) | random tasks between the time that the scheduler spawns its first     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1808) | kthread and the time that RCU's kthreads have all been spawned. If    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1809) | there ever turns out to be a good reason for sending POSIX signals    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1810) | during that time, appropriate adjustments will be made. (If it turns  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1811) | out that POSIX signals are sent during this time for no good reason,  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1812) | other adjustments will be made, appropriate or otherwise.)            |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1813) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1814) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1815) I learned of these boot-time requirements as a result of a series of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1816) system hangs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1817) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1818) Interrupts and NMIs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1819) ~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1820) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1821) The Linux kernel has interrupts, and RCU read-side critical sections are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1822) legal within interrupt handlers and within interrupt-disabled regions of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1823) code, as are invocations of ``call_rcu()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1824) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1825) Some Linux-kernel architectures can enter an interrupt handler from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1826) non-idle process context, and then just never leave it, instead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1827) stealthily transitioning back to process context. This trick is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1828) sometimes used to invoke system calls from inside the kernel. These
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1829) “half-interrupts” mean that RCU has to be very careful about how it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1830) counts interrupt nesting levels. I learned of this requirement the hard
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1831) way during a rewrite of RCU's dyntick-idle code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1832) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1833) The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1834) critical sections are legal within NMI handlers. Thankfully, RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1835) update-side primitives, including ``call_rcu()``, are prohibited within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1836) NMI handlers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1837) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1838) The name notwithstanding, some Linux-kernel architectures can have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1839) nested NMIs, which RCU must handle correctly. Andy Lutomirski `surprised
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1840) me <https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1841) with this requirement; he also kindly surprised me with `an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1842) algorithm <https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1843) that meets this requirement.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1844) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1845) Furthermore, NMI handlers can be interrupted by what appear to RCU to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1846) normal interrupts. One way that this can happen is for code that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1847) directly invokes ``rcu_irq_enter()`` and ``rcu_irq_exit()`` to be called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1848) from an NMI handler. This astonishing fact of life prompted the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1849) code structure, which has ``rcu_irq_enter()`` invoking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1850) ``rcu_nmi_enter()`` and ``rcu_irq_exit()`` invoking ``rcu_nmi_exit()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1851) And yes, I also learned of this requirement the hard way.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1852) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1853) Loadable Modules
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1854) ~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1855) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1856) The Linux kernel has loadable modules, and these modules can also be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1857) unloaded. After a given module has been unloaded, any attempt to call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1858) one of its functions results in a segmentation fault. The module-unload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1859) functions must therefore cancel any delayed calls to loadable-module
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1860) functions, for example, any outstanding ``mod_timer()`` must be dealt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1861) with via ``del_timer_sync()`` or similar.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1862) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1863) Unfortunately, there is no way to cancel an RCU callback; once you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1864) invoke ``call_rcu()``, the callback function is eventually going to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1865) invoked, unless the system goes down first. Because it is normally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1866) considered socially irresponsible to crash the system in response to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1867) module unload request, we need some other way to deal with in-flight RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1868) callbacks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1869) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1870) RCU therefore provides ``rcu_barrier()``, which waits until all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1871) in-flight RCU callbacks have been invoked. If a module uses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1872) ``call_rcu()``, its exit function should therefore prevent any future
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1873) invocation of ``call_rcu()``, then invoke ``rcu_barrier()``. In theory,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1874) the underlying module-unload code could invoke ``rcu_barrier()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1875) unconditionally, but in practice this would incur unacceptable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1876) latencies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1877) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1878) Nikita Danilov noted this requirement for an analogous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1879) filesystem-unmount situation, and Dipankar Sarma incorporated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1880) ``rcu_barrier()`` into RCU. The need for ``rcu_barrier()`` for module
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1881) unloading became apparent later.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1882) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1883) .. important::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1884) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1885)    The ``rcu_barrier()`` function is not, repeat,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1886)    *not*, obligated to wait for a grace period. It is instead only required
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1887)    to wait for RCU callbacks that have already been posted. Therefore, if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1888)    there are no RCU callbacks posted anywhere in the system,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1889)    ``rcu_barrier()`` is within its rights to return immediately. Even if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1890)    there are callbacks posted, ``rcu_barrier()`` does not necessarily need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1891)    to wait for a grace period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1892) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1893) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1894) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1895) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1896) | Wait a minute! Each RCU callbacks must wait for a grace period to     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1897) | complete, and ``rcu_barrier()`` must wait for each pre-existing       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1898) | callback to be invoked. Doesn't ``rcu_barrier()`` therefore need to   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1899) | wait for a full grace period if there is even one callback posted     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1900) | anywhere in the system?                                               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1901) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1902) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1903) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1904) | Absolutely not!!!                                                     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1905) | Yes, each RCU callbacks must wait for a grace period to complete, but |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1906) | it might well be partly (or even completely) finished waiting by the  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1907) | time ``rcu_barrier()`` is invoked. In that case, ``rcu_barrier()``    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1908) | need only wait for the remaining portion of the grace period to       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1909) | elapse. So even if there are quite a few callbacks posted,            |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1910) | ``rcu_barrier()`` might well return quite quickly.                    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1911) |                                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1912) | So if you need to wait for a grace period as well as for all          |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1913) | pre-existing callbacks, you will need to invoke both                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1914) | ``synchronize_rcu()`` and ``rcu_barrier()``. If latency is a concern, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1915) | you can always use workqueues to invoke them concurrently.            |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1916) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1917) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1918) Hotplug CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1919) ~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1920) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1921) The Linux kernel supports CPU hotplug, which means that CPUs can come
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1922) and go. It is of course illegal to use any RCU API member from an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1923) offline CPU, with the exception of `SRCU <#Sleepable%20RCU>`__ read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1924) critical sections. This requirement was present from day one in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1925) DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1926) implementation is “interesting.”
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1927) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1928) The Linux-kernel CPU-hotplug implementation has notifiers that are used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1929) to allow the various kernel subsystems (including RCU) to respond
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1930) appropriately to a given CPU-hotplug operation. Most RCU operations may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1931) be invoked from CPU-hotplug notifiers, including even synchronous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1932) grace-period operations such as ``synchronize_rcu()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1933) ``synchronize_rcu_expedited()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1934) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1935) However, all-callback-wait operations such as ``rcu_barrier()`` are also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1936) not supported, due to the fact that there are phases of CPU-hotplug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1937) operations where the outgoing CPU's callbacks will not be invoked until
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1938) after the CPU-hotplug operation ends, which could also result in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1939) deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1940) during its execution, which results in another type of deadlock when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1941) invoked from a CPU-hotplug notifier.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1942) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1943) Scheduler and RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1944) ~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1945) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1946) RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1947) accumulation by these kthreads. This requirement was no surprise, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1948) RCU's violation of it when running context-switch-heavy workloads when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1949) built with ``CONFIG_NO_HZ_FULL=y`` `did come as a surprise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1950) [PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1951) RCU has made good progress towards meeting this requirement, even for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1952) context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1953) room for further improvement.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1954) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1955) There is no longer any prohibition against holding any of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1956) scheduler's runqueue or priority-inheritance spinlocks across an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1957) ``rcu_read_unlock()``, even if interrupts and preemption were enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1958) somewhere within the corresponding RCU read-side critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1959) Therefore, it is now perfectly legal to execute ``rcu_read_lock()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1960) with preemption enabled, acquire one of the scheduler locks, and hold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1961) that lock across the matching ``rcu_read_unlock()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1962) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1963) Similarly, the RCU flavor consolidation has removed the need for negative
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1964) nesting.  The fact that interrupt-disabled regions of code act as RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1965) read-side critical sections implicitly avoids earlier issues that used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1966) to result in destructive recursion via interrupt handler's use of RCU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1967) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1968) Tracing and RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1969) ~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1970) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1971) It is possible to use tracing on RCU code, but tracing itself uses RCU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1972) For this reason, ``rcu_dereference_raw_check()`` is provided for use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1973) by tracing, which avoids the destructive recursion that could otherwise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1974) ensue. This API is also used by virtualization in some architectures,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1975) where RCU readers execute in environments in which tracing cannot be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1976) used. The tracing folks both located the requirement and provided the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1977) needed fix, so this surprise requirement was relatively painless.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1978) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1979) Accesses to User Memory and RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1980) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1981) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1982) The kernel needs to access user-space memory, for example, to access data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1983) referenced by system-call parameters.  The ``get_user()`` macro does this job.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1984) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1985) However, user-space memory might well be paged out, which means that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1986) ``get_user()`` might well page-fault and thus block while waiting for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1987) resulting I/O to complete.  It would be a very bad thing for the compiler to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1988) reorder a ``get_user()`` invocation into an RCU read-side critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1989) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1990) For example, suppose that the source code looked like this:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1991) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1992)   ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1993) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1994)        1 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1995)        2 p = rcu_dereference(gp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1996)        3 v = p->value;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1997)        4 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1998)        5 get_user(user_v, user_p);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1999)        6 do_something_with(v, user_v);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2000) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2001) The compiler must not be permitted to transform this source code into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2002) the following:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2003) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2004)   ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2005) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2006)        1 rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2007)        2 p = rcu_dereference(gp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2008)        3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2009)        4 v = p->value;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2010)        5 rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2011)        6 do_something_with(v, user_v);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2012) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2013) If the compiler did make this transformation in a ``CONFIG_PREEMPT=n`` kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2014) build, and if ``get_user()`` did page fault, the result would be a quiescent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2015) state in the middle of an RCU read-side critical section.  This misplaced
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2016) quiescent state could result in line 4 being a use-after-free access,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2017) which could be bad for your kernel's actuarial statistics.  Similar examples
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2018) can be constructed with the call to ``get_user()`` preceding the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2019) ``rcu_read_lock()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2020) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2021) Unfortunately, ``get_user()`` doesn't have any particular ordering properties,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2022) and in some architectures the underlying ``asm`` isn't even marked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2023) ``volatile``.  And even if it was marked ``volatile``, the above access to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2024) ``p->value`` is not volatile, so the compiler would not have any reason to keep
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2025) those two accesses in order.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2026) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2027) Therefore, the Linux-kernel definitions of ``rcu_read_lock()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2028) ``rcu_read_unlock()`` must act as compiler barriers, at least for outermost
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2029) instances of ``rcu_read_lock()`` and ``rcu_read_unlock()`` within a nested set
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2030) of RCU read-side critical sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2031) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2032) Energy Efficiency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2033) ~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2034) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2035) Interrupting idle CPUs is considered socially unacceptable, especially
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2036) by people with battery-powered embedded systems. RCU therefore conserves
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2037) energy by detecting which CPUs are idle, including tracking CPUs that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2038) have been interrupted from idle. This is a large part of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2039) energy-efficiency requirement, so I learned of this via an irate phone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2040) call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2041) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2042) Because RCU avoids interrupting idle CPUs, it is illegal to execute an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2043) RCU read-side critical section on an idle CPU. (Kernels built with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2044) ``CONFIG_PROVE_RCU=y`` will splat if you try it.) The ``RCU_NONIDLE()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2045) macro and ``_rcuidle`` event tracing is provided to work around this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2046) restriction. In addition, ``rcu_is_watching()`` may be used to test
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2047) whether or not it is currently legal to run RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2048) sections on this CPU. I learned of the need for diagnostics on the one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2049) hand and ``RCU_NONIDLE()`` on the other while inspecting idle-loop code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2050) Steven Rostedt supplied ``_rcuidle`` event tracing, which is used quite
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2051) heavily in the idle loop. However, there are some restrictions on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2052) code placed within ``RCU_NONIDLE()``:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2053) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2054) #. Blocking is prohibited. In practice, this is not a serious
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2055)    restriction given that idle tasks are prohibited from blocking to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2056)    begin with.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2057) #. Although nesting ``RCU_NONIDLE()`` is permitted, they cannot nest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2058)    indefinitely deeply. However, given that they can be nested on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2059)    order of a million deep, even on 32-bit systems, this should not be a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2060)    serious restriction. This nesting limit would probably be reached
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2061)    long after the compiler OOMed or the stack overflowed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2062) #. Any code path that enters ``RCU_NONIDLE()`` must sequence out of that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2063)    same ``RCU_NONIDLE()``. For example, the following is grossly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2064)    illegal:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2065) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2066)       ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2067) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2068) 	  1     RCU_NONIDLE({
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2069) 	  2       do_something();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2070) 	  3       goto bad_idea;  /* BUG!!! */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2071) 	  4       do_something_else();});
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2072) 	  5   bad_idea:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2073) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2074) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2075)    It is just as illegal to transfer control into the middle of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2076)    ``RCU_NONIDLE()``'s argument. Yes, in theory, you could transfer in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2077)    as long as you also transferred out, but in practice you could also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2078)    expect to get sharply worded review comments.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2079) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2080) It is similarly socially unacceptable to interrupt an ``nohz_full`` CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2081) running in userspace. RCU must therefore track ``nohz_full`` userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2082) execution. RCU must therefore be able to sample state at two points in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2083) time, and be able to determine whether or not some other CPU spent any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2084) time idle and/or executing in userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2085) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2086) These energy-efficiency requirements have proven quite difficult to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2087) understand and to meet, for example, there have been more than five
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2088) clean-sheet rewrites of RCU's energy-efficiency code, the last of which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2089) was finally able to demonstrate `real energy savings running on real
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2090) hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2091) [PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf>`__.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2092) As noted earlier, I learned of many of these requirements via angry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2093) phone calls: Flaming me on the Linux-kernel mailing list was apparently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2094) not sufficient to fully vent their ire at RCU's energy-efficiency bugs!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2095) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2096) Scheduling-Clock Interrupts and RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2097) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2098) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2099) The kernel transitions between in-kernel non-idle execution, userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2100) execution, and the idle loop. Depending on kernel configuration, RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2101) handles these states differently:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2102) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2103) +-----------------+------------------+------------------+-----------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2104) | ``HZ`` Kconfig  | In-Kernel        | Usermode         | Idle            |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2105) +=================+==================+==================+=================+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2106) | ``HZ_PERIODIC`` | Can rely on      | Can rely on      | Can rely on     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2107) |                 | scheduling-clock | scheduling-clock | RCU's           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2108) |                 | interrupt.       | interrupt and    | dyntick-idle    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2109) |                 |                  | its detection    | detection.      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2110) |                 |                  | of interrupt     |                 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2111) |                 |                  | from usermode.   |                 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2112) +-----------------+------------------+------------------+-----------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2113) | ``NO_HZ_IDLE``  | Can rely on      | Can rely on      | Can rely on     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2114) |                 | scheduling-clock | scheduling-clock | RCU's           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2115) |                 | interrupt.       | interrupt and    | dyntick-idle    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2116) |                 |                  | its detection    | detection.      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2117) |                 |                  | of interrupt     |                 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2118) |                 |                  | from usermode.   |                 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2119) +-----------------+------------------+------------------+-----------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2120) | ``NO_HZ_FULL``  | Can only         | Can rely on      | Can rely on     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2121) |                 | sometimes rely   | RCU's            | RCU's           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2122) |                 | on               | dyntick-idle     | dyntick-idle    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2123) |                 | scheduling-clock | detection.       | detection.      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2124) |                 | interrupt. In    |                  |                 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2125) |                 | other cases, it  |                  |                 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2126) |                 | is necessary to  |                  |                 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2127) |                 | bound kernel     |                  |                 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2128) |                 | execution times  |                  |                 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2129) |                 | and/or use       |                  |                 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2130) |                 | IPIs.            |                  |                 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2131) +-----------------+------------------+------------------+-----------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2132) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2133) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2134) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2135) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2136) | Why can't ``NO_HZ_FULL`` in-kernel execution rely on the              |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2137) | scheduling-clock interrupt, just like ``HZ_PERIODIC`` and             |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2138) | ``NO_HZ_IDLE`` do?                                                    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2139) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2140) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2141) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2142) | Because, as a performance optimization, ``NO_HZ_FULL`` does not       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2143) | necessarily re-enable the scheduling-clock interrupt on entry to each |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2144) | and every system call.                                                |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2145) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2146) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2147) However, RCU must be reliably informed as to whether any given CPU is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2148) currently in the idle loop, and, for ``NO_HZ_FULL``, also whether that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2149) CPU is executing in usermode, as discussed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2150) `earlier <#Energy%20Efficiency>`__. It also requires that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2151) scheduling-clock interrupt be enabled when RCU needs it to be:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2152) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2153) #. If a CPU is either idle or executing in usermode, and RCU believes it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2154)    is non-idle, the scheduling-clock tick had better be running.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2155)    Otherwise, you will get RCU CPU stall warnings. Or at best, very long
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2156)    (11-second) grace periods, with a pointless IPI waking the CPU from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2157)    time to time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2158) #. If a CPU is in a portion of the kernel that executes RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2159)    critical sections, and RCU believes this CPU to be idle, you will get
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2160)    random memory corruption. **DON'T DO THIS!!!**
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2161)    This is one reason to test with lockdep, which will complain about
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2162)    this sort of thing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2163) #. If a CPU is in a portion of the kernel that is absolutely positively
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2164)    no-joking guaranteed to never execute any RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2165)    sections, and RCU believes this CPU to be idle, no problem. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2166)    sort of thing is used by some architectures for light-weight
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2167)    exception handlers, which can then avoid the overhead of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2168)    ``rcu_irq_enter()`` and ``rcu_irq_exit()`` at exception entry and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2169)    exit, respectively. Some go further and avoid the entireties of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2170)    ``irq_enter()`` and ``irq_exit()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2171)    Just make very sure you are running some of your tests with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2172)    ``CONFIG_PROVE_RCU=y``, just in case one of your code paths was in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2173)    fact joking about not doing RCU read-side critical sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2174) #. If a CPU is executing in the kernel with the scheduling-clock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2175)    interrupt disabled and RCU believes this CPU to be non-idle, and if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2176)    the CPU goes idle (from an RCU perspective) every few jiffies, no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2177)    problem. It is usually OK for there to be the occasional gap between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2178)    idle periods of up to a second or so.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2179)    If the gap grows too long, you get RCU CPU stall warnings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2180) #. If a CPU is either idle or executing in usermode, and RCU believes it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2181)    to be idle, of course no problem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2182) #. If a CPU is executing in the kernel, the kernel code path is passing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2183)    through quiescent states at a reasonable frequency (preferably about
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2184)    once per few jiffies, but the occasional excursion to a second or so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2185)    is usually OK) and the scheduling-clock interrupt is enabled, of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2186)    course no problem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2187)    If the gap between a successive pair of quiescent states grows too
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2188)    long, you get RCU CPU stall warnings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2189) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2190) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2191) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2192) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2193) | But what if my driver has a hardware interrupt handler that can run   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2194) | for many seconds? I cannot invoke ``schedule()`` from an hardware     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2195) | interrupt handler, after all!                                         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2196) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2197) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2198) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2199) | One approach is to do ``rcu_irq_exit();rcu_irq_enter();`` every so    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2200) | often. But given that long-running interrupt handlers can cause other |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2201) | problems, not least for response time, shouldn't you work to keep     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2202) | your interrupt handler's runtime within reasonable bounds?            |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2203) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2204) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2205) But as long as RCU is properly informed of kernel state transitions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2206) between in-kernel execution, usermode execution, and idle, and as long
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2207) as the scheduling-clock interrupt is enabled when RCU needs it to be,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2208) you can rest assured that the bugs you encounter will be in some other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2209) part of RCU or some other part of the kernel!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2210) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2211) Memory Efficiency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2212) ~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2213) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2214) Although small-memory non-realtime systems can simply use Tiny RCU, code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2215) size is only one aspect of memory efficiency. Another aspect is the size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2216) of the ``rcu_head`` structure used by ``call_rcu()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2217) ``kfree_rcu()``. Although this structure contains nothing more than a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2218) pair of pointers, it does appear in many RCU-protected data structures,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2219) including some that are size critical. The ``page`` structure is a case
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2220) in point, as evidenced by the many occurrences of the ``union`` keyword
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2221) within that structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2222) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2223) This need for memory efficiency is one reason that RCU uses hand-crafted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2224) singly linked lists to track the ``rcu_head`` structures that are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2225) waiting for a grace period to elapse. It is also the reason why
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2226) ``rcu_head`` structures do not contain debug information, such as fields
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2227) tracking the file and line of the ``call_rcu()`` or ``kfree_rcu()`` that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2228) posted them. Although this information might appear in debug-only kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2229) builds at some point, in the meantime, the ``->func`` field will often
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2230) provide the needed debug information.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2231) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2232) However, in some cases, the need for memory efficiency leads to even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2233) more extreme measures. Returning to the ``page`` structure, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2234) ``rcu_head`` field shares storage with a great many other structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2235) that are used at various points in the corresponding page's lifetime. In
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2236) order to correctly resolve certain `race
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2237) conditions <https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com>`__,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2238) the Linux kernel's memory-management subsystem needs a particular bit to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2239) remain zero during all phases of grace-period processing, and that bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2240) happens to map to the bottom bit of the ``rcu_head`` structure's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2241) ``->next`` field. RCU makes this guarantee as long as ``call_rcu()`` is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2242) used to post the callback, as opposed to ``kfree_rcu()`` or some future
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2243) “lazy” variant of ``call_rcu()`` that might one day be created for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2244) energy-efficiency purposes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2245) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2246) That said, there are limits. RCU requires that the ``rcu_head``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2247) structure be aligned to a two-byte boundary, and passing a misaligned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2248) ``rcu_head`` structure to one of the ``call_rcu()`` family of functions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2249) will result in a splat. It is therefore necessary to exercise caution
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2250) when packing structures containing fields of type ``rcu_head``. Why not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2251) a four-byte or even eight-byte alignment requirement? Because the m68k
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2252) architecture provides only two-byte alignment, and thus acts as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2253) alignment's least common denominator.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2254) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2255) The reason for reserving the bottom bit of pointers to ``rcu_head``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2256) structures is to leave the door open to “lazy” callbacks whose
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2257) invocations can safely be deferred. Deferring invocation could
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2258) potentially have energy-efficiency benefits, but only if the rate of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2259) non-lazy callbacks decreases significantly for some important workload.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2260) In the meantime, reserving the bottom bit keeps this option open in case
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2261) it one day becomes useful.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2262) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2263) Performance, Scalability, Response Time, and Reliability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2264) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2265) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2266) Expanding on the `earlier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2267) discussion <#Performance%20and%20Scalability>`__, RCU is used heavily by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2268) hot code paths in performance-critical portions of the Linux kernel's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2269) networking, security, virtualization, and scheduling code paths. RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2270) must therefore use efficient implementations, especially in its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2271) read-side primitives. To that end, it would be good if preemptible RCU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2272) implementation of ``rcu_read_lock()`` could be inlined, however, doing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2273) this requires resolving ``#include`` issues with the ``task_struct``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2274) structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2275) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2276) The Linux kernel supports hardware configurations with up to 4096 CPUs,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2277) which means that RCU must be extremely scalable. Algorithms that involve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2278) frequent acquisitions of global locks or frequent atomic operations on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2279) global variables simply cannot be tolerated within the RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2280) implementation. RCU therefore makes heavy use of a combining tree based
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2281) on the ``rcu_node`` structure. RCU is required to tolerate all CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2282) continuously invoking any combination of RCU's runtime primitives with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2283) minimal per-operation overhead. In fact, in many cases, increasing load
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2284) must *decrease* the per-operation overhead, witness the batching
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2285) optimizations for ``synchronize_rcu()``, ``call_rcu()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2286) ``synchronize_rcu_expedited()``, and ``rcu_barrier()``. As a general
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2287) rule, RCU must cheerfully accept whatever the rest of the Linux kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2288) decides to throw at it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2289) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2290) The Linux kernel is used for real-time workloads, especially in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2291) conjunction with the `-rt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2292) patchset <https://rt.wiki.kernel.org/index.php/Main_Page>`__. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2293) real-time-latency response requirements are such that the traditional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2294) approach of disabling preemption across RCU read-side critical sections
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2295) is inappropriate. Kernels built with ``CONFIG_PREEMPT=y`` therefore use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2296) an RCU implementation that allows RCU read-side critical sections to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2297) preempted. This requirement made its presence known after users made it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2298) clear that an earlier `real-time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2299) patch <https://lwn.net/Articles/107930/>`__ did not meet their needs, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2300) conjunction with some `RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2301) issues <https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2302) encountered by a very early version of the -rt patchset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2303) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2304) In addition, RCU must make do with a sub-100-microsecond real-time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2305) latency budget. In fact, on smaller systems with the -rt patchset, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2306) Linux kernel provides sub-20-microsecond real-time latencies for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2307) whole kernel, including RCU. RCU's scalability and latency must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2308) therefore be sufficient for these sorts of configurations. To my
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2309) surprise, the sub-100-microsecond real-time latency budget `applies to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2310) even the largest systems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2311) [PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf>`__,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2312) up to and including systems with 4096 CPUs. This real-time requirement
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2313) motivated the grace-period kthread, which also simplified handling of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2314) number of race conditions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2315) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2316) RCU must avoid degrading real-time response for CPU-bound threads,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2317) whether executing in usermode (which is one use case for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2318) ``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2319) the kernel must execute ``cond_resched()`` at least once per few tens of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2320) milliseconds in order to avoid receiving an IPI from RCU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2321) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2322) Finally, RCU's status as a synchronization primitive means that any RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2323) failure can result in arbitrary memory corruption that can be extremely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2324) difficult to debug. This means that RCU must be extremely reliable,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2325) which in practice also means that RCU must have an aggressive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2326) stress-test suite. This stress-test suite is called ``rcutorture``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2327) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2328) Although the need for ``rcutorture`` was no surprise, the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2329) immense popularity of the Linux kernel is posing interesting—and perhaps
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2330) unprecedented—validation challenges. To see this, keep in mind that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2331) there are well over one billion instances of the Linux kernel running
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2332) today, given Android smartphones, Linux-powered televisions, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2333) servers. This number can be expected to increase sharply with the advent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2334) of the celebrated Internet of Things.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2335) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2336) Suppose that RCU contains a race condition that manifests on average
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2337) once per million years of runtime. This bug will be occurring about
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2338) three times per *day* across the installed base. RCU could simply hide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2339) behind hardware error rates, given that no one should really expect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2340) their smartphone to last for a million years. However, anyone taking too
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2341) much comfort from this thought should consider the fact that in most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2342) jurisdictions, a successful multi-year test of a given mechanism, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2343) might include a Linux kernel, suffices for a number of types of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2344) safety-critical certifications. In fact, rumor has it that the Linux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2345) kernel is already being used in production for safety-critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2346) applications. I don't know about you, but I would feel quite bad if a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2347) bug in RCU killed someone. Which might explain my recent focus on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2348) validation and verification.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2349) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2350) Other RCU Flavors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2351) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2352) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2353) One of the more surprising things about RCU is that there are now no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2354) fewer than five *flavors*, or API families. In addition, the primary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2355) flavor that has been the sole focus up to this point has two different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2356) implementations, non-preemptible and preemptible. The other four flavors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2357) are listed below, with requirements for each described in a separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2358) section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2359) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2360) #. `Bottom-Half Flavor (Historical)`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2361) #. `Sched Flavor (Historical)`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2362) #. `Sleepable RCU`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2363) #. `Tasks RCU`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2364) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2365) Bottom-Half Flavor (Historical)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2366) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2367) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2368) The RCU-bh flavor of RCU has since been expressed in terms of the other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2369) RCU flavors as part of a consolidation of the three flavors into a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2370) single flavor. The read-side API remains, and continues to disable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2371) softirq and to be accounted for by lockdep. Much of the material in this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2372) section is therefore strictly historical in nature.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2373) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2374) The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2375) flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2376) flavor of RCU that could withstand the network-based denial-of-service
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2377) attacks researched by Robert Olsson. These attacks placed so much
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2378) networking load on the system that some of the CPUs never exited softirq
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2379) execution, which in turn prevented those CPUs from ever executing a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2380) context switch, which, in the RCU implementation of that time, prevented
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2381) grace periods from ever ending. The result was an out-of-memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2382) condition and a system hang.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2383) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2384) The solution was the creation of RCU-bh, which does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2385) ``local_bh_disable()`` across its read-side critical sections, and which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2386) uses the transition from one type of softirq processing to another as a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2387) quiescent state in addition to context switch, idle, user mode, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2388) offline. This means that RCU-bh grace periods can complete even when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2389) some of the CPUs execute in softirq indefinitely, thus allowing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2390) algorithms based on RCU-bh to withstand network-based denial-of-service
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2391) attacks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2392) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2393) Because ``rcu_read_lock_bh()`` and ``rcu_read_unlock_bh()`` disable and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2394) re-enable softirq handlers, any attempt to start a softirq handlers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2395) during the RCU-bh read-side critical section will be deferred. In this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2396) case, ``rcu_read_unlock_bh()`` will invoke softirq processing, which can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2397) take considerable time. One can of course argue that this softirq
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2398) overhead should be associated with the code following the RCU-bh
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2399) read-side critical section rather than ``rcu_read_unlock_bh()``, but the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2400) fact is that most profiling tools cannot be expected to make this sort
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2401) of fine distinction. For example, suppose that a three-millisecond-long
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2402) RCU-bh read-side critical section executes during a time of heavy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2403) networking load. There will very likely be an attempt to invoke at least
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2404) one softirq handler during that three milliseconds, but any such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2405) invocation will be delayed until the time of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2406) ``rcu_read_unlock_bh()``. This can of course make it appear at first
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2407) glance as if ``rcu_read_unlock_bh()`` was executing very slowly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2408) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2409) The `RCU-bh
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2410) API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2411) includes ``rcu_read_lock_bh()``, ``rcu_read_unlock_bh()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2412) ``rcu_dereference_bh()``, ``rcu_dereference_bh_check()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2413) ``synchronize_rcu_bh()``, ``synchronize_rcu_bh_expedited()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2414) ``call_rcu_bh()``, ``rcu_barrier_bh()``, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2415) ``rcu_read_lock_bh_held()``. However, the update-side APIs are now
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2416) simple wrappers for other RCU flavors, namely RCU-sched in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2417) CONFIG_PREEMPT=n kernels and RCU-preempt otherwise.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2418) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2419) Sched Flavor (Historical)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2420) ~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2421) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2422) The RCU-sched flavor of RCU has since been expressed in terms of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2423) other RCU flavors as part of a consolidation of the three flavors into a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2424) single flavor. The read-side API remains, and continues to disable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2425) preemption and to be accounted for by lockdep. Much of the material in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2426) this section is therefore strictly historical in nature.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2427) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2428) Before preemptible RCU, waiting for an RCU grace period had the side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2429) effect of also waiting for all pre-existing interrupt and NMI handlers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2430) However, there are legitimate preemptible-RCU implementations that do
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2431) not have this property, given that any point in the code outside of an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2432) RCU read-side critical section can be a quiescent state. Therefore,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2433) *RCU-sched* was created, which follows “classic” RCU in that an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2434) RCU-sched grace period waits for pre-existing interrupt and NMI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2435) handlers. In kernels built with ``CONFIG_PREEMPT=n``, the RCU and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2436) RCU-sched APIs have identical implementations, while kernels built with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2437) ``CONFIG_PREEMPT=y`` provide a separate implementation for each.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2438) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2439) Note well that in ``CONFIG_PREEMPT=y`` kernels,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2440) ``rcu_read_lock_sched()`` and ``rcu_read_unlock_sched()`` disable and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2441) re-enable preemption, respectively. This means that if there was a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2442) preemption attempt during the RCU-sched read-side critical section,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2443) ``rcu_read_unlock_sched()`` will enter the scheduler, with all the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2444) latency and overhead entailed. Just as with ``rcu_read_unlock_bh()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2445) this can make it look as if ``rcu_read_unlock_sched()`` was executing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2446) very slowly. However, the highest-priority task won't be preempted, so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2447) that task will enjoy low-overhead ``rcu_read_unlock_sched()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2448) invocations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2449) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2450) The `RCU-sched
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2451) API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2452) includes ``rcu_read_lock_sched()``, ``rcu_read_unlock_sched()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2453) ``rcu_read_lock_sched_notrace()``, ``rcu_read_unlock_sched_notrace()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2454) ``rcu_dereference_sched()``, ``rcu_dereference_sched_check()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2455) ``synchronize_sched()``, ``synchronize_rcu_sched_expedited()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2456) ``call_rcu_sched()``, ``rcu_barrier_sched()``, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2457) ``rcu_read_lock_sched_held()``. However, anything that disables
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2458) preemption also marks an RCU-sched read-side critical section, including
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2459) ``preempt_disable()`` and ``preempt_enable()``, ``local_irq_save()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2460) ``local_irq_restore()``, and so on.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2461) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2462) Sleepable RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2463) ~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2464) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2465) For well over a decade, someone saying “I need to block within an RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2466) read-side critical section” was a reliable indication that this someone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2467) did not understand RCU. After all, if you are always blocking in an RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2468) read-side critical section, you can probably afford to use a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2469) higher-overhead synchronization mechanism. However, that changed with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2470) the advent of the Linux kernel's notifiers, whose RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2471) sections almost never sleep, but sometimes need to. This resulted in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2472) introduction of `sleepable RCU <https://lwn.net/Articles/202847/>`__, or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2473) *SRCU*.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2474) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2475) SRCU allows different domains to be defined, with each such domain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2476) defined by an instance of an ``srcu_struct`` structure. A pointer to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2477) this structure must be passed in to each SRCU function, for example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2478) ``synchronize_srcu(&ss)``, where ``ss`` is the ``srcu_struct``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2479) structure. The key benefit of these domains is that a slow SRCU reader
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2480) in one domain does not delay an SRCU grace period in some other domain.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2481) That said, one consequence of these domains is that read-side code must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2482) pass a “cookie” from ``srcu_read_lock()`` to ``srcu_read_unlock()``, for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2483) example, as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2484) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2485)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2486) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2487)        1 int idx;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2488)        2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2489)        3 idx = srcu_read_lock(&ss);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2490)        4 do_something();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2491)        5 srcu_read_unlock(&ss, idx);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2492) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2493) As noted above, it is legal to block within SRCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2494) sections, however, with great power comes great responsibility. If you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2495) block forever in one of a given domain's SRCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2496) sections, then that domain's grace periods will also be blocked forever.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2497) Of course, one good way to block forever is to deadlock, which can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2498) happen if any operation in a given domain's SRCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2499) section can wait, either directly or indirectly, for that domain's grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2500) period to elapse. For example, this results in a self-deadlock:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2501) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2502)    ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2503) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2504)        1 int idx;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2505)        2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2506)        3 idx = srcu_read_lock(&ss);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2507)        4 do_something();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2508)        5 synchronize_srcu(&ss);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2509)        6 srcu_read_unlock(&ss, idx);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2510) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2511) However, if line 5 acquired a mutex that was held across a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2512) ``synchronize_srcu()`` for domain ``ss``, deadlock would still be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2513) possible. Furthermore, if line 5 acquired a mutex that was held across a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2514) ``synchronize_srcu()`` for some other domain ``ss1``, and if an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2515) ``ss1``-domain SRCU read-side critical section acquired another mutex
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2516) that was held across as ``ss``-domain ``synchronize_srcu()``, deadlock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2517) would again be possible. Such a deadlock cycle could extend across an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2518) arbitrarily large number of different SRCU domains. Again, with great
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2519) power comes great responsibility.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2520) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2521) Unlike the other RCU flavors, SRCU read-side critical sections can run
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2522) on idle and even offline CPUs. This ability requires that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2523) ``srcu_read_lock()`` and ``srcu_read_unlock()`` contain memory barriers,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2524) which means that SRCU readers will run a bit slower than would RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2525) readers. It also motivates the ``smp_mb__after_srcu_read_unlock()`` API,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2526) which, in combination with ``srcu_read_unlock()``, guarantees a full
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2527) memory barrier.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2528) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2529) Also unlike other RCU flavors, ``synchronize_srcu()`` may **not** be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2530) invoked from CPU-hotplug notifiers, due to the fact that SRCU grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2531) periods make use of timers and the possibility of timers being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2532) temporarily “stranded” on the outgoing CPU. This stranding of timers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2533) means that timers posted to the outgoing CPU will not fire until late in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2534) the CPU-hotplug process. The problem is that if a notifier is waiting on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2535) an SRCU grace period, that grace period is waiting on a timer, and that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2536) timer is stranded on the outgoing CPU, then the notifier will never be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2537) awakened, in other words, deadlock has occurred. This same situation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2538) course also prohibits ``srcu_barrier()`` from being invoked from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2539) CPU-hotplug notifiers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2540) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2541) SRCU also differs from other RCU flavors in that SRCU's expedited and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2542) non-expedited grace periods are implemented by the same mechanism. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2543) means that in the current SRCU implementation, expediting a future grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2544) period has the side effect of expediting all prior grace periods that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2545) have not yet completed. (But please note that this is a property of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2546) current implementation, not necessarily of future implementations.) In
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2547) addition, if SRCU has been idle for longer than the interval specified
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2548) by the ``srcutree.exp_holdoff`` kernel boot parameter (25 microseconds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2549) by default), and if a ``synchronize_srcu()`` invocation ends this idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2550) period, that invocation will be automatically expedited.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2551) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2552) As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2553) locking bottleneck present in prior kernel versions. Although this will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2554) allow users to put much heavier stress on ``call_srcu()``, it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2555) important to note that SRCU does not yet take any special steps to deal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2556) with callback flooding. So if you are posting (say) 10,000 SRCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2557) callbacks per second per CPU, you are probably totally OK, but if you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2558) intend to post (say) 1,000,000 SRCU callbacks per second per CPU, please
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2559) run some tests first. SRCU just might need a few adjustment to deal with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2560) that sort of load. Of course, your mileage may vary based on the speed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2561) of your CPUs and the size of your memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2562) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2563) The `SRCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2564) API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2565) includes ``srcu_read_lock()``, ``srcu_read_unlock()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2566) ``srcu_dereference()``, ``srcu_dereference_check()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2567) ``synchronize_srcu()``, ``synchronize_srcu_expedited()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2568) ``call_srcu()``, ``srcu_barrier()``, and ``srcu_read_lock_held()``. It
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2569) also includes ``DEFINE_SRCU()``, ``DEFINE_STATIC_SRCU()``, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2570) ``init_srcu_struct()`` APIs for defining and initializing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2571) ``srcu_struct`` structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2572) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2573) Tasks RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2574) ~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2575) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2576) Some forms of tracing use “trampolines” to handle the binary rewriting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2577) required to install different types of probes. It would be good to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2578) able to free old trampolines, which sounds like a job for some form of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2579) RCU. However, because it is necessary to be able to install a trace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2580) anywhere in the code, it is not possible to use read-side markers such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2581) as ``rcu_read_lock()`` and ``rcu_read_unlock()``. In addition, it does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2582) not work to have these markers in the trampoline itself, because there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2583) would need to be instructions following ``rcu_read_unlock()``. Although
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2584) ``synchronize_rcu()`` would guarantee that execution reached the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2585) ``rcu_read_unlock()``, it would not be able to guarantee that execution
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2586) had completely left the trampoline. Worse yet, in some situations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2587) the trampoline's protection must extend a few instructions *prior* to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2588) execution reaching the trampoline.  For example, these few instructions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2589) might calculate the address of the trampoline, so that entering the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2590) trampoline would be pre-ordained a surprisingly long time before execution
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2591) actually reached the trampoline itself.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2592) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2593) The solution, in the form of `Tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2594) RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2595) critical sections that are delimited by voluntary context switches, that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2596) is, calls to ``schedule()``, ``cond_resched()``, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2597) ``synchronize_rcu_tasks()``. In addition, transitions to and from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2598) userspace execution also delimit tasks-RCU read-side critical sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2599) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2600) The tasks-RCU API is quite compact, consisting only of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2601) ``call_rcu_tasks()``, ``synchronize_rcu_tasks()``, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2602) ``rcu_barrier_tasks()``. In ``CONFIG_PREEMPT=n`` kernels, trampolines
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2603) cannot be preempted, so these APIs map to ``call_rcu()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2604) ``synchronize_rcu()``, and ``rcu_barrier()``, respectively. In
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2605) ``CONFIG_PREEMPT=y`` kernels, trampolines can be preempted, and these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2606) three APIs are therefore implemented by separate functions that check
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2607) for voluntary context switches.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2608) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2609) Possible Future Changes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2610) -----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2611) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2612) One of the tricks that RCU uses to attain update-side scalability is to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2613) increase grace-period latency with increasing numbers of CPUs. If this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2614) becomes a serious problem, it will be necessary to rework the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2615) grace-period state machine so as to avoid the need for the additional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2616) latency.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2617) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2618) RCU disables CPU hotplug in a few places, perhaps most notably in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2619) ``rcu_barrier()`` operations. If there is a strong reason to use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2620) ``rcu_barrier()`` in CPU-hotplug notifiers, it will be necessary to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2621) avoid disabling CPU hotplug. This would introduce some complexity, so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2622) there had better be a *very* good reason.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2623) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2624) The tradeoff between grace-period latency on the one hand and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2625) interruptions of other CPUs on the other hand may need to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2626) re-examined. The desire is of course for zero grace-period latency as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2627) well as zero interprocessor interrupts undertaken during an expedited
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2628) grace period operation. While this ideal is unlikely to be achievable,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2629) it is quite possible that further improvements can be made.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2630) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2631) The multiprocessor implementations of RCU use a combining tree that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2632) groups CPUs so as to reduce lock contention and increase cache locality.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2633) However, this combining tree does not spread its memory across NUMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2634) nodes nor does it align the CPU groups with hardware features such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2635) sockets or cores. Such spreading and alignment is currently believed to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2636) be unnecessary because the hotpath read-side primitives do not access
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2637) the combining tree, nor does ``call_rcu()`` in the common case. If you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2638) believe that your architecture needs such spreading and alignment, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2639) your architecture should also benefit from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2640) ``rcutree.rcu_fanout_leaf`` boot parameter, which can be set to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2641) number of CPUs in a socket, NUMA node, or whatever. If the number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2642) CPUs is too large, use a fraction of the number of CPUs. If the number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2643) of CPUs is a large prime number, well, that certainly is an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2644) “interesting” architectural choice! More flexible arrangements might be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2645) considered, but only if ``rcutree.rcu_fanout_leaf`` has proven
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2646) inadequate, and only if the inadequacy has been demonstrated by a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2647) carefully run and realistic system-level workload.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2648) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2649) Please note that arrangements that require RCU to remap CPU numbers will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2650) require extremely good demonstration of need and full exploration of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2651) alternatives.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2652) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2653) RCU's various kthreads are reasonably recent additions. It is quite
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2654) likely that adjustments will be required to more gracefully handle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2655) extreme loads. It might also be necessary to be able to relate CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2656) utilization by RCU's kthreads and softirq handlers to the code that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2657) instigated this CPU utilization. For example, RCU callback overhead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2658) might be charged back to the originating ``call_rcu()`` instance, though
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2659) probably not in production kernels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2660) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2661) Additional work may be required to provide reasonable forward-progress
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2662) guarantees under heavy load for grace periods and for callback
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2663) invocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2664) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2665) Summary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2666) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2667) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2668) This document has presented more than two decade's worth of RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2669) requirements. Given that the requirements keep changing, this will not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2670) be the last word on this subject, but at least it serves to get an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2671) important subset of the requirements set forth.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2672) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2673) Acknowledgments
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2674) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2675) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2676) I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, Oleg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2677) Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and Andy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2678) Lutomirski for their help in rendering this article human readable, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2679) to Michelle Rankin for her support of this effort. Other contributions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2680) are acknowledged in the Linux kernel's git archive.