Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    1) ===================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    2) A Tour Through TREE_RCU's Data Structures [LWN.net]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    3) ===================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    5) December 18, 2016
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    7) This article was contributed by Paul E. McKenney
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    9) Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   10) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   11) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   12) This document describes RCU's major data structures and their relationship
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   13) to each other.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   14) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   15) Data-Structure Relationships
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   16) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   17) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   18) RCU is for all intents and purposes a large state machine, and its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   19) data structures maintain the state in such a way as to allow RCU readers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   20) to execute extremely quickly, while also processing the RCU grace periods
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   21) requested by updaters in an efficient and extremely scalable fashion.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   22) The efficiency and scalability of RCU updaters is provided primarily
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   23) by a combining tree, as shown below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   24) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   25) .. kernel-figure:: BigTreeClassicRCU.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   26) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   27) This diagram shows an enclosing ``rcu_state`` structure containing a tree
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   28) of ``rcu_node`` structures. Each leaf node of the ``rcu_node`` tree has up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   29) to 16 ``rcu_data`` structures associated with it, so that there are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   30) ``NR_CPUS`` number of ``rcu_data`` structures, one for each possible CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   31) This structure is adjusted at boot time, if needed, to handle the common
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   32) case where ``nr_cpu_ids`` is much less than ``NR_CPUs``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   33) For example, a number of Linux distributions set ``NR_CPUs=4096``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   34) which results in a three-level ``rcu_node`` tree.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   35) If the actual hardware has only 16 CPUs, RCU will adjust itself
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   36) at boot time, resulting in an ``rcu_node`` tree with only a single node.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   37) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   38) The purpose of this combining tree is to allow per-CPU events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   39) such as quiescent states, dyntick-idle transitions,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   40) and CPU hotplug operations to be processed efficiently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   41) and scalably.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   42) Quiescent states are recorded by the per-CPU ``rcu_data`` structures,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   43) and other events are recorded by the leaf-level ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   44) structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   45) All of these events are combined at each level of the tree until finally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   46) grace periods are completed at the tree's root ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   47) structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   48) A grace period can be completed at the root once every CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   49) (or, in the case of ``CONFIG_PREEMPT_RCU``, task)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   50) has passed through a quiescent state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   51) Once a grace period has completed, record of that fact is propagated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   52) back down the tree.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   53) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   54) As can be seen from the diagram, on a 64-bit system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   55) a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   56) of 64 at the root and a fanout of 16 at the leaves.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   57) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   58) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   59) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   60) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   61) | Why isn't the fanout at the leaves also 64?                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   62) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   63) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   64) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   65) | Because there are more types of events that affect the leaf-level     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   66) | ``rcu_node`` structures than further up the tree. Therefore, if the   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   67) | leaf ``rcu_node`` structures have fanout of 64, the contention on     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   68) | these structures' ``->structures`` becomes excessive. Experimentation |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   69) | on a wide variety of systems has shown that a fanout of 16 works well |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   70) | for the leaves of the ``rcu_node`` tree.                              |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   71) |                                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   72) | Of course, further experience with systems having hundreds or         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   73) | thousands of CPUs may demonstrate that the fanout for the non-leaf    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   74) | ``rcu_node`` structures must also be reduced. Such reduction can be   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   75) | easily carried out when and if it proves necessary. In the meantime,  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   76) | if you are using such a system and running into contention problems   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   77) | on the non-leaf ``rcu_node`` structures, you may use the              |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   78) | ``CONFIG_RCU_FANOUT`` kernel configuration parameter to reduce the    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   79) | non-leaf fanout as needed.                                            |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   80) |                                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   81) | Kernels built for systems with strong NUMA characteristics might      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   82) | also need to adjust ``CONFIG_RCU_FANOUT`` so that the domains of      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   83) | the ``rcu_node`` structures align with hardware boundaries.           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   84) | However, there has thus far been no need for this.                    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   85) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   86) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   87) If your system has more than 1,024 CPUs (or more than 512 CPUs on a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   88) 32-bit system), then RCU will automatically add more levels to the tree.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   89) For example, if you are crazy enough to build a 64-bit system with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   90) 65,536 CPUs, RCU would configure the ``rcu_node`` tree as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   91) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   92) .. kernel-figure:: HugeTreeClassicRCU.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   93) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   94) RCU currently permits up to a four-level tree, which on a 64-bit system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   95) accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   96) 32-bit systems. On the other hand, you can set both
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   97) ``CONFIG_RCU_FANOUT`` and ``CONFIG_RCU_FANOUT_LEAF`` to be as small as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   98) 2, which would result in a 16-CPU test using a 4-level tree. This can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   99) useful for testing large-system capabilities on small test machines.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  100) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  101) This multi-level combining tree allows us to get most of the performance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  102) and scalability benefits of partitioning, even though RCU grace-period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  103) detection is inherently a global operation. The trick here is that only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  104) the last CPU to report a quiescent state into a given ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  105) structure need advance to the ``rcu_node`` structure at the next level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  106) up the tree. This means that at the leaf-level ``rcu_node`` structure,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  107) only one access out of sixteen will progress up the tree. For the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  108) internal ``rcu_node`` structures, the situation is even more extreme:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  109) Only one access out of sixty-four will progress up the tree. Because the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  110) vast majority of the CPUs do not progress up the tree, the lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  111) contention remains roughly constant up the tree. No matter how many CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  112) there are in the system, at most 64 quiescent-state reports per grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  113) period will progress all the way to the root ``rcu_node`` structure,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  114) thus ensuring that the lock contention on that root ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  115) structure remains acceptably low.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  116) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  117) In effect, the combining tree acts like a big shock absorber, keeping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  118) lock contention under control at all tree levels regardless of the level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  119) of loading on the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  120) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  121) RCU updaters wait for normal grace periods by registering RCU callbacks,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  122) either directly via ``call_rcu()`` or indirectly via
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  123) ``synchronize_rcu()`` and friends. RCU callbacks are represented by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  124) ``rcu_head`` structures, which are queued on ``rcu_data`` structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  125) while they are waiting for a grace period to elapse, as shown in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  126) following figure:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  127) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  128) .. kernel-figure:: BigTreePreemptRCUBHdyntickCB.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  129) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  130) This figure shows how ``TREE_RCU``'s and ``PREEMPT_RCU``'s major data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  131) structures are related. Lesser data structures will be introduced with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  132) the algorithms that make use of them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  133) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  134) Note that each of the data structures in the above figure has its own
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  135) synchronization:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  136) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  137) #. Each ``rcu_state`` structures has a lock and a mutex, and some fields
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  138)    are protected by the corresponding root ``rcu_node`` structure's lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  139) #. Each ``rcu_node`` structure has a spinlock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  140) #. The fields in ``rcu_data`` are private to the corresponding CPU,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  141)    although a few can be read and written by other CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  142) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  143) It is important to note that different data structures can have very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  144) different ideas about the state of RCU at any given time. For but one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  145) example, awareness of the start or end of a given RCU grace period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  146) propagates slowly through the data structures. This slow propagation is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  147) absolutely necessary for RCU to have good read-side performance. If this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  148) balkanized implementation seems foreign to you, one useful trick is to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  149) consider each instance of these data structures to be a different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  150) person, each having the usual slightly different view of reality.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  151) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  152) The general role of each of these data structures is as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  153) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  154) #. ``rcu_state``: This structure forms the interconnection between the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  155)    ``rcu_node`` and ``rcu_data`` structures, tracks grace periods,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  156)    serves as short-term repository for callbacks orphaned by CPU-hotplug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  157)    events, maintains ``rcu_barrier()`` state, tracks expedited
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  158)    grace-period state, and maintains state used to force quiescent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  159)    states when grace periods extend too long,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  160) #. ``rcu_node``: This structure forms the combining tree that propagates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  161)    quiescent-state information from the leaves to the root, and also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  162)    propagates grace-period information from the root to the leaves. It
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  163)    provides local copies of the grace-period state in order to allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  164)    this information to be accessed in a synchronized manner without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  165)    suffering the scalability limitations that would otherwise be imposed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  166)    by global locking. In ``CONFIG_PREEMPT_RCU`` kernels, it manages the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  167)    lists of tasks that have blocked while in their current RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  168)    critical section. In ``CONFIG_PREEMPT_RCU`` with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  169)    ``CONFIG_RCU_BOOST``, it manages the per-\ ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  170)    priority-boosting kernel threads (kthreads) and state. Finally, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  171)    records CPU-hotplug state in order to determine which CPUs should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  172)    ignored during a given grace period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  173) #. ``rcu_data``: This per-CPU structure is the focus of quiescent-state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  174)    detection and RCU callback queuing. It also tracks its relationship
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  175)    to the corresponding leaf ``rcu_node`` structure to allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  176)    more-efficient propagation of quiescent states up the ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  177)    combining tree. Like the ``rcu_node`` structure, it provides a local
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  178)    copy of the grace-period information to allow for-free synchronized
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  179)    access to this information from the corresponding CPU. Finally, this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  180)    structure records past dyntick-idle state for the corresponding CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  181)    and also tracks statistics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  182) #. ``rcu_head``: This structure represents RCU callbacks, and is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  183)    only structure allocated and managed by RCU users. The ``rcu_head``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  184)    structure is normally embedded within the RCU-protected data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  185)    structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  186) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  187) If all you wanted from this article was a general notion of how RCU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  188) data structures are related, you are done. Otherwise, each of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  189) following sections give more details on the ``rcu_state``, ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  190) and ``rcu_data`` data structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  191) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  192) The ``rcu_state`` Structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  193) ~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  194) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  195) The ``rcu_state`` structure is the base structure that represents the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  196) state of RCU in the system. This structure forms the interconnection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  197) between the ``rcu_node`` and ``rcu_data`` structures, tracks grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  198) periods, contains the lock used to synchronize with CPU-hotplug events,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  199) and maintains state used to force quiescent states when grace periods
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  200) extend too long,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  201) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  202) A few of the ``rcu_state`` structure's fields are discussed, singly and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  203) in groups, in the following sections. The more specialized fields are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  204) covered in the discussion of their use.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  205) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  206) Relationship to rcu_node and rcu_data Structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  207) ''''''''''''''''''''''''''''''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  208) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  209) This portion of the ``rcu_state`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  210) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  211) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  212) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  213)      1   struct rcu_node node[NUM_RCU_NODES];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  214)      2   struct rcu_node *level[NUM_RCU_LVLS + 1];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  215)      3   struct rcu_data __percpu *rda;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  216) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  217) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  218) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  219) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  220) | Wait a minute! You said that the ``rcu_node`` structures formed a     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  221) | tree, but they are declared as a flat array! What gives?              |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  222) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  223) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  224) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  225) | The tree is laid out in the array. The first node In the array is the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  226) | head, the next set of nodes in the array are children of the head     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  227) | node, and so on until the last set of nodes in the array are the      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  228) | leaves.                                                               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  229) | See the following diagrams to see how this works.                     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  230) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  231) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  232) The ``rcu_node`` tree is embedded into the ``->node[]`` array as shown
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  233) in the following figure:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  234) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  235) .. kernel-figure:: TreeMapping.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  236) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  237) One interesting consequence of this mapping is that a breadth-first
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  238) traversal of the tree is implemented as a simple linear scan of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  239) array, which is in fact what the ``rcu_for_each_node_breadth_first()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  240) macro does. This macro is used at the beginning and ends of grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  241) periods.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  242) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  243) Each entry of the ``->level`` array references the first ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  244) structure on the corresponding level of the tree, for example, as shown
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  245) below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  246) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  247) .. kernel-figure:: TreeMappingLevel.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  248) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  249) The zero\ :sup:`th` element of the array references the root
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  250) ``rcu_node`` structure, the first element references the first child of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  251) the root ``rcu_node``, and finally the second element references the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  252) first leaf ``rcu_node`` structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  253) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  254) For whatever it is worth, if you draw the tree to be tree-shaped rather
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  255) than array-shaped, it is easy to draw a planar representation:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  256) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  257) .. kernel-figure:: TreeLevel.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  258) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  259) Finally, the ``->rda`` field references a per-CPU pointer to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  260) corresponding CPU's ``rcu_data`` structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  261) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  262) All of these fields are constant once initialization is complete, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  263) therefore need no protection.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  264) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  265) Grace-Period Tracking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  266) '''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  267) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  268) This portion of the ``rcu_state`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  269) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  270) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  271) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  272)      1   unsigned long gp_seq;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  273) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  274) RCU grace periods are numbered, and the ``->gp_seq`` field contains the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  275) current grace-period sequence number. The bottom two bits are the state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  276) of the current grace period, which can be zero for not yet started or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  277) one for in progress. In other words, if the bottom two bits of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  278) ``->gp_seq`` are zero, then RCU is idle. Any other value in the bottom
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  279) two bits indicates that something is broken. This field is protected by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  280) the root ``rcu_node`` structure's ``->lock`` field.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  281) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  282) There are ``->gp_seq`` fields in the ``rcu_node`` and ``rcu_data``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  283) structures as well. The fields in the ``rcu_state`` structure represent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  284) the most current value, and those of the other structures are compared
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  285) in order to detect the beginnings and ends of grace periods in a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  286) distributed fashion. The values flow from ``rcu_state`` to ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  287) (down the tree from the root to the leaves) to ``rcu_data``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  288) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  289) Miscellaneous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  290) '''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  291) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  292) This portion of the ``rcu_state`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  293) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  294) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  295) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  296)      1   unsigned long gp_max;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  297)      2   char abbr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  298)      3   char *name;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  299) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  300) The ``->gp_max`` field tracks the duration of the longest grace period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  301) in jiffies. It is protected by the root ``rcu_node``'s ``->lock``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  302) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  303) The ``->name`` and ``->abbr`` fields distinguish between preemptible RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  304) (“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  305) These fields are used for diagnostic and tracing purposes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  306) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  307) The ``rcu_node`` Structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  308) ~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  309) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  310) The ``rcu_node`` structures form the combining tree that propagates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  311) quiescent-state information from the leaves to the root and also that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  312) propagates grace-period information from the root down to the leaves.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  313) They provides local copies of the grace-period state in order to allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  314) this information to be accessed in a synchronized manner without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  315) suffering the scalability limitations that would otherwise be imposed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  316) global locking. In ``CONFIG_PREEMPT_RCU`` kernels, they manage the lists
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  317) of tasks that have blocked while in their current RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  318) section. In ``CONFIG_PREEMPT_RCU`` with ``CONFIG_RCU_BOOST``, they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  319) manage the per-\ ``rcu_node`` priority-boosting kernel threads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  320) (kthreads) and state. Finally, they record CPU-hotplug state in order to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  321) determine which CPUs should be ignored during a given grace period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  322) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  323) The ``rcu_node`` structure's fields are discussed, singly and in groups,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  324) in the following sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  325) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  326) Connection to Combining Tree
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  327) ''''''''''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  328) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  329) This portion of the ``rcu_node`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  330) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  331) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  332) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  333)      1   struct rcu_node *parent;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  334)      2   u8 level;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  335)      3   u8 grpnum;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  336)      4   unsigned long grpmask;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  337)      5   int grplo;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  338)      6   int grphi;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  339) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  340) The ``->parent`` pointer references the ``rcu_node`` one level up in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  341) tree, and is ``NULL`` for the root ``rcu_node``. The RCU implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  342) makes heavy use of this field to push quiescent states up the tree. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  343) ``->level`` field gives the level in the tree, with the root being at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  344) level zero, its children at level one, and so on. The ``->grpnum`` field
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  345) gives this node's position within the children of its parent, so this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  346) number can range between 0 and 31 on 32-bit systems and between 0 and 63
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  347) on 64-bit systems. The ``->level`` and ``->grpnum`` fields are used only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  348) during initialization and for tracing. The ``->grpmask`` field is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  349) bitmask counterpart of ``->grpnum``, and therefore always has exactly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  350) one bit set. This mask is used to clear the bit corresponding to this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  351) ``rcu_node`` structure in its parent's bitmasks, which are described
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  352) later. Finally, the ``->grplo`` and ``->grphi`` fields contain the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  353) lowest and highest numbered CPU served by this ``rcu_node`` structure,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  354) respectively.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  355) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  356) All of these fields are constant, and thus do not require any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  357) synchronization.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  358) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  359) Synchronization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  360) '''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  361) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  362) This field of the ``rcu_node`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  363) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  364) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  365) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  366)      1   raw_spinlock_t lock;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  367) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  368) This field is used to protect the remaining fields in this structure,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  369) unless otherwise stated. That said, all of the fields in this structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  370) can be accessed without locking for tracing purposes. Yes, this can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  371) result in confusing traces, but better some tracing confusion than to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  372) heisenbugged out of existence.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  373) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  374) .. _grace-period-tracking-1:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  375) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  376) Grace-Period Tracking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  377) '''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  378) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  379) This portion of the ``rcu_node`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  380) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  381) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  382) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  383)      1   unsigned long gp_seq;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  384)      2   unsigned long gp_seq_needed;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  385) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  386) The ``rcu_node`` structures' ``->gp_seq`` fields are the counterparts of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  387) the field of the same name in the ``rcu_state`` structure. They each may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  388) lag up to one step behind their ``rcu_state`` counterpart. If the bottom
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  389) two bits of a given ``rcu_node`` structure's ``->gp_seq`` field is zero,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  390) then this ``rcu_node`` structure believes that RCU is idle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  391) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  392) The ``>gp_seq`` field of each ``rcu_node`` structure is updated at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  393) beginning and the end of each grace period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  394) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  395) The ``->gp_seq_needed`` fields record the furthest-in-the-future grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  396) period request seen by the corresponding ``rcu_node`` structure. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  397) request is considered fulfilled when the value of the ``->gp_seq`` field
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  398) equals or exceeds that of the ``->gp_seq_needed`` field.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  399) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  400) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  401) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  402) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  403) | Suppose that this ``rcu_node`` structure doesn't see a request for a  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  404) | very long time. Won't wrapping of the ``->gp_seq`` field cause        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  405) | problems?                                                             |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  406) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  407) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  408) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  409) | No, because if the ``->gp_seq_needed`` field lags behind the          |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  410) | ``->gp_seq`` field, the ``->gp_seq_needed`` field will be updated at  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  411) | the end of the grace period. Modulo-arithmetic comparisons therefore  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  412) | will always get the correct answer, even with wrapping.               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  413) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  414) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  415) Quiescent-State Tracking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  416) ''''''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  417) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  418) These fields manage the propagation of quiescent states up the combining
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  419) tree.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  420) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  421) This portion of the ``rcu_node`` structure has fields as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  422) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  423) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  424) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  425)      1   unsigned long qsmask;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  426)      2   unsigned long expmask;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  427)      3   unsigned long qsmaskinit;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  428)      4   unsigned long expmaskinit;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  429) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  430) The ``->qsmask`` field tracks which of this ``rcu_node`` structure's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  431) children still need to report quiescent states for the current normal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  432) grace period. Such children will have a value of 1 in their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  433) corresponding bit. Note that the leaf ``rcu_node`` structures should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  434) thought of as having ``rcu_data`` structures as their children.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  435) Similarly, the ``->expmask`` field tracks which of this ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  436) structure's children still need to report quiescent states for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  437) current expedited grace period. An expedited grace period has the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  438) conceptual properties as a normal grace period, but the expedited
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  439) implementation accepts extreme CPU overhead to obtain much lower
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  440) grace-period latency, for example, consuming a few tens of microseconds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  441) worth of CPU time to reduce grace-period duration from milliseconds to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  442) tens of microseconds. The ``->qsmaskinit`` field tracks which of this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  443) ``rcu_node`` structure's children cover for at least one online CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  444) This mask is used to initialize ``->qsmask``, and ``->expmaskinit`` is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  445) used to initialize ``->expmask`` and the beginning of the normal and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  446) expedited grace periods, respectively.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  447) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  448) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  449) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  450) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  451) | Why are these bitmasks protected by locking? Come on, haven't you     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  452) | heard of atomic instructions???                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  453) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  454) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  455) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  456) | Lockless grace-period computation! Such a tantalizing possibility!    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  457) | But consider the following sequence of events:                        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  458) |                                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  459) | #. CPU 0 has been in dyntick-idle mode for quite some time. When it   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  460) |    wakes up, it notices that the current RCU grace period needs it to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  461) |    report in, so it sets a flag where the scheduling clock interrupt  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  462) |    will find it.                                                      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  463) | #. Meanwhile, CPU 1 is running ``force_quiescent_state()``, and       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  464) |    notices that CPU 0 has been in dyntick idle mode, which qualifies  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  465) |    as an extended quiescent state.                                    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  466) | #. CPU 0's scheduling clock interrupt fires in the middle of an RCU   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  467) |    read-side critical section, and notices that the RCU core needs    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  468) |    something, so commences RCU softirq processing.                    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  469) | #. CPU 0's softirq handler executes and is just about ready to report |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  470) |    its quiescent state up the ``rcu_node`` tree.                      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  471) | #. But CPU 1 beats it to the punch, completing the current grace      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  472) |    period and starting a new one.                                     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  473) | #. CPU 0 now reports its quiescent state for the wrong grace period.  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  474) |    That grace period might now end before the RCU read-side critical  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  475) |    section. If that happens, disaster will ensue.                     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  476) |                                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  477) | So the locking is absolutely required in order to coordinate clearing |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  478) | of the bits with updating of the grace-period sequence number in      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  479) | ``->gp_seq``.                                                         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  480) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  481) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  482) Blocked-Task Management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  483) '''''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  484) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  485) ``PREEMPT_RCU`` allows tasks to be preempted in the midst of their RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  486) read-side critical sections, and these tasks must be tracked explicitly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  487) The details of exactly why and how they are tracked will be covered in a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  488) separate article on RCU read-side processing. For now, it is enough to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  489) know that the ``rcu_node`` structure tracks them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  490) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  491) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  492) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  493)      1   struct list_head blkd_tasks;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  494)      2   struct list_head *gp_tasks;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  495)      3   struct list_head *exp_tasks;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  496)      4   bool wait_blkd_tasks;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  497) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  498) The ``->blkd_tasks`` field is a list header for the list of blocked and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  499) preempted tasks. As tasks undergo context switches within RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  500) critical sections, their ``task_struct`` structures are enqueued (via
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  501) the ``task_struct``'s ``->rcu_node_entry`` field) onto the head of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  502) ``->blkd_tasks`` list for the leaf ``rcu_node`` structure corresponding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  503) to the CPU on which the outgoing context switch executed. As these tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  504) later exit their RCU read-side critical sections, they remove themselves
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  505) from the list. This list is therefore in reverse time order, so that if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  506) one of the tasks is blocking the current grace period, all subsequent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  507) tasks must also be blocking that same grace period. Therefore, a single
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  508) pointer into this list suffices to track all tasks blocking a given
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  509) grace period. That pointer is stored in ``->gp_tasks`` for normal grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  510) periods and in ``->exp_tasks`` for expedited grace periods. These last
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  511) two fields are ``NULL`` if either there is no grace period in flight or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  512) if there are no blocked tasks preventing that grace period from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  513) completing. If either of these two pointers is referencing a task that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  514) removes itself from the ``->blkd_tasks`` list, then that task must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  515) advance the pointer to the next task on the list, or set the pointer to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  516) ``NULL`` if there are no subsequent tasks on the list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  517) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  518) For example, suppose that tasks T1, T2, and T3 are all hard-affinitied
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  519) to the largest-numbered CPU in the system. Then if task T1 blocked in an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  520) RCU read-side critical section, then an expedited grace period started,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  521) then task T2 blocked in an RCU read-side critical section, then a normal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  522) grace period started, and finally task 3 blocked in an RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  523) critical section, then the state of the last leaf ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  524) structure's blocked-task list would be as shown below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  525) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  526) .. kernel-figure:: blkd_task.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  527) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  528) Task T1 is blocking both grace periods, task T2 is blocking only the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  529) normal grace period, and task T3 is blocking neither grace period. Note
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  530) that these tasks will not remove themselves from this list immediately
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  531) upon resuming execution. They will instead remain on the list until they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  532) execute the outermost ``rcu_read_unlock()`` that ends their RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  533) read-side critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  534) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  535) The ``->wait_blkd_tasks`` field indicates whether or not the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  536) grace period is waiting on a blocked task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  537) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  538) Sizing the ``rcu_node`` Array
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  539) '''''''''''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  540) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  541) The ``rcu_node`` array is sized via a series of C-preprocessor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  542) expressions as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  543) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  544) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  545) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  546)     1 #ifdef CONFIG_RCU_FANOUT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  547)     2 #define RCU_FANOUT CONFIG_RCU_FANOUT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  548)     3 #else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  549)     4 # ifdef CONFIG_64BIT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  550)     5 # define RCU_FANOUT 64
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  551)     6 # else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  552)     7 # define RCU_FANOUT 32
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  553)     8 # endif
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  554)     9 #endif
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  555)    10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  556)    11 #ifdef CONFIG_RCU_FANOUT_LEAF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  557)    12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  558)    13 #else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  559)    14 # ifdef CONFIG_64BIT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  560)    15 # define RCU_FANOUT_LEAF 64
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  561)    16 # else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  562)    17 # define RCU_FANOUT_LEAF 32
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  563)    18 # endif
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  564)    19 #endif
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  565)    20
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  566)    21 #define RCU_FANOUT_1        (RCU_FANOUT_LEAF)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  567)    22 #define RCU_FANOUT_2        (RCU_FANOUT_1 * RCU_FANOUT)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  568)    23 #define RCU_FANOUT_3        (RCU_FANOUT_2 * RCU_FANOUT)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  569)    24 #define RCU_FANOUT_4        (RCU_FANOUT_3 * RCU_FANOUT)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  570)    25
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  571)    26 #if NR_CPUS <= RCU_FANOUT_1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  572)    27 #  define RCU_NUM_LVLS        1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  573)    28 #  define NUM_RCU_LVL_0        1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  574)    29 #  define NUM_RCU_NODES        NUM_RCU_LVL_0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  575)    30 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  576)    31 #  define RCU_NODE_NAME_INIT  { "rcu_node_0" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  577)    32 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  578)    33 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  579)    34 #elif NR_CPUS <= RCU_FANOUT_2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  580)    35 #  define RCU_NUM_LVLS        2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  581)    36 #  define NUM_RCU_LVL_0        1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  582)    37 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  583)    38 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  584)    39 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  585)    40 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  586)    41 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  587)    42 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  588)    43 #elif NR_CPUS <= RCU_FANOUT_3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  589)    44 #  define RCU_NUM_LVLS        3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  590)    45 #  define NUM_RCU_LVL_0        1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  591)    46 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  592)    47 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  593)    48 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  594)    49 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  595)    50 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  596)    51 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  597)    52 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  598)    53 #elif NR_CPUS <= RCU_FANOUT_4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  599)    54 #  define RCU_NUM_LVLS        4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  600)    55 #  define NUM_RCU_LVL_0        1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  601)    56 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  602)    57 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  603)    58 #  define NUM_RCU_LVL_3        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  604)    59 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  605)    60 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  606)    61 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  607)    62 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  608)    63 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  609)    64 #else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  610)    65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  611)    66 #endif
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  612) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  613) The maximum number of levels in the ``rcu_node`` structure is currently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  614) limited to four, as specified by lines 21-24 and the structure of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  615) subsequent “if” statement. For 32-bit systems, this allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  616) 16*32*32*32=524,288 CPUs, which should be sufficient for the next few
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  617) years at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  618) allowed, which should see us through the next decade or so. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  619) four-level tree also allows kernels built with ``CONFIG_RCU_FANOUT=8``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  620) to support up to 4096 CPUs, which might be useful in very large systems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  621) having eight CPUs per socket (but please note that no one has yet shown
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  622) any measurable performance degradation due to misaligned socket and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  623) ``rcu_node`` boundaries). In addition, building kernels with a full four
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  624) levels of ``rcu_node`` tree permits better testing of RCU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  625) combining-tree code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  626) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  627) The ``RCU_FANOUT`` symbol controls how many children are permitted at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  628) each non-leaf level of the ``rcu_node`` tree. If the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  629) ``CONFIG_RCU_FANOUT`` Kconfig option is not specified, it is set based
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  630) on the word size of the system, which is also the Kconfig default.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  631) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  632) The ``RCU_FANOUT_LEAF`` symbol controls how many CPUs are handled by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  633) each leaf ``rcu_node`` structure. Experience has shown that allowing a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  634) given leaf ``rcu_node`` structure to handle 64 CPUs, as permitted by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  635) number of bits in the ``->qsmask`` field on a 64-bit system, results in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  636) excessive contention for the leaf ``rcu_node`` structures' ``->lock``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  637) fields. The number of CPUs per leaf ``rcu_node`` structure is therefore
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  638) limited to 16 given the default value of ``CONFIG_RCU_FANOUT_LEAF``. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  639) ``CONFIG_RCU_FANOUT_LEAF`` is unspecified, the value selected is based
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  640) on the word size of the system, just as for ``CONFIG_RCU_FANOUT``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  641) Lines 11-19 perform this computation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  642) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  643) Lines 21-24 compute the maximum number of CPUs supported by a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  644) single-level (which contains a single ``rcu_node`` structure),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  645) two-level, three-level, and four-level ``rcu_node`` tree, respectively,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  646) given the fanout specified by ``RCU_FANOUT`` and ``RCU_FANOUT_LEAF``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  647) These numbers of CPUs are retained in the ``RCU_FANOUT_1``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  648) ``RCU_FANOUT_2``, ``RCU_FANOUT_3``, and ``RCU_FANOUT_4`` C-preprocessor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  649) variables, respectively.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  650) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  651) These variables are used to control the C-preprocessor ``#if`` statement
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  652) spanning lines 26-66 that computes the number of ``rcu_node`` structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  653) required for each level of the tree, as well as the number of levels
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  654) required. The number of levels is placed in the ``NUM_RCU_LVLS``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  655) C-preprocessor variable by lines 27, 35, 44, and 54. The number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  656) ``rcu_node`` structures for the topmost level of the tree is always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  657) exactly one, and this value is unconditionally placed into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  658) ``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  659) (if any) of the ``rcu_node`` tree are computed by dividing the maximum
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  660) number of CPUs by the fanout supported by the number of levels from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  661) current level down, rounding up. This computation is performed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  662) lines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  663) initializers for lockdep lock-class names. Finally, lines 64-66 produce
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  664) an error if the maximum number of CPUs is too large for the specified
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  665) fanout.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  666) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  667) The ``rcu_segcblist`` Structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  668) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  669) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  670) The ``rcu_segcblist`` structure maintains a segmented list of callbacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  671) as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  672) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  673) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  674) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  675)     1 #define RCU_DONE_TAIL        0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  676)     2 #define RCU_WAIT_TAIL        1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  677)     3 #define RCU_NEXT_READY_TAIL  2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  678)     4 #define RCU_NEXT_TAIL        3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  679)     5 #define RCU_CBLIST_NSEGS     4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  680)     6
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  681)     7 struct rcu_segcblist {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  682)     8   struct rcu_head *head;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  683)     9   struct rcu_head **tails[RCU_CBLIST_NSEGS];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  684)    10   unsigned long gp_seq[RCU_CBLIST_NSEGS];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  685)    11   long len;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  686)    12   long len_lazy;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  687)    13 };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  688) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  689) The segments are as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  690) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  691) #. ``RCU_DONE_TAIL``: Callbacks whose grace periods have elapsed. These
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  692)    callbacks are ready to be invoked.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  693) #. ``RCU_WAIT_TAIL``: Callbacks that are waiting for the current grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  694)    period. Note that different CPUs can have different ideas about which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  695)    grace period is current, hence the ``->gp_seq`` field.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  696) #. ``RCU_NEXT_READY_TAIL``: Callbacks waiting for the next grace period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  697)    to start.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  698) #. ``RCU_NEXT_TAIL``: Callbacks that have not yet been associated with a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  699)    grace period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  700) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  701) The ``->head`` pointer references the first callback or is ``NULL`` if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  702) the list contains no callbacks (which is *not* the same as being empty).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  703) Each element of the ``->tails[]`` array references the ``->next``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  704) pointer of the last callback in the corresponding segment of the list,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  705) or the list's ``->head`` pointer if that segment and all previous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  706) segments are empty. If the corresponding segment is empty but some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  707) previous segment is not empty, then the array element is identical to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  708) its predecessor. Older callbacks are closer to the head of the list, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  709) new callbacks are added at the tail. This relationship between the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  710) ``->head`` pointer, the ``->tails[]`` array, and the callbacks is shown
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  711) in this diagram:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  712) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  713) .. kernel-figure:: nxtlist.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  714) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  715) In this figure, the ``->head`` pointer references the first RCU callback
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  716) in the list. The ``->tails[RCU_DONE_TAIL]`` array element references the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  717) ``->head`` pointer itself, indicating that none of the callbacks is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  718) ready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  719) callback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  720) are both waiting on the current grace period, give or take possible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  721) disagreements about exactly which grace period is the current one. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  722) ``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  723) callback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  724) there are no callbacks waiting on the next RCU grace period. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  725) ``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  726) pointer, indicating that all the remaining RCU callbacks have not yet
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  727) been assigned to an RCU grace period. Note that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  728) ``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  729) callback's ``->next`` pointer unless the callback list is empty, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  730) which case it references the ``->head`` pointer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  731) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  732) There is one additional important special case for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  733) ``->tails[RCU_NEXT_TAIL]`` array element: It can be ``NULL`` when this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  734) list is *disabled*. Lists are disabled when the corresponding CPU is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  735) offline or when the corresponding CPU's callbacks are offloaded to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  736) kthread, both of which are described elsewhere.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  737) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  738) CPUs advance their callbacks from the ``RCU_NEXT_TAIL`` to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  739) ``RCU_NEXT_READY_TAIL`` to the ``RCU_WAIT_TAIL`` to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  740) ``RCU_DONE_TAIL`` list segments as grace periods advance.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  741) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  742) The ``->gp_seq[]`` array records grace-period numbers corresponding to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  743) the list segments. This is what allows different CPUs to have different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  744) ideas as to which is the current grace period while still avoiding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  745) premature invocation of their callbacks. In particular, this allows CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  746) that go idle for extended periods to determine which of their callbacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  747) are ready to be invoked after reawakening.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  748) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  749) The ``->len`` counter contains the number of callbacks in ``->head``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  750) and the ``->len_lazy`` contains the number of those callbacks that are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  751) known to only free memory, and whose invocation can therefore be safely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  752) deferred.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  753) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  754) .. important::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  755) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  756)    It is the ``->len`` field that determines whether or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  757)    not there are callbacks associated with this ``rcu_segcblist``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  758)    structure, *not* the ``->head`` pointer. The reason for this is that all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  759)    the ready-to-invoke callbacks (that is, those in the ``RCU_DONE_TAIL``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  760)    segment) are extracted all at once at callback-invocation time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  761)    (``rcu_do_batch``), due to which ``->head`` may be set to NULL if there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  762)    are no not-done callbacks remaining in the ``rcu_segcblist``. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  763)    callback invocation must be postponed, for example, because a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  764)    high-priority process just woke up on this CPU, then the remaining
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  765)    callbacks are placed back on the ``RCU_DONE_TAIL`` segment and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  766)    ``->head`` once again points to the start of the segment. In short, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  767)    head field can briefly be ``NULL`` even though the CPU has callbacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  768)    present the entire time. Therefore, it is not appropriate to test the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  769)    ``->head`` pointer for ``NULL``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  770) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  771) In contrast, the ``->len`` and ``->len_lazy`` counts are adjusted only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  772) after the corresponding callbacks have been invoked. This means that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  773) ``->len`` count is zero only if the ``rcu_segcblist`` structure really
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  774) is devoid of callbacks. Of course, off-CPU sampling of the ``->len``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  775) count requires careful use of appropriate synchronization, for example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  776) memory barriers. This synchronization can be a bit subtle, particularly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  777) in the case of ``rcu_barrier()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  778) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  779) The ``rcu_data`` Structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  780) ~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  781) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  782) The ``rcu_data`` maintains the per-CPU state for the RCU subsystem. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  783) fields in this structure may be accessed only from the corresponding CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  784) (and from tracing) unless otherwise stated. This structure is the focus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  785) of quiescent-state detection and RCU callback queuing. It also tracks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  786) its relationship to the corresponding leaf ``rcu_node`` structure to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  787) allow more-efficient propagation of quiescent states up the ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  788) combining tree. Like the ``rcu_node`` structure, it provides a local
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  789) copy of the grace-period information to allow for-free synchronized
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  790) access to this information from the corresponding CPU. Finally, this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  791) structure records past dyntick-idle state for the corresponding CPU and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  792) also tracks statistics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  793) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  794) The ``rcu_data`` structure's fields are discussed, singly and in groups,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  795) in the following sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  796) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  797) Connection to Other Data Structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  798) '''''''''''''''''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  799) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  800) This portion of the ``rcu_data`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  801) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  802) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  803) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  804)      1   int cpu;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  805)      2   struct rcu_node *mynode;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  806)      3   unsigned long grpmask;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  807)      4   bool beenonline;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  808) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  809) The ``->cpu`` field contains the number of the corresponding CPU and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  810) ``->mynode`` field references the corresponding ``rcu_node`` structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  811) The ``->mynode`` is used to propagate quiescent states up the combining
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  812) tree. These two fields are constant and therefore do not require
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  813) synchronization.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  814) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  815) The ``->grpmask`` field indicates the bit in the ``->mynode->qsmask``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  816) corresponding to this ``rcu_data`` structure, and is also used when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  817) propagating quiescent states. The ``->beenonline`` flag is set whenever
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  818) the corresponding CPU comes online, which means that the debugfs tracing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  819) need not dump out any ``rcu_data`` structure for which this flag is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  820) set.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  821) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  822) Quiescent-State and Grace-Period Tracking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  823) '''''''''''''''''''''''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  824) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  825) This portion of the ``rcu_data`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  826) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  827) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  828) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  829)      1   unsigned long gp_seq;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  830)      2   unsigned long gp_seq_needed;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  831)      3   bool cpu_no_qs;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  832)      4   bool core_needs_qs;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  833)      5   bool gpwrap;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  834) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  835) The ``->gp_seq`` field is the counterpart of the field of the same name
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  836) in the ``rcu_state`` and ``rcu_node`` structures. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  837) ``->gp_seq_needed`` field is the counterpart of the field of the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  838) name in the rcu_node structure. They may each lag up to one behind their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  839) ``rcu_node`` counterparts, but in ``CONFIG_NO_HZ_IDLE`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  840) ``CONFIG_NO_HZ_FULL`` kernels can lag arbitrarily far behind for CPUs in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  841) dyntick-idle mode (but these counters will catch up upon exit from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  842) dyntick-idle mode). If the lower two bits of a given ``rcu_data``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  843) structure's ``->gp_seq`` are zero, then this ``rcu_data`` structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  844) believes that RCU is idle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  845) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  846) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  847) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  848) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  849) | All this replication of the grace period numbers can only cause       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  850) | massive confusion. Why not just keep a global sequence number and be  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  851) | done with it???                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  852) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  853) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  854) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  855) | Because if there was only a single global sequence numbers, there     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  856) | would need to be a single global lock to allow safely accessing and   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  857) | updating it. And if we are not going to have a single global lock, we |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  858) | need to carefully manage the numbers on a per-node basis. Recall from |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  859) | the answer to a previous Quick Quiz that the consequences of applying |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  860) | a previously sampled quiescent state to the wrong grace period are    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  861) | quite severe.                                                         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  862) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  863) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  864) The ``->cpu_no_qs`` flag indicates that the CPU has not yet passed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  865) through a quiescent state, while the ``->core_needs_qs`` flag indicates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  866) that the RCU core needs a quiescent state from the corresponding CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  867) The ``->gpwrap`` field indicates that the corresponding CPU has remained
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  868) idle for so long that the ``gp_seq`` counter is in danger of overflow,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  869) which will cause the CPU to disregard the values of its counters on its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  870) next exit from idle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  871) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  872) RCU Callback Handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  873) '''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  874) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  875) In the absence of CPU-hotplug events, RCU callbacks are invoked by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  876) same CPU that registered them. This is strictly a cache-locality
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  877) optimization: callbacks can and do get invoked on CPUs other than the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  878) one that registered them. After all, if the CPU that registered a given
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  879) callback has gone offline before the callback can be invoked, there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  880) really is no other choice.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  881) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  882) This portion of the ``rcu_data`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  883) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  884) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  885) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  886)     1 struct rcu_segcblist cblist;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  887)     2 long qlen_last_fqs_check;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  888)     3 unsigned long n_cbs_invoked;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  889)     4 unsigned long n_nocbs_invoked;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  890)     5 unsigned long n_cbs_orphaned;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  891)     6 unsigned long n_cbs_adopted;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  892)     7 unsigned long n_force_qs_snap;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  893)     8 long blimit;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  894) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  895) The ``->cblist`` structure is the segmented callback list described
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  896) earlier. The CPU advances the callbacks in its ``rcu_data`` structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  897) whenever it notices that another RCU grace period has completed. The CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  898) detects the completion of an RCU grace period by noticing that the value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  899) of its ``rcu_data`` structure's ``->gp_seq`` field differs from that of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  900) its leaf ``rcu_node`` structure. Recall that each ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  901) structure's ``->gp_seq`` field is updated at the beginnings and ends of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  902) each grace period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  903) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  904) The ``->qlen_last_fqs_check`` and ``->n_force_qs_snap`` coordinate the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  905) forcing of quiescent states from ``call_rcu()`` and friends when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  906) callback lists grow excessively long.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  907) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  908) The ``->n_cbs_invoked``, ``->n_cbs_orphaned``, and ``->n_cbs_adopted``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  909) fields count the number of callbacks invoked, sent to other CPUs when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  910) this CPU goes offline, and received from other CPUs when those other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  911) CPUs go offline. The ``->n_nocbs_invoked`` is used when the CPU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  912) callbacks are offloaded to a kthread.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  913) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  914) Finally, the ``->blimit`` counter is the maximum number of RCU callbacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  915) that may be invoked at a given time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  916) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  917) Dyntick-Idle Handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  918) '''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  919) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  920) This portion of the ``rcu_data`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  921) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  922) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  923) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  924)      1   int dynticks_snap;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  925)      2   unsigned long dynticks_fqs;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  926) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  927) The ``->dynticks_snap`` field is used to take a snapshot of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  928) corresponding CPU's dyntick-idle state when forcing quiescent states,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  929) and is therefore accessed from other CPUs. Finally, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  930) ``->dynticks_fqs`` field is used to count the number of times this CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  931) is determined to be in dyntick-idle state, and is used for tracing and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  932) debugging purposes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  933) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  934) This portion of the rcu_data structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  935) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  936) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  937) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  938)      1   long dynticks_nesting;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  939)      2   long dynticks_nmi_nesting;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  940)      3   atomic_t dynticks;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  941)      4   bool rcu_need_heavy_qs;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  942)      5   bool rcu_urgent_qs;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  943) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  944) These fields in the rcu_data structure maintain the per-CPU dyntick-idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  945) state for the corresponding CPU. The fields may be accessed only from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  946) the corresponding CPU (and from tracing) unless otherwise stated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  947) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  948) The ``->dynticks_nesting`` field counts the nesting depth of process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  949) execution, so that in normal circumstances this counter has value zero
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  950) or one. NMIs, irqs, and tracers are counted by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  951) ``->dynticks_nmi_nesting`` field. Because NMIs cannot be masked, changes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  952) to this variable have to be undertaken carefully using an algorithm
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  953) provided by Andy Lutomirski. The initial transition from idle adds one,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  954) and nested transitions add two, so that a nesting level of five is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  955) represented by a ``->dynticks_nmi_nesting`` value of nine. This counter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  956) can therefore be thought of as counting the number of reasons why this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  957) CPU cannot be permitted to enter dyntick-idle mode, aside from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  958) process-level transitions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  959) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  960) However, it turns out that when running in non-idle kernel context, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  961) Linux kernel is fully capable of entering interrupt handlers that never
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  962) exit and perhaps also vice versa. Therefore, whenever the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  963) ``->dynticks_nesting`` field is incremented up from zero, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  964) ``->dynticks_nmi_nesting`` field is set to a large positive number, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  965) whenever the ``->dynticks_nesting`` field is decremented down to zero,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  966) the ``->dynticks_nmi_nesting`` field is set to zero. Assuming that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  967) the number of misnested interrupts is not sufficient to overflow the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  968) counter, this approach corrects the ``->dynticks_nmi_nesting`` field
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  969) every time the corresponding CPU enters the idle loop from process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  970) context.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  971) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  972) The ``->dynticks`` field counts the corresponding CPU's transitions to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  973) and from either dyntick-idle or user mode, so that this counter has an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  974) even value when the CPU is in dyntick-idle mode or user mode and an odd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  975) value otherwise. The transitions to/from user mode need to be counted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  976) for user mode adaptive-ticks support (see timers/NO_HZ.txt).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  977) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  978) The ``->rcu_need_heavy_qs`` field is used to record the fact that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  979) RCU core code would really like to see a quiescent state from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  980) corresponding CPU, so much so that it is willing to call for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  981) heavy-weight dyntick-counter operations. This flag is checked by RCU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  982) context-switch and ``cond_resched()`` code, which provide a momentary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  983) idle sojourn in response.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  984) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  985) Finally, the ``->rcu_urgent_qs`` field is used to record the fact that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  986) the RCU core code would really like to see a quiescent state from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  987) corresponding CPU, with the various other fields indicating just how
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  988) badly RCU wants this quiescent state. This flag is checked by RCU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  989) context-switch path (``rcu_note_context_switch``) and the cond_resched
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  990) code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  991) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  992) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  993) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  994) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  995) | Why not simply combine the ``->dynticks_nesting`` and                 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  996) | ``->dynticks_nmi_nesting`` counters into a single counter that just   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  997) | counts the number of reasons that the corresponding CPU is non-idle?  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  998) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  999) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1000) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1001) | Because this would fail in the presence of interrupts whose handlers  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1002) | never return and of handlers that manage to return from a made-up     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1003) | interrupt.                                                            |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1004) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1005) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1006) Additional fields are present for some special-purpose builds, and are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1007) discussed separately.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1008) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1009) The ``rcu_head`` Structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1010) ~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1011) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1012) Each ``rcu_head`` structure represents an RCU callback. These structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1013) are normally embedded within RCU-protected data structures whose
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1014) algorithms use asynchronous grace periods. In contrast, when using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1015) algorithms that block waiting for RCU grace periods, RCU users need not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1016) provide ``rcu_head`` structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1017) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1018) The ``rcu_head`` structure has fields as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1019) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1020) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1021) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1022)      1   struct rcu_head *next;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1023)      2   void (*func)(struct rcu_head *head);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1024) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1025) The ``->next`` field is used to link the ``rcu_head`` structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1026) together in the lists within the ``rcu_data`` structures. The ``->func``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1027) field is a pointer to the function to be called when the callback is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1028) ready to be invoked, and this function is passed a pointer to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1029) ``rcu_head`` structure. However, ``kfree_rcu()`` uses the ``->func``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1030) field to record the offset of the ``rcu_head`` structure within the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1031) enclosing RCU-protected data structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1032) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1033) Both of these fields are used internally by RCU. From the viewpoint of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1034) RCU users, this structure is an opaque “cookie”.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1035) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1036) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1037) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1038) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1039) | Given that the callback function ``->func`` is passed a pointer to    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1040) | the ``rcu_head`` structure, how is that function supposed to find the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1041) | beginning of the enclosing RCU-protected data structure?              |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1042) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1043) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1044) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1045) | In actual practice, there is a separate callback function per type of |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1046) | RCU-protected data structure. The callback function can therefore use |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1047) | the ``container_of()`` macro in the Linux kernel (or other            |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1048) | pointer-manipulation facilities in other software environments) to    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1049) | find the beginning of the enclosing structure.                        |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1050) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1051) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1052) RCU-Specific Fields in the ``task_struct`` Structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1053) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1054) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1055) The ``CONFIG_PREEMPT_RCU`` implementation uses some additional fields in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1056) the ``task_struct`` structure:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1057) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1058) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1059) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1060)     1 #ifdef CONFIG_PREEMPT_RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1061)     2   int rcu_read_lock_nesting;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1062)     3   union rcu_special rcu_read_unlock_special;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1063)     4   struct list_head rcu_node_entry;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1064)     5   struct rcu_node *rcu_blocked_node;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1065)     6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1066)     7 #ifdef CONFIG_TASKS_RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1067)     8   unsigned long rcu_tasks_nvcsw;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1068)     9   bool rcu_tasks_holdout;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1069)    10   struct list_head rcu_tasks_holdout_list;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1070)    11   int rcu_tasks_idle_cpu;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1071)    12 #endif /* #ifdef CONFIG_TASKS_RCU */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1072) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1073) The ``->rcu_read_lock_nesting`` field records the nesting level for RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1074) read-side critical sections, and the ``->rcu_read_unlock_special`` field
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1075) is a bitmask that records special conditions that require
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1076) ``rcu_read_unlock()`` to do additional work. The ``->rcu_node_entry``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1077) field is used to form lists of tasks that have blocked within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1078) preemptible-RCU read-side critical sections and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1079) ``->rcu_blocked_node`` field references the ``rcu_node`` structure whose
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1080) list this task is a member of, or ``NULL`` if it is not blocked within a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1081) preemptible-RCU read-side critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1082) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1083) The ``->rcu_tasks_nvcsw`` field tracks the number of voluntary context
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1084) switches that this task had undergone at the beginning of the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1085) tasks-RCU grace period, ``->rcu_tasks_holdout`` is set if the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1086) tasks-RCU grace period is waiting on this task,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1087) ``->rcu_tasks_holdout_list`` is a list element enqueuing this task on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1088) the holdout list, and ``->rcu_tasks_idle_cpu`` tracks which CPU this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1089) idle task is running, but only if the task is currently running, that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1090) is, if the CPU is currently idle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1091) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1092) Accessor Functions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1093) ~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1094) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1095) The following listing shows the ``rcu_get_root()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1096) ``rcu_for_each_node_breadth_first`` and ``rcu_for_each_leaf_node()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1097) function and macros:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1098) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1099) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1100) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1101)      1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1102)      2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1103)      3   return &rsp->node[0];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1104)      4 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1105)      5
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1106)      6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1107)      7   for ((rnp) = &(rsp)->node[0]; \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1108)      8        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1109)      9
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1110)     10 #define rcu_for_each_leaf_node(rsp, rnp) \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1111)     11   for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1112)     12        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1113) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1114) The ``rcu_get_root()`` simply returns a pointer to the first element of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1115) the specified ``rcu_state`` structure's ``->node[]`` array, which is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1116) root ``rcu_node`` structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1117) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1118) As noted earlier, the ``rcu_for_each_node_breadth_first()`` macro takes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1119) advantage of the layout of the ``rcu_node`` structures in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1120) ``rcu_state`` structure's ``->node[]`` array, performing a breadth-first
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1121) traversal by simply traversing the array in order. Similarly, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1122) ``rcu_for_each_leaf_node()`` macro traverses only the last part of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1123) array, thus traversing only the leaf ``rcu_node`` structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1124) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1125) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1126) | **Quick Quiz**:                                                       |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1127) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1128) | What does ``rcu_for_each_leaf_node()`` do if the ``rcu_node`` tree    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1129) | contains only a single node?                                          |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1130) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1131) | **Answer**:                                                           |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1132) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1133) | In the single-node case, ``rcu_for_each_leaf_node()`` traverses the   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1134) | single node.                                                          |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1135) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1136) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1137) Summary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1138) ~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1139) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1140) So the state of RCU is represented by an ``rcu_state`` structure, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1141) contains a combining tree of ``rcu_node`` and ``rcu_data`` structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1142) Finally, in ``CONFIG_NO_HZ_IDLE`` kernels, each CPU's dyntick-idle state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1143) is tracked by dynticks-related fields in the ``rcu_data`` structure. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1144) you made it this far, you are well prepared to read the code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1145) walkthroughs in the other articles in this series.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1146) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1147) Acknowledgments
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1148) ~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1149) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1150) I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1151) Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1152) helping me get this document into a more human-readable state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1153) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1154) Legal Statement
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1155) ~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1156) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1157) This work represents the view of the author and does not necessarily
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1158) represent the view of IBM.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1159) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1160) Linux is a registered trademark of Linus Torvalds.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1161) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1162) Other company, product, and service names may be trademarks or service
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1163) marks of others.