^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) ===================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) A Tour Through TREE_RCU's Data Structures [LWN.net]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ===================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) December 18, 2016
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) This article was contributed by Paul E. McKenney
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) This document describes RCU's major data structures and their relationship
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) to each other.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) Data-Structure Relationships
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) RCU is for all intents and purposes a large state machine, and its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) data structures maintain the state in such a way as to allow RCU readers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) to execute extremely quickly, while also processing the RCU grace periods
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) requested by updaters in an efficient and extremely scalable fashion.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) The efficiency and scalability of RCU updaters is provided primarily
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) by a combining tree, as shown below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) .. kernel-figure:: BigTreeClassicRCU.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) This diagram shows an enclosing ``rcu_state`` structure containing a tree
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) of ``rcu_node`` structures. Each leaf node of the ``rcu_node`` tree has up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) to 16 ``rcu_data`` structures associated with it, so that there are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) ``NR_CPUS`` number of ``rcu_data`` structures, one for each possible CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) This structure is adjusted at boot time, if needed, to handle the common
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) case where ``nr_cpu_ids`` is much less than ``NR_CPUs``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) For example, a number of Linux distributions set ``NR_CPUs=4096``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) which results in a three-level ``rcu_node`` tree.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) If the actual hardware has only 16 CPUs, RCU will adjust itself
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) at boot time, resulting in an ``rcu_node`` tree with only a single node.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) The purpose of this combining tree is to allow per-CPU events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) such as quiescent states, dyntick-idle transitions,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) and CPU hotplug operations to be processed efficiently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) and scalably.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) Quiescent states are recorded by the per-CPU ``rcu_data`` structures,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) and other events are recorded by the leaf-level ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) All of these events are combined at each level of the tree until finally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) grace periods are completed at the tree's root ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) A grace period can be completed at the root once every CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) (or, in the case of ``CONFIG_PREEMPT_RCU``, task)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) has passed through a quiescent state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) Once a grace period has completed, record of that fact is propagated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) back down the tree.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) As can be seen from the diagram, on a 64-bit system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) of 64 at the root and a fanout of 16 at the leaves.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) | Why isn't the fanout at the leaves also 64? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) | Because there are more types of events that affect the leaf-level |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) | ``rcu_node`` structures than further up the tree. Therefore, if the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) | leaf ``rcu_node`` structures have fanout of 64, the contention on |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) | these structures' ``->structures`` becomes excessive. Experimentation |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) | on a wide variety of systems has shown that a fanout of 16 works well |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) | for the leaves of the ``rcu_node`` tree. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) | Of course, further experience with systems having hundreds or |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) | thousands of CPUs may demonstrate that the fanout for the non-leaf |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) | ``rcu_node`` structures must also be reduced. Such reduction can be |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) | easily carried out when and if it proves necessary. In the meantime, |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) | if you are using such a system and running into contention problems |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) | on the non-leaf ``rcu_node`` structures, you may use the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) | ``CONFIG_RCU_FANOUT`` kernel configuration parameter to reduce the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) | non-leaf fanout as needed. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) | Kernels built for systems with strong NUMA characteristics might |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) | also need to adjust ``CONFIG_RCU_FANOUT`` so that the domains of |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) | the ``rcu_node`` structures align with hardware boundaries. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) | However, there has thus far been no need for this. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) If your system has more than 1,024 CPUs (or more than 512 CPUs on a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) 32-bit system), then RCU will automatically add more levels to the tree.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) For example, if you are crazy enough to build a 64-bit system with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) 65,536 CPUs, RCU would configure the ``rcu_node`` tree as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) .. kernel-figure:: HugeTreeClassicRCU.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) RCU currently permits up to a four-level tree, which on a 64-bit system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) 32-bit systems. On the other hand, you can set both
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) ``CONFIG_RCU_FANOUT`` and ``CONFIG_RCU_FANOUT_LEAF`` to be as small as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) 2, which would result in a 16-CPU test using a 4-level tree. This can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) useful for testing large-system capabilities on small test machines.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) This multi-level combining tree allows us to get most of the performance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) and scalability benefits of partitioning, even though RCU grace-period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) detection is inherently a global operation. The trick here is that only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) the last CPU to report a quiescent state into a given ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) structure need advance to the ``rcu_node`` structure at the next level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) up the tree. This means that at the leaf-level ``rcu_node`` structure,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) only one access out of sixteen will progress up the tree. For the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) internal ``rcu_node`` structures, the situation is even more extreme:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) Only one access out of sixty-four will progress up the tree. Because the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) vast majority of the CPUs do not progress up the tree, the lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) contention remains roughly constant up the tree. No matter how many CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) there are in the system, at most 64 quiescent-state reports per grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) period will progress all the way to the root ``rcu_node`` structure,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) thus ensuring that the lock contention on that root ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) structure remains acceptably low.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) In effect, the combining tree acts like a big shock absorber, keeping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) lock contention under control at all tree levels regardless of the level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) of loading on the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) RCU updaters wait for normal grace periods by registering RCU callbacks,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) either directly via ``call_rcu()`` or indirectly via
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) ``synchronize_rcu()`` and friends. RCU callbacks are represented by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) ``rcu_head`` structures, which are queued on ``rcu_data`` structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) while they are waiting for a grace period to elapse, as shown in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) following figure:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) .. kernel-figure:: BigTreePreemptRCUBHdyntickCB.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) This figure shows how ``TREE_RCU``'s and ``PREEMPT_RCU``'s major data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) structures are related. Lesser data structures will be introduced with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) the algorithms that make use of them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) Note that each of the data structures in the above figure has its own
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) synchronization:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) #. Each ``rcu_state`` structures has a lock and a mutex, and some fields
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) are protected by the corresponding root ``rcu_node`` structure's lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) #. Each ``rcu_node`` structure has a spinlock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) #. The fields in ``rcu_data`` are private to the corresponding CPU,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) although a few can be read and written by other CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) It is important to note that different data structures can have very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) different ideas about the state of RCU at any given time. For but one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) example, awareness of the start or end of a given RCU grace period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) propagates slowly through the data structures. This slow propagation is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) absolutely necessary for RCU to have good read-side performance. If this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) balkanized implementation seems foreign to you, one useful trick is to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) consider each instance of these data structures to be a different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) person, each having the usual slightly different view of reality.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) The general role of each of these data structures is as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) #. ``rcu_state``: This structure forms the interconnection between the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) ``rcu_node`` and ``rcu_data`` structures, tracks grace periods,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) serves as short-term repository for callbacks orphaned by CPU-hotplug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) events, maintains ``rcu_barrier()`` state, tracks expedited
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) grace-period state, and maintains state used to force quiescent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) states when grace periods extend too long,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) #. ``rcu_node``: This structure forms the combining tree that propagates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) quiescent-state information from the leaves to the root, and also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) propagates grace-period information from the root to the leaves. It
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) provides local copies of the grace-period state in order to allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) this information to be accessed in a synchronized manner without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) suffering the scalability limitations that would otherwise be imposed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) by global locking. In ``CONFIG_PREEMPT_RCU`` kernels, it manages the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) lists of tasks that have blocked while in their current RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) critical section. In ``CONFIG_PREEMPT_RCU`` with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) ``CONFIG_RCU_BOOST``, it manages the per-\ ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) priority-boosting kernel threads (kthreads) and state. Finally, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) records CPU-hotplug state in order to determine which CPUs should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) ignored during a given grace period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) #. ``rcu_data``: This per-CPU structure is the focus of quiescent-state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) detection and RCU callback queuing. It also tracks its relationship
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) to the corresponding leaf ``rcu_node`` structure to allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) more-efficient propagation of quiescent states up the ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) combining tree. Like the ``rcu_node`` structure, it provides a local
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) copy of the grace-period information to allow for-free synchronized
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) access to this information from the corresponding CPU. Finally, this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) structure records past dyntick-idle state for the corresponding CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) and also tracks statistics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) #. ``rcu_head``: This structure represents RCU callbacks, and is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) only structure allocated and managed by RCU users. The ``rcu_head``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) structure is normally embedded within the RCU-protected data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) If all you wanted from this article was a general notion of how RCU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) data structures are related, you are done. Otherwise, each of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) following sections give more details on the ``rcu_state``, ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) and ``rcu_data`` data structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) The ``rcu_state`` Structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) ~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) The ``rcu_state`` structure is the base structure that represents the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) state of RCU in the system. This structure forms the interconnection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) between the ``rcu_node`` and ``rcu_data`` structures, tracks grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) periods, contains the lock used to synchronize with CPU-hotplug events,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) and maintains state used to force quiescent states when grace periods
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) extend too long,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) A few of the ``rcu_state`` structure's fields are discussed, singly and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) in groups, in the following sections. The more specialized fields are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) covered in the discussion of their use.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) Relationship to rcu_node and rcu_data Structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) ''''''''''''''''''''''''''''''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) This portion of the ``rcu_state`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) 1 struct rcu_node node[NUM_RCU_NODES];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) 2 struct rcu_node *level[NUM_RCU_LVLS + 1];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) 3 struct rcu_data __percpu *rda;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) | Wait a minute! You said that the ``rcu_node`` structures formed a |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) | tree, but they are declared as a flat array! What gives? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) | The tree is laid out in the array. The first node In the array is the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) | head, the next set of nodes in the array are children of the head |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) | node, and so on until the last set of nodes in the array are the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) | leaves. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) | See the following diagrams to see how this works. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) The ``rcu_node`` tree is embedded into the ``->node[]`` array as shown
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) in the following figure:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) .. kernel-figure:: TreeMapping.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) One interesting consequence of this mapping is that a breadth-first
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) traversal of the tree is implemented as a simple linear scan of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) array, which is in fact what the ``rcu_for_each_node_breadth_first()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) macro does. This macro is used at the beginning and ends of grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) periods.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) Each entry of the ``->level`` array references the first ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) structure on the corresponding level of the tree, for example, as shown
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) .. kernel-figure:: TreeMappingLevel.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) The zero\ :sup:`th` element of the array references the root
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) ``rcu_node`` structure, the first element references the first child of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) the root ``rcu_node``, and finally the second element references the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) first leaf ``rcu_node`` structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) For whatever it is worth, if you draw the tree to be tree-shaped rather
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) than array-shaped, it is easy to draw a planar representation:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) .. kernel-figure:: TreeLevel.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) Finally, the ``->rda`` field references a per-CPU pointer to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) corresponding CPU's ``rcu_data`` structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) All of these fields are constant once initialization is complete, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) therefore need no protection.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) Grace-Period Tracking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) '''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) This portion of the ``rcu_state`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) 1 unsigned long gp_seq;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) RCU grace periods are numbered, and the ``->gp_seq`` field contains the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) current grace-period sequence number. The bottom two bits are the state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) of the current grace period, which can be zero for not yet started or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) one for in progress. In other words, if the bottom two bits of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) ``->gp_seq`` are zero, then RCU is idle. Any other value in the bottom
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) two bits indicates that something is broken. This field is protected by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) the root ``rcu_node`` structure's ``->lock`` field.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) There are ``->gp_seq`` fields in the ``rcu_node`` and ``rcu_data``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) structures as well. The fields in the ``rcu_state`` structure represent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) the most current value, and those of the other structures are compared
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) in order to detect the beginnings and ends of grace periods in a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) distributed fashion. The values flow from ``rcu_state`` to ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) (down the tree from the root to the leaves) to ``rcu_data``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) Miscellaneous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) '''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) This portion of the ``rcu_state`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) 1 unsigned long gp_max;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) 2 char abbr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) 3 char *name;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) The ``->gp_max`` field tracks the duration of the longest grace period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) in jiffies. It is protected by the root ``rcu_node``'s ``->lock``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) The ``->name`` and ``->abbr`` fields distinguish between preemptible RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) (“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) These fields are used for diagnostic and tracing purposes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) The ``rcu_node`` Structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) ~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) The ``rcu_node`` structures form the combining tree that propagates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) quiescent-state information from the leaves to the root and also that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) propagates grace-period information from the root down to the leaves.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) They provides local copies of the grace-period state in order to allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) this information to be accessed in a synchronized manner without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) suffering the scalability limitations that would otherwise be imposed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) global locking. In ``CONFIG_PREEMPT_RCU`` kernels, they manage the lists
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) of tasks that have blocked while in their current RCU read-side critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) section. In ``CONFIG_PREEMPT_RCU`` with ``CONFIG_RCU_BOOST``, they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) manage the per-\ ``rcu_node`` priority-boosting kernel threads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) (kthreads) and state. Finally, they record CPU-hotplug state in order to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) determine which CPUs should be ignored during a given grace period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) The ``rcu_node`` structure's fields are discussed, singly and in groups,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) in the following sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) Connection to Combining Tree
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) ''''''''''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) This portion of the ``rcu_node`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) 1 struct rcu_node *parent;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) 2 u8 level;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) 3 u8 grpnum;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) 4 unsigned long grpmask;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) 5 int grplo;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) 6 int grphi;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) The ``->parent`` pointer references the ``rcu_node`` one level up in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) tree, and is ``NULL`` for the root ``rcu_node``. The RCU implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) makes heavy use of this field to push quiescent states up the tree. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) ``->level`` field gives the level in the tree, with the root being at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) level zero, its children at level one, and so on. The ``->grpnum`` field
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) gives this node's position within the children of its parent, so this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) number can range between 0 and 31 on 32-bit systems and between 0 and 63
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) on 64-bit systems. The ``->level`` and ``->grpnum`` fields are used only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) during initialization and for tracing. The ``->grpmask`` field is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) bitmask counterpart of ``->grpnum``, and therefore always has exactly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) one bit set. This mask is used to clear the bit corresponding to this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) ``rcu_node`` structure in its parent's bitmasks, which are described
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) later. Finally, the ``->grplo`` and ``->grphi`` fields contain the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) lowest and highest numbered CPU served by this ``rcu_node`` structure,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) respectively.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) All of these fields are constant, and thus do not require any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) synchronization.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) Synchronization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) '''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) This field of the ``rcu_node`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) 1 raw_spinlock_t lock;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) This field is used to protect the remaining fields in this structure,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) unless otherwise stated. That said, all of the fields in this structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) can be accessed without locking for tracing purposes. Yes, this can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) result in confusing traces, but better some tracing confusion than to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) heisenbugged out of existence.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) .. _grace-period-tracking-1:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) Grace-Period Tracking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) '''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) This portion of the ``rcu_node`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) 1 unsigned long gp_seq;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) 2 unsigned long gp_seq_needed;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) The ``rcu_node`` structures' ``->gp_seq`` fields are the counterparts of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) the field of the same name in the ``rcu_state`` structure. They each may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) lag up to one step behind their ``rcu_state`` counterpart. If the bottom
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) two bits of a given ``rcu_node`` structure's ``->gp_seq`` field is zero,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) then this ``rcu_node`` structure believes that RCU is idle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) The ``>gp_seq`` field of each ``rcu_node`` structure is updated at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) beginning and the end of each grace period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) The ``->gp_seq_needed`` fields record the furthest-in-the-future grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) period request seen by the corresponding ``rcu_node`` structure. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) request is considered fulfilled when the value of the ``->gp_seq`` field
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) equals or exceeds that of the ``->gp_seq_needed`` field.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) | Suppose that this ``rcu_node`` structure doesn't see a request for a |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) | very long time. Won't wrapping of the ``->gp_seq`` field cause |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) | problems? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) | No, because if the ``->gp_seq_needed`` field lags behind the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) | ``->gp_seq`` field, the ``->gp_seq_needed`` field will be updated at |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) | the end of the grace period. Modulo-arithmetic comparisons therefore |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) | will always get the correct answer, even with wrapping. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) Quiescent-State Tracking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) ''''''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) These fields manage the propagation of quiescent states up the combining
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) tree.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) This portion of the ``rcu_node`` structure has fields as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) 1 unsigned long qsmask;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) 2 unsigned long expmask;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) 3 unsigned long qsmaskinit;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) 4 unsigned long expmaskinit;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) The ``->qsmask`` field tracks which of this ``rcu_node`` structure's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) children still need to report quiescent states for the current normal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) grace period. Such children will have a value of 1 in their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) corresponding bit. Note that the leaf ``rcu_node`` structures should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) thought of as having ``rcu_data`` structures as their children.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) Similarly, the ``->expmask`` field tracks which of this ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) structure's children still need to report quiescent states for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) current expedited grace period. An expedited grace period has the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) conceptual properties as a normal grace period, but the expedited
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) implementation accepts extreme CPU overhead to obtain much lower
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) grace-period latency, for example, consuming a few tens of microseconds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) worth of CPU time to reduce grace-period duration from milliseconds to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) tens of microseconds. The ``->qsmaskinit`` field tracks which of this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) ``rcu_node`` structure's children cover for at least one online CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) This mask is used to initialize ``->qsmask``, and ``->expmaskinit`` is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) used to initialize ``->expmask`` and the beginning of the normal and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) expedited grace periods, respectively.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) | Why are these bitmasks protected by locking? Come on, haven't you |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) | heard of atomic instructions??? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456) | Lockless grace-period computation! Such a tantalizing possibility! |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) | But consider the following sequence of events: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) | #. CPU 0 has been in dyntick-idle mode for quite some time. When it |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) | wakes up, it notices that the current RCU grace period needs it to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) | report in, so it sets a flag where the scheduling clock interrupt |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) | will find it. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) | #. Meanwhile, CPU 1 is running ``force_quiescent_state()``, and |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) | notices that CPU 0 has been in dyntick idle mode, which qualifies |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) | as an extended quiescent state. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) | #. CPU 0's scheduling clock interrupt fires in the middle of an RCU |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) | read-side critical section, and notices that the RCU core needs |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) | something, so commences RCU softirq processing. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) | #. CPU 0's softirq handler executes and is just about ready to report |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) | its quiescent state up the ``rcu_node`` tree. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) | #. But CPU 1 beats it to the punch, completing the current grace |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) | period and starting a new one. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) | #. CPU 0 now reports its quiescent state for the wrong grace period. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) | That grace period might now end before the RCU read-side critical |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) | section. If that happens, disaster will ensue. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) | So the locking is absolutely required in order to coordinate clearing |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478) | of the bits with updating of the grace-period sequence number in |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) | ``->gp_seq``. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) Blocked-Task Management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) '''''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) ``PREEMPT_RCU`` allows tasks to be preempted in the midst of their RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) read-side critical sections, and these tasks must be tracked explicitly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) The details of exactly why and how they are tracked will be covered in a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) separate article on RCU read-side processing. For now, it is enough to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) know that the ``rcu_node`` structure tracks them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) 1 struct list_head blkd_tasks;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) 2 struct list_head *gp_tasks;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) 3 struct list_head *exp_tasks;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) 4 bool wait_blkd_tasks;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) The ``->blkd_tasks`` field is a list header for the list of blocked and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) preempted tasks. As tasks undergo context switches within RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) critical sections, their ``task_struct`` structures are enqueued (via
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) the ``task_struct``'s ``->rcu_node_entry`` field) onto the head of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) ``->blkd_tasks`` list for the leaf ``rcu_node`` structure corresponding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) to the CPU on which the outgoing context switch executed. As these tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) later exit their RCU read-side critical sections, they remove themselves
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) from the list. This list is therefore in reverse time order, so that if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) one of the tasks is blocking the current grace period, all subsequent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507) tasks must also be blocking that same grace period. Therefore, a single
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) pointer into this list suffices to track all tasks blocking a given
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) grace period. That pointer is stored in ``->gp_tasks`` for normal grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) periods and in ``->exp_tasks`` for expedited grace periods. These last
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) two fields are ``NULL`` if either there is no grace period in flight or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) if there are no blocked tasks preventing that grace period from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) completing. If either of these two pointers is referencing a task that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) removes itself from the ``->blkd_tasks`` list, then that task must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) advance the pointer to the next task on the list, or set the pointer to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) ``NULL`` if there are no subsequent tasks on the list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) For example, suppose that tasks T1, T2, and T3 are all hard-affinitied
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) to the largest-numbered CPU in the system. Then if task T1 blocked in an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) RCU read-side critical section, then an expedited grace period started,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) then task T2 blocked in an RCU read-side critical section, then a normal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) grace period started, and finally task 3 blocked in an RCU read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) critical section, then the state of the last leaf ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524) structure's blocked-task list would be as shown below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) .. kernel-figure:: blkd_task.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) Task T1 is blocking both grace periods, task T2 is blocking only the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) normal grace period, and task T3 is blocking neither grace period. Note
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) that these tasks will not remove themselves from this list immediately
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) upon resuming execution. They will instead remain on the list until they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) execute the outermost ``rcu_read_unlock()`` that ends their RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) read-side critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) The ``->wait_blkd_tasks`` field indicates whether or not the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) grace period is waiting on a blocked task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538) Sizing the ``rcu_node`` Array
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539) '''''''''''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) The ``rcu_node`` array is sized via a series of C-preprocessor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) expressions as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) 1 #ifdef CONFIG_RCU_FANOUT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547) 2 #define RCU_FANOUT CONFIG_RCU_FANOUT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) 3 #else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) 4 # ifdef CONFIG_64BIT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) 5 # define RCU_FANOUT 64
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551) 6 # else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552) 7 # define RCU_FANOUT 32
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) 8 # endif
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) 9 #endif
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555) 10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556) 11 #ifdef CONFIG_RCU_FANOUT_LEAF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557) 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) 13 #else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559) 14 # ifdef CONFIG_64BIT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560) 15 # define RCU_FANOUT_LEAF 64
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561) 16 # else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562) 17 # define RCU_FANOUT_LEAF 32
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563) 18 # endif
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564) 19 #endif
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565) 20
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566) 21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567) 22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568) 23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569) 24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570) 25
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571) 26 #if NR_CPUS <= RCU_FANOUT_1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572) 27 # define RCU_NUM_LVLS 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573) 28 # define NUM_RCU_LVL_0 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574) 29 # define NUM_RCU_NODES NUM_RCU_LVL_0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575) 30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576) 31 # define RCU_NODE_NAME_INIT { "rcu_node_0" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577) 32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 578) 33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 579) 34 #elif NR_CPUS <= RCU_FANOUT_2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 580) 35 # define RCU_NUM_LVLS 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 581) 36 # define NUM_RCU_LVL_0 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 582) 37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 583) 38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 584) 39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 585) 40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 586) 41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 587) 42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 588) 43 #elif NR_CPUS <= RCU_FANOUT_3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 589) 44 # define RCU_NUM_LVLS 3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 590) 45 # define NUM_RCU_LVL_0 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 591) 46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 592) 47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 593) 48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 594) 49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 595) 50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 596) 51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 597) 52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 598) 53 #elif NR_CPUS <= RCU_FANOUT_4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 599) 54 # define RCU_NUM_LVLS 4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 600) 55 # define NUM_RCU_LVL_0 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 601) 56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 602) 57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 603) 58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 604) 59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 605) 60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 606) 61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 607) 62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 608) 63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 609) 64 #else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 610) 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 611) 66 #endif
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 612)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 613) The maximum number of levels in the ``rcu_node`` structure is currently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 614) limited to four, as specified by lines 21-24 and the structure of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 615) subsequent “if” statement. For 32-bit systems, this allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 616) 16*32*32*32=524,288 CPUs, which should be sufficient for the next few
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 617) years at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 618) allowed, which should see us through the next decade or so. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 619) four-level tree also allows kernels built with ``CONFIG_RCU_FANOUT=8``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 620) to support up to 4096 CPUs, which might be useful in very large systems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 621) having eight CPUs per socket (but please note that no one has yet shown
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 622) any measurable performance degradation due to misaligned socket and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 623) ``rcu_node`` boundaries). In addition, building kernels with a full four
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 624) levels of ``rcu_node`` tree permits better testing of RCU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 625) combining-tree code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 626)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 627) The ``RCU_FANOUT`` symbol controls how many children are permitted at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 628) each non-leaf level of the ``rcu_node`` tree. If the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 629) ``CONFIG_RCU_FANOUT`` Kconfig option is not specified, it is set based
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 630) on the word size of the system, which is also the Kconfig default.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 631)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 632) The ``RCU_FANOUT_LEAF`` symbol controls how many CPUs are handled by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 633) each leaf ``rcu_node`` structure. Experience has shown that allowing a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 634) given leaf ``rcu_node`` structure to handle 64 CPUs, as permitted by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 635) number of bits in the ``->qsmask`` field on a 64-bit system, results in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 636) excessive contention for the leaf ``rcu_node`` structures' ``->lock``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 637) fields. The number of CPUs per leaf ``rcu_node`` structure is therefore
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 638) limited to 16 given the default value of ``CONFIG_RCU_FANOUT_LEAF``. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 639) ``CONFIG_RCU_FANOUT_LEAF`` is unspecified, the value selected is based
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 640) on the word size of the system, just as for ``CONFIG_RCU_FANOUT``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 641) Lines 11-19 perform this computation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 642)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 643) Lines 21-24 compute the maximum number of CPUs supported by a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 644) single-level (which contains a single ``rcu_node`` structure),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 645) two-level, three-level, and four-level ``rcu_node`` tree, respectively,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 646) given the fanout specified by ``RCU_FANOUT`` and ``RCU_FANOUT_LEAF``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 647) These numbers of CPUs are retained in the ``RCU_FANOUT_1``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 648) ``RCU_FANOUT_2``, ``RCU_FANOUT_3``, and ``RCU_FANOUT_4`` C-preprocessor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 649) variables, respectively.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 650)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 651) These variables are used to control the C-preprocessor ``#if`` statement
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 652) spanning lines 26-66 that computes the number of ``rcu_node`` structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 653) required for each level of the tree, as well as the number of levels
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 654) required. The number of levels is placed in the ``NUM_RCU_LVLS``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 655) C-preprocessor variable by lines 27, 35, 44, and 54. The number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 656) ``rcu_node`` structures for the topmost level of the tree is always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 657) exactly one, and this value is unconditionally placed into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 658) ``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 659) (if any) of the ``rcu_node`` tree are computed by dividing the maximum
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 660) number of CPUs by the fanout supported by the number of levels from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 661) current level down, rounding up. This computation is performed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 662) lines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 663) initializers for lockdep lock-class names. Finally, lines 64-66 produce
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 664) an error if the maximum number of CPUs is too large for the specified
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 665) fanout.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 666)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 667) The ``rcu_segcblist`` Structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 668) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 669)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 670) The ``rcu_segcblist`` structure maintains a segmented list of callbacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 671) as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 672)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 673) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 674)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 675) 1 #define RCU_DONE_TAIL 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 676) 2 #define RCU_WAIT_TAIL 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 677) 3 #define RCU_NEXT_READY_TAIL 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 678) 4 #define RCU_NEXT_TAIL 3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 679) 5 #define RCU_CBLIST_NSEGS 4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 680) 6
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 681) 7 struct rcu_segcblist {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 682) 8 struct rcu_head *head;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 683) 9 struct rcu_head **tails[RCU_CBLIST_NSEGS];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 684) 10 unsigned long gp_seq[RCU_CBLIST_NSEGS];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 685) 11 long len;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 686) 12 long len_lazy;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 687) 13 };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 688)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 689) The segments are as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 690)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 691) #. ``RCU_DONE_TAIL``: Callbacks whose grace periods have elapsed. These
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 692) callbacks are ready to be invoked.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 693) #. ``RCU_WAIT_TAIL``: Callbacks that are waiting for the current grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 694) period. Note that different CPUs can have different ideas about which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 695) grace period is current, hence the ``->gp_seq`` field.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 696) #. ``RCU_NEXT_READY_TAIL``: Callbacks waiting for the next grace period
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 697) to start.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 698) #. ``RCU_NEXT_TAIL``: Callbacks that have not yet been associated with a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 699) grace period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 700)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 701) The ``->head`` pointer references the first callback or is ``NULL`` if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 702) the list contains no callbacks (which is *not* the same as being empty).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 703) Each element of the ``->tails[]`` array references the ``->next``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 704) pointer of the last callback in the corresponding segment of the list,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 705) or the list's ``->head`` pointer if that segment and all previous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 706) segments are empty. If the corresponding segment is empty but some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 707) previous segment is not empty, then the array element is identical to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 708) its predecessor. Older callbacks are closer to the head of the list, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 709) new callbacks are added at the tail. This relationship between the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 710) ``->head`` pointer, the ``->tails[]`` array, and the callbacks is shown
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 711) in this diagram:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 712)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 713) .. kernel-figure:: nxtlist.svg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 714)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 715) In this figure, the ``->head`` pointer references the first RCU callback
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 716) in the list. The ``->tails[RCU_DONE_TAIL]`` array element references the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 717) ``->head`` pointer itself, indicating that none of the callbacks is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 718) ready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 719) callback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 720) are both waiting on the current grace period, give or take possible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 721) disagreements about exactly which grace period is the current one. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 722) ``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 723) callback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 724) there are no callbacks waiting on the next RCU grace period. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 725) ``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 726) pointer, indicating that all the remaining RCU callbacks have not yet
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 727) been assigned to an RCU grace period. Note that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 728) ``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 729) callback's ``->next`` pointer unless the callback list is empty, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 730) which case it references the ``->head`` pointer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 731)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 732) There is one additional important special case for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 733) ``->tails[RCU_NEXT_TAIL]`` array element: It can be ``NULL`` when this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 734) list is *disabled*. Lists are disabled when the corresponding CPU is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 735) offline or when the corresponding CPU's callbacks are offloaded to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 736) kthread, both of which are described elsewhere.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 737)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 738) CPUs advance their callbacks from the ``RCU_NEXT_TAIL`` to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 739) ``RCU_NEXT_READY_TAIL`` to the ``RCU_WAIT_TAIL`` to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 740) ``RCU_DONE_TAIL`` list segments as grace periods advance.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 741)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 742) The ``->gp_seq[]`` array records grace-period numbers corresponding to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 743) the list segments. This is what allows different CPUs to have different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 744) ideas as to which is the current grace period while still avoiding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 745) premature invocation of their callbacks. In particular, this allows CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 746) that go idle for extended periods to determine which of their callbacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 747) are ready to be invoked after reawakening.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 748)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 749) The ``->len`` counter contains the number of callbacks in ``->head``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 750) and the ``->len_lazy`` contains the number of those callbacks that are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 751) known to only free memory, and whose invocation can therefore be safely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 752) deferred.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 753)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 754) .. important::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 755)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 756) It is the ``->len`` field that determines whether or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 757) not there are callbacks associated with this ``rcu_segcblist``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 758) structure, *not* the ``->head`` pointer. The reason for this is that all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 759) the ready-to-invoke callbacks (that is, those in the ``RCU_DONE_TAIL``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 760) segment) are extracted all at once at callback-invocation time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 761) (``rcu_do_batch``), due to which ``->head`` may be set to NULL if there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 762) are no not-done callbacks remaining in the ``rcu_segcblist``. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 763) callback invocation must be postponed, for example, because a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 764) high-priority process just woke up on this CPU, then the remaining
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 765) callbacks are placed back on the ``RCU_DONE_TAIL`` segment and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 766) ``->head`` once again points to the start of the segment. In short, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 767) head field can briefly be ``NULL`` even though the CPU has callbacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 768) present the entire time. Therefore, it is not appropriate to test the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 769) ``->head`` pointer for ``NULL``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 770)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 771) In contrast, the ``->len`` and ``->len_lazy`` counts are adjusted only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 772) after the corresponding callbacks have been invoked. This means that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 773) ``->len`` count is zero only if the ``rcu_segcblist`` structure really
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 774) is devoid of callbacks. Of course, off-CPU sampling of the ``->len``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 775) count requires careful use of appropriate synchronization, for example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 776) memory barriers. This synchronization can be a bit subtle, particularly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 777) in the case of ``rcu_barrier()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 778)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 779) The ``rcu_data`` Structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 780) ~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 781)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 782) The ``rcu_data`` maintains the per-CPU state for the RCU subsystem. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 783) fields in this structure may be accessed only from the corresponding CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 784) (and from tracing) unless otherwise stated. This structure is the focus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 785) of quiescent-state detection and RCU callback queuing. It also tracks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 786) its relationship to the corresponding leaf ``rcu_node`` structure to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 787) allow more-efficient propagation of quiescent states up the ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 788) combining tree. Like the ``rcu_node`` structure, it provides a local
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 789) copy of the grace-period information to allow for-free synchronized
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 790) access to this information from the corresponding CPU. Finally, this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 791) structure records past dyntick-idle state for the corresponding CPU and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 792) also tracks statistics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 793)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 794) The ``rcu_data`` structure's fields are discussed, singly and in groups,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 795) in the following sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 796)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 797) Connection to Other Data Structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 798) '''''''''''''''''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 799)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 800) This portion of the ``rcu_data`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 801)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 802) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 803)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 804) 1 int cpu;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 805) 2 struct rcu_node *mynode;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 806) 3 unsigned long grpmask;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 807) 4 bool beenonline;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 808)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 809) The ``->cpu`` field contains the number of the corresponding CPU and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 810) ``->mynode`` field references the corresponding ``rcu_node`` structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 811) The ``->mynode`` is used to propagate quiescent states up the combining
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 812) tree. These two fields are constant and therefore do not require
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 813) synchronization.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 814)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 815) The ``->grpmask`` field indicates the bit in the ``->mynode->qsmask``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 816) corresponding to this ``rcu_data`` structure, and is also used when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 817) propagating quiescent states. The ``->beenonline`` flag is set whenever
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 818) the corresponding CPU comes online, which means that the debugfs tracing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 819) need not dump out any ``rcu_data`` structure for which this flag is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 820) set.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 821)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 822) Quiescent-State and Grace-Period Tracking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 823) '''''''''''''''''''''''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 824)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 825) This portion of the ``rcu_data`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 826)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 827) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 828)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 829) 1 unsigned long gp_seq;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 830) 2 unsigned long gp_seq_needed;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 831) 3 bool cpu_no_qs;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 832) 4 bool core_needs_qs;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 833) 5 bool gpwrap;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 834)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 835) The ``->gp_seq`` field is the counterpart of the field of the same name
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 836) in the ``rcu_state`` and ``rcu_node`` structures. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 837) ``->gp_seq_needed`` field is the counterpart of the field of the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 838) name in the rcu_node structure. They may each lag up to one behind their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 839) ``rcu_node`` counterparts, but in ``CONFIG_NO_HZ_IDLE`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 840) ``CONFIG_NO_HZ_FULL`` kernels can lag arbitrarily far behind for CPUs in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 841) dyntick-idle mode (but these counters will catch up upon exit from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 842) dyntick-idle mode). If the lower two bits of a given ``rcu_data``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 843) structure's ``->gp_seq`` are zero, then this ``rcu_data`` structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 844) believes that RCU is idle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 845)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 846) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 847) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 848) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 849) | All this replication of the grace period numbers can only cause |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 850) | massive confusion. Why not just keep a global sequence number and be |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 851) | done with it??? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 852) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 853) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 854) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 855) | Because if there was only a single global sequence numbers, there |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 856) | would need to be a single global lock to allow safely accessing and |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 857) | updating it. And if we are not going to have a single global lock, we |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 858) | need to carefully manage the numbers on a per-node basis. Recall from |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 859) | the answer to a previous Quick Quiz that the consequences of applying |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 860) | a previously sampled quiescent state to the wrong grace period are |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 861) | quite severe. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 862) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 863)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 864) The ``->cpu_no_qs`` flag indicates that the CPU has not yet passed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 865) through a quiescent state, while the ``->core_needs_qs`` flag indicates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 866) that the RCU core needs a quiescent state from the corresponding CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 867) The ``->gpwrap`` field indicates that the corresponding CPU has remained
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 868) idle for so long that the ``gp_seq`` counter is in danger of overflow,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 869) which will cause the CPU to disregard the values of its counters on its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 870) next exit from idle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 871)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 872) RCU Callback Handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 873) '''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 874)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 875) In the absence of CPU-hotplug events, RCU callbacks are invoked by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 876) same CPU that registered them. This is strictly a cache-locality
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 877) optimization: callbacks can and do get invoked on CPUs other than the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 878) one that registered them. After all, if the CPU that registered a given
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 879) callback has gone offline before the callback can be invoked, there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 880) really is no other choice.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 881)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 882) This portion of the ``rcu_data`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 883)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 884) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 885)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 886) 1 struct rcu_segcblist cblist;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 887) 2 long qlen_last_fqs_check;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 888) 3 unsigned long n_cbs_invoked;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 889) 4 unsigned long n_nocbs_invoked;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 890) 5 unsigned long n_cbs_orphaned;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 891) 6 unsigned long n_cbs_adopted;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 892) 7 unsigned long n_force_qs_snap;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 893) 8 long blimit;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 894)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 895) The ``->cblist`` structure is the segmented callback list described
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 896) earlier. The CPU advances the callbacks in its ``rcu_data`` structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 897) whenever it notices that another RCU grace period has completed. The CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 898) detects the completion of an RCU grace period by noticing that the value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 899) of its ``rcu_data`` structure's ``->gp_seq`` field differs from that of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 900) its leaf ``rcu_node`` structure. Recall that each ``rcu_node``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 901) structure's ``->gp_seq`` field is updated at the beginnings and ends of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 902) each grace period.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 903)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 904) The ``->qlen_last_fqs_check`` and ``->n_force_qs_snap`` coordinate the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 905) forcing of quiescent states from ``call_rcu()`` and friends when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 906) callback lists grow excessively long.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 907)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 908) The ``->n_cbs_invoked``, ``->n_cbs_orphaned``, and ``->n_cbs_adopted``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 909) fields count the number of callbacks invoked, sent to other CPUs when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 910) this CPU goes offline, and received from other CPUs when those other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 911) CPUs go offline. The ``->n_nocbs_invoked`` is used when the CPU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 912) callbacks are offloaded to a kthread.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 913)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 914) Finally, the ``->blimit`` counter is the maximum number of RCU callbacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 915) that may be invoked at a given time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 916)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 917) Dyntick-Idle Handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 918) '''''''''''''''''''''
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 919)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 920) This portion of the ``rcu_data`` structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 921)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 922) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 923)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 924) 1 int dynticks_snap;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 925) 2 unsigned long dynticks_fqs;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 926)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 927) The ``->dynticks_snap`` field is used to take a snapshot of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 928) corresponding CPU's dyntick-idle state when forcing quiescent states,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 929) and is therefore accessed from other CPUs. Finally, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 930) ``->dynticks_fqs`` field is used to count the number of times this CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 931) is determined to be in dyntick-idle state, and is used for tracing and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 932) debugging purposes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 933)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 934) This portion of the rcu_data structure is declared as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 935)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 936) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 937)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 938) 1 long dynticks_nesting;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 939) 2 long dynticks_nmi_nesting;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 940) 3 atomic_t dynticks;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 941) 4 bool rcu_need_heavy_qs;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 942) 5 bool rcu_urgent_qs;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 943)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 944) These fields in the rcu_data structure maintain the per-CPU dyntick-idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 945) state for the corresponding CPU. The fields may be accessed only from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 946) the corresponding CPU (and from tracing) unless otherwise stated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 947)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 948) The ``->dynticks_nesting`` field counts the nesting depth of process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 949) execution, so that in normal circumstances this counter has value zero
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 950) or one. NMIs, irqs, and tracers are counted by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 951) ``->dynticks_nmi_nesting`` field. Because NMIs cannot be masked, changes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 952) to this variable have to be undertaken carefully using an algorithm
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 953) provided by Andy Lutomirski. The initial transition from idle adds one,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 954) and nested transitions add two, so that a nesting level of five is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 955) represented by a ``->dynticks_nmi_nesting`` value of nine. This counter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 956) can therefore be thought of as counting the number of reasons why this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 957) CPU cannot be permitted to enter dyntick-idle mode, aside from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 958) process-level transitions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 959)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 960) However, it turns out that when running in non-idle kernel context, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 961) Linux kernel is fully capable of entering interrupt handlers that never
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 962) exit and perhaps also vice versa. Therefore, whenever the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 963) ``->dynticks_nesting`` field is incremented up from zero, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 964) ``->dynticks_nmi_nesting`` field is set to a large positive number, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 965) whenever the ``->dynticks_nesting`` field is decremented down to zero,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 966) the ``->dynticks_nmi_nesting`` field is set to zero. Assuming that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 967) the number of misnested interrupts is not sufficient to overflow the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 968) counter, this approach corrects the ``->dynticks_nmi_nesting`` field
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 969) every time the corresponding CPU enters the idle loop from process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 970) context.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 971)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 972) The ``->dynticks`` field counts the corresponding CPU's transitions to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 973) and from either dyntick-idle or user mode, so that this counter has an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 974) even value when the CPU is in dyntick-idle mode or user mode and an odd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 975) value otherwise. The transitions to/from user mode need to be counted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 976) for user mode adaptive-ticks support (see timers/NO_HZ.txt).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 977)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 978) The ``->rcu_need_heavy_qs`` field is used to record the fact that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 979) RCU core code would really like to see a quiescent state from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 980) corresponding CPU, so much so that it is willing to call for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 981) heavy-weight dyntick-counter operations. This flag is checked by RCU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 982) context-switch and ``cond_resched()`` code, which provide a momentary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 983) idle sojourn in response.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 984)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 985) Finally, the ``->rcu_urgent_qs`` field is used to record the fact that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 986) the RCU core code would really like to see a quiescent state from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 987) corresponding CPU, with the various other fields indicating just how
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 988) badly RCU wants this quiescent state. This flag is checked by RCU's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 989) context-switch path (``rcu_note_context_switch``) and the cond_resched
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 990) code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 991)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 992) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 993) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 994) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 995) | Why not simply combine the ``->dynticks_nesting`` and |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 996) | ``->dynticks_nmi_nesting`` counters into a single counter that just |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 997) | counts the number of reasons that the corresponding CPU is non-idle? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 998) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 999) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1000) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1001) | Because this would fail in the presence of interrupts whose handlers |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1002) | never return and of handlers that manage to return from a made-up |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1003) | interrupt. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1004) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1005)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1006) Additional fields are present for some special-purpose builds, and are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1007) discussed separately.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1008)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1009) The ``rcu_head`` Structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1010) ~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1011)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1012) Each ``rcu_head`` structure represents an RCU callback. These structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1013) are normally embedded within RCU-protected data structures whose
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1014) algorithms use asynchronous grace periods. In contrast, when using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1015) algorithms that block waiting for RCU grace periods, RCU users need not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1016) provide ``rcu_head`` structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1017)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1018) The ``rcu_head`` structure has fields as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1019)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1020) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1021)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1022) 1 struct rcu_head *next;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1023) 2 void (*func)(struct rcu_head *head);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1024)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1025) The ``->next`` field is used to link the ``rcu_head`` structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1026) together in the lists within the ``rcu_data`` structures. The ``->func``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1027) field is a pointer to the function to be called when the callback is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1028) ready to be invoked, and this function is passed a pointer to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1029) ``rcu_head`` structure. However, ``kfree_rcu()`` uses the ``->func``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1030) field to record the offset of the ``rcu_head`` structure within the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1031) enclosing RCU-protected data structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1032)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1033) Both of these fields are used internally by RCU. From the viewpoint of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1034) RCU users, this structure is an opaque “cookie”.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1035)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1036) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1037) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1038) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1039) | Given that the callback function ``->func`` is passed a pointer to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1040) | the ``rcu_head`` structure, how is that function supposed to find the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1041) | beginning of the enclosing RCU-protected data structure? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1042) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1043) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1044) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1045) | In actual practice, there is a separate callback function per type of |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1046) | RCU-protected data structure. The callback function can therefore use |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1047) | the ``container_of()`` macro in the Linux kernel (or other |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1048) | pointer-manipulation facilities in other software environments) to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1049) | find the beginning of the enclosing structure. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1050) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1051)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1052) RCU-Specific Fields in the ``task_struct`` Structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1053) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1054)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1055) The ``CONFIG_PREEMPT_RCU`` implementation uses some additional fields in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1056) the ``task_struct`` structure:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1057)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1058) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1059)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1060) 1 #ifdef CONFIG_PREEMPT_RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1061) 2 int rcu_read_lock_nesting;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1062) 3 union rcu_special rcu_read_unlock_special;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1063) 4 struct list_head rcu_node_entry;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1064) 5 struct rcu_node *rcu_blocked_node;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1065) 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1066) 7 #ifdef CONFIG_TASKS_RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1067) 8 unsigned long rcu_tasks_nvcsw;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1068) 9 bool rcu_tasks_holdout;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1069) 10 struct list_head rcu_tasks_holdout_list;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1070) 11 int rcu_tasks_idle_cpu;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1071) 12 #endif /* #ifdef CONFIG_TASKS_RCU */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1072)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1073) The ``->rcu_read_lock_nesting`` field records the nesting level for RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1074) read-side critical sections, and the ``->rcu_read_unlock_special`` field
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1075) is a bitmask that records special conditions that require
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1076) ``rcu_read_unlock()`` to do additional work. The ``->rcu_node_entry``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1077) field is used to form lists of tasks that have blocked within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1078) preemptible-RCU read-side critical sections and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1079) ``->rcu_blocked_node`` field references the ``rcu_node`` structure whose
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1080) list this task is a member of, or ``NULL`` if it is not blocked within a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1081) preemptible-RCU read-side critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1082)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1083) The ``->rcu_tasks_nvcsw`` field tracks the number of voluntary context
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1084) switches that this task had undergone at the beginning of the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1085) tasks-RCU grace period, ``->rcu_tasks_holdout`` is set if the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1086) tasks-RCU grace period is waiting on this task,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1087) ``->rcu_tasks_holdout_list`` is a list element enqueuing this task on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1088) the holdout list, and ``->rcu_tasks_idle_cpu`` tracks which CPU this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1089) idle task is running, but only if the task is currently running, that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1090) is, if the CPU is currently idle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1091)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1092) Accessor Functions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1093) ~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1094)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1095) The following listing shows the ``rcu_get_root()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1096) ``rcu_for_each_node_breadth_first`` and ``rcu_for_each_leaf_node()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1097) function and macros:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1098)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1099) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1100)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1101) 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1102) 2 {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1103) 3 return &rsp->node[0];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1104) 4 }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1105) 5
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1106) 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1107) 7 for ((rnp) = &(rsp)->node[0]; \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1108) 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1109) 9
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1110) 10 #define rcu_for_each_leaf_node(rsp, rnp) \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1111) 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1112) 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1113)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1114) The ``rcu_get_root()`` simply returns a pointer to the first element of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1115) the specified ``rcu_state`` structure's ``->node[]`` array, which is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1116) root ``rcu_node`` structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1117)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1118) As noted earlier, the ``rcu_for_each_node_breadth_first()`` macro takes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1119) advantage of the layout of the ``rcu_node`` structures in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1120) ``rcu_state`` structure's ``->node[]`` array, performing a breadth-first
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1121) traversal by simply traversing the array in order. Similarly, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1122) ``rcu_for_each_leaf_node()`` macro traverses only the last part of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1123) array, thus traversing only the leaf ``rcu_node`` structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1124)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1125) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1126) | **Quick Quiz**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1127) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1128) | What does ``rcu_for_each_leaf_node()`` do if the ``rcu_node`` tree |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1129) | contains only a single node? |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1130) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1131) | **Answer**: |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1132) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1133) | In the single-node case, ``rcu_for_each_leaf_node()`` traverses the |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1134) | single node. |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1135) +-----------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1136)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1137) Summary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1138) ~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1139)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1140) So the state of RCU is represented by an ``rcu_state`` structure, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1141) contains a combining tree of ``rcu_node`` and ``rcu_data`` structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1142) Finally, in ``CONFIG_NO_HZ_IDLE`` kernels, each CPU's dyntick-idle state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1143) is tracked by dynticks-related fields in the ``rcu_data`` structure. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1144) you made it this far, you are well prepared to read the code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1145) walkthroughs in the other articles in this series.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1147) Acknowledgments
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1148) ~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1149)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1150) I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1151) Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1152) helping me get this document into a more human-readable state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1153)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1154) Legal Statement
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1155) ~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1156)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1157) This work represents the view of the author and does not necessarily
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1158) represent the view of IBM.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1159)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1160) Linux is a registered trademark of Linus Torvalds.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1161)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1162) Other company, product, and service names may be trademarks or service
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1163) marks of others.