^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) =========================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) Cluster-wide Power-up/power-down race avoidance algorithm
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) =========================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) This file documents the algorithm which is used to coordinate CPU and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) cluster setup and teardown operations and to manage hardware coherency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) controls safely.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) The section "Rationale" explains what the algorithm is for and why it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) needed. "Basic model" explains general concepts using a simplified view
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) of the system. The other sections explain the actual details of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) algorithm in use.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) Rationale
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) ---------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) In a system containing multiple CPUs, it is desirable to have the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) ability to turn off individual CPUs when the system is idle, reducing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) power consumption and thermal dissipation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) In a system containing multiple clusters of CPUs, it is also desirable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) to have the ability to turn off entire clusters.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) Turning entire clusters off and on is a risky business, because it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) involves performing potentially destructive operations affecting a group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) of independently running CPUs, while the OS continues to run. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) means that we need some coordination in order to ensure that critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) cluster-level operations are only performed when it is truly safe to do
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) so.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) Simple locking may not be sufficient to solve this problem, because
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) mechanisms like Linux spinlocks may rely on coherency mechanisms which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) are not immediately enabled when a cluster powers up. Since enabling or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) disabling those mechanisms may itself be a non-atomic operation (such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) writing some hardware registers and invalidating large caches), other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) methods of coordination are required in order to guarantee safe
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) power-down and power-up at the cluster level.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) The mechanism presented in this document describes a coherent memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) based protocol for performing the needed coordination. It aims to be as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) lightweight as possible, while providing the required safety properties.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) Basic model
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) -----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) Each cluster and CPU is assigned a state, as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) - DOWN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) - COMING_UP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) - UP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) - GOING_DOWN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) +---------> UP ----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) | v
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) COMING_UP GOING_DOWN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) ^ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) +--------- DOWN <--------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) DOWN:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) The CPU or cluster is not coherent, and is either powered off or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) suspended, or is ready to be powered off or suspended.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) COMING_UP:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) The CPU or cluster has committed to moving to the UP state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) It may be part way through the process of initialisation and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) enabling coherency.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) UP:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) The CPU or cluster is active and coherent at the hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) level. A CPU in this state is not necessarily being used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) actively by the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) GOING_DOWN:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) The CPU or cluster has committed to moving to the DOWN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) state. It may be part way through the process of teardown and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) coherency exit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) Each CPU has one of these states assigned to it at any point in time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) The CPU states are described in the "CPU state" section, below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) Each cluster is also assigned a state, but it is necessary to split the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) state value into two parts (the "cluster" state and "inbound" state) and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) to introduce additional states in order to avoid races between different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) CPUs in the cluster simultaneously modifying the state. The cluster-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) level states are described in the "Cluster state" section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) To help distinguish the CPU states from cluster states in this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) discussion, the state names are given a `CPU_` prefix for the CPU states,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) and a `CLUSTER_` or `INBOUND_` prefix for the cluster states.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) CPU state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) ---------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) In this algorithm, each individual core in a multi-core processor is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) referred to as a "CPU". CPUs are assumed to be single-threaded:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) therefore, a CPU can only be doing one thing at a single point in time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) This means that CPUs fit the basic model closely.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) The algorithm defines the following states for each CPU in the system:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) - CPU_DOWN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) - CPU_COMING_UP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) - CPU_UP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) - CPU_GOING_DOWN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) cluster setup and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) CPU setup complete policy decision
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) +-----------> CPU_UP ------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) | v
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) CPU_COMING_UP CPU_GOING_DOWN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) ^ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) +----------- CPU_DOWN <----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) policy decision CPU teardown complete
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) or hardware event
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) The definitions of the four states correspond closely to the states of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) the basic model.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) Transitions between states occur as follows.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) A trigger event (spontaneous) means that the CPU can transition to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) next state as a result of making local progress only, with no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) requirement for any external event to happen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) CPU_DOWN:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) A CPU reaches the CPU_DOWN state when it is ready for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) power-down. On reaching this state, the CPU will typically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) power itself down or suspend itself, via a WFI instruction or a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) firmware call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) Next state:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) CPU_COMING_UP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) Conditions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) none
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) Trigger events:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) a) an explicit hardware power-up operation, resulting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) from a policy decision on another CPU;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) b) a hardware event, such as an interrupt.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) CPU_COMING_UP:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) A CPU cannot start participating in hardware coherency until the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) cluster is set up and coherent. If the cluster is not ready,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) then the CPU will wait in the CPU_COMING_UP state until the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) cluster has been set up.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) Next state:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) CPU_UP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) Conditions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) The CPU's parent cluster must be in CLUSTER_UP.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) Trigger events:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) Transition of the parent cluster to CLUSTER_UP.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) Refer to the "Cluster state" section for a description of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) CLUSTER_UP state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) CPU_UP:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) When a CPU reaches the CPU_UP state, it is safe for the CPU to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) start participating in local coherency.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) This is done by jumping to the kernel's CPU resume code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) Note that the definition of this state is slightly different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) from the basic model definition: CPU_UP does not mean that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) CPU is coherent yet, but it does mean that it is safe to resume
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) the kernel. The kernel handles the rest of the resume
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) procedure, so the remaining steps are not visible as part of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) race avoidance algorithm.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) The CPU remains in this state until an explicit policy decision
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) is made to shut down or suspend the CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) Next state:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) CPU_GOING_DOWN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) Conditions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) none
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) Trigger events:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) explicit policy decision
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) CPU_GOING_DOWN:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) While in this state, the CPU exits coherency, including any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) operations required to achieve this (such as cleaning data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) caches).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) Next state:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) CPU_DOWN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) Conditions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) local CPU teardown complete
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) Trigger events:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) (spontaneous)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) Cluster state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) A cluster is a group of connected CPUs with some common resources.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) Because a cluster contains multiple CPUs, it can be doing multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) things at the same time. This has some implications. In particular, a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) CPU can start up while another CPU is tearing the cluster down.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) In this discussion, the "outbound side" is the view of the cluster state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) as seen by a CPU tearing the cluster down. The "inbound side" is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) view of the cluster state as seen by a CPU setting the CPU up.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) In order to enable safe coordination in such situations, it is important
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) that a CPU which is setting up the cluster can advertise its state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) independently of the CPU which is tearing down the cluster. For this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) reason, the cluster state is split into two parts:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) "cluster" state: The global state of the cluster; or the state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) on the outbound side:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) - CLUSTER_DOWN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) - CLUSTER_UP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) - CLUSTER_GOING_DOWN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) "inbound" state: The state of the cluster on the inbound side.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) - INBOUND_NOT_COMING_UP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) - INBOUND_COMING_UP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) The different pairings of these states results in six possible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) states for the cluster as a whole::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) CLUSTER_UP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) +==========> INBOUND_NOT_COMING_UP -------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) # |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) CLUSTER_UP <----+ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) INBOUND_COMING_UP | v
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) ^ CLUSTER_GOING_DOWN CLUSTER_GOING_DOWN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) # INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) CLUSTER_DOWN | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) INBOUND_COMING_UP <----+ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) ^ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) +=========== CLUSTER_DOWN <------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) INBOUND_NOT_COMING_UP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) Transitions -----> can only be made by the outbound CPU, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) only involve changes to the "cluster" state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) Transitions ===##> can only be made by the inbound CPU, and only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) involve changes to the "inbound" state, except where there is no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) further transition possible on the outbound side (i.e., the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) outbound CPU has put the cluster into the CLUSTER_DOWN state).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) The race avoidance algorithm does not provide a way to determine
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) which exact CPUs within the cluster play these roles. This must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) be decided in advance by some other means. Refer to the section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) "Last man and first man selection" for more explanation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) cluster can actually be powered down.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) The parallelism of the inbound and outbound CPUs is observed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) the existence of two different paths from CLUSTER_GOING_DOWN/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) COMING_UP in the basic model). The second path avoids cluster
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) teardown completely.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) model. The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) is trivial and merely resets the state machine ready for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) next cycle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) Details of the allowable transitions follow.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) The next state in each case is notated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) <cluster state>/<inbound state> (<transitioner>)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) where the <transitioner> is the side on which the transition
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) can occur; either the inbound or the outbound side.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) CLUSTER_DOWN/INBOUND_NOT_COMING_UP:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) Next state:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) CLUSTER_DOWN/INBOUND_COMING_UP (inbound)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) Conditions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) none
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) Trigger events:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) a) an explicit hardware power-up operation, resulting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) from a policy decision on another CPU;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) b) a hardware event, such as an interrupt.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) CLUSTER_DOWN/INBOUND_COMING_UP:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) In this state, an inbound CPU sets up the cluster, including
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) enabling of hardware coherency at the cluster level and any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) other operations (such as cache invalidation) which are required
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) in order to achieve this.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) The purpose of this state is to do sufficient cluster-level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) setup to enable other CPUs in the cluster to enter coherency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) safely.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) Next state:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) CLUSTER_UP/INBOUND_COMING_UP (inbound)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) Conditions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) cluster-level setup and hardware coherency complete
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) Trigger events:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) (spontaneous)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) CLUSTER_UP/INBOUND_COMING_UP:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) Cluster-level setup is complete and hardware coherency is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) enabled for the cluster. Other CPUs in the cluster can safely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) enter coherency.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) This is a transient state, leading immediately to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) CLUSTER_UP/INBOUND_NOT_COMING_UP. All other CPUs on the cluster
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) should consider treat these two states as equivalent.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) Next state:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) Conditions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) none
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) Trigger events:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) (spontaneous)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) CLUSTER_UP/INBOUND_NOT_COMING_UP:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) Cluster-level setup is complete and hardware coherency is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) enabled for the cluster. Other CPUs in the cluster can safely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) enter coherency.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) The cluster will remain in this state until a policy decision is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) made to power the cluster down.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) Next state:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) Conditions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) none
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) Trigger events:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) policy decision to power down the cluster
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) An outbound CPU is tearing the cluster down. The selected CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) must wait in this state until all CPUs in the cluster are in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) CPU_DOWN state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) When all CPUs are in the CPU_DOWN state, the cluster can be torn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) down, for example by cleaning data caches and exiting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) cluster-level coherency.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) To avoid wasteful unnecessary teardown operations, the outbound
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) should check the inbound cluster state for asynchronous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) transitions to INBOUND_COMING_UP. Alternatively, individual
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) CPUs can be checked for entry into CPU_COMING_UP or CPU_UP.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) Next states:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) Conditions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) cluster torn down and ready to power off
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) Trigger events:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) (spontaneous)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) Conditions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) none
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) Trigger events:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) a) an explicit hardware power-up operation,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) resulting from a policy decision on another
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) CPU;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) b) a hardware event, such as an interrupt.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) CLUSTER_GOING_DOWN/INBOUND_COMING_UP:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) The cluster is (or was) being torn down, but another CPU has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) come online in the meantime and is trying to set up the cluster
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) again.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) If the outbound CPU observes this state, it has two choices:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) a) back out of teardown, restoring the cluster to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) CLUSTER_UP state;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) b) finish tearing the cluster down and put the cluster
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) in the CLUSTER_DOWN state; the inbound CPU will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) set up the cluster again from there.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) Choice (a) permits the removal of some latency by avoiding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) unnecessary teardown and setup operations in situations where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) the cluster is not really going to be powered down.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) Next states:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) CLUSTER_UP/INBOUND_COMING_UP (outbound)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) Conditions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) cluster-level setup and hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) coherency complete
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) Trigger events:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) (spontaneous)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) CLUSTER_DOWN/INBOUND_COMING_UP (outbound)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) Conditions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) cluster torn down and ready to power off
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) Trigger events:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) (spontaneous)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) Last man and First man selection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) --------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) The CPU which performs cluster tear-down operations on the outbound side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) is commonly referred to as the "last man".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) The CPU which performs cluster setup on the inbound side is commonly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) referred to as the "first man".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) The race avoidance algorithm documented above does not provide a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) mechanism to choose which CPUs should play these roles.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456) Last man:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) When shutting down the cluster, all the CPUs involved are initially
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) executing Linux and hence coherent. Therefore, ordinary spinlocks can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) be used to select a last man safely, before the CPUs become
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) non-coherent.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) First man:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) Because CPUs may power up asynchronously in response to external wake-up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) events, a dynamic mechanism is needed to make sure that only one CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) attempts to play the first man role and do the cluster-level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) initialisation: any other CPUs must wait for this to complete before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) proceeding.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) Cluster-level initialisation may involve actions such as configuring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) coherency controls in the bus fabric.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) The current implementation in mcpm_head.S uses a separate mutual exclusion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) mechanism to do this arbitration. This mechanism is documented in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) detail in vlocks.txt.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) Features and Limitations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481) ------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) Implementation:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) The current ARM-based implementation is split between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) arch/arm/common/mcpm_entry.c (everything else):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) __mcpm_cpu_going_down() signals the transition of a CPU to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) CPU_GOING_DOWN state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) __mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) A CPU transitions to CPU_COMING_UP and then to CPU_UP via the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) low-level power-up code in mcpm_head.S. This could
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) involve CPU-specific setup code, but in the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) implementation it does not.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) __mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) and from there to CLUSTER_DOWN or back to CLUSTER_UP (in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) the case of an aborted cluster power-down).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) These functions are more complex than the __mcpm_cpu_*()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) functions due to the extra inter-CPU coordination which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507) is needed for safe transitions at the cluster level.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) the low-level power-up code in mcpm_head.S. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) typically involves platform-specific setup code,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) provided by the platform-specific power_up_setup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) function registered via mcpm_sync_init.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) Deep topologies:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) As currently described and implemented, the algorithm does not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) support CPU topologies involving more than two levels (i.e.,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) clusters of clusters are not supported). The algorithm could be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) extended by replicating the cluster-level states for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) additional topological levels, and modifying the transition
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) rules for the intermediate (non-outermost) cluster levels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525) Colophon
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) --------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) Originally created and documented by Dave Martin for Linaro Limited, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) collaboration with Nicolas Pitre and Achin Gupta.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) Copyright (C) 2012-2013 Linaro Limited
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) Distributed under the terms of Version 2 of the GNU General Public
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) License, as defined in linux/COPYING.