^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) MD Cluster
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) The cluster MD is a shared-device RAID for a cluster, it supports
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) two levels: raid1 and raid10 (limited support).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) 1. On-disk format
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) Separate write-intent-bitmaps are used for each cluster node.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) The bitmaps record all writes that may have been started on that node,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) and may not yet have finished. The on-disk layout is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) 0 4k 8k 12k
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) -------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) | idle | md super | bm super [0] + bits |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) | bm bits [3, contd] | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) During "normal" functioning we assume the filesystem ensures that only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) one node writes to any given block at a time, so a write request will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) - set the appropriate bit (if not already set)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) - commit the write to all mirrors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) - schedule the bit to be cleared after a timeout.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) Reads are just handled normally. It is up to the filesystem to ensure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) one node doesn't read from a location where another node (or the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) node) is writing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) 2. DLM Locks for management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) ===========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) There are three groups of locks for managing the device:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) 2.1 Bitmap lock resource (bm_lockres)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) -------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) The bm_lockres protects individual node bitmaps. They are named in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) node joins the cluster, it acquires the lock in PW mode and it stays
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) so during the lifetime the node is part of the cluster. The lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) resource number is based on the slot number returned by the DLM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) subsystem. Since DLM starts node count from one and bitmap slots
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) start from zero, one is subtracted from the DLM slot number to arrive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) at the bitmap slot number.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) The LVB of the bitmap lock for a particular node records the range
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) of sectors that are being re-synced by that node. No other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) node may write to those sectors. This is used when a new nodes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) joins the cluster.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) 2.2 Message passing locks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) -------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) Each node has to communicate with other nodes when starting or ending
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) resync, and for metadata superblock updates. This communication is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) managed through three locks: "token", "message", and "ack", together
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) with the Lock Value Block (LVB) of one of the "message" lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) 2.3 new-device management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) -------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) A single lock: "no-new-dev" is used to co-ordinate the addition of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) new devices - this must be synchronized across the array.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) Normally all nodes hold a concurrent-read lock on this device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) 3. Communication
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) Messages can be broadcast to all nodes, and the sender waits for all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) other nodes to acknowledge the message before proceeding. Only one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) message can be processed at a time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) 3.1 Message Types
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) There are six types of messages which are passed:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) 3.1.1 METADATA_UPDATED
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) ^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) informs other nodes that the metadata has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) been updated, and the node must re-read the md superblock. This is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) performed synchronously. It is primarily used to signal device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) failure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) 3.1.2 RESYNCING
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) ^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) informs other nodes that a resync is initiated or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) ended so that each node may suspend or resume the region. Each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) RESYNCING message identifies a range of the devices that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) sending node is about to resync. This overrides any previous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) notification from that node: only one ranged can be resynced at a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) time per-node.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) 3.1.3 NEWDISK
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) ^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) informs other nodes that a device is being added to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) the array. Message contains an identifier for that device. See
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) below for further details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 3.1.4 REMOVE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) ^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) A failed or spare device is being removed from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) array. The slot-number of the device is included in the message.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) 3.1.5 RE_ADD:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) A failed device is being re-activated - the assumption
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) is that it has been determined to be working again.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) 3.1.6 BITMAP_NEEDS_SYNC:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) If a node is stopped locally but the bitmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) isn't clean, then another node is informed to take the ownership of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) resync.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) 3.2 Communication mechanism
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) The DLM LVB is used to communicate within nodes of the cluster. There
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) are three resources used for the purpose:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) 3.2.1 token
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) ^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) The resource which protects the entire communication
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) system. The node having the token resource is allowed to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) communicate.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 3.2.2 message
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) ^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) The lock resource which carries the data to communicate.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) 3.2.3 ack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) ^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) The resource, acquiring which means the message has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) acknowledged by all nodes in the cluster. The BAST of the resource
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) is used to inform the receiving node that a node wants to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) communicate.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) The algorithm is:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 1. receive status - all nodes have concurrent-reader lock on "ack"::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) sender receiver receiver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) "ack":CR "ack":CR "ack":CR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 2. sender get EX on "token",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) sender get EX on "message"::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) sender receiver receiver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) "token":EX "ack":CR "ack":CR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) "message":EX
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) "ack":CR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) Sender checks that it still needs to send a message. Messages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) received or other events that happened while waiting for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) "token" may have made this message inappropriate or redundant.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 3. sender writes LVB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) sender down-convert "message" from EX to CW
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) sender try to get EX of "ack"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) [ wait until all receivers have *processed* the "message" ]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) [ triggered by bast of "ack" ]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) receiver get CR on "message"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) receiver read LVB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) receiver processes the message
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) [ wait finish ]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) receiver releases "ack"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) receiver tries to get PR on "message"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) sender receiver receiver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) "token":EX "message":CR "message":CR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) "message":CW
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) "ack":EX
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) 4. triggered by grant of EX on "ack" (indicating all receivers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) have processed message)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) sender down-converts "ack" from EX to CR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) sender releases "message"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) sender releases "token"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) receiver upconvert to PR on "message"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) receiver get CR of "ack"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) receiver release "message"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) sender receiver receiver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) "ack":CR "ack":CR "ack":CR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) 4. Handling Failures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) 4.1 Node Failure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) ----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) When a node fails, the DLM informs the cluster with the slot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) number. The node starts a cluster recovery thread. The cluster
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) recovery thread:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) - acquires the bitmap<number> lock of the failed node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) - opens the bitmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) - reads the bitmap of the failed node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) - copies the set bitmap to local node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) - cleans the bitmap of the failed node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) - releases bitmap<number> lock of the failed node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) - initiates resync of the bitmap on the current node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) md_check_recovery is invoked within recover_bitmaps,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) then md_check_recovery -> metadata_update_start/finish,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) it will lock the communication by lock_comm.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) Which means when one node is resyncing it blocks all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) other nodes from writing anywhere on the array.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) The resync process is the regular md resync. However, in a clustered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) environment when a resync is performed, it needs to tell other nodes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) of the areas which are suspended. Before a resync starts, the node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) send out RESYNCING with the (lo,hi) range of the area which needs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) be suspended. Each node maintains a suspend_list, which contains the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) list of ranges which are currently suspended. On receiving RESYNCING,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) the node adds the range to the suspend_list. Similarly, when the node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) performing resync finishes, it sends RESYNCING with an empty range to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) other nodes and other nodes remove the corresponding entry from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) suspend_list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) A helper function, ->area_resyncing() can be used to check if a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) particular I/O range should be suspended or not.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) 4.2 Device Failure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) ==================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) Device failures are handled and communicated with the metadata update
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) routine. When a node detects a device failure it does not allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) any further writes to that device until the failure has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) acknowledged by all other nodes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) 5. Adding a new Device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) ----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) For adding a new device, it is necessary that all nodes "see" the new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) device to be added. For this, the following algorithm is used:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) 2. Node 1 sends a NEWDISK message with uuid and slot number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) 3. Other nodes issue kobject_uevent_env with uuid and slot number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) (Steps 4,5 could be a udev rule)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) 4. In userspace, the node searches for the disk, perhaps
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) using blkid -t SUB_UUID=""
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) 5. Other nodes issue either of the following depending on whether
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) the disk was found:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) disc.number set to slot number)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) ioctl(CLUSTERED_DISK_NACK)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) 6. Other nodes drop lock on "no-new-devs" (CR) if device is found
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) 7. Node 1 attempts EX lock on "no-new-dev"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) 8. If node 1 gets the lock, it sends METADATA_UPDATED after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) unmarking the disk as SpareLocal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) 9. If not (get "no-new-dev" lock), it fails the operation and sends
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) METADATA_UPDATED.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) 10. Other nodes get the information whether a disk is added or not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) by the following METADATA_UPDATED.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) 6. Module interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) There are 17 call-backs which the md core can make to the cluster
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) module. Understanding these can give a good overview of the whole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) 6.1 join(nodes) and leave()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) These are called when an array is started with a clustered bitmap,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) and when the array is stopped. join() ensures the cluster is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) available and initializes the various resources.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) Only the first 'nodes' nodes in the cluster can use the array.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) 6.2 slot_number()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) Reports the slot number advised by the cluster infrastructure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) Range is from 0 to nodes-1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) 6.3 resync_info_update()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) ------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) This updates the resync range that is stored in the bitmap lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) The starting point is updated as the resync progresses. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) end point is always the end of the array.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) It does *not* send a RESYNCING message.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) 6.4 resync_start(), resync_finish()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) -----------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) These are called when resync/recovery/reshape starts or stops.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) They update the resyncing range in the bitmap lock and also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) send a RESYNCING message. resync_start reports the whole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) array as resyncing, resync_finish reports none of it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) resync_finish() also sends a BITMAP_NEEDS_SYNC message which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) allows some other node to take over.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) 6.5 metadata_update_start(), metadata_update_finish(), metadata_update_cancel()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) -------------------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) metadata_update_start is used to get exclusive access to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) the metadata. If a change is still needed once that access is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) gained, metadata_update_finish() will send a METADATA_UPDATE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) message to all other nodes, otherwise metadata_update_cancel()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) can be used to release the lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) 6.6 area_resyncing()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) --------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) This combines two elements of functionality.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) Firstly, it will check if any node is currently resyncing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) anything in a given range of sectors. If any resync is found,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) then the caller will avoid writing or read-balancing in that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) range.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) Secondly, while node recovery is happening it reports that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) all areas are resyncing for READ requests. This avoids races
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) between the cluster-filesystem and the cluster-RAID handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) a node failure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) 6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) ---------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) These are used to manage the new-disk protocol described above.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) When a new device is added, add_new_disk_start() is called before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) it is bound to the array and, if that succeeds, add_new_disk_finish()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) is called the device is fully added.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) When a device is added in acknowledgement to a previous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) request, or when the device is declared "unavailable",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) new_disk_ack() is called.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) 6.8 remove_disk()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) This is called when a spare or failed device is removed from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) the array. It causes a REMOVE message to be send to other nodes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) 6.9 gather_bitmaps()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) --------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) This sends a RE_ADD message to all other nodes and then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) gathers bitmap information from all bitmaps. This combined
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) bitmap is then used to recovery the re-added device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) 6.10 lock_all_bitmaps() and unlock_all_bitmaps()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) ------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) These are called when change bitmap to none. If a node plans
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) to clear the cluster raid's bitmap, it need to make sure no other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) nodes are using the raid which is achieved by lock all bitmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) locks within the cluster, and also those locks are unlocked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) accordingly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) 7. Unsupported features
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) =======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) There are somethings which are not supported by cluster MD yet.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) - change array_sectors.