^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) dm-switch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) The device-mapper switch target creates a device that supports an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) arbitrary mapping of fixed-size regions of I/O across a fixed set of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) paths. The path used for any specific region can be switched
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) dynamically by sending the target a message.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) It maps I/O to underlying block devices efficiently when there is a large
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) number of fixed-sized address regions but there is no simple pattern
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) that would allow for a compact representation of the mapping such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) dm-stripe.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) Background
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) ----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) Dell EqualLogic and some other iSCSI storage arrays use a distributed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) frameless architecture. In this architecture, the storage group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) consists of a number of distinct storage arrays ("members") each having
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) independent controllers, disk storage and network adapters. When a LUN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) is created it is spread across multiple members. The details of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) spreading are hidden from initiators connected to this storage system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) The storage group exposes a single target discovery portal, no matter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) how many members are being used. When iSCSI sessions are created, each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) session is connected to an eth port on a single member. Data to a LUN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) can be sent on any iSCSI session, and if the blocks being accessed are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) stored on another member the I/O will be forwarded as required. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) forwarding is invisible to the initiator. The storage layout is also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) dynamic, and the blocks stored on disk may be moved from member to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) member as needed to balance the load.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) This architecture simplifies the management and configuration of both
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) the storage group and initiators. In a multipathing configuration, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) is possible to set up multiple iSCSI sessions to use multiple network
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) interfaces on both the host and target to take advantage of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) increased network bandwidth. An initiator could use a simple round
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) robin algorithm to send I/O across all paths and let the storage array
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) members forward it as necessary, but there is a performance advantage to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) sending data directly to the correct member.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) A device-mapper table already lets you map different regions of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) device onto different targets. However in this architecture the LUN is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) spread with an address region size on the order of 10s of MBs, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) means the resulting table could have more than a million entries and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) consume far too much memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) Using this device-mapper switch target we can now build a two-layer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) device hierarchy:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) Upper Tier - Determine which array member the I/O should be sent to.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) Lower Tier - Load balance amongst paths to a particular member.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) The lower tier consists of a single dm multipath device for each member.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) Each of these multipath devices contains the set of paths directly to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) the array member in one priority group, and leverages existing path
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) selectors to load balance amongst these paths. We also build a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) non-preferred priority group containing paths to other array members for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) failover reasons.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) The upper tier consists of a single dm-switch device. This device uses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) a bitmap to look up the location of the I/O and choose the appropriate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) lower tier device to route the I/O. By using a bitmap we are able to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) use 4 bits for each address range in a 16 member group (which is very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) large for us). This is a much denser representation than the dm table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) b-tree can achieve.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) Construction Parameters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) =======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) <num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) <num_paths>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) The number of paths across which to distribute the I/O.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) <region_size>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) The number of 512-byte sectors in a region. Each region can be redirected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) to any of the available paths.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) <num_optional_args>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) The number of optional arguments. Currently, no optional arguments
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) are supported and so this must be zero.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) <dev_path>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) The block device that represents a specific path to the device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) <offset>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) The offset of the start of data on the specific <dev_path> (in units
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) of 512-byte sectors). This number is added to the sector number when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) forwarding the request to the specific path. Typically it is zero.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) Messages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) Modify the region table by specifying which regions are redirected to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) which paths.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) <index>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) The region number (region size was specified in constructor parameters).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) If index is omitted, the next region (previous index + 1) is used.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) Expressed in hexadecimal (WITHOUT any prefix like 0x).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) <path_nr>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) The path number in the range 0 ... (<num_paths> - 1).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) Expressed in hexadecimal (WITHOUT any prefix like 0x).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) R<n>,<m>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) This parameter allows repetitive patterns to be loaded quickly. <n> and <m>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) are hexadecimal numbers. The last <n> mappings are repeated in the next <m>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) slots.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) Status
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) ======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) No status line is reported.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) Example
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) the same size.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) Create a switch device with 64kB region size::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0`
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) Set mappings for the first 7 entries to point to devices switch0, switch1,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) switch2, switch0, switch1, switch2, switch1::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) Set repetitive mapping. This command::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) is equivalent to::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) :1 :2 :1 :2 :1 :2 :1 :2 :1 :2