^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) =============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) BTT - Block Translation Table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) =============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) 1. Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) ===============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) Persistent memory based storage is able to perform IO at byte (or more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) accurately, cache line) granularity. However, we often want to expose such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) storage as traditional block devices. The block drivers for persistent memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) will do exactly this. However, they do not provide any atomicity guarantees.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) Traditional SSDs typically provide protection against torn sectors in hardware,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) using stored energy in capacitors to complete in-flight block writes, or perhaps
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) in firmware. We don't have this luxury with persistent memory - if a write is in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) progress, and we experience a power failure, the block will contain a mix of old
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) and new data. Applications may not be prepared to handle such a scenario.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) The Block Translation Table (BTT) provides atomic sector update semantics for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) persistent memory devices, so that applications that rely on sector writes not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) being torn can continue to do so. The BTT manifests itself as a stacked block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) device, and reserves a portion of the underlying storage for its metadata. At
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) the heart of it, is an indirection table that re-maps all the blocks on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) volume. It can be thought of as an extremely simple file system that only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) provides atomic sector updates.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) 2. Static Layout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) The underlying storage on which a BTT can be laid out is not limited in any way.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) The BTT, however, splits the available space into chunks of up to 512 GiB,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) called "Arenas".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) Each arena follows the same layout for its metadata, and all references in an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) arena are internal to it (with the exception of one field that points to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) next arena). The following depicts the "On-disk" metadata layout::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) Backing Store +-------> Arena
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) +---------------+ | +------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) | | | | Arena info block |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) | Arena 0 +---+ | 4K |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) | 512G | +------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) +---------------+ | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) | Arena 1 | | Data Blocks |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) | 512G | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) +---------------+ | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) | . | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) | . | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) | . | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) +---------------+ +------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) | BTT Map |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) +------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) | BTT Flog |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) +------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) | Info block copy |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) | 4K |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) +------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) 3. Theory of Operation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) a. The BTT Map
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) The map is a simple lookup/indirection table that maps an LBA to an internal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) block. Each map entry is 32 bits. The two most significant bits are special
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) flags, and the remaining form the internal block number.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) ======== =============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) Bit Description
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) ======== =============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) 31 - 30 Error and Zero flags - Used in the following way::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) == == ====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) 31 30 Description
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) == == ====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) 0 0 Initial state. Reads return zeroes; Premap = Postmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) 0 1 Zero state: Reads return zeroes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) 1 0 Error state: Reads fail; Writes clear 'E' bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) 1 1 Normal Block – has valid postmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) == == ====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) 29 - 0 Mappings to internal 'postmap' blocks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) ======== =============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) Some of the terminology that will be subsequently used:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) ============ ================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) External LBA LBA as made visible to upper layers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) ABA Arena Block Address - Block offset/number within an arena
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) Premap ABA The block offset into an arena, which was decided upon by range
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) checking the External LBA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) Postmap ABA The block number in the "Data Blocks" area obtained after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) indirection from the map
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) nfree The number of free blocks that are maintained at any given time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) This is the number of concurrent writes that can happen to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) arena.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) ============ ================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) For example, after adding a BTT, we surface a disk of 1024G. We get a read for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) the external LBA at 768G. This falls into the second arena, and of the 512G
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) worth of blocks that this arena contributes, this block is at 256G. Thus, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) premap ABA is 256G. We now refer to the map, and find out the mapping for block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) 'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) b. The BTT Flog
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) The BTT provides sector atomicity by making every write an "allocating write",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) i.e. Every write goes to a "free" block. A running list of free blocks is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) maintained in the form of the BTT flog. 'Flog' is a combination of the words
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) "free list" and "log". The flog contains 'nfree' entries, and an entry contains:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) ======== =====================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) lba The premap ABA that is being written to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) old_map The old postmap ABA - after 'this' write completes, this will be a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) free block.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) new_map The new postmap ABA. The map will up updated to reflect this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) lba->postmap_aba mapping, but we log it here in case we have to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) recover.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) seq Sequence number to mark which of the 2 sections of this flog entry is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) valid/newest. It cycles between 01->10->11->01 (binary) under normal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) operation, with 00 indicating an uninitialized state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) lba' alternate lba entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) old_map' alternate old postmap entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) new_map' alternate new postmap entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) seq' alternate sequence number.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) ======== =====================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) done such that for any entry being written, it:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) a. overwrites the 'old' section in the entry based on sequence numbers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) b. writes the 'new' section such that the sequence number is written last.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) c. The concept of lanes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) -----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) While 'nfree' describes the number of concurrent IOs an arena can process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) process::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) nlanes = min(nfree, num_cpus)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) A lane number is obtained at the start of any IO, and is used for indexing into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) all the on-disk and in-memory data structures for the duration of the IO. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) there are more CPUs than the max number of available lanes, than lanes are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) protected by spinlocks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) d. In-memory data structure: Read Tracking Table (RTT)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) ------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) Consider a case where we have two threads, one doing reads and the other,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) writes. We can hit a condition where the writer thread grabs a free block to do
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) a new IO, but the (slow) reader thread is still reading from it. In other words,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) the reader consulted a map entry, and started reading the corresponding block. A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) writer started writing to the same external LBA, and finished the write updating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) the map for that external LBA to point to its new postmap ABA. At this point the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) internal, postmap block that the reader is (still) reading has been inserted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) into the list of free blocks. If another write comes in for the same LBA, it can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) grab this free block, and start writing to it, causing the reader to read
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) incorrect data. To prevent this, we introduce the RTT.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) into rtt[lane_number], the postmap ABA it is reading, and clears it after the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) read is complete. Every writer thread, after grabbing a free block, checks the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) RTT for its presence. If the postmap free block is in the RTT, it waits till the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) reader clears the RTT entry, and only then starts writing to it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) e. In-memory data structure: map locks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) Consider a case where two writer threads are writing to the same LBA. There can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) be a race in the following sequence of steps::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) free[lane] = map[premap_aba]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) map[premap_aba] = postmap_aba
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) Both threads can update their respective free[lane] with the same old, freed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) postmap_aba. This has made the layout inconsistent by losing a free entry, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) at the same time, duplicating another free entry for two lanes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) To solve this, we could have a single map lock (per arena) that has to be taken
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) before performing the above sequence, but we feel that could be too contentious.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) Instead we use an array of (nfree) map_locks that is indexed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) (premap_aba modulo nfree).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) f. Reconstruction from the Flog
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) -------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) On startup, we analyze the BTT flog to create our list of free blocks. We walk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) through all the entries, and for each lane, of the set of two possible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) 'sections', we always look at the most recent one only (based on the sequence
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) number). The reconstruction rules/steps are simple:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) - Read map[log_entry.lba].
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) - If log_entry.new matches the map entry, then log_entry.old is free.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) - If log_entry.new does not match the map entry, then log_entry.new is free.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) (This case can only be caused by power-fails/unsafe shutdowns)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) g. Summarizing - Read and Write flows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) -------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) Read:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) 1. Convert external LBA to arena number + pre-map ABA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) 2. Get a lane (and take lane_lock)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) 3. Read map to get the entry for this pre-map ABA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) 4. Enter post-map ABA into RTT[lane]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) 5. If TRIM flag set in map, return zeroes, and end IO (go to step 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) 6. If ERROR flag set in map, end IO with EIO (go to step 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) 7. Read data from this block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) 8. Remove post-map ABA entry from RTT[lane]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) 9. Release lane (and lane_lock)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) Write:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) 1. Convert external LBA to Arena number + pre-map ABA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) 2. Get a lane (and take lane_lock)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) 3. Use lane to index into in-memory free list and obtain a new block, next flog
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) index, next sequence number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) 4. Scan the RTT to check if free block is present, and spin/wait if it is.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) 5. Write data to this free block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) 6. Read map to get the existing post-map ABA entry for this pre-map ABA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) 7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) 8. Write new post-map ABA into map.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) 9. Write old post-map entry into the free list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) 10. Calculate next sequence number and write into the free list entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) 11. Release lane (and lane_lock)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) 4. Error Handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) An arena would be in an error state if any of the metadata is corrupted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) irrecoverably, either due to a bug or a media error. The following conditions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) indicate an error:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) - Info block checksum does not match (and recovering from the copy also fails)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) - All internal available blocks are not uniquely and entirely addressed by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) sum of mapped blocks and free blocks (from the BTT flog).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) - Rebuilding free list from the flog reveals missing/duplicate/impossible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) entries
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) - A map entry is out of bounds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) If any of these error conditions are encountered, the arena is put into a read
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) only state using a flag in the info block.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) 5. Usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) The BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) (pmem, or blk mode). The easiest way to set up such a namespace is using the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) 'ndctl' utility [1]:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) For example, the ndctl command line to setup a btt with a 4k sector size is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) ndctl create-namespace -f -e namespace0.0 -m sector -l 4k
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) See ndctl create-namespace --help for more options.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) [1]: https://github.com/pmem/ndctl