Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) =============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) BTT - Block Translation Table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) =============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 1. Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) ===============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) Persistent memory based storage is able to perform IO at byte (or more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) accurately, cache line) granularity. However, we often want to expose such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) storage as traditional block devices. The block drivers for persistent memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) will do exactly this. However, they do not provide any atomicity guarantees.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) Traditional SSDs typically provide protection against torn sectors in hardware,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) using stored energy in capacitors to complete in-flight block writes, or perhaps
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) in firmware. We don't have this luxury with persistent memory - if a write is in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) progress, and we experience a power failure, the block will contain a mix of old
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) and new data. Applications may not be prepared to handle such a scenario.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) The Block Translation Table (BTT) provides atomic sector update semantics for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) persistent memory devices, so that applications that rely on sector writes not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) being torn can continue to do so. The BTT manifests itself as a stacked block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) device, and reserves a portion of the underlying storage for its metadata. At
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) the heart of it, is an indirection table that re-maps all the blocks on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) volume. It can be thought of as an extremely simple file system that only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) provides atomic sector updates.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 2. Static Layout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) The underlying storage on which a BTT can be laid out is not limited in any way.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) The BTT, however, splits the available space into chunks of up to 512 GiB,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) called "Arenas".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) Each arena follows the same layout for its metadata, and all references in an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) arena are internal to it (with the exception of one field that points to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) next arena). The following depicts the "On-disk" metadata layout::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40)     Backing Store     +------->  Arena
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41)   +---------------+   |   +------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42)   |               |   |   | Arena info block |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43)   |    Arena 0    +---+   |       4K         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44)   |     512G      |       +------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45)   |               |       |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46)   +---------------+       |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47)   |               |       |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48)   |    Arena 1    |       |   Data Blocks    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49)   |     512G      |       |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50)   |               |       |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51)   +---------------+       |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52)   |       .       |       |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53)   |       .       |       |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54)   |       .       |       |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55)   |               |       |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56)   |               |       |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57)   +---------------+       +------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58)                           |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59)                           |     BTT Map      |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60)                           |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61)                           |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62)                           +------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63)                           |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64)                           |     BTT Flog     |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65)                           |                  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66)                           +------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67)                           | Info block copy  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68)                           |       4K         |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69)                           +------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 3. Theory of Operation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) a. The BTT Map
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) The map is a simple lookup/indirection table that maps an LBA to an internal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) block. Each map entry is 32 bits. The two most significant bits are special
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) flags, and the remaining form the internal block number.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) ======== =============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) Bit      Description
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) ======== =============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) 31 - 30	 Error and Zero flags - Used in the following way::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) 	   == ==  ====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 	   31 30  Description
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) 	   == ==  ====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) 	   0  0	  Initial state. Reads return zeroes; Premap = Postmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) 	   0  1	  Zero state: Reads return zeroes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) 	   1  0	  Error state: Reads fail; Writes clear 'E' bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) 	   1  1	  Normal Block – has valid postmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) 	   == ==  ====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 29 - 0	 Mappings to internal 'postmap' blocks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) ======== =============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) Some of the terminology that will be subsequently used:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) ============	================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) External LBA	LBA as made visible to upper layers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) ABA		Arena Block Address - Block offset/number within an arena
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) Premap ABA	The block offset into an arena, which was decided upon by range
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) 		checking the External LBA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) Postmap ABA	The block number in the "Data Blocks" area obtained after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) 		indirection from the map
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) nfree		The number of free blocks that are maintained at any given time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) 		This is the number of concurrent writes that can happen to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) 		arena.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) ============	================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) For example, after adding a BTT, we surface a disk of 1024G. We get a read for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) the external LBA at 768G. This falls into the second arena, and of the 512G
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) worth of blocks that this arena contributes, this block is at 256G. Thus, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) premap ABA is 256G. We now refer to the map, and find out the mapping for block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) 'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) b. The BTT Flog
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) The BTT provides sector atomicity by making every write an "allocating write",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) i.e. Every write goes to a "free" block. A running list of free blocks is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) maintained in the form of the BTT flog. 'Flog' is a combination of the words
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) "free list" and "log". The flog contains 'nfree' entries, and an entry contains:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) ========  =====================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) lba       The premap ABA that is being written to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) old_map   The old postmap ABA - after 'this' write completes, this will be a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) 	  free block.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) new_map   The new postmap ABA. The map will up updated to reflect this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) 	  lba->postmap_aba mapping, but we log it here in case we have to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 	  recover.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) seq	  Sequence number to mark which of the 2 sections of this flog entry is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) 	  valid/newest. It cycles between 01->10->11->01 (binary) under normal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 	  operation, with 00 indicating an uninitialized state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) lba'	  alternate lba entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) old_map'  alternate old postmap entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) new_map'  alternate new postmap entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) seq'	  alternate sequence number.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) ========  =====================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) done such that for any entry being written, it:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) a. overwrites the 'old' section in the entry based on sequence numbers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) b. writes the 'new' section such that the sequence number is written last.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) c. The concept of lanes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) -----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) While 'nfree' describes the number of concurrent IOs an arena can process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) process::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 	nlanes = min(nfree, num_cpus)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) A lane number is obtained at the start of any IO, and is used for indexing into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) all the on-disk and in-memory data structures for the duration of the IO. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) there are more CPUs than the max number of available lanes, than lanes are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) protected by spinlocks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) d. In-memory data structure: Read Tracking Table (RTT)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) ------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) Consider a case where we have two threads, one doing reads and the other,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) writes. We can hit a condition where the writer thread grabs a free block to do
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) a new IO, but the (slow) reader thread is still reading from it. In other words,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) the reader consulted a map entry, and started reading the corresponding block. A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) writer started writing to the same external LBA, and finished the write updating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) the map for that external LBA to point to its new postmap ABA. At this point the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) internal, postmap block that the reader is (still) reading has been inserted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) into the list of free blocks. If another write comes in for the same LBA, it can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) grab this free block, and start writing to it, causing the reader to read
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) incorrect data. To prevent this, we introduce the RTT.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) into rtt[lane_number], the postmap ABA it is reading, and clears it after the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) read is complete. Every writer thread, after grabbing a free block, checks the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) RTT for its presence. If the postmap free block is in the RTT, it waits till the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) reader clears the RTT entry, and only then starts writing to it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) e. In-memory data structure: map locks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) Consider a case where two writer threads are writing to the same LBA. There can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) be a race in the following sequence of steps::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) 	free[lane] = map[premap_aba]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) 	map[premap_aba] = postmap_aba
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) Both threads can update their respective free[lane] with the same old, freed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) postmap_aba. This has made the layout inconsistent by losing a free entry, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) at the same time, duplicating another free entry for two lanes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) To solve this, we could have a single map lock (per arena) that has to be taken
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) before performing the above sequence, but we feel that could be too contentious.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) Instead we use an array of (nfree) map_locks that is indexed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) (premap_aba modulo nfree).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) f. Reconstruction from the Flog
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) -------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) On startup, we analyze the BTT flog to create our list of free blocks. We walk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) through all the entries, and for each lane, of the set of two possible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) 'sections', we always look at the most recent one only (based on the sequence
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) number). The reconstruction rules/steps are simple:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) - Read map[log_entry.lba].
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) - If log_entry.new matches the map entry, then log_entry.old is free.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) - If log_entry.new does not match the map entry, then log_entry.new is free.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220)   (This case can only be caused by power-fails/unsafe shutdowns)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) g. Summarizing - Read and Write flows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) -------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) Read:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) 1.  Convert external LBA to arena number + pre-map ABA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) 2.  Get a lane (and take lane_lock)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) 3.  Read map to get the entry for this pre-map ABA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) 4.  Enter post-map ABA into RTT[lane]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) 5.  If TRIM flag set in map, return zeroes, and end IO (go to step 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) 6.  If ERROR flag set in map, end IO with EIO (go to step 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) 7.  Read data from this block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) 8.  Remove post-map ABA entry from RTT[lane]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) 9.  Release lane (and lane_lock)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) Write:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) 1.  Convert external LBA to Arena number + pre-map ABA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) 2.  Get a lane (and take lane_lock)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) 3.  Use lane to index into in-memory free list and obtain a new block, next flog
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243)     index, next sequence number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) 4.  Scan the RTT to check if free block is present, and spin/wait if it is.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) 5.  Write data to this free block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) 6.  Read map to get the existing post-map ABA entry for this pre-map ABA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) 7.  Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) 8.  Write new post-map ABA into map.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) 9.  Write old post-map entry into the free list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) 10. Calculate next sequence number and write into the free list entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) 11. Release lane (and lane_lock)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) 4. Error Handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) An arena would be in an error state if any of the metadata is corrupted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) irrecoverably, either due to a bug or a media error. The following conditions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) indicate an error:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) - Info block checksum does not match (and recovering from the copy also fails)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) - All internal available blocks are not uniquely and entirely addressed by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263)   sum of mapped blocks and free blocks (from the BTT flog).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) - Rebuilding free list from the flog reveals missing/duplicate/impossible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265)   entries
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) - A map entry is out of bounds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) If any of these error conditions are encountered, the arena is put into a read
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) only state using a flag in the info block.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) 5. Usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) The BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) (pmem, or blk mode). The easiest way to set up such a namespace is using the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) 'ndctl' utility [1]:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) For example, the ndctl command line to setup a btt with a 4k sector size is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)     ndctl create-namespace -f -e namespace0.0 -m sector -l 4k
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) See ndctl create-namespace --help for more options.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) [1]: https://github.com/pmem/ndctl