^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) dm-log-writes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) This target takes 2 devices, one to pass all IO to normally, and one to log all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) of the write operations to. This is intended for file system developers wishing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) to verify the integrity of metadata or data as the file system is written to.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) There is a log_write_entry written for every WRITE request and the target is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) able to take arbitrary data from userspace to insert into the log. The data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) that is in the WRITE requests is copied into the log to make the replay happen
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) exactly as it happened originally.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) Log Ordering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) We log things in order of completion once we are sure the write is no longer in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) cache. This means that normal WRITE requests are not actually logged until the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) next REQ_PREFLUSH request. This is to make it easier for userspace to replay
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) the log in a way that correlates to what is on disk and not what is in cache,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) to make it easier to detect improper waiting/flushing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) This works by attaching all WRITE requests to a list once the write completes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) Once we see a REQ_PREFLUSH request we splice this list onto the request and once
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) simulate the worst case scenario with regard to power failures. Consider the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) following example (W means write, C means complete):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) W1,W2,W3,C3,C2,Wflush,C1,Cflush
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) The log would show the following:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) W3,W2,flush,W1....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) Again this is to simulate what is actually on disk, this allows us to detect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) cases where a power failure at a particular point in time would create an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) inconsistent file system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) they complete as those requests will obviously bypass the device cache.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) have all the DISCARD requests, and then the WRITE requests and then the FLUSH
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) request. Consider the following example:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) WRITE block 1, DISCARD block 1, FLUSH
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) If we logged DISCARD when it completed, the replay would look like this:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) DISCARD 1, WRITE 1, FLUSH
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) which isn't quite what happened and wouldn't be caught during the log replay.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) Target interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) i) Constructor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) log-writes <dev_path> <log_dev_path>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) ============= ==============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) dev_path Device that all of the IO will go to normally.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) log_dev_path Device where the log entries are written to.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) ============= ==============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) ii) Status
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) <#logged entries> <highest allocated sector>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) =========================== ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) #logged entries Number of logged entries
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) highest allocated sector Highest allocated sector
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) =========================== ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) iii) Messages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) mark <description>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) You can use a dmsetup message to set an arbitrary mark in a log.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) For example say you want to fsck a file system after every
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) write, but first you need to replay up to the mkfs to make sure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) we're fsck'ing something reasonable, you would do something like
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) this::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) mkfs.btrfs -f /dev/mapper/log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) dmsetup message log 0 mark mkfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) <run test>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) This would allow you to replay the log up to the mkfs mark and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) then replay from that point on doing the fsck check in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) interval that you want.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) Every log has a mark at the end labeled "dm-log-writes-end".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) Userspace component
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) There is a userspace tool that will replay the log for you in various ways.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) It can be found here: https://github.com/josefbacik/log-writes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) Example usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) Say you want to test fsync on your file system. You would do something like
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) this::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) dmsetup create log --table "$TABLE"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) mkfs.btrfs -f /dev/mapper/log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) dmsetup message log 0 mark mkfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) mount /dev/mapper/log /mnt/btrfs-test
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) <some test that does fsync at the end>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) dmsetup message log 0 mark fsync
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) md5sum /mnt/btrfs-test/foo
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) umount /mnt/btrfs-test
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) dmsetup remove log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) mount /dev/sdb /mnt/btrfs-test
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) md5sum /mnt/btrfs-test/foo
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) <verify md5sum's are correct>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) Another option is to do a complicated file system operation and verify the file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) system is consistent during the entire operation. You could do this with:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) dmsetup create log --table "$TABLE"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) mkfs.btrfs -f /dev/mapper/log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) dmsetup message log 0 mark mkfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) mount /dev/mapper/log /mnt/btrfs-test
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) <fsstress to dirty the fs>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) btrfs filesystem balance /mnt/btrfs-test
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) umount /mnt/btrfs-test
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) dmsetup remove log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) btrfsck /dev/sdb
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) --fsck "btrfsck /dev/sdb" --check fua
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) And that will replay the log until it sees a FUA request, run the fsck command
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) and if the fsck passes it will replay to the next FUA, until it is completed or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) the fsck command exists abnormally.