^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) XFS Self Describing Metadata
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) The largest scalability problem facing XFS is not one of algorithmic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) scalability, but of verification of the filesystem structure. Scalabilty of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) structures and indexes on disk and the algorithms for iterating them are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) adequate for supporting PB scale filesystems with billions of inodes, however it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) is this very scalability that causes the verification problem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) Almost all metadata on XFS is dynamically allocated. The only fixed location
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) other metadata structures need to be discovered by walking the filesystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) structure in different ways. While this is already done by userspace tools for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) validating and repairing the structure, there are limits to what they can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) verify, and this in turn limits the supportable size of an XFS filesystem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) For example, it is entirely possible to manually use xfs_db and a bit of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) scripting to analyse the structure of a 100TB filesystem when trying to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) determine the root cause of a corruption problem, but it is still mainly a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) manual task of verifying that things like single bit errors or misplaced writes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) weren't the ultimate cause of a corruption event. It may take a few hours to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) few days to perform such forensic analysis, so for at this scale root cause
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) analysis is entirely possible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) to analyse and so that analysis blows out towards weeks/months of forensic work.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) Most of the analysis work is slow and tedious, so as the amount of analysis goes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) up, the more likely that the cause will be lost in the noise. Hence the primary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) concern for supporting PB scale filesystems is minimising the time and effort
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) required for basic forensic analysis of the filesystem structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) Self Describing Metadata
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) One of the problems with the current metadata format is that apart from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) magic number in the metadata block, we have no other way of identifying what it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) is supposed to be. We can't even identify if it is the right place. Put simply,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) you can't look at a single metadata block in isolation and say "yes, it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) supposed to be there and the contents are valid".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) Hence most of the time spent on forensic analysis is spent doing basic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) verification of metadata values, looking for values that are in range (and hence
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) not detected by automated verification checks) but are not correct. Finding and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) understanding how things like cross linked block lists (e.g. sibling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) pointers in a btree end up with loops in them) are the key to understanding what
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) went wrong, but it is impossible to tell what order the blocks were linked into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) each other or written to disk after the fact.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) Hence we need to record more information into the metadata to allow us to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) quickly determine if the metadata is intact and can be ignored for the purpose
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) of analysis. We can't protect against every possible type of error, but we can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) ensure that common types of errors are easily detectable. Hence the concept of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) self describing metadata.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) The first, fundamental requirement of self describing metadata is that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) metadata object contains some form of unique identifier in a well known
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) location. This allows us to identify the expected contents of the block and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) hence parse and verify the metadata object. IF we can't independently identify
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) the type of metadata in the object, then the metadata doesn't describe itself
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) very well at all!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) Luckily, almost all XFS metadata has magic numbers embedded already - only the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) AGFL, remote symlinks and remote attribute blocks do not contain identifying
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) magic numbers. Hence we can change the on-disk format of all these objects to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) add more identifying information and detect this simply by changing the magic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) numbers in the metadata objects. That is, if it has the current magic number,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) the metadata isn't self identifying. If it contains a new magic number, it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) self identifying and we can do much more expansive automated verification of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) metadata object at runtime, during forensic analysis or repair.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) As a primary concern, self describing metadata needs some form of overall
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) integrity checking. We cannot trust the metadata if we cannot verify that it has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) not been changed as a result of external influences. Hence we need some form of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) integrity check, and this is done by adding CRC32c validation to the metadata
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) block. If we can verify the block contains the metadata it was intended to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) contain, a large amount of the manual verification work can be skipped.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) CRC32c was selected as metadata cannot be more than 64k in length in XFS and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) fast. So while CRC32c is not the strongest of possible integrity checks that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) could be used, it is more than sufficient for our needs and has relatively
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) little overhead. Adding support for larger integrity fields and/or algorithms
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) does really provide any extra value over CRC32c, but it does add a lot of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) complexity and so there is no provision for changing the integrity checking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) mechanism.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) Self describing metadata needs to contain enough information so that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) metadata block can be verified as being in the correct place without needing to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) look at any other metadata. This means it needs to contain location information.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) Just adding a block number to the metadata is not sufficient to protect against
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) mis-directed writes - a write might be misdirected to the wrong LUN and so be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) written to the "correct block" of the wrong filesystem. Hence location
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) information must contain a filesystem identifier as well as a block number.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) Another key information point in forensic analysis is knowing who the metadata
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) block belongs to. We already know the type, the location, that it is valid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) and/or corrupted, and how long ago that it was last modified. Knowing the owner
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) of the block is important as it allows us to find other related metadata to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) determine the scope of the corruption. For example, if we have a extent btree
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) object, we don't know what inode it belongs to and hence have to walk the entire
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) filesystem to find the owner of the block. Worse, the corruption could mean that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) no owner can be found (i.e. it's an orphan block), and so without an owner field
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) in the metadata we have no idea of the scope of the corruption. If we have an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) owner field in the metadata object, we can immediately do top down validation to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) determine the scope of the problem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) Different types of metadata have different owner identifiers. For example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) directory, attribute and extent tree blocks are all owned by an inode, while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) freespace btree blocks are owned by an allocation group. Hence the size and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) contents of the owner field are determined by the type of metadata object we are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) looking at. The owner information can also identify misplaced writes (e.g.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) freespace btree block written to the wrong AG).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) Self describing metadata also needs to contain some indication of when it was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) written to the filesystem. One of the key information points when doing forensic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) analysis is how recently the block was modified. Correlation of set of corrupted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) metadata blocks based on modification times is important as it can indicate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) whether the corruptions are related, whether there's been multiple corruption
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) events that lead to the eventual failure, and even whether there are corruptions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) present that the run-time verification is not detecting.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) For example, we can determine whether a metadata object is supposed to be free
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) space or still allocated if it is still referenced by its owner by looking at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) when the free space btree block that contains the block was last written
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) compared to when the metadata object itself was last written. If the free space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) block is more recent than the object and the object's owner, then there is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) very good chance that the block should have been removed from the owner.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) To provide this "written timestamp", each metadata block gets the Log Sequence
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) Number (LSN) of the most recent transaction it was modified on written into it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) This number will always increase over the life of the filesystem, and the only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) thing that resets it is running xfs_repair on the filesystem. Further, by use of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) the LSN we can tell if the corrupted metadata all belonged to the same log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) checkpoint and hence have some idea of how much modification occurred between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) the first and last instance of corrupt metadata on disk and, further, how much
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) modification occurred between the corruption being written and when it was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) detected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) Runtime Validation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) ==================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) Validation of self-describing metadata takes place at runtime in two places:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) - immediately after a successful read from disk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) - immediately prior to write IO submission
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) The verification is completely stateless - it is done independently of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) modification process, and seeks only to check that the metadata is what it says
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) it is and that the metadata fields are within bounds and internally consistent.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) As such, we cannot catch all types of corruption that can occur within a block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) as there may be certain limitations that operational state enforces of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) metadata, or there may be corruption of interblock relationships (e.g. corrupted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) sibling pointer lists). Hence we still need stateful checking in the main code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) body, but in general most of the per-field validation is handled by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) verifiers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) For read verification, the caller needs to specify the expected type of metadata
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) that it should see, and the IO completion process verifies that the metadata
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) object matches what was expected. If the verification process fails, then it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) marks the object being read as EFSCORRUPTED. The caller needs to catch this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) error (same as for IO errors), and if it needs to take special action due to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) verification error it can do so by catching the EFSCORRUPTED error value. If we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) need more discrimination of error type at higher levels, we can define new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) error numbers for different errors as necessary.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) The first step in read verification is checking the magic number and determining
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) whether CRC validating is necessary. If it is, the CRC32c is calculated and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) compared against the value stored in the object itself. Once this is validated,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) further checks are made against the location information, followed by extensive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) object specific metadata validation. If any of these checks fail, then the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) Write verification is the opposite of the read verification - first the object
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) is extensively verified and if it is OK we then update the LSN from the last
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) modification made to the object, After this, we calculate the CRC and insert it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) into the object. Once this is done the write IO is allowed to continue. If any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) error occurs during this process, the buffer is again marked with a EFSCORRUPTED
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) error for the higher layers to catch.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) Structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) A typical on-disk structure needs to contain the following information::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) struct xfs_ondisk_hdr {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) __be32 magic; /* magic number */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) __be32 crc; /* CRC, not logged */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) uuid_t uuid; /* filesystem identifier */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) __be64 owner; /* parent object */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) __be64 blkno; /* location on disk */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) __be64 lsn; /* last modification in log, not logged */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) Depending on the metadata, this information may be part of a header structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) separate to the metadata contents, or may be distributed through an existing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) structure. The latter occurs with metadata that already contains some of this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) information, such as the superblock and AG headers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) Other metadata may have different formats for the information, but the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) level of information is generally provided. For example:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) - short btree blocks have a 32 bit owner (ag number) and a 32 bit block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) number for location. The two of these combined provide the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) information as @owner and @blkno in eh above structure, but using 8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) bytes less space on disk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) - directory/attribute node blocks have a 16 bit magic number, and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) header that contains the magic number has other information in it as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) well. hence the additional metadata headers change the overall format
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) of the metadata.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) A typical buffer read verifier is structured as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) #define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) static void
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) xfs_foo_read_verify(
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) struct xfs_buf *bp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) struct xfs_mount *mp = bp->b_mount;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) if ((xfs_sb_version_hascrc(&mp->m_sb) &&
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) XFS_FOO_CRC_OFF)) ||
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) !xfs_foo_verify(bp)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) xfs_buf_ioerror(bp, EFSCORRUPTED);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) The code ensures that the CRC is only checked if the filesystem has CRCs enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) by checking the superblock of the feature bit, and then if the CRC verifies OK
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) (or is not needed) it verifies the actual contents of the block.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) The verifier function will take a couple of different forms, depending on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) whether the magic number can be used to determine the format of the block. In
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) the case it can't, the code is structured as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) static bool
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) xfs_foo_verify(
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) struct xfs_buf *bp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) struct xfs_mount *mp = bp->b_mount;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) struct xfs_ondisk_hdr *hdr = bp->b_addr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) if (!xfs_sb_version_hascrc(&mp->m_sb)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) if (bp->b_bn != be64_to_cpu(hdr->blkno))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) if (hdr->owner == 0)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) /* object specific verification checks here */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) If there are different magic numbers for the different formats, the verifier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) will look like::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) static bool
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) xfs_foo_verify(
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) struct xfs_buf *bp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) struct xfs_mount *mp = bp->b_mount;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) struct xfs_ondisk_hdr *hdr = bp->b_addr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) if (bp->b_bn != be64_to_cpu(hdr->blkno))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) if (hdr->owner == 0)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) return false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) /* object specific verification checks here */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) return true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) Write verifiers are very similar to the read verifiers, they just do things in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) the opposite order to the read verifiers. A typical write verifier::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) static void
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) xfs_foo_write_verify(
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) struct xfs_buf *bp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) struct xfs_mount *mp = bp->b_mount;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) struct xfs_buf_log_item *bip = bp->b_fspriv;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) if (!xfs_foo_verify(bp)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) xfs_buf_ioerror(bp, EFSCORRUPTED);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) return;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) if (!xfs_sb_version_hascrc(&mp->m_sb))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) return;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) if (bip) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) struct xfs_ondisk_hdr *hdr = bp->b_addr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) This will verify the internal structure of the metadata before we go any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) further, detecting corruptions that have occurred as the metadata has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) update the LSN field (when it was last modified) and calculate the CRC on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) metadata. Once this is done, we can issue the IO.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) Inodes and Dquots
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) Inodes and dquots are special snowflakes. They have per-object CRC and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) self-identifiers, but they are packed so that there are multiple objects per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) buffer. Hence we do not use per-buffer verifiers to do the work of per-object
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) verification and CRC calculations. The per-buffer verifiers simply perform basic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) identification of the buffer - that they contain inodes or dquots, and that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) there are magic numbers in all the expected spots. All further CRC and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) verification checks are done when each inode is read from or written back to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) buffer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) The structure of the verifiers and the identifiers checks is very similar to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) buffer code described above. The only difference is where they are called. For
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) example, inode read verification is done in xfs_inode_from_disk() when the inode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) is first read out of the buffer and the struct xfs_inode is instantiated. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) inode is already extensively verified during writeback in xfs_iflush_int, so the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) only addition here is to add the LSN and CRC to the inode as it is copied back
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) into the buffer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) the unlinked list modifications check or update CRCs, neither during unlink nor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) log recovery. So, it's gone unnoticed until now. This won't matter immediately -
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) repair will probably complain about it - but it needs to be fixed.