^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) XFS Delayed Logging Design
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Introduction to Re-logging in XFS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) =================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) XFS logging is a combination of logical and physical logging. Some objects,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) such as inodes and dquots, are logged in logical format where the details
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) logged are made up of the changes to in-core structures rather than on-disk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) structures. Other objects - typically buffers - have their physical changes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) logged. The reason for these differences is to reduce the amount of log space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) required for objects that are frequently logged. Some parts of inodes are more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) frequently logged than others, and inodes are typically more frequently logged
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) than any other object (except maybe the superblock buffer) so keeping the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) amount of metadata logged low is of prime importance.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) The reason that this is such a concern is that XFS allows multiple separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) modifications to a single object to be carried in the log at any given time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) This allows the log to avoid needing to flush each change to disk before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) recording a new change to the object. XFS does this via a method called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) "re-logging". Conceptually, this is quite simple - all it requires is that any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) new change to the object is recorded with a *new copy* of all the existing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) changes in the new transaction that is written to the log.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) That is, if we have a sequence of changes A through to F, and the object was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) written to disk after change D, we would see in the log the following series
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) of transactions, their contents and the log sequence number (LSN) of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) transaction::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) Transaction Contents LSN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) A A X
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) B A+B X+n
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) C A+B+C X+n+m
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) D A+B+C+D X+n+m+o
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) <object written to disk>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) E E Y (> X+n+m+o)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) F E+F Y+p
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) In other words, each time an object is relogged, the new transaction contains
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) the aggregation of all the previous changes currently held only in the log.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) This relogging technique also allows objects to be moved forward in the log so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) that an object being relogged does not prevent the tail of the log from ever
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) moving forward. This can be seen in the table above by the changing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) (increasing) LSN of each subsequent transaction - the LSN is effectively a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) direct encoding of the location in the log of the transaction.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) This relogging is also used to implement long-running, multiple-commit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) transactions. These transaction are known as rolling transactions, and require
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) a special log reservation known as a permanent transaction reservation. A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) typical example of a rolling transaction is the removal of extents from an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) inode which can only be done at a rate of two extents per transaction because
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) of reservation size limitations. Hence a rolling extent removal transaction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) keeps relogging the inode and btree buffers as they get modified in each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) removal operation. This keeps them moving forward in the log as the operation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) progresses, ensuring that current operation never gets blocked by itself if the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) log wraps around.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) Hence it can be seen that the relogging operation is fundamental to the correct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) working of the XFS journalling subsystem. From the above description, most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) people should be able to see why the XFS metadata operations writes so much to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) the log - repeated operations to the same objects write the same changes to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) the log over and over again. Worse is the fact that objects tend to get
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) dirtier as they get relogged, so each subsequent transaction is writing more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) metadata into the log.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) Another feature of the XFS transaction subsystem is that most transactions are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) asynchronous. That is, they don't commit to disk until either a log buffer is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) filled (a log buffer can hold multiple transactions) or a synchronous operation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) forces the log buffers holding the transactions to disk. This means that XFS is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) doing aggregation of transactions in memory - batching them, if you like - to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) minimise the impact of the log IO on transaction throughput.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) The limitation on asynchronous transaction throughput is the number and size of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) log buffers made available by the log manager. By default there are 8 log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) buffers available and the size of each is 32kB - the size can be increased up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) to 256kB by use of a mount option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) Effectively, this gives us the maximum bound of outstanding metadata changes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) that can be made to the filesystem at any point in time - if all the log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) buffers are full and under IO, then no more transactions can be committed until
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) the current batch completes. It is now common for a single current CPU core to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) be to able to issue enough transactions to keep the log buffers full and under
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) IO permanently. Hence the XFS journalling subsystem can be considered to be IO
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) bound.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) Delayed Logging: Concepts
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) =========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) The key thing to note about the asynchronous logging combined with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) relogging technique XFS uses is that we can be relogging changed objects
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) multiple times before they are committed to disk in the log buffers. If we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) return to the previous relogging example, it is entirely possible that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) transactions A through D are committed to disk in the same log buffer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) That is, a single log buffer may contain multiple copies of the same object,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) but only one of those copies needs to be there - the last one "D", as it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) contains all the changes from the previous changes. In other words, we have one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) necessary copy in the log buffer, and three stale copies that are simply
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) wasting space. When we are doing repeated operations on the same set of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) objects, these "stale objects" can be over 90% of the space used in the log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) buffers. It is clear that reducing the number of stale objects written to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) log would greatly reduce the amount of metadata we write to the log, and this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) is the fundamental goal of delayed logging.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) From a conceptual point of view, XFS is already doing relogging in memory (where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) memory == log buffer), only it is doing it extremely inefficiently. It is using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) logical to physical formatting to do the relogging because there is no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) infrastructure to keep track of logical changes in memory prior to physically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) formatting the changes in a transaction to the log buffer. Hence we cannot avoid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) accumulating stale objects in the log buffers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) Delayed logging is the name we've given to keeping and tracking transactional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) changes to objects in memory outside the log buffer infrastructure. Because of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) the relogging concept fundamental to the XFS journalling subsystem, this is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) actually relatively easy to do - all the changes to logged items are already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) tracked in the current infrastructure. The big problem is how to accumulate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) them and get them to the log in a consistent, recoverable manner.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) Describing the problems and how they have been solved is the focus of this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) document.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) One of the key changes that delayed logging makes to the operation of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) journalling subsystem is that it disassociates the amount of outstanding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) metadata changes from the size and number of log buffers available. In other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) words, instead of there only being a maximum of 2MB of transaction changes not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) written to the log at any point in time, there may be a much greater amount
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) being accumulated in memory. Hence the potential for loss of metadata on a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) crash is much greater than for the existing logging mechanism.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) It should be noted that this does not change the guarantee that log recovery
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) will result in a consistent filesystem. What it does mean is that as far as the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) recovered filesystem is concerned, there may be many thousands of transactions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) that simply did not occur as a result of the crash. This makes it even more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) important that applications that care about their data use fsync() where they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) need to ensure application level data integrity is maintained.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) It should be noted that delayed logging is not an innovative new concept that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) warrants rigorous proofs to determine whether it is correct or not. The method
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) of accumulating changes in memory for some period before writing them to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) log is used effectively in many filesystems including ext3 and ext4. Hence
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) no time is spent in this document trying to convince the reader that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) concept is sound. Instead it is simply considered a "solved problem" and as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) such implementing it in XFS is purely an exercise in software engineering.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) The fundamental requirements for delayed logging in XFS are simple:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 1. Reduce the amount of metadata written to the log by at least
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) an order of magnitude.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 2. Supply sufficient statistics to validate Requirement #1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) 3. Supply sufficient new tracing infrastructure to be able to debug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) problems with the new code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 4. No on-disk format change (metadata or log format).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 5. Enable and disable with a mount option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) 6. No performance regressions for synchronous transaction workloads.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) Delayed Logging: Design
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) =======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) Storing Changes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) The problem with accumulating changes at a logical level (i.e. just using the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) existing log item dirty region tracking) is that when it comes to writing the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) changes to the log buffers, we need to ensure that the object we are formatting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) is not changing while we do this. This requires locking the object to prevent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) concurrent modification. Hence flushing the logical changes to the log would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) require us to lock every object, format them, and then unlock them again.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) This introduces lots of scope for deadlocks with transactions that are already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) running. For example, a transaction has object A locked and modified, but needs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) the delayed logging tracking lock to commit the transaction. However, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) flushing thread has the delayed logging tracking lock already held, and is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) trying to get the lock on object A to flush it to the log buffer. This appears
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) to be an unsolvable deadlock condition, and it was solving this problem that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) was the barrier to implementing delayed logging for so long.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) The solution is relatively simple - it just took a long time to recognise it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) Put simply, the current logging code formats the changes to each item into an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) vector array that points to the changed regions in the item. The log write code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) simply copies the memory these vectors point to into the log buffer during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) transaction commit while the item is locked in the transaction. Instead of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) using the log buffer as the destination of the formatting code, we can use an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) allocated memory buffer big enough to fit the formatted vector.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) If we then copy the vector into the memory buffer and rewrite the vector to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) point to the memory buffer rather than the object itself, we now have a copy of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) the changes in a format that is compatible with the log buffer writing code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) that does not require us to lock the item to access. This formatting and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) rewriting can all be done while the object is locked during transaction commit,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) resulting in a vector that is transactionally consistent and can be accessed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) without needing to lock the owning item.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) Hence we avoid the need to lock items when we need to flush outstanding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) asynchronous transactions to the log. The differences between the existing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) formatting method and the delayed logging formatting can be seen in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) diagram below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) Current format log vector::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) Object +---------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) Vector 1 +----+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) Vector 2 +----+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) Vector 3 +----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) After formatting::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) Log Buffer +-V1-+-V2-+----V3----+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) Delayed logging vector::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) Object +---------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) Vector 1 +----+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) Vector 2 +----+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) Vector 3 +----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) After formatting::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) Memory Buffer +-V1-+-V2-+----V3----+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) Vector 1 +----+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) Vector 2 +----+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) Vector 3 +----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) The memory buffer and associated vector need to be passed as a single object,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) but still need to be associated with the parent object so if the object is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) relogged we can replace the current memory buffer with a new memory buffer that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) contains the latest changes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) The reason for keeping the vector around after we've formatted the memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) buffer is to support splitting vectors across log buffer boundaries correctly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) If we don't keep the vector around, we do not know where the region boundaries
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) are in the item, so we'd need a new encapsulation method for regions in the log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) buffer writing (i.e. double encapsulation). This would be an on-disk format
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) change and as such is not desirable. It also means we'd have to write the log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) region headers in the formatting stage, which is problematic as there is per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) region state that needs to be placed into the headers during the log write.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) Hence we need to keep the vector, but by attaching the memory buffer to it and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) rewriting the vector addresses to point at the memory buffer we end up with a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) self-describing object that can be passed to the log buffer write code to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) handled in exactly the same manner as the existing log vectors are handled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) Hence we avoid needing a new on-disk format to handle items that have been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) relogged in memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) Tracking Changes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) ----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) Now that we can record transactional changes in memory in a form that allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) them to be used without limitations, we need to be able to track and accumulate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) them so that they can be written to the log at some later point in time. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) log item is the natural place to store this vector and buffer, and also makes sense
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) to be the object that is used to track committed objects as it will always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) exist once the object has been included in a transaction.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) The log item is already used to track the log items that have been written to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) the log but not yet written to disk. Such log items are considered "active"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) and as such are stored in the Active Item List (AIL) which is a LSN-ordered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) double linked list. Items are inserted into this list during log buffer IO
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) completion, after which they are unpinned and can be written to disk. An object
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) that is in the AIL can be relogged, which causes the object to be pinned again
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) and then moved forward in the AIL when the log buffer IO completes for that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) transaction.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) Essentially, this shows that an item that is in the AIL can still be modified
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) and relogged, so any tracking must be separate to the AIL infrastructure. As
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) such, we cannot reuse the AIL list pointers for tracking committed items, nor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) can we store state in any field that is protected by the AIL lock. Hence the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) committed item tracking needs it's own locks, lists and state fields in the log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) item.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) Similar to the AIL, tracking of committed items is done through a new list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) called the Committed Item List (CIL). The list tracks log items that have been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) committed and have formatted memory buffers attached to them. It tracks objects
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) in transaction commit order, so when an object is relogged it is removed from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) it's place in the list and re-inserted at the tail. This is entirely arbitrary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) and done to make it easy for debugging - the last items in the list are the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) ones that are most recently modified. Ordering of the CIL is not necessary for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) transactional integrity (as discussed in the next section) so the ordering is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) done for convenience/sanity of the developers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) Delayed Logging: Checkpoints
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) ----------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) When we have a log synchronisation event, commonly known as a "log force",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) all the items in the CIL must be written into the log via the log buffers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) We need to write these items in the order that they exist in the CIL, and they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) need to be written as an atomic transaction. The need for all the objects to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) written as an atomic transaction comes from the requirements of relogging and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) log replay - all the changes in all the objects in a given transaction must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) either be completely replayed during log recovery, or not replayed at all. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) a transaction is not replayed because it is not complete in the log, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) no later transactions should be replayed, either.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) To fulfill this requirement, we need to write the entire CIL in a single log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) transaction. Fortunately, the XFS log code has no fixed limit on the size of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) transaction, nor does the log replay code. The only fundamental limit is that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) the transaction cannot be larger than just under half the size of the log. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) reason for this limit is that to find the head and tail of the log, there must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) be at least one complete transaction in the log at any given time. If a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) transaction is larger than half the log, then there is the possibility that a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) crash during the write of a such a transaction could partially overwrite the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) only complete previous transaction in the log. This will result in a recovery
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) failure and an inconsistent filesystem and hence we must enforce the maximum
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) size of a checkpoint to be slightly less than a half the log.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) Apart from this size requirement, a checkpoint transaction looks no different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) to any other transaction - it contains a transaction header, a series of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) formatted log items and a commit record at the tail. From a recovery
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) perspective, the checkpoint transaction is also no different - just a lot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) bigger with a lot more items in it. The worst case effect of this is that we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) might need to tune the recovery transaction object hash size.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) Because the checkpoint is just another transaction and all the changes to log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) items are stored as log vectors, we can use the existing log buffer writing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) code to write the changes into the log. To do this efficiently, we need to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) minimise the time we hold the CIL locked while writing the checkpoint
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) transaction. The current log write code enables us to do this easily with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) way it separates the writing of the transaction contents (the log vectors) from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) the transaction commit record, but tracking this requires us to have a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) per-checkpoint context that travels through the log write process through to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) checkpoint completion.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) Hence a checkpoint has a context that tracks the state of the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) checkpoint from initiation to checkpoint completion. A new context is initiated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) at the same time a checkpoint transaction is started. That is, when we remove
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) all the current items from the CIL during a checkpoint operation, we move all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) those changes into the current checkpoint context. We then initialise a new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) context and attach that to the CIL for aggregation of new transactions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) This allows us to unlock the CIL immediately after transfer of all the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) committed items and effectively allow new transactions to be issued while we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) are formatting the checkpoint into the log. It also allows concurrent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) checkpoints to be written into the log buffers in the case of log force heavy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) workloads, just like the existing transaction commit code does. This, however,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) requires that we strictly order the commit records in the log so that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) checkpoint sequence order is maintained during log replay.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) To ensure that we can be writing an item into a checkpoint transaction at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) the same time another transaction modifies the item and inserts the log item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) into the new CIL, then checkpoint transaction commit code cannot use log items
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) to store the list of log vectors that need to be written into the transaction.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) Hence log vectors need to be able to be chained together to allow them to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) detached from the log items. That is, when the CIL is flushed the memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) buffer and log vector attached to each log item needs to be attached to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) checkpoint context so that the log item can be released. In diagrammatic form,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) the CIL would look like this before the flush::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) CIL Head
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) Log Item <-> log vector 1 -> memory buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) | -> vector array
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) Log Item <-> log vector 2 -> memory buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) | -> vector array
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) ......
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) Log Item <-> log vector N-1 -> memory buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) | -> vector array
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) Log Item <-> log vector N -> memory buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) -> vector array
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) And after the flush the CIL head is empty, and the checkpoint context log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) vector list would look like::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) Checkpoint Context
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) log vector 1 -> memory buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) | -> vector array
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) | -> Log Item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) log vector 2 -> memory buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) | -> vector array
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) | -> Log Item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) ......
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) log vector N-1 -> memory buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) | -> vector array
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) | -> Log Item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) log vector N -> memory buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) -> vector array
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) -> Log Item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) Once this transfer is done, the CIL can be unlocked and new transactions can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) start, while the checkpoint flush code works over the log vector chain to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) commit the checkpoint.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) Once the checkpoint is written into the log buffers, the checkpoint context is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) attached to the log buffer that the commit record was written to along with a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) completion callback. Log IO completion will call that callback, which can then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) run transaction committed processing for the log items (i.e. insert into AIL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) and unpin) in the log vector chain and then free the log vector chain and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) checkpoint context.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) Discussion Point: I am uncertain as to whether the log item is the most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) efficient way to track vectors, even though it seems like the natural way to do
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) it. The fact that we walk the log items (in the CIL) just to chain the log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) vectors and break the link between the log item and the log vector means that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) we take a cache line hit for the log item list modification, then another for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) the log vector chaining. If we track by the log vectors, then we only need to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) break the link between the log item and the log vector, which means we should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) dirty only the log item cachelines. Normally I wouldn't be concerned about one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) vs two dirty cachelines except for the fact I've seen upwards of 80,000 log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) vectors in one checkpoint transaction. I'd guess this is a "measure and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) compare" situation that can be done after a working and reviewed implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) is in the dev tree....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) Delayed Logging: Checkpoint Sequencing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) One of the key aspects of the XFS transaction subsystem is that it tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) committed transactions with the log sequence number of the transaction commit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) This allows transactions to be issued asynchronously even though there may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) future operations that cannot be completed until that transaction is fully
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) committed to the log. In the rare case that a dependent operation occurs (e.g.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) re-using a freed metadata extent for a data extent), a special, optimised log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) force can be issued to force the dependent transaction to disk immediately.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) To do this, transactions need to record the LSN of the commit record of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) transaction. This LSN comes directly from the log buffer the transaction is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) written into. While this works just fine for the existing transaction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) mechanism, it does not work for delayed logging because transactions are not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) written directly into the log buffers. Hence some other method of sequencing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) transactions is required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) As discussed in the checkpoint section, delayed logging uses per-checkpoint
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) contexts, and as such it is simple to assign a sequence number to each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) checkpoint. Because the switching of checkpoint contexts must be done
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) atomically, it is simple to ensure that each new context has a monotonically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) increasing sequence number assigned to it without the need for an external
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) atomic counter - we can just take the current context sequence number and add
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) one to it for the new context.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) Then, instead of assigning a log buffer LSN to the transaction commit LSN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) during the commit, we can assign the current checkpoint sequence. This allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) operations that track transactions that have not yet completed know what
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) checkpoint sequence needs to be committed before they can continue. As a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) result, the code that forces the log to a specific LSN now needs to ensure that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) the log forces to a specific checkpoint.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) To ensure that we can do this, we need to track all the checkpoint contexts
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) that are currently committing to the log. When we flush a checkpoint, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) context gets added to a "committing" list which can be searched. When a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) checkpoint commit completes, it is removed from the committing list. Because
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456) the checkpoint context records the LSN of the commit record for the checkpoint,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) we can also wait on the log buffer that contains the commit record, thereby
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) using the existing log force mechanisms to execute synchronous forces.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) It should be noted that the synchronous forces may need to be extended with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) mitigation algorithms similar to the current log buffer code to allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) aggregation of multiple synchronous transactions if there are already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) synchronous transactions being flushed. Investigation of the performance of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) current design is needed before making any decisions here.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) The main concern with log forces is to ensure that all the previous checkpoints
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) are also committed to disk before the one we need to wait for. Therefore we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) need to check that all the prior contexts in the committing list are also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) complete before waiting on the one we need to complete. We do this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) synchronisation in the log force code so that we don't need to wait anywhere
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) else for such serialisation - it only matters when we do a log force.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) The only remaining complexity is that a log force now also has to handle the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) case where the forcing sequence number is the same as the current context. That
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) is, we need to flush the CIL and potentially wait for it to complete. This is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) simple addition to the existing log forcing code to check the sequence numbers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) and push if required. Indeed, placing the current sequence checkpoint flush in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478) the log force code enables the current mechanism for issuing synchronous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) transactions to remain untouched (i.e. commit an asynchronous transaction, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) force the log at the LSN of that transaction) and so the higher level code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481) behaves the same regardless of whether delayed logging is being used or not.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) Delayed Logging: Checkpoint Log Space Accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) ------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) The big issue for a checkpoint transaction is the log space reservation for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) transaction. We don't know how big a checkpoint transaction is going to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) ahead of time, nor how many log buffers it will take to write out, nor the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) number of split log vector regions are going to be used. We can track the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) amount of log space required as we add items to the commit item list, but we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) still need to reserve the space in the log for the checkpoint.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) A typical transaction reserves enough space in the log for the worst case space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) usage of the transaction. The reservation accounts for log record headers,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) transaction and region headers, headers for split regions, buffer tail padding,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) etc. as well as the actual space for all the changed metadata in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) transaction. While some of this is fixed overhead, much of it is dependent on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) the size of the transaction and the number of regions being logged (the number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) of log vectors in the transaction).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) An example of the differences would be logging directory changes versus logging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) inode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) there are lots of transactions that only contain an inode core and an inode log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) format structure. That is, two vectors totaling roughly 150 bytes. If we modify
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) vector is 12 bytes, so the total to be logged is approximately 1.75MB. In
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507) comparison, if we are logging full directory buffers, they are typically 4KB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) buffer format structure for each buffer - roughly 800 vectors or 1.51MB total
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) space. From this, it should be obvious that a static log space reservation is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) not particularly flexible and is difficult to select the "optimal value" for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) all workloads.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) Further, if we are going to use a static reservation, which bit of the entire
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) reservation does it cover? We account for space used by the transaction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) reservation by tracking the space currently used by the object in the CIL and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) then calculating the increase or decrease in space used as the object is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) relogged. This allows for a checkpoint reservation to only have to account for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) log buffer metadata used such as log header records.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) However, even using a static reservation for just the log metadata is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) problematic. Typically log record headers use at least 16KB of log space per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524) large enough to handle arbitrary sized checkpoint transactions. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525) reservation needs to be made before the checkpoint is started, and we need to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) be able to reserve the space without sleeping. For a 8MB checkpoint, we need a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527) reservation of around 150KB, which is a non-trivial amount of space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) A static reservation needs to manipulate the log grant counters - we can take a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) permanent reservation on the space, but we still need to make sure we refresh
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) the write reservation (the actual space available to the transaction) after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) every checkpoint transaction completion. Unfortunately, if this space is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) available when required, then the regrant code will sleep waiting for it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) The problem with this is that it can lead to deadlocks as we may need to commit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) checkpoints to be able to free up log space (refer back to the description of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) rolling transactions for an example of this). Hence we *must* always have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538) space available in the log if we are to use static reservations, and that is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539) very difficult and complex to arrange. It is possible to do, but there is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540) simpler way.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) The simpler way of doing this is tracking the entire log space used by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543) items in the CIL and using this to dynamically calculate the amount of log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) space required by the log metadata. If this log metadata space changes as a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545) result of a transaction commit inserting a new memory buffer into the CIL, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) the difference in space required is removed from the transaction that causes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547) the change. Transactions at this level will *always* have enough space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) available in their reservation for this as they have already reserved the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) maximal amount of log metadata space they require, and such a delta reservation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) will always be less than or equal to the maximal amount in the reservation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552) Hence we can grow the checkpoint transaction reservation dynamically as items
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) are added to the CIL and avoid the need for reserving and regranting log space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) up front. This avoids deadlocks and removes a blocking point from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555) checkpoint flush code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557) As mentioned early, transactions can't grow to more than half the size of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) log. Hence as part of the reservation growing, we need to also check the size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559) of the reservation against the maximum allowed transaction size. If we reach
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560) the maximum threshold, we need to push the CIL to the log. This is effectively
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561) a "background flush" and is done on demand. This is identical to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562) a CIL push triggered by a log force, only that there is no waiting for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563) checkpoint commit to complete. This background push is checked and executed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564) transaction commit code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566) If the transaction subsystem goes idle while we still have items in the CIL,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567) they will be flushed by the periodic log force issued by the xfssyncd. This log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568) force will push the CIL to disk, and if the transaction subsystem stays idle,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569) allow the idle log to be covered (effectively marked clean) in exactly the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570) manner that is done for the existing logging method. A discussion point is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571) whether this log force needs to be done more frequently than the current rate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572) which is once every 30s.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575) Delayed Logging: Log Item Pinning
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576) ---------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 578) Currently log items are pinned during transaction commit while the items are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 579) still locked. This happens just after the items are formatted, though it could
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 580) be done any time before the items are unlocked. The result of this mechanism is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 581) that items get pinned once for every transaction that is committed to the log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 582) buffers. Hence items that are relogged in the log buffers will have a pin count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 583) for every outstanding transaction they were dirtied in. When each of these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 584) transactions is completed, they will unpin the item once. As a result, the item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 585) only becomes unpinned when all the transactions complete and there are no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 586) pending transactions. Thus the pinning and unpinning of a log item is symmetric
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 587) as there is a 1:1 relationship with transaction commit and log item completion.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 588)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 589) For delayed logging, however, we have an asymmetric transaction commit to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 590) completion relationship. Every time an object is relogged in the CIL it goes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 591) through the commit process without a corresponding completion being registered.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 592) That is, we now have a many-to-one relationship between transaction commit and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 593) log item completion. The result of this is that pinning and unpinning of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 594) log items becomes unbalanced if we retain the "pin on transaction commit, unpin
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 595) on transaction completion" model.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 596)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 597) To keep pin/unpin symmetry, the algorithm needs to change to a "pin on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 598) insertion into the CIL, unpin on checkpoint completion". In other words, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 599) pinning and unpinning becomes symmetric around a checkpoint context. We have to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 600) pin the object the first time it is inserted into the CIL - if it is already in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 601) the CIL during a transaction commit, then we do not pin it again. Because there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 602) can be multiple outstanding checkpoint contexts, we can still see elevated pin
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 603) counts, but as each checkpoint completes the pin count will retain the correct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 604) value according to it's context.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 605)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 606) Just to make matters more slightly more complex, this checkpoint level context
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 607) for the pin count means that the pinning of an item must take place under the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 608) CIL commit/flush lock. If we pin the object outside this lock, we cannot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 609) guarantee which context the pin count is associated with. This is because of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 610) the fact pinning the item is dependent on whether the item is present in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 611) current CIL or not. If we don't pin the CIL first before we check and pin the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 612) object, we have a race with CIL being flushed between the check and the pin
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 613) (or not pinning, as the case may be). Hence we must hold the CIL flush/commit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 614) lock to guarantee that we pin the items correctly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 615)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 616) Delayed Logging: Concurrent Scalability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 617) ---------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 618)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 619) A fundamental requirement for the CIL is that accesses through transaction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 620) commits must scale to many concurrent commits. The current transaction commit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 621) code does not break down even when there are transactions coming from 2048
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 622) processors at once. The current transaction code does not go any faster than if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 623) there was only one CPU using it, but it does not slow down either.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 624)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 625) As a result, the delayed logging transaction commit code needs to be designed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 626) for concurrency from the ground up. It is obvious that there are serialisation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 627) points in the design - the three important ones are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 628)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 629) 1. Locking out new transaction commits while flushing the CIL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 630) 2. Adding items to the CIL and updating item space accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 631) 3. Checkpoint commit ordering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 632)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 633) Looking at the transaction commit and CIL flushing interactions, it is clear
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 634) that we have a many-to-one interaction here. That is, the only restriction on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 635) the number of concurrent transactions that can be trying to commit at once is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 636) the amount of space available in the log for their reservations. The practical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 637) limit here is in the order of several hundred concurrent transactions for a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 638) 128MB log, which means that it is generally one per CPU in a machine.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 639)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 640) The amount of time a transaction commit needs to hold out a flush is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 641) relatively long period of time - the pinning of log items needs to be done
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 642) while we are holding out a CIL flush, so at the moment that means it is held
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 643) across the formatting of the objects into memory buffers (i.e. while memcpy()s
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 644) are in progress). Ultimately a two pass algorithm where the formatting is done
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 645) separately to the pinning of objects could be used to reduce the hold time of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 646) the transaction commit side.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 647)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 648) Because of the number of potential transaction commit side holders, the lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 649) really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 650) want every other CPU in the machine spinning on the CIL lock. Given that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 651) flushing the CIL could involve walking a list of tens of thousands of log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 652) items, it will get held for a significant time and so spin contention is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 653) significant concern. Preventing lots of CPUs spinning doing nothing is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 654) main reason for choosing a sleeping lock even though nothing in either the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 655) transaction commit or CIL flush side sleeps with the lock held.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 656)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 657) It should also be noted that CIL flushing is also a relatively rare operation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 658) compared to transaction commit for asynchronous transaction workloads - only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 659) time will tell if using a read-write semaphore for exclusion will limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 660) transaction commit concurrency due to cache line bouncing of the lock on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 661) read side.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 662)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 663) The second serialisation point is on the transaction commit side where items
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 664) are inserted into the CIL. Because transactions can enter this code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 665) concurrently, the CIL needs to be protected separately from the above
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 666) commit/flush exclusion. It also needs to be an exclusive lock but it is only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 667) held for a very short time and so a spin lock is appropriate here. It is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 668) possible that this lock will become a contention point, but given the short
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 669) hold time once per transaction I think that contention is unlikely.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 670)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 671) The final serialisation point is the checkpoint commit record ordering code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 672) that is run as part of the checkpoint commit and log force sequencing. The code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 673) path that triggers a CIL flush (i.e. whatever triggers the log force) will enter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 674) an ordering loop after writing all the log vectors into the log buffers but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 675) before writing the commit record. This loop walks the list of committing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 676) checkpoints and needs to block waiting for checkpoints to complete their commit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 677) record write. As a result it needs a lock and a wait variable. Log force
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 678) sequencing also requires the same lock, list walk, and blocking mechanism to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 679) ensure completion of checkpoints.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 680)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 681) These two sequencing operations can use the mechanism even though the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 682) events they are waiting for are different. The checkpoint commit record
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 683) sequencing needs to wait until checkpoint contexts contain a commit LSN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 684) (obtained through completion of a commit record write) while log force
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 685) sequencing needs to wait until previous checkpoint contexts are removed from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 686) the committing list (i.e. they've completed). A simple wait variable and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 687) broadcast wakeups (thundering herds) has been used to implement these two
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 688) serialisation queues. They use the same lock as the CIL, too. If we see too
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 689) much contention on the CIL lock, or too many context switches as a result of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 690) the broadcast wakeups these operations can be put under a new spinlock and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 691) given separate wait lists to reduce lock contention and the number of processes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 692) woken by the wrong event.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 693)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 694)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 695) Lifecycle Changes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 696) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 697)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 698) The existing log item life cycle is as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 699)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 700) 1. Transaction allocate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 701) 2. Transaction reserve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 702) 3. Lock item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 703) 4. Join item to transaction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 704) If not already attached,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 705) Allocate log item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 706) Attach log item to owner item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 707) Attach log item to transaction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 708) 5. Modify item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 709) Record modifications in log item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 710) 6. Transaction commit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 711) Pin item in memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 712) Format item into log buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 713) Write commit LSN into transaction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 714) Unlock item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 715) Attach transaction to log buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 716)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 717) <log buffer IO dispatched>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 718) <log buffer IO completes>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 719)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 720) 7. Transaction completion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 721) Mark log item committed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 722) Insert log item into AIL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 723) Write commit LSN into log item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 724) Unpin log item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 725) 8. AIL traversal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 726) Lock item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 727) Mark log item clean
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 728) Flush item to disk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 729)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 730) <item IO completion>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 731)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 732) 9. Log item removed from AIL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 733) Moves log tail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 734) Item unlocked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 735)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 736) Essentially, steps 1-6 operate independently from step 7, which is also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 737) independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 738) at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 739) at the same time. If the log item is in the AIL or between steps 6 and 7
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 740) and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 741) are entered and completed is the object considered clean.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 742)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 743) With delayed logging, there are new steps inserted into the life cycle::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 744)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 745) 1. Transaction allocate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 746) 2. Transaction reserve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 747) 3. Lock item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 748) 4. Join item to transaction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 749) If not already attached,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 750) Allocate log item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 751) Attach log item to owner item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 752) Attach log item to transaction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 753) 5. Modify item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 754) Record modifications in log item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 755) 6. Transaction commit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 756) Pin item in memory if not pinned in CIL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 757) Format item into log vector + buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 758) Attach log vector and buffer to log item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 759) Insert log item into CIL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 760) Write CIL context sequence into transaction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 761) Unlock item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 762)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 763) <next log force>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 764)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 765) 7. CIL push
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 766) lock CIL flush
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 767) Chain log vectors and buffers together
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 768) Remove items from CIL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 769) unlock CIL flush
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 770) write log vectors into log
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 771) sequence commit records
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 772) attach checkpoint context to log buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 773)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 774) <log buffer IO dispatched>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 775) <log buffer IO completes>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 776)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 777) 8. Checkpoint completion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 778) Mark log item committed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 779) Insert item into AIL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 780) Write commit LSN into log item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 781) Unpin log item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 782) 9. AIL traversal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 783) Lock item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 784) Mark log item clean
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 785) Flush item to disk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 786) <item IO completion>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 787) 10. Log item removed from AIL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 788) Moves log tail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 789) Item unlocked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 790)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 791) From this, it can be seen that the only life cycle differences between the two
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 792) logging methods are in the middle of the life cycle - they still have the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 793) beginning and end and execution constraints. The only differences are in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 794) committing of the log items to the log itself and the completion processing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 795) Hence delayed logging should not introduce any constraints on log item
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 796) behaviour, allocation or freeing that don't already exist.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 797)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 798) As a result of this zero-impact "insertion" of delayed logging infrastructure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 799) and the design of the internal structures to avoid on disk format changes, we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 800) can basically switch between delayed logging and the existing mechanism with a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 801) mount option. Fundamentally, there is no reason why the log manager would not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 802) be able to swap methods automatically and transparently depending on load
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 803) characteristics, but this should not be necessary if delayed logging works as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 804) designed.