Orange Pi5 kernel

^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  3) Block and Inode Allocation Policy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  4) ---------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  5) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  6) ext4 recognizes (better than ext3, anyway) that data locality is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  7) generally a desirably quality of a filesystem. On a spinning disk,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  8) keeping related blocks near each other reduces the amount of movement
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  9) that the head actuator and disk must perform to access a data block,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) thus speeding up disk IO. On an SSD there of course are no moving parts,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) but locality can increase the size of each transfer request while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) reducing the total number of requests. This locality may also have the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) effect of concentrating writes on a single erase block, which can speed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) up file rewrites significantly. Therefore, it is useful to reduce
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) fragmentation whenever possible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) The first tool that ext4 uses to combat fragmentation is the multi-block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) allocator. When a file is first created, the block allocator
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) speculatively allocates 8KiB of disk space to the file on the assumption
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) that the space will get written soon. When the file is closed, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) unused speculative allocations are of course freed, but if the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) speculation is correct (typically the case for full writes of small
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) files) then the file data gets written out in a single multi-block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) extent. A second related trick that ext4 uses is delayed allocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) Under this scheme, when a file needs more blocks to absorb file writes,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) the filesystem defers deciding the exact placement on the disk until all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) the dirty buffers are being written out to disk. By not committing to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) particular placement until it's absolutely necessary (the commit timeout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) is hit, or sync() is called, or the kernel runs out of memory), the hope
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) is that the filesystem can make better location decisions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) The third trick that ext4 (and ext3) uses is that it tries to keep a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) file's data blocks in the same block group as its inode. This cuts down
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) on the seek penalty when the filesystem first has to read a file's inode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) to learn where the file's data blocks live and then seek over to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) file's data blocks to begin I/O operations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) The fourth trick is that all the inodes in a directory are placed in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) same block group as the directory, when feasible. The working assumption
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) here is that all the files in a directory might be related, therefore it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) is useful to try to keep them all together.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) The fifth trick is that the disk volume is cut up into 128MB block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) groups; these mini-containers are used as outlined above to try to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) maintain data locality. However, there is a deliberate quirk -- when a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) directory is created in the root directory, the inode allocator scans
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) the block groups and puts that directory into the least heavily loaded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) block group that it can find. This encourages directories to spread out
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) over a disk; as the top-level directory/file blobs fill up one block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) group, the allocators simply move on to the next block group. Allegedly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) this scheme evens out the loading on the block groups, though the author
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) suspects that the directories which are so unlucky as to land towards
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) the end of a spinning drive get a raw deal performance-wise.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) Of course if all of these mechanisms fail, one can always use e4defrag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) to defragment files.