Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) Path walking and name lookup locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) ====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) Path resolution is the finding a dentry corresponding to a path name string, by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) performing a path walk. Typically, for every open(), stat() etc., the path name
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) will be resolved. Paths are resolved by walking the namespace tree, starting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) with the first component of the pathname (eg. root or cwd) with a known dentry,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) then finding the child of that dentry, which is named the next component in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) path string. Then repeating the lookup from the child dentry and finding its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) child with the next element, and so on.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) Since it is a frequent operation for workloads like multiuser environments and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) web servers, it is important to optimize this code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) Path walking synchronisation history:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) Prior to 2.5.10, dcache_lock was acquired in d_lookup (dcache hash lookup) and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) thus in every component during path look-up. Since 2.5.10 onwards, fast-walk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) algorithm changed this by holding the dcache_lock at the beginning and walking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) as many cached path component dentries as possible. This significantly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) decreases the number of acquisition of dcache_lock. However it also increases
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) the lock hold time significantly and affects performance in large SMP machines.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) Since 2.5.62 kernel, dcache has been using a new locking model that uses RCU to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) make dcache look-up lock-free.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) All the above algorithms required taking a lock and reference count on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) dentry that was looked up, so that may be used as the basis for walking the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) next path element. This is inefficient and unscalable. It is inefficient
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) because of the locks and atomic operations required for every dentry element
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) slows things down. It is not scalable because many parallel applications that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) are path-walk intensive tend to do path lookups starting from a common dentry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) (usually, the root "/" or current working directory). So contention on these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) common path elements causes lock and cacheline queueing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) Since 2.6.38, RCU is used to make a significant part of the entire path walk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) (including dcache look-up) completely "store-free" (so, no locks, atomics, or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) even stores into cachelines of common dentries). This is known as "rcu-walk"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) path walking.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) Path walking overview
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) A name string specifies a start (root directory, cwd, fd-relative) and a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) sequence of elements (directory entry names), which together refer to a path in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) the namespace. A path is represented as a (dentry, vfsmount) tuple. The name
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) elements are sub-strings, separated by '/'.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) Name lookups will want to find a particular path that a name string refers to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) (usually the final element, or parent of final element). This is done by taking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) the path given by the name's starting point (which we know in advance -- eg.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) current->fs->cwd or current->fs->root) as the first parent of the lookup. Then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) iteratively for each subsequent name element, look up the child of the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) parent with the given name and if it is not the desired entry, make it the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) parent for the next lookup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) A parent, of course, must be a directory, and we must have appropriate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) permissions on the parent inode to be able to walk into it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) Turning the child into a parent for the next lookup requires more checks and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) procedures. Symlinks essentially substitute the symlink name for the target
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) name in the name string, and require some recursive path walking.  Mount points
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) must be followed into (thus changing the vfsmount that subsequent path elements
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) refer to), switching from the mount point path to the root of the particular
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) mounted vfsmount. These behaviours are variously modified depending on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) exact path walking flags.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) Path walking then must, broadly, do several particular things:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) - find the start point of the walk;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) - perform permissions and validity checks on inodes;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) - perform dcache hash name lookups on (parent, name element) tuples;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) - traverse mount points;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) - traverse symlinks;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) - lookup and create missing parts of the path on demand.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) Safe store-free look-up of dcache hash table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) ============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) Dcache name lookup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) In order to lookup a dcache (parent, name) tuple, we take a hash on the tuple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) and use that to select a bucket in the dcache-hash table. The list of entries
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) in that bucket is then walked, and we do a full comparison of each entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) against our (parent, name) tuple.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) The hash lists are RCU protected, so list walking is not serialised with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) concurrent updates (insertion, deletion from the hash). This is a standard RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) list application with the exception of renames, which will be covered below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) Parent and name members of a dentry, as well as its membership in the dcache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) hash, and its inode are protected by the per-dentry d_lock spinlock. A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) reference is taken on the dentry (while the fields are verified under d_lock),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) and this stabilises its d_inode pointer and actual inode. This gives a stable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) point to perform the next step of our path walk against.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) These members are also protected by d_seq seqlock, although this offers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) read-only protection and no durability of results, so care must be taken when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) using d_seq for synchronisation (see seqcount based lookups, below).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) Renames
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) Back to the rename case. In usual RCU protected lists, the only operations that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) will happen to an object is insertion, and then eventually removal from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) list. The object will not be reused until an RCU grace period is complete.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) This ensures the RCU list traversal primitives can run over the object without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) problems (see RCU documentation for how this works).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) However when a dentry is renamed, its hash value can change, requiring it to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) moved to a new hash list. Allocating and inserting a new alias would be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) expensive and also problematic for directory dentries. Latency would be far to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) high to wait for a grace period after removing the dentry and before inserting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) it in the new hash bucket. So what is done is to insert the dentry into the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) new list immediately.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) However, when the dentry's list pointers are updated to point to objects in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) new list before waiting for a grace period, this can result in a concurrent RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) lookup of the old list veering off into the new (incorrect) list and missing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) the remaining dentries on the list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) There is no fundamental problem with walking down the wrong list, because the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) dentry comparisons will never match. However it is fatal to miss a matching
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) dentry. So a seqlock is used to detect when a rename has occurred, and so the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) lookup can be retried.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123)          1      2      3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)         +---+  +---+  +---+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) hlist-->| N-+->| N-+->| N-+->
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) head <--+-P |<-+-P |<-+-P |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)         +---+  +---+  +---+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) Rename of dentry 2 may require it deleted from the above list, and inserted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) into a new list. Deleting 2 gives the following list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)          1             3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)         +---+         +---+     (don't worry, the longer pointers do not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) hlist-->| N-+-------->| N-+->    impose a measurable performance overhead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) head <--+-P |<--------+-P |      on modern CPUs)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)         +---+         +---+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)           ^      2      ^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)           |    +---+    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)           |    | N-+----+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140)           +----+-P |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)                +---+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) This is a standard RCU-list deletion, which leaves the deleted object's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) pointers intact, so a concurrent list walker that is currently looking at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) object 2 will correctly continue to object 3 when it is time to traverse the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) next object.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) However, when inserting object 2 onto a new list, we end up with this:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)          1             3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)         +---+         +---+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) hlist-->| N-+-------->| N-+->
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) head <--+-P |<--------+-P |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154)         +---+         +---+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)                  2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156)                +---+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157)                | N-+---->
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158)           <----+-P |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)                +---+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) Because we didn't wait for a grace period, there may be a concurrent lookup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) still at 2. Now when it follows 2's 'next' pointer, it will walk off into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) another list without ever having checked object 3.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) A related, but distinctly different, issue is that of rename atomicity versus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) lookup operations. If a file is renamed from 'A' to 'B', a lookup must only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) find either 'A' or 'B'. So if a lookup of 'A' returns NULL, a subsequent lookup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) of 'B' must succeed (note the reverse is not true).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) Between deleting the dentry from the old hash list, and inserting it on the new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) hash list, a lookup may find neither 'A' nor 'B' matching the dentry. The same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) rename seqlock is also used to cover this race in much the same way, by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) retrying a negative lookup result if a rename was in progress.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) Seqcount based lookups
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) ----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) In refcount based dcache lookups, d_lock is used to serialise access to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) the dentry, stabilising it while comparing its name and parent and then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) taking a reference count (the reference count then gives a stable place to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) start the next part of the path walk from).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) As explained above, we would like to do path walking without taking locks or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) reference counts on intermediate dentries along the path. To do this, a per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) dentry seqlock (d_seq) is used to take a "coherent snapshot" of what the dentry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) looks like (its name, parent, and inode). That snapshot is then used to start
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) the next part of the path walk. When loading the coherent snapshot under d_seq,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) care must be taken to load the members up-front, and use those pointers rather
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) than reloading from the dentry later on (otherwise we'd have interesting things
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) like d_inode going NULL underneath us, if the name was unlinked).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) Also important is to avoid performing any destructive operations (pretty much:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) no non-atomic stores to shared data), and to recheck the seqcount when we are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) "done" with the operation. Retry or abort if the seqcount does not match.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) Avoiding destructive or changing operations means we can easily unwind from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) failure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) What this means is that a caller, provided they are holding RCU lock to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) protect the dentry object from disappearing, can perform a seqcount based
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) lookup which does not increment the refcount on the dentry or write to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) it in any way. This returned dentry can be used for subsequent operations,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) provided that d_seq is rechecked after that operation is complete.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) Inodes are also rcu freed, so the seqcount lookup dentry's inode may also be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) queried for permissions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) With this two parts of the puzzle, we can do path lookups without taking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) locks or refcounts on dentry elements.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) RCU-walk path walking design
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) Path walking code now has two distinct modes, ref-walk and rcu-walk. ref-walk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) is the traditional[*] way of performing dcache lookups using d_lock to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) serialise concurrent modifications to the dentry and take a reference count on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) it. ref-walk is simple and obvious, and may sleep, take locks, etc while path
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) walking is operating on each dentry. rcu-walk uses seqcount based dentry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) lookups, and can perform lookup of intermediate elements without any stores to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) shared data in the dentry or inode. rcu-walk can not be applied to all cases,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) eg. if the filesystem must sleep or perform non trivial operations, rcu-walk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) must be switched to ref-walk mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) [*] RCU is still used for the dentry hash lookup in ref-walk, but not the full
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223)     path walk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) Where ref-walk uses a stable, refcounted ``parent'' to walk the remaining
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) path string, rcu-walk uses a d_seq protected snapshot. When looking up a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) child of this parent snapshot, we open d_seq critical section on the child
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) before closing d_seq critical section on the parent. This gives an interlocking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) ladder of snapshots to walk down.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232)      proc 101
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233)       /----------------\
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234)      / comm:    "vi"    \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235)     /  fs.root: dentry0  \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236)     \  fs.cwd:  dentry2  /
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237)      \                  /
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238)       \----------------/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) So when vi wants to open("/home/npiggin/test.c", O_RDWR), then it will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) start from current->fs->root, which is a pinned dentry. Alternatively,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) "./test.c" would start from cwd; both names refer to the same path in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) the context of proc101.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245)      dentry 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)     +---------------------+   rcu-walk begins here, we note d_seq, check the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247)     | name:    "/"        |   inode's permission, and then look up the next
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248)     | inode:   10         |   path element which is "home"...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249)     | children:"home", ...|
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250)     +---------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251)               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252)      dentry 1 V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253)     +---------------------+   ... which brings us here. We find dentry1 via
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254)     | name:    "home"     |   hash lookup, then note d_seq and compare name
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255)     | inode:   678        |   string and parent pointer. When we have a match,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256)     | children:"npiggin"  |   we now recheck the d_seq of dentry0. Then we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257)     +---------------------+   check inode and look up the next element.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258)               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259)      dentry2  V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260)     +---------------------+   Note: if dentry0 is now modified, lookup is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261)     | name:    "npiggin"  |   not necessarily invalid, so we need only keep a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262)     | inode:   543        |   parent for d_seq verification, and grandparents
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263)     | children:"a.c", ... |   can be forgotten.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264)     +---------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265)               |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266)      dentry3  V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267)     +---------------------+   At this point we have our destination dentry.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268)     | name:    "a.c"      |   We now take its d_lock, verify d_seq of this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269)     | inode:   14221      |   dentry. If that checks out, we can increment
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270)     | children:NULL       |   its refcount because we're holding d_lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271)     +---------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) Taking a refcount on a dentry from rcu-walk mode, by taking its d_lock,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) re-checking its d_seq, and then incrementing its refcount is called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) "dropping rcu" or dropping from rcu-walk into ref-walk mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) It is, in some sense, a bit of a house of cards. If the seqcount check of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) parent snapshot fails, the house comes down, because we had closed the d_seq
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) section on the grandparent, so we have nothing left to stand on. In that case,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) the path walk must be fully restarted (which we do in ref-walk mode, to avoid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) live locks). It is costly to have a full restart, but fortunately they are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) quite rare.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) When we reach a point where sleeping is required, or a filesystem callout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) requires ref-walk, then instead of restarting the walk, we attempt to drop rcu
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) at the last known good dentry we have. Avoiding a full restart in ref-walk in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) these cases is fundamental for performance and scalability because blocking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) operations such as creates and unlinks are not uncommon.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) The detailed design for rcu-walk is like this:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) * LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) * Take the RCU lock for the entire path walk, starting with the acquiring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293)   of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294)   not required for dentry persistence.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) * synchronize_rcu is called when unregistering a filesystem, so we can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296)   access d_ops and i_ops during rcu-walk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) * Similarly take the vfsmount lock for the entire path walk. So now mnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298)   refcounts are not required for persistence. Also we are free to perform mount
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299)   lookups, and to assume dentry mount points and mount roots are stable up and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300)   down the path.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) * Have a per-dentry seqlock to protect the dentry name, parent, and inode,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302)   so we can load this tuple atomically, and also check whether any of its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303)   members have changed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) * Dentry lookups (based on parent, candidate string tuple) recheck the parent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305)   sequence after the child is found in case anything changed in the parent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306)   during the path walk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) * inode is also RCU protected so we can load d_inode and use the inode for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308)   limited things.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) * i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) * i_op can be loaded.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) * When the destination dentry is reached, drop rcu there (ie. take d_lock,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312)   verify d_seq, increment refcount).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) * If seqlock verification fails anywhere along the path, do a full restart
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314)   of the path lookup in ref-walk mode. -ECHILD tends to be used (for want of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315)   a better errno) to signal an rcu-walk failure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) The cases where rcu-walk cannot continue are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) * NULL dentry (ie. any uncached path element)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) * Following links
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) It may be possible eventually to make following links rcu-walk aware.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) Uncached path elements will always require dropping to ref-walk mode, at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) very least because i_mutex needs to be grabbed, and objects allocated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) Final note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) "store-free" path walking is not strictly store free. We take vfsmount lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) and refcounts (both of which can be made per-cpu), and we also store to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) stack (which is essentially CPU-local), and we also have to take locks and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) refcount on final dentry.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) The point is that shared data, where practically possible, is not locked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) or stored into. The result is massive improvements in performance and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) scalability of path resolution.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) Interesting statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) The following table gives rcu lookup statistics for a few simple workloads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) (2s12c24t Westmere, debian non-graphical system). Ungraceful are attempts to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) drop rcu that fail due to d_seq failure and requiring the entire path lookup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) again. Other cases are successful rcu-drops that are required before the final
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) element, nodentry for missing dentry, revalidate for filesystem revalidate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) routine requiring rcu drop, permission for permission check requiring drop,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) and link for symlink traversal requiring drop.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348)      rcu-lookups     restart  nodentry          link  revalidate  permission
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) bootup     47121           0      4624          1010       10283        7852
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) dbench  25386793           0   6778659(26.7%)     55         549        1156
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) kbuild   2696672          10     64442(2.3%)  108764(4.0%)     1        1590
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) git diff   39605           0        28             2           0         106
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) vfstest 24185492        4945    708725(2.9%) 1076136(4.4%)     0        2651
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) What this shows is that failed rcu-walk lookups, ie. ones that are restarted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) entirely with ref-walk, are quite rare. Even the "vfstest" case which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) specifically has concurrent renames/mkdir/rmdir/ creat/unlink/etc to exercise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) such races is not showing a huge amount of restarts.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) Dropping from rcu-walk to ref-walk mean that we have encountered a dentry where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) the reference count needs to be taken for some reason. This is either because
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) we have reached the target of the path walk, or because we have encountered a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) condition that can't be resolved in rcu-walk mode.  Ideally, we drop rcu-walk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) only when we have reached the target dentry, so the other statistics show where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) this does not happen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) Note that a graceful drop from rcu-walk mode due to something such as the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) dentry not existing (which can be common) is not necessarily a failure of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) rcu-walk scheme, because some elements of the path may have been walked in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) rcu-walk mode. The further we get from common path elements (such as cwd or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) root), the less contended the dentry is likely to be. The closer we are to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) common path elements, the more likely they will exist in dentry cache.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) Papers and other documentation on dcache locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) ================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) 1. Scaling dcache with RCU (https://linuxjournal.com/article.php?sid=7124).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) 2. http://lse.sourceforge.net/locking/dcache/dcache.html
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) 3. path-lookup.md in this directory.