Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) Directory Locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) Locking scheme used for directory operations is based on two
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) kinds of locks - per-inode (->i_rwsem) and per-filesystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) (->s_vfs_rename_mutex).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) When taking the i_rwsem on multiple non-directory objects, we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) always acquire the locks in order by increasing address.  We'll call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) that "inode pointer" order in the following.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) For our purposes all operations fall in 5 classes:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) 1) read access.  Locking rules: caller locks directory we are accessing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) The lock is taken shared.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) 2) object creation.  Locking rules: same as above, but the lock is taken
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) exclusive.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) 3) object removal.  Locking rules: caller locks parent, finds victim,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) locks victim and calls the method.  Locks are exclusive.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 4) rename() that is _not_ cross-directory.  Locking rules: caller locks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) the parent and finds source and target.  In case of exchange (with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) RENAME_EXCHANGE in flags argument) lock both.  In any case,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) if the target already exists, lock it.  If the source is a non-directory,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) lock it.  If we need to lock both, lock them in inode pointer order.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) Then call the method.  All locks are exclusive.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) NB: we might get away with locking the source (and target in exchange
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) case) shared.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 5) link creation.  Locking rules:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) 	* lock parent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) 	* check that source is not a directory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) 	* lock source
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) 	* call the method.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) All locks are exclusive.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) 6) cross-directory rename.  The trickiest in the whole bunch.  Locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) rules:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) 	* lock the filesystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) 	* lock parents in "ancestors first" order.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) 	* find source and target.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) 	* if old parent is equal to or is a descendent of target
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) 	  fail with -ENOTEMPTY
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) 	* if new parent is equal to or is a descendent of source
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) 	  fail with -ELOOP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) 	* If it's an exchange, lock both the source and the target.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) 	* If the target exists, lock it.  If the source is a non-directory,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) 	  lock it.  If we need to lock both, do so in inode pointer order.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) 	* call the method.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) All ->i_rwsem are taken exclusive.  Again, we might get away with locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) the source (and target in exchange case) shared.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) The rules above obviously guarantee that all directories that are going to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) read, modified or removed by method will be locked by caller.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) If no directory is its own ancestor, the scheme above is deadlock-free.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) Proof:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) 	First of all, at any moment we have a partial ordering of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) 	objects - A < B iff A is an ancestor of B.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 	That ordering can change.  However, the following is true:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) (1) if object removal or non-cross-directory rename holds lock on A and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75)     attempts to acquire lock on B, A will remain the parent of B until we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76)     acquire the lock on B.  (Proof: only cross-directory rename can change
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77)     the parent of object and it would have to lock the parent).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) (2) if cross-directory rename holds the lock on filesystem, order will not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80)     change until rename acquires all locks.  (Proof: other cross-directory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81)     renames will be blocked on filesystem lock and we don't start changing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82)     the order until we had acquired all locks).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) (3) locks on non-directory objects are acquired only after locks on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85)     directory objects, and are acquired in inode pointer order.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86)     (Proof: all operations but renames take lock on at most one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87)     non-directory object, except renames, which take locks on source and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88)     target in inode pointer order in the case they are not directories.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) Now consider the minimal deadlock.  Each process is blocked on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) attempt to acquire some lock and already holds at least one lock.  Let's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) consider the set of contended locks.  First of all, filesystem lock is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) not contended, since any process blocked on it is not holding any locks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) Thus all processes are blocked on ->i_rwsem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) By (3), any process holding a non-directory lock can only be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) waiting on another non-directory lock with a larger address.  Therefore
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) the process holding the "largest" such lock can always make progress, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) non-directory objects are not included in the set of contended locks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) Thus link creation can't be a part of deadlock - it can't be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) blocked on source and it means that it doesn't hold any locks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) Any contended object is either held by cross-directory rename or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) has a child that is also contended.  Indeed, suppose that it is held by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) operation other than cross-directory rename.  Then the lock this operation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) is blocked on belongs to child of that object due to (1).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) It means that one of the operations is cross-directory rename.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) Otherwise the set of contended objects would be infinite - each of them
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) would have a contended child and we had assumed that no object is its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) own descendent.  Moreover, there is exactly one cross-directory rename
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) (see above).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) Consider the object blocking the cross-directory rename.  One
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) of its descendents is locked by cross-directory rename (otherwise we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) would again have an infinite set of contended objects).  But that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) means that cross-directory rename is taking locks out of order.  Due
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) to (2) the order hadn't changed since we had acquired filesystem lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) But locking rules for cross-directory rename guarantee that we do not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) try to acquire lock on descendent before the lock on ancestor.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) Contradiction.  I.e.  deadlock is impossible.  Q.E.D.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) These operations are guaranteed to avoid loop creation.  Indeed,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) the only operation that could introduce loops is cross-directory rename.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) Since the only new (parent, child) pair added by rename() is (new parent,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) source), such loop would have to contain these objects and the rest of it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) would have to exist before rename().  I.e. at the moment of loop creation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) rename() responsible for that would be holding filesystem lock and new parent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) would have to be equal to or a descendent of source.  But that means that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) new parent had been equal to or a descendent of source since the moment when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) we had acquired filesystem lock and rename() would fail with -ELOOP in that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) case.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) While this locking scheme works for arbitrary DAGs, it relies on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) ability to check that directory is a descendent of another object.  Current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) implementation assumes that directory graph is a tree.  This assumption is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) also preserved by all operations (cross-directory rename on a tree that would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) not introduce a cycle will leave it a tree and link() fails for directories).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) Notice that "directory" in the above == "anything that might have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) children", so if we are going to introduce hybrid objects we will need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) either to make sure that link(2) doesn't work for them or to make changes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) in is_subdir() that would make it work even in presence of such beasts.