Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) =====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) Mandatory File Locking For The Linux Operating System
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) =====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) 		Andy Walker <andy@lysaker.kvaerner.no>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) 			   15 April 1996
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) 		     (Updated September 2007)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) 0. Why you should avoid mandatory locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) -----------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) The Linux implementation is prey to a number of difficult-to-fix race
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) conditions which in practice make it not dependable:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) 	- The write system call checks for a mandatory lock only once
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) 	  at its start.  It is therefore possible for a lock request to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) 	  be granted after this check but before the data is modified.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) 	  A process may then see file data change even while a mandatory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) 	  lock was held.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) 	- Similarly, an exclusive lock may be granted on a file after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 	  the kernel has decided to proceed with a read, but before the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 	  read has actually completed, and the reading process may see
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) 	  the file data in a state which should not have been visible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 	  to it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) 	- Similar races make the claimed mutual exclusion between lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) 	  and mmap similarly unreliable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) 1. What is  mandatory locking?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) ------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) Mandatory locking is kernel enforced file locking, as opposed to the more usual
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) cooperative file locking used to guarantee sequential access to files among
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) processes. File locks are applied using the flock() and fcntl() system calls
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) (and the lockf() library routine which is a wrapper around fcntl().) It is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) normally a process' responsibility to check for locks on a file it wishes to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) update, before applying its own lock, updating the file and unlocking it again.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) The most commonly used example of this (and in the case of sendmail, the most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) troublesome) is access to a user's mailbox. The mail user agent and the mail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) transfer agent must guard against updating the mailbox at the same time, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) prevent reading the mailbox while it is being updated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) In a perfect world all processes would use and honour a cooperative, or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) "advisory" locking scheme. However, the world isn't perfect, and there's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) a lot of poorly written code out there.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) In trying to address this problem, the designers of System V UNIX came up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) with a "mandatory" locking scheme, whereby the operating system kernel would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) block attempts by a process to write to a file that another process holds a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) "read" -or- "shared" lock on, and block attempts to both read and write to a 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) file that a process holds a "write " -or- "exclusive" lock on.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) The System V mandatory locking scheme was intended to have as little impact as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) possible on existing user code. The scheme is based on marking individual files
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) as candidates for mandatory locking, and using the existing fcntl()/lockf()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) interface for applying locks just as if they were normal, advisory locks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) .. Note::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63)    1. In saying "file" in the paragraphs above I am actually not telling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64)       the whole truth. System V locking is based on fcntl(). The granularity of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65)       fcntl() is such that it allows the locking of byte ranges in files, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66)       addition to entire files, so the mandatory locking rules also have byte
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67)       level granularity.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69)    2. POSIX.1 does not specify any scheme for mandatory locking, despite
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70)       borrowing the fcntl() locking scheme from System V. The mandatory locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71)       scheme is defined by the System V Interface Definition (SVID) Version 3.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 2. Marking a file for mandatory locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) ---------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) A file is marked as a candidate for mandatory locking by setting the group-id
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) bit in its file mode but removing the group-execute bit. This is an otherwise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) meaningless combination, and was chosen by the System V implementors so as not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) to break existing user programs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) Note that the group-id bit is usually automatically cleared by the kernel when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) a setgid file is written to. This is a security measure. The kernel has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) modified to recognize the special case of a mandatory lock candidate and to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) refrain from clearing this bit. Similarly the kernel has been modified not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) to run mandatory lock candidates with setgid privileges.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) 3. Available implementations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) ----------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) I have considered the implementations of mandatory locking available with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) SunOS 4.1.x, Solaris 2.x and HP-UX 9.x.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) Generally I have tried to make the most sense out of the behaviour exhibited
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) by these three reference systems. There are many anomalies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) All the reference systems reject all calls to open() for a file on which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) another process has outstanding mandatory locks. This is in direct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) contravention of SVID 3, which states that only calls to open() with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) O_TRUNC flag set should be rejected. The Linux implementation follows the SVID
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) definition, which is the "Right Thing", since only calls with O_TRUNC can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) modify the contents of the file.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) just mandatory locks. That would appear to contravene POSIX.1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) mmap() is another interesting case. All the operating systems mentioned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) prevent mandatory locks from being applied to an mmap()'ed file, but  HP-UX
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) also disallows advisory locks for such a file. SVID actually specifies the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) paranoid HP-UX behaviour.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) In my opinion only MAP_SHARED mappings should be immune from locking, and then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) only from mandatory locks - that is what is currently implemented.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) mandatory locks, so reads and writes to locked files always block when they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) should return EAGAIN.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) I'm afraid that this is such an esoteric area that the semantics described
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) below are just as valid as any others, so long as the main points seem to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) agree. 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) 4. Semantics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) ------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) 1. Mandatory locks can only be applied via the fcntl()/lockf() locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)    interface - in other words the System V/POSIX interface. BSD style
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)    locks using flock() never result in a mandatory lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 2. If a process has locked a region of a file with a mandatory read lock, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)    other processes are permitted to read from that region. If any of these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)    processes attempts to write to the region it will block until the lock is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)    released, unless the process has opened the file with the O_NONBLOCK
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)    flag in which case the system call will return immediately with the error
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)    status EAGAIN.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) 3. If a process has locked a region of a file with a mandatory write lock, all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)    attempts to read or write to that region block until the lock is released,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)    unless a process has opened the file with the O_NONBLOCK flag in which case
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)    the system call will return immediately with the error status EAGAIN.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) 4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)    any mandatory locks owned by other processes will be rejected with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)    error status EAGAIN.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 5. Attempts to apply a mandatory lock to a file that is memory mapped and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)    shared (via mmap() with MAP_SHARED) will be rejected with the error status
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)    EAGAIN.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)    that has any mandatory locks in effect will be rejected with the error status
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)    EAGAIN.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) 5. Which system calls are affected?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) -----------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) Those which modify a file's contents, not just the inode. That gives read(),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) write(), readv(), writev(), open(), creat(), mmap(), truncate() and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) ftruncate(). truncate() and ftruncate() are considered to be "write" actions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) for the purposes of mandatory locking.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) The affected region is usually defined as stretching from the current position
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) for the total number of bytes read or written. For the truncate calls it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) defined as the bytes of a file removed or added (we must also consider bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) added, as a lock can specify just "the whole file", rather than a specific
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) range of bytes.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) Note 3: I may have overlooked some system calls that need mandatory lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) checking in my eagerness to get this code out the door. Please let me know, or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) better still fix the system calls yourself and submit a patch to me or Linus.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 6. Warning!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) -----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) Not even root can override a mandatory lock, so runaway processes can wreak
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) havoc if they lock crucial files. The way around it is to change the file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) permissions (remove the setgid bit) before trying to read or write to it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) Of course, that might be a bit tricky if the system is hung :-(
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) 7. The "mand" mount option
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) --------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) Mandatory locking is disabled on all filesystems by default, and must be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) administratively enabled by mounting with "-o mand". That mount option
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) is only allowed if the mounting task has the CAP_SYS_ADMIN capability.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) Since kernel v4.5, it is possible to disable mandatory locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) altogether by setting CONFIG_MANDATORY_FILE_LOCKING to "n". A kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) with this disabled will reject attempts to mount filesystems with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) "mand" mount option with the error status EPERM.