^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ===================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) File management in the Linux kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ===================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) This document describes how locking for files (struct file)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) and file descriptor table (struct files) works.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) Up until 2.6.12, the file descriptor table has been protected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) with a lock (files->file_lock) and reference count (files->count).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) ->file_lock protected accesses to all the file related fields
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) of the table. ->count was used for sharing the file descriptor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) table between tasks cloned with CLONE_FILES flag. Typically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) this would be the case for posix threads. As with the common
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) refcounting model in the kernel, the last task doing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) a put_files_struct() frees the file descriptor (fd) table.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) The files (struct file) themselves are protected using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) reference count (->f_count).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) In the new lock-free model of file descriptor management,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) the reference counting is similar, but the locking is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) based on RCU. The file descriptor table contains multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) elements - the fd sets (open_fds and close_on_exec, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) array of file pointers, the sizes of the sets and the array
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) etc.). In order for the updates to appear atomic to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) a lock-free reader, all the elements of the file descriptor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) table are in a separate structure - struct fdtable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) files_struct contains a pointer to struct fdtable through
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) which the actual fd table is accessed. Initially the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) fdtable is embedded in files_struct itself. On a subsequent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) expansion of fdtable, a new fdtable structure is allocated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) and files->fdtab points to the new structure. The fdtable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) structure is freed with RCU and lock-free readers either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) see the old fdtable or the new fdtable making the update
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) appear atomic. Here are the locking rules for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) the fdtable structure -
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) 1. All references to the fdtable must be done through
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) the files_fdtable() macro::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) struct fdtable *fdt;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) fdt = files_fdtable(files);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) ....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) if (n <= fdt->max_fds)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) ....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) files_fdtable() uses rcu_dereference() macro which takes care of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) the memory barrier requirements for lock-free dereference.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) The fdtable pointer must be read within the read-side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) critical section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) 2. Reading of the fdtable as described above must be protected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) by rcu_read_lock()/rcu_read_unlock().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) 3. For any update to the fd table, files->file_lock must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) be held.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) 4. To look up the file structure given an fd, a reader
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) must use either fcheck() or fcheck_files() APIs. These
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) take care of barrier requirements due to lock-free lookup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) An example::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) struct file *file;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) file = fcheck(fd);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) if (file) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) ....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) 5. Handling of the file structures is special. Since the look-up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) of the fd (fget()/fget_light()) are lock-free, it is possible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) that look-up may race with the last put() operation on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) file structure. This is avoided using atomic_long_inc_not_zero()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) on ->f_count::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) file = fcheck_files(files, fd);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) if (file) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) if (atomic_long_inc_not_zero(&file->f_count))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) *fput_needed = 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) else
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) /* Didn't get the reference, someone's freed */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) file = NULL;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) ....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) return file;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) atomic_long_inc_not_zero() detects if refcounts is already zero or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) goes to zero during increment. If it does, we fail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) fget()/fget_light().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 6. Since both fdtable and file structures can be looked up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) lock-free, they must be installed using rcu_assign_pointer()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) API. If they are looked up lock-free, rcu_dereference()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) must be used. However it is advisable to use files_fdtable()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) and fcheck()/fcheck_files() which take care of these issues.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) 7. While updating, the fdtable pointer must be looked up while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) holding files->file_lock. If ->file_lock is dropped, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) another thread expand the files thereby creating a new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) fdtable and making the earlier fdtable pointer stale.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) For example::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) spin_lock(&files->file_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) fd = locate_fd(files, file, start);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) if (fd >= 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) /* locate_fd() may have expanded fdtable, load the ptr */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) fdt = files_fdtable(files);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) __set_open_fd(fd, fdt);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) __clear_close_on_exec(fd, fdt);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) spin_unlock(&files->file_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) .....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) the fdtable pointer (fdt) must be loaded after locate_fd().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)