^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) ORANGEFS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) for large storage problems faced by HPC, BigData, Streaming Video,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) Genomics, Bioinformatics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) Orangefs, originally called PVFS, was first developed in 1993 by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) Walt Ligon and Eric Blumer as a parallel file system for Parallel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) of parallel programs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) Orangefs features include:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) * Distributes file data among multiple file servers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) * Supports simultaneous access by multiple clients
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) * Stores file data and metadata on servers using local file system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) and access methods
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) * Userspace implementation is easy to install and maintain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) * Direct MPI support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) * Stateless
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) Mailing List Archives
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) Mailing List Submissions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) devel@lists.orangefs.org
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) Documentation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) http://www.orangefs.org/documentation/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) Running ORANGEFS On a Single Server
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) ===================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) OrangeFS is usually run in large installations with multiple servers and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) clients, but a complete filesystem can be run on a single machine for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) development and testing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) On Fedora, install orangefs and orangefs-server::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) dnf -y install orangefs orangefs-server
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) There is an example server configuration file in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) /etc/orangefs/orangefs.conf. Change localhost to your hostname if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) necessary.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) To generate a filesystem to run xfstests against, see below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) There is an example client configuration file in /etc/pvfs2tab. It is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) single line. Uncomment it and change the hostname if necessary. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) controls clients which use libpvfs2. This does not control the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) pvfs2-client-core.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) Create the filesystem::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) pvfs2-server -f /etc/orangefs/orangefs.conf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) Start the server::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) systemctl start orangefs-server
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) Test the server::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) pvfs2-ping -m /pvfsmnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) Start the client. The module must be compiled in or loaded before this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) point::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) systemctl start orangefs-client
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) Mount the filesystem::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) Userspace Filesystem Source
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) ===========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) http://www.orangefs.org/download
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) Orangefs versions prior to 2.9.3 would not be compatible with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) upstream version of the kernel client.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) Building ORANGEFS on a Single Server
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) ====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) Where OrangeFS cannot be installed from distribution packages, it may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) built from source.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) You can omit --prefix if you don't care that things are sprinkled around
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) default, we will probably be changing the default to LMDB soon.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) ./configure --prefix=/opt/ofs --with-db-backend=lmdb --disable-usrint
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) make install
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) Create an orangefs config file by running pvfs2-genconfig and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) specifying a target config file. Pvfs2-genconfig will prompt you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) through. Generally it works fine to take the defaults, but you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) should use your server's hostname, rather than "localhost" when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) it comes to that question::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) Create an /etc/pvfs2tab file (localhost is fine)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) /etc/pvfs2tab
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) Create the mount point you specified in the tab file if needed::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) mkdir /pvfsmnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) Bootstrap the server::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) Start the server::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) /opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) Now the server should be running. Pvfs2-ls is a simple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) test to verify that the server is running::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) /opt/ofs/bin/pvfs2-ls /pvfsmnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) If stuff seems to be working, load the kernel module and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) turn on the client core::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) /opt/ofs/sbin/pvfs2-client -p /opt/ofs/sbin/pvfs2-client-core
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) Mount your filesystem::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) mount -t pvfs2 tcp://`hostname`:3334/orangefs /pvfsmnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) Running xfstests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) It is useful to use a scratch filesystem with xfstests. This can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) done with only one server.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) Make a second copy of the FileSystem section in the server configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) file, which is /etc/orangefs/orangefs.conf. Change the Name to scratch.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) Change the ID to something other than the ID of the first FileSystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) section (2 is usually a good choice).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) Then there are two FileSystem sections: orangefs and scratch.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) This change should be made before creating the filesystem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) pvfs2-server -f /etc/orangefs/orangefs.conf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) To run xfstests, create /etc/xfsqa.config::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) TEST_DIR=/orangefs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) TEST_DEV=tcp://localhost:3334/orangefs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) SCRATCH_MNT=/scratch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) SCRATCH_DEV=tcp://localhost:3334/scratch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) Then xfstests can be run::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) ./check -pvfs2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) Options
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) The following mount options are accepted:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) acl
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) Allow the use of Access Control Lists on files and directories.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) intr
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) Some operations between the kernel client and the user space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) filesystem can be interruptible, such as changes in debug levels
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) and the setting of tunable parameters.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) local_lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) Enable posix locking from the perspective of "this" kernel. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) default file_operations lock action is to return ENOSYS. Posix
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) locking kicks in if the filesystem is mounted with -o local_lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) Distributed locking is being worked on for the future.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) Debugging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) If you want the debug (GOSSIP) statements in a particular
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) source file (inode.c for example) go to syslog::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) echo inode > /sys/kernel/debug/orangefs/kernel-debug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) No debugging (the default)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) echo none > /sys/kernel/debug/orangefs/kernel-debug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) Debugging from several source files::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) All debugging::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) echo all > /sys/kernel/debug/orangefs/kernel-debug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) Get a list of all debugging keywords::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) cat /sys/kernel/debug/orangefs/debug-help
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) Protocol between Kernel Module and Userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) ============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) Orangefs is a user space filesystem and an associated kernel module.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) We'll just refer to the user space part of Orangefs as "userspace"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) from here on out. Orangefs descends from PVFS, and userspace code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) still uses PVFS for function and variable names. Userspace typedefs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) many of the important structures. Function and variable names in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) the kernel module have been transitioned to "orangefs", and The Linux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) Coding Style avoids typedefs, so kernel module structures that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) correspond to userspace structures are not typedefed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) The kernel module implements a pseudo device that userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) can read from and write to. Userspace can also manipulate the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) kernel module through the pseudo device with ioctl.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) The Bufmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) ----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) At startup userspace allocates two page-size-aligned (posix_memalign)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) mlocked memory buffers, one is used for IO and one is used for readdir
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) operations. The IO buffer is 41943040 bytes and the readdir buffer is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) 4194304 bytes. Each buffer contains logical chunks, or partitions, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) a pointer to each buffer is added to its own PVFS_dev_map_desc structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) which also describes its total size, as well as the size and number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) the partitions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) mapping routine in the kernel module with an ioctl. The structure is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) copied from user space to kernel space with copy_from_user and is used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) then contains:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) * refcnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) - a reference counter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) partition size, which represents the filesystem's block size and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) is used for s_blocksize in super blocks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) partitions in the IO buffer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) * total_size - the total size of the IO buffer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) * page_count - the number of 4096 byte pages in the IO buffer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) of kcalloced memory. This memory is used as an array of pointers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) to each of the pages in the IO buffer through a call to get_user_pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) bytes of kcalloced memory. This memory is further intialized:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) structure. user_desc->ptr points to the IO buffer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) pages_per_desc = bufmap->desc_size / PAGE_SIZE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) offset = 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) bufmap->desc_array[0].page_array = &bufmap->page_array[offset]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) bufmap->desc_array[0].array_count = pages_per_desc = 1024
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) offset += 1024
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) .
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) .
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) .
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) bufmap->desc_array[9].page_array = &bufmap->page_array[offset]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) bufmap->desc_array[9].array_count = pages_per_desc = 1024
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) bufmap->desc_array[9].uaddr = (user_desc->ptr) +
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) (9 * 1024 * 4096)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) offset += 1024
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) * buffer_index_array - a desc_count sized array of ints, used to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) indicate which of the IO buffer's partitions are available to use.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) * buffer_index_lock - a spinlock to protect buffer_index_array during update.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) int array used to indicate which of the readdir buffer's partitions are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) available to use.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) * readdir_index_lock - a spinlock to protect readdir_index_array during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) update.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) Operations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) ----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) The kernel module builds an "op" (struct orangefs_kernel_op_s) when it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) needs to communicate with userspace. Part of the op contains the "upcall"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) which expresses the request to userspace. Part of the op eventually
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) contains the "downcall" which expresses the results of the request.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) The slab allocator is used to keep a cache of op structures handy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) At init time the kernel module defines and initializes a request list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) and an in_progress hash table to keep track of all the ops that are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) in flight at any given time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) Ops are stateful:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) * unknown
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) - op was just initialized
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) * waiting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) - op is on request_list (upward bound)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) * inprogr
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) - op is in progress (waiting for downcall)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) * serviced
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) - op has matching downcall; ok
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) * purged
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) - op has to start a timer since client-core
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) exited uncleanly before servicing op
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) * given up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) - submitter has given up waiting for it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) When some arbitrary userspace program needs to perform a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) filesystem operation on Orangefs (readdir, I/O, create, whatever)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) an op structure is initialized and tagged with a distinguishing ID
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) number. The upcall part of the op is filled out, and the op is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) passed to the "service_operation" function.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) Service_operation changes the op's state to "waiting", puts
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) it on the request list, and signals the Orangefs file_operations.poll
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) function through a wait queue. Userspace is polling the pseudo-device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) and thus becomes aware of the upcall request that needs to be read.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) When the Orangefs file_operations.read function is triggered, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) request list is searched for an op that seems ready-to-process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) The op is removed from the request list. The tag from the op and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) the filled-out upcall struct are copy_to_user'ed back to userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) If any of these (and some additional protocol) copy_to_users fail,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) the op's state is set to "waiting" and the op is added back to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) the request list. Otherwise, the op's state is changed to "in progress",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) and the op is hashed on its tag and put onto the end of a list in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) in_progress hash table at the index the tag hashed to.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) When userspace has assembled the response to the upcall, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) writes the response, which includes the distinguishing tag, back to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) the pseudo device in a series of io_vecs. This triggers the Orangefs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) file_operations.write_iter function to find the op with the associated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) tag and remove it from the in_progress hash table. As long as the op's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) state is not "canceled" or "given up", its state is set to "serviced".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) The file_operations.write_iter function returns to the waiting vfs,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) and back to service_operation through wait_for_matching_downcall.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) Service operation returns to its caller with the op's downcall
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) part (the response to the upcall) filled out.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) The "client-core" is the bridge between the kernel module and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) userspace. The client-core is a daemon. The client-core has an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) associated watchdog daemon. If the client-core is ever signaled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) to die, the watchdog daemon restarts the client-core. Even though
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) the client-core is restarted "right away", there is a period of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) time during such an event that the client-core is dead. A dead client-core
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) can't be triggered by the Orangefs file_operations.poll function.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) Ops that pass through service_operation during a "dead spell" can timeout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) on the wait queue and one attempt is made to recycle them. Obviously,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) if the client-core stays dead too long, the arbitrary userspace processes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) trying to use Orangefs will be negatively affected. Waiting ops
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) that can't be serviced will be removed from the request list and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) have their states set to "given up". In-progress ops that can't
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) be serviced will be removed from the in_progress hash table and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) have their states set to "given up".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) Readdir and I/O ops are atypical with respect to their payloads.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) - readdir ops use the smaller of the two pre-allocated pre-partitioned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) memory buffers. The readdir buffer is only available to userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) The kernel module obtains an index to a free partition before launching
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) a readdir op. Userspace deposits the results into the indexed partition
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) and then writes them to back to the pvfs device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) - io (read and write) ops use the larger of the two pre-allocated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) pre-partitioned memory buffers. The IO buffer is accessible from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) both userspace and the kernel module. The kernel module obtains an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) index to a free partition before launching an io op. The kernel module
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) deposits write data into the indexed partition, to be consumed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) directly by userspace. Userspace deposits the results of read
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) requests into the indexed partition, to be consumed directly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) by the kernel module.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) Responses to kernel requests are all packaged in pvfs2_downcall_t
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) structs. Besides a few other members, pvfs2_downcall_t contains a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) union of structs, each of which is associated with a particular
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) response type.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) The several members outside of the union are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) ``int32_t type``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) - type of operation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) ``int32_t status``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) - return code for the operation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) ``int64_t trailer_size``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) - 0 unless readdir operation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) ``char *trailer_buf``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) - initialized to NULL, used during readdir operations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) The appropriate member inside the union is filled out for any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) particular response.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) PVFS2_VFS_OP_FILE_IO
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) fill a pvfs2_io_response_t
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) PVFS2_VFS_OP_LOOKUP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) fill a PVFS_object_kref
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) PVFS2_VFS_OP_CREATE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) fill a PVFS_object_kref
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) PVFS2_VFS_OP_SYMLINK
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) fill a PVFS_object_kref
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) PVFS2_VFS_OP_GETATTR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) fill in a string with the link target when the object is a symlink.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) PVFS2_VFS_OP_MKDIR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) fill a PVFS_object_kref
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) PVFS2_VFS_OP_STATFS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) fill a pvfs2_statfs_response_t with useless info <g>. It is hard for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) us to know, in a timely fashion, these statistics about our
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) distributed network filesystem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) PVFS2_VFS_OP_FS_MOUNT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) except its members are in a different order and "__pad1" is replaced
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) with "id".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) PVFS2_VFS_OP_GETXATTR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) fill a pvfs2_getxattr_response_t
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) PVFS2_VFS_OP_LISTXATTR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) fill a pvfs2_listxattr_response_t
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) PVFS2_VFS_OP_PARAM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) fill a pvfs2_param_response_t
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) PVFS2_VFS_OP_PERF_COUNT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) fill a pvfs2_perf_count_response_t
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) PVFS2_VFS_OP_FSKEY
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) file a pvfs2_fs_key_response_t
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) PVFS2_VFS_OP_READDIR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) jamb everything needed to represent a pvfs2_readdir_response_t into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) the readdir buffer descriptor specified in the upcall.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) made by the kernel side.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) A buffer_list containing:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478) - a pointer to the prepared response to the request from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) kernel (struct pvfs2_downcall_t).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) - and also, in the case of a readdir request, a pointer to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481) buffer containing descriptors for the objects in the target
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) directory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) ... is sent to the function (PINT_dev_write_list) which performs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) the writev.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) PINT_dev_write_list has a local iovec array: struct iovec io_array[10];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) The first four elements of io_array are initialized like this for all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) responses::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) io_array[0].iov_base = address of local variable "proto_ver" (int32_t)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) io_array[0].iov_len = sizeof(int32_t)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) io_array[1].iov_base = address of global variable "pdev_magic" (int32_t)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) io_array[1].iov_len = sizeof(int32_t)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) io_array[2].iov_len = sizeof(int64_t)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) of global variable vfs_request (vfs_request_t)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) io_array[3].iov_len = sizeof(pvfs2_downcall_t)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) Readdir responses initialize the fifth element io_array like this::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507) io_array[4].iov_base = contents of member trailer_buf (char *)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) from out_downcall member of global variable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) vfs_request
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) io_array[4].iov_len = contents of member trailer_size (PVFS_size)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) from out_downcall member of global variable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) vfs_request
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) Orangefs exploits the dcache in order to avoid sending redundant
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) requests to userspace. We keep object inode attributes up-to-date with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) help it decide whether or not to update an inode: "new" and "bypass".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) Orangefs keeps private data in an object's inode that includes a short
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) timeout value, getattr_time, which allows any iteration of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) orangefs_inode_getattr to know how long it has been since the inode was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) updated. When the object is not new (new == 0) and the bypass flag is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) set (bypass == 0) orangefs_inode_getattr returns without updating the inode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) if getattr_time has not timed out. Getattr_time is updated each time the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524) inode is updated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) Creation of a new object (file, dir, sym-link) includes the evaluation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527) its pathname, resulting in a negative directory entry for the object.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) A new inode is allocated and associated with the dentry, turning it from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) a negative dentry into a "productive full member of society". Orangefs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) obtains the new inode from Linux with new_inode() and associates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) the inode with the dentry by sending the pair back to Linux with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) d_instantiate().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534) The evaluation of a pathname for an object resolves to its corresponding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) dentry. If there is no corresponding dentry, one is created for it in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) the dcache. Whenever a dentry is modified or verified Orangefs stores a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) short timeout value in the dentry's d_time, and the dentry will be trusted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538) for that amount of time. Orangefs is a network filesystem, and objects
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539) can potentially change out-of-band with any particular Orangefs kernel module
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540) instance, so trusting a dentry is risky. The alternative to trusting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) dentries is to always obtain the needed information from userspace - at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) least a trip to the client-core, maybe to the servers. Obtaining information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543) from a dentry is cheap, obtaining it from userspace is relatively expensive,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) hence the motivation to use the dentry when possible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) The timeout values d_time and getattr_time are jiffy based, and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547) code is designed to avoid the jiffy-wrap problem::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) "In general, if the clock may have wrapped around more than once, there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) is no way to tell how much time has elapsed. However, if the times t1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551) and t2 are known to be fairly close, we can reliably compute the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552) difference in a way that takes into account the possibility that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) clock may have wrapped between times."
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555) from course notes by instructor Andy Wang
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556)