^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Ceph Distributed File System
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Ceph is a distributed network file system designed to provide good
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) performance, reliability, and scalability.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) Basic features include:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) * POSIX semantics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) * Seamless scaling from 1 to many thousands of nodes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) * High availability and reliability. No single point of failure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) * N-way replication of data across storage nodes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) * Fast recovery from node failures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) * Automatic rebalancing of data on node addition/removal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) * Easy deployment: most FS components are userspace daemons
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) Also,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) * Flexible snapshots (on any directory)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) * Recursive accounting (nested files, directories, bytes)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) on symmetric access by all clients to shared block devices, Ceph
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) separates data and metadata management into independent server
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) clusters, similar to Lustre. Unlike Lustre, however, metadata and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) storage nodes run entirely as user space daemons. File data is striped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) across storage nodes in large chunks to distribute workload and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) facilitate high throughputs. When storage nodes fail, data is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) re-replicated in a distributed fashion by the storage nodes themselves
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) (with some minimal coordination from a cluster monitor), making the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) system extremely efficient and scalable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) Metadata servers effectively form a large, consistent, distributed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) in-memory cache above the file namespace that is extremely scalable,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) dynamically redistributes metadata in response to workload changes,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) and can tolerate arbitrary (well, non-Byzantine) node failures. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) metadata server takes a somewhat unconventional approach to metadata
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) storage to significantly improve performance for common workloads. In
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) particular, inodes with only a single link are embedded in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) directories, allowing entire directories of dentries and inodes to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) loaded into its cache with a single I/O operation. The contents of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) extremely large directories can be fragmented and managed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) independent metadata servers, allowing scalable concurrent access.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) The system offers automatic data rebalancing/migration when scaling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) from a small cluster of just a few nodes to many hundreds, without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) requiring an administrator carve the data set into static volumes or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) go through the tedious process of migrating data between servers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) When the file system approaches full, new nodes can be easily added
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) and things will "just work."
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) Ceph includes flexible snapshot mechanism that allows a user to create
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) a snapshot on any subdirectory (and its nested contents) in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) system. Snapshot creation and deletion are as simple as 'mkdir
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) .snap/foo' and 'rmdir .snap/foo'.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) Ceph also provides some recursive accounting on directories for nested
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) files and bytes. That is, a 'getfattr -d foo' on any directory in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) system will reveal the total number of nested regular files and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) subdirectories, and a summation of all nested file sizes. This makes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) the identification of large disk space consumers relatively quick, as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) no 'du' or similar recursive scan of the file system is required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) Finally, Ceph also allows quotas to be set on any directory in the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) The quota can restrict the number of bytes or the number of files stored
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) beneath that point in the directory hierarchy. Quotas can be set using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) getfattr -n ceph.quota.max_bytes /some/dir
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) A limitation of the current quotas implementation is that it relies on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) cooperation of the client mounting the file system to stop writers when a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) limit is reached. A modified or adversarial client cannot be prevented
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) from writing as much data as it needs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) Mount Syntax
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) The basic mount syntax is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) You only need to specify a single monitor, as the client will get the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) full list when it connects. (However, if the monitor you specify
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) happens to be down, the mount won't succeed.) The port can be left
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) off if the monitor is using the default. So if the monitor is at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) 1.2.3.4::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) # mount -t ceph 1.2.3.4:/ /mnt/ceph
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) is sufficient. If /sbin/mount.ceph is installed, a hostname can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) used instead of an IP address.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) Mount Options
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) ip=A.B.C.D[:N]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) Specify the IP and/or port the client should bind to locally.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) There is normally not much reason to do this. If the IP is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) specified, the client's IP address is determined by looking at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) address its connection to the monitor originates from.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) wsize=X
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) Specify the maximum write size in bytes. Default: 64 MB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) rsize=X
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) Specify the maximum read size in bytes. Default: 64 MB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) rasize=X
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) Specify the maximum readahead size in bytes. Default: 8 MB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) mount_timeout=X
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) Specify the timeout value for mount (in seconds), in the case
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) of a non-responsive Ceph file system. The default is 60
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) seconds.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) caps_max=X
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) Specify the maximum number of caps to hold. Unused caps are released
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) when number of caps exceeds the limit. The default is 0 (no limit)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) rbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) When stat() is called on a directory, set st_size to 'rbytes',
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) the summation of file sizes over all files nested beneath that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) directory. This is the default.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) norbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) When stat() is called on a directory, set st_size to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) number of entries in that directory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) nocrc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) Disable CRC32C calculation for data writes. If set, the storage node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) must rely on TCP's error correction to detect data corruption
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) in the data payload.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) dcache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) Use the dcache contents to perform negative lookups and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) readdir when the client has the entire directory contents in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) its cache. (This does not change correctness; the client uses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) cached metadata only when a lease or capability ensures it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) valid.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) nodcache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) Do not use the dcache as above. This avoids a significant amount of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) complex code, sacrificing performance without affecting correctness,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) and is useful for tracking down bugs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) noasyncreaddir
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) Do not use the dcache as above for readdir.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) noquotadf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) Report overall filesystem usage in statfs instead of using the root
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) directory quota.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) nocopyfrom
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) Don't use the RADOS 'copy-from' operation to perform remote object
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) copies. Currently, it's only used in copy_file_range, which will revert
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) to the default VFS implementation if this option is used.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) recover_session=<no|clean>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) Set auto reconnect mode in the case where the client is blocklisted. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) available modes are "no" and "clean". The default is "no".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) * no: never attempt to reconnect when client detects that it has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) blocklisted. Operations will generally fail after being blocklisted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) * clean: client reconnects to the ceph cluster automatically when it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) detects that it has been blocklisted. During reconnect, client drops
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) dirty data/metadata, invalidates page caches and writable file handles.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) After reconnect, file locks become stale because the MDS loses track
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) of them. If an inode contains any stale file locks, read/write on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) inode is not allowed until applications release all stale file locks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) More Information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) For more information on Ceph, see the home page at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) https://ceph.com/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) The Linux kernel client source tree is available at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) - https://github.com/ceph/ceph-client.git
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) - git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) and the source for the full system is at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) https://github.com/ceph/ceph.git