^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. _cleancache:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Motivation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) Cleancache is a new optional feature provided by the VFS layer that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) potentially dramatically increases page cache effectiveness for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) many workloads in many environments at a negligible cost.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) Cleancache can be thought of as a page-granularity victim cache for clean
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) pages that the kernel's pageframe replacement algorithm (PFRA) would like
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) to keep around, but can't since there isn't enough memory. So when the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) PFRA "evicts" a page, it first attempts to use cleancache code to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) put the data contained in that page into "transcendent memory", memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) that is not directly accessible or addressable by the kernel and is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) of unknown and possibly time-varying size.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) Later, when a cleancache-enabled filesystem wishes to access a page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) in a file on disk, it first checks cleancache to see if it already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) contains it; if it does, the page of data is copied into the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) and a disk access is avoided.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) Transcendent memory "drivers" for cleancache are currently implemented
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) in Xen (using hypervisor memory) and zcache (using in-kernel compressed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) memory) and other implementations are in development.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) :ref:`FAQs <faq>` are included below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) Implementation Overview
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) =======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) A cleancache "backend" that provides transcendent memory registers itself
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) to the kernel's cleancache "frontend" by calling cleancache_register_ops,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) passing a pointer to a cleancache_ops structure with funcs set appropriately.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) The functions provided must conform to certain semantics as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) Most important, cleancache is "ephemeral". Pages which are copied into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) cleancache have an indefinite lifetime which is completely unknowable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) by the kernel and so may or may not still be in cleancache at any later time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) Thus, as its name implies, cleancache is not suitable for dirty pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) Cleancache has complete discretion over what pages to preserve and what
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) pages to discard and when.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) Mounting a cleancache-enabled filesystem should call "init_fs" to obtain a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) pool id which, if positive, must be saved in the filesystem's superblock;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) a negative return value indicates failure. A "put_page" will copy a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) (presumably about-to-be-evicted) page into cleancache and associate it with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) the pool id, a file key, and a page index into the file. (The combination
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) of a pool id, a file key, and an index is sometimes called a "handle".)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) A "get_page" will copy the page, if found, from cleancache into kernel memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) An "invalidate_page" will ensure the page no longer is present in cleancache;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) an "invalidate_inode" will invalidate all pages associated with the specified
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) file; and, when a filesystem is unmounted, an "invalidate_fs" will invalidate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) all pages in all files specified by the given pool id and also surrender
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) the pool id.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) An "init_shared_fs", like init_fs, obtains a pool id but tells cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) to treat the pool as shared using a 128-bit UUID as a key. On systems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) that may run multiple kernels (such as hard partitioned or virtualized
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) systems) that may share a clustered filesystem, and where cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) may be shared among those kernels, calls to init_shared_fs that specify the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) same UUID will receive the same pool id, thus allowing the pages to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) be shared. Note that any security requirements must be imposed outside
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) of the kernel (e.g. by "tools" that control cleancache). Or a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) cleancache implementation can simply disable shared_init by always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) returning a negative value.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) If a get_page is successful on a non-shared pool, the page is invalidated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) (thus making cleancache an "exclusive" cache). On a shared pool, the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) is NOT invalidated on a successful get_page so that it remains accessible to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) other sharers. The kernel is responsible for ensuring coherency between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) cleancache (shared or not), the page cache, and the filesystem, using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) cleancache invalidate operations as required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) Note that cleancache must enforce put-put-get coherency and get-get
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) coherency. For the former, if two puts are made to the same handle but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) with different data, say AAA by the first put and BBB by the second, a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) subsequent get can never return the stale data (AAA). For get-get coherency,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) if a get for a given handle fails, subsequent gets for that handle will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) never succeed unless preceded by a successful put with that handle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) Last, cleancache provides no SMP serialization guarantees; if two
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) different Linux threads are simultaneously putting and invalidating a page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) with the same handle, the results are indeterminate. Callers must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) lock the page to ensure serial behavior.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) Cleancache Performance Metrics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) ==============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) If properly configured, monitoring of cleancache is done via debugfs in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) the `/sys/kernel/debug/cleancache` directory. The effectiveness of cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) can be measured (across all filesystems) with:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) ``succ_gets``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) number of gets that were successful
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) ``failed_gets``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) number of gets that failed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) ``puts``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) number of puts attempted (all "succeed")
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) ``invalidates``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) number of invalidates attempted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) A backend implementation may provide additional metrics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) .. _faq:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) FAQ
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) ===
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) * Where's the value? (Andrew Morton)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) Cleancache provides a significant performance benefit to many workloads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) in many environments with negligible overhead by improving the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) effectiveness of the pagecache. Clean pagecache pages are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) saved in transcendent memory (RAM that is otherwise not directly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) addressable to the kernel); fetching those pages later avoids "refaults"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) and thus disk reads.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) Cleancache (and its sister code "frontswap") provide interfaces for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) this transcendent memory (aka "tmem"), which conceptually lies between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) fast kernel-directly-addressable RAM and slower DMA/asynchronous devices.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) Disallowing direct kernel or userland reads/writes to tmem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) is ideal when data is transformed to a different form and size (such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) as with compression) or secretly moved (as might be useful for write-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) balancing for some RAM-like devices). Evicted page-cache pages (and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) swap pages) are a great use for this kind of slower-than-RAM-but-much-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) faster-than-disk transcendent memory, and the cleancache (and frontswap)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) "page-object-oriented" specification provides a nice way to read and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) write -- and indirectly "name" -- the pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) In the virtual case, the whole point of virtualization is to statistically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) multiplex physical resources across the varying demands of multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) virtual machines. This is really hard to do with RAM and efforts to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) do it well with no kernel change have essentially failed (except in some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) well-publicized special-case workloads). Cleancache -- and frontswap --
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) with a fairly small impact on the kernel, provide a huge amount
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) of flexibility for more dynamic, flexible RAM multiplexing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) Specifically, the Xen Transcendent Memory backend allows otherwise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) virtual machines, but the pages can be compressed and deduplicated to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) optimize RAM utilization. And when guest OS's are induced to surrender
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) underutilized RAM (e.g. with "self-ballooning"), page cache pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) are the first to go, and cleancache allows those pages to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) saved and reclaimed if overall host system memory conditions allow.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) And the identical interface used for cleancache can be used in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) physical systems as well. The zcache driver acts as a memory-hungry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) device that stores pages of data in a compressed state. And
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) the proposed "RAMster" driver shares RAM across multiple physical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) systems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) * Why does cleancache have its sticky fingers so deep inside the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) filesystems and VFS? (Andrew Morton and Christoph Hellwig)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) The core hooks for cleancache in VFS are in most cases a single line
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) and the minimum set are placed precisely where needed to maintain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) coherency (via cleancache_invalidate operations) between cleancache,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) the page cache, and disk. All hooks compile into nothingness if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) cleancache is config'ed off and turn into a function-pointer-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) compare-to-NULL if config'ed on but no backend claims the ops
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) functions, or to a compare-struct-element-to-negative if a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) backend claims the ops functions but a filesystem doesn't enable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) cleancache.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) Some filesystems are built entirely on top of VFS and the hooks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) in VFS are sufficient, so don't require an "init_fs" hook; the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) initial implementation of cleancache didn't provide this hook.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) But for some filesystems (such as btrfs), the VFS hooks are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) incomplete and one or more hooks in fs-specific code are required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) And for some other filesystems, such as tmpfs, cleancache may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) be counterproductive. So it seemed prudent to require a filesystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) to "opt in" to use cleancache, which requires adding a hook in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) each filesystem. Not all filesystems are supported by cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) only because they haven't been tested. The existing set should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) be sufficient to validate the concept, the opt-in approach means
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) that untested filesystems are not affected, and the hooks in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) existing filesystems should make it very easy to add more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) filesystems in the future.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) The total impact of the hooks to existing fs and mm files is only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) about 40 lines added (not counting comments and blank lines).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) * Why not make cleancache asynchronous and batched so it can more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) easily interface with real devices with DMA instead of copying each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) individual page? (Minchan Kim)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) The one-page-at-a-time copy semantics simplifies the implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) on both the frontend and backend and also allows the backend to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) do fancy things on-the-fly like page compression and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) page deduplication. And since the data is "gone" (copied into/out
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) of the pageframe) before the cleancache get/put call returns,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) a great deal of race conditions and potential coherency issues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) are avoided. While the interface seems odd for a "real device"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) or for real kernel-addressable RAM, it makes perfect sense for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) transcendent memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) * Why is non-shared cleancache "exclusive"? And where is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) page "invalidated" after a "get"? (Minchan Kim)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) The main reason is to free up space in transcendent memory and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) to avoid unnecessary cleancache_invalidate calls. If you want inclusive,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) the page can be "put" immediately following the "get". If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) put-after-get for inclusive becomes common, the interface could
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) be easily extended to add a "get_no_invalidate" call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) The invalidate is done by the cleancache backend implementation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) * What's the performance impact?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) Performance analysis has been presented at OLS'09 and LCA'10.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) Briefly, performance gains can be significant on most workloads,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) especially when memory pressure is high (e.g. when RAM is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) overcommitted in a virtual workload); and because the hooks are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) invoked primarily in place of or in addition to a disk read/write,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) overhead is negligible even in worst case workloads. Basically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) cleancache replaces I/O with memory-copy-CPU-overhead; on older
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) single-core systems with slow memory-copy speeds, cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) has little value, but in newer multicore machines, especially
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) consolidated/virtualized machines, it has great value.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) * How do I add cleancache support for filesystem X? (Boaz Harrash)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) Filesystems that are well-behaved and conform to certain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) restrictions can utilize cleancache simply by making a call to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) cleancache_init_fs at mount time. Unusual, misbehaving, or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) poorly layered filesystems must either add additional hooks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) and/or undergo extensive additional testing... or should just
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) not enable the optional cleancache.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) Some points for a filesystem to consider:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) - The FS should be block-device-based (e.g. a ram-based FS such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) as tmpfs should not enable cleancache)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) - To ensure coherency/correctness, the FS must ensure that all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) file removal or truncation operations either go through VFS or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) add hooks to do the equivalent cleancache "invalidate" operations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) - To ensure coherency/correctness, either inode numbers must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) be unique across the lifetime of the on-disk file OR the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) FS must provide an "encode_fh" function.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) - The FS must call the VFS superblock alloc and deactivate routines
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) or add hooks to do the equivalent cleancache calls done there.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) - To maximize performance, all pages fetched from the FS should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) go through the do_mpag_readpage routine or the FS should add
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) hooks to do the equivalent (cf. btrfs)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) - Currently, the FS blocksize must be the same as PAGESIZE. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) is not an architectural restriction, but no backends currently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) support anything different.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) - A clustered FS should invoke the "shared_init_fs" cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) hook to get best performance for some backends.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) * Why not use the KVA of the inode as the key? (Christoph Hellwig)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) If cleancache would use the inode virtual address instead of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) inode/filehandle, the pool id could be eliminated. But, this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) won't work because cleancache retains pagecache data pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) persistently even when the inode has been pruned from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) inode unused list, and only invalidates the data page if the file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) gets removed/truncated. So if cleancache used the inode kva,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) there would be potential coherency issues if/when the inode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) kva is reused for a different file. Alternately, if cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) invalidated the pages when the inode kva was freed, much of the value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) of cleancache would be lost because the cache of pages in cleanache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) is potentially much larger than the kernel pagecache and is most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) useful if the pages survive inode cache removal.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) * Why is a global variable required?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) The cleancache_enabled flag is checked in all of the frequently-used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) cleancache hooks. The alternative is a function call to check a static
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) variable. Since cleancache is enabled dynamically at runtime, systems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) that don't enable cleancache would suffer thousands (possibly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) tens-of-thousands) of unnecessary function calls per second. So the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) global variable allows cleancache to be enabled by default at compile
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) time, but have insignificant performance impact when cleancache remains
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) disabled at runtime.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) * Does cleanache work with KVM?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) The memory model of KVM is sufficiently different that a cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) backend may have less value for KVM. This remains to be tested,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) especially in an overcommitted system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) * Does cleancache work in userspace? It sounds useful for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) memory hungry caches like web browsers. (Jamie Lokier)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) No plans yet, though we agree it sounds useful, at least for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) apps that bypass the page cache (e.g. O_DIRECT).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) Last updated: Dan Magenheimer, April 13 2011