Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. _cleancache:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) Cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) Motivation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) Cleancache is a new optional feature provided by the VFS layer that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) potentially dramatically increases page cache effectiveness for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) many workloads in many environments at a negligible cost.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) Cleancache can be thought of as a page-granularity victim cache for clean
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) pages that the kernel's pageframe replacement algorithm (PFRA) would like
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) to keep around, but can't since there isn't enough memory.  So when the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) PFRA "evicts" a page, it first attempts to use cleancache code to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) put the data contained in that page into "transcendent memory", memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) that is not directly accessible or addressable by the kernel and is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) of unknown and possibly time-varying size.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) Later, when a cleancache-enabled filesystem wishes to access a page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) in a file on disk, it first checks cleancache to see if it already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) contains it; if it does, the page of data is copied into the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) and a disk access is avoided.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) Transcendent memory "drivers" for cleancache are currently implemented
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) in Xen (using hypervisor memory) and zcache (using in-kernel compressed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) memory) and other implementations are in development.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) :ref:`FAQs <faq>` are included below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) Implementation Overview
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) =======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) A cleancache "backend" that provides transcendent memory registers itself
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) to the kernel's cleancache "frontend" by calling cleancache_register_ops,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) passing a pointer to a cleancache_ops structure with funcs set appropriately.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) The functions provided must conform to certain semantics as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) Most important, cleancache is "ephemeral".  Pages which are copied into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) cleancache have an indefinite lifetime which is completely unknowable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) by the kernel and so may or may not still be in cleancache at any later time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) Thus, as its name implies, cleancache is not suitable for dirty pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) Cleancache has complete discretion over what pages to preserve and what
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) pages to discard and when.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) Mounting a cleancache-enabled filesystem should call "init_fs" to obtain a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) pool id which, if positive, must be saved in the filesystem's superblock;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) a negative return value indicates failure.  A "put_page" will copy a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) (presumably about-to-be-evicted) page into cleancache and associate it with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) the pool id, a file key, and a page index into the file.  (The combination
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) of a pool id, a file key, and an index is sometimes called a "handle".)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) A "get_page" will copy the page, if found, from cleancache into kernel memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) An "invalidate_page" will ensure the page no longer is present in cleancache;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) an "invalidate_inode" will invalidate all pages associated with the specified
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) file; and, when a filesystem is unmounted, an "invalidate_fs" will invalidate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) all pages in all files specified by the given pool id and also surrender
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) the pool id.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) An "init_shared_fs", like init_fs, obtains a pool id but tells cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) to treat the pool as shared using a 128-bit UUID as a key.  On systems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) that may run multiple kernels (such as hard partitioned or virtualized
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) systems) that may share a clustered filesystem, and where cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) may be shared among those kernels, calls to init_shared_fs that specify the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) same UUID will receive the same pool id, thus allowing the pages to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) be shared.  Note that any security requirements must be imposed outside
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) of the kernel (e.g. by "tools" that control cleancache).  Or a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) cleancache implementation can simply disable shared_init by always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) returning a negative value.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) If a get_page is successful on a non-shared pool, the page is invalidated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) (thus making cleancache an "exclusive" cache).  On a shared pool, the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) is NOT invalidated on a successful get_page so that it remains accessible to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) other sharers.  The kernel is responsible for ensuring coherency between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) cleancache (shared or not), the page cache, and the filesystem, using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) cleancache invalidate operations as required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) Note that cleancache must enforce put-put-get coherency and get-get
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) coherency.  For the former, if two puts are made to the same handle but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) with different data, say AAA by the first put and BBB by the second, a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) subsequent get can never return the stale data (AAA).  For get-get coherency,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) if a get for a given handle fails, subsequent gets for that handle will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) never succeed unless preceded by a successful put with that handle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) Last, cleancache provides no SMP serialization guarantees; if two
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) different Linux threads are simultaneously putting and invalidating a page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) with the same handle, the results are indeterminate.  Callers must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) lock the page to ensure serial behavior.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) Cleancache Performance Metrics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) ==============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) If properly configured, monitoring of cleancache is done via debugfs in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) the `/sys/kernel/debug/cleancache` directory.  The effectiveness of cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) can be measured (across all filesystems) with:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) ``succ_gets``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) 	number of gets that were successful
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) ``failed_gets``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) 	number of gets that failed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) ``puts``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 	number of puts attempted (all "succeed")
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) ``invalidates``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 	number of invalidates attempted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) A backend implementation may provide additional metrics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) .. _faq:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) FAQ
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) ===
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) * Where's the value? (Andrew Morton)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) Cleancache provides a significant performance benefit to many workloads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) in many environments with negligible overhead by improving the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) effectiveness of the pagecache.  Clean pagecache pages are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) saved in transcendent memory (RAM that is otherwise not directly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) addressable to the kernel); fetching those pages later avoids "refaults"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) and thus disk reads.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) Cleancache (and its sister code "frontswap") provide interfaces for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) this transcendent memory (aka "tmem"), which conceptually lies between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) fast kernel-directly-addressable RAM and slower DMA/asynchronous devices.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) Disallowing direct kernel or userland reads/writes to tmem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) is ideal when data is transformed to a different form and size (such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) as with compression) or secretly moved (as might be useful for write-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) balancing for some RAM-like devices).  Evicted page-cache pages (and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) swap pages) are a great use for this kind of slower-than-RAM-but-much-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) faster-than-disk transcendent memory, and the cleancache (and frontswap)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) "page-object-oriented" specification provides a nice way to read and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) write -- and indirectly "name" -- the pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) In the virtual case, the whole point of virtualization is to statistically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) multiplex physical resources across the varying demands of multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) virtual machines.  This is really hard to do with RAM and efforts to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) do it well with no kernel change have essentially failed (except in some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) well-publicized special-case workloads).  Cleancache -- and frontswap --
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) with a fairly small impact on the kernel, provide a huge amount
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) of flexibility for more dynamic, flexible RAM multiplexing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) Specifically, the Xen Transcendent Memory backend allows otherwise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) virtual machines, but the pages can be compressed and deduplicated to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) optimize RAM utilization.  And when guest OS's are induced to surrender
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) underutilized RAM (e.g. with "self-ballooning"), page cache pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) are the first to go, and cleancache allows those pages to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) saved and reclaimed if overall host system memory conditions allow.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) And the identical interface used for cleancache can be used in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) physical systems as well.  The zcache driver acts as a memory-hungry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) device that stores pages of data in a compressed state.  And
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) the proposed "RAMster" driver shares RAM across multiple physical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) systems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) * Why does cleancache have its sticky fingers so deep inside the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160)   filesystems and VFS? (Andrew Morton and Christoph Hellwig)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) The core hooks for cleancache in VFS are in most cases a single line
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) and the minimum set are placed precisely where needed to maintain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) coherency (via cleancache_invalidate operations) between cleancache,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) the page cache, and disk.  All hooks compile into nothingness if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) cleancache is config'ed off and turn into a function-pointer-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) compare-to-NULL if config'ed on but no backend claims the ops
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) functions, or to a compare-struct-element-to-negative if a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) backend claims the ops functions but a filesystem doesn't enable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) cleancache.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) Some filesystems are built entirely on top of VFS and the hooks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) in VFS are sufficient, so don't require an "init_fs" hook; the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) initial implementation of cleancache didn't provide this hook.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) But for some filesystems (such as btrfs), the VFS hooks are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) incomplete and one or more hooks in fs-specific code are required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) And for some other filesystems, such as tmpfs, cleancache may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) be counterproductive.  So it seemed prudent to require a filesystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) to "opt in" to use cleancache, which requires adding a hook in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) each filesystem.  Not all filesystems are supported by cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) only because they haven't been tested.  The existing set should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) be sufficient to validate the concept, the opt-in approach means
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) that untested filesystems are not affected, and the hooks in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) existing filesystems should make it very easy to add more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) filesystems in the future.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) The total impact of the hooks to existing fs and mm files is only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) about 40 lines added (not counting comments and blank lines).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) * Why not make cleancache asynchronous and batched so it can more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191)   easily interface with real devices with DMA instead of copying each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192)   individual page? (Minchan Kim)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) The one-page-at-a-time copy semantics simplifies the implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) on both the frontend and backend and also allows the backend to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) do fancy things on-the-fly like page compression and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) page deduplication.  And since the data is "gone" (copied into/out
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) of the pageframe) before the cleancache get/put call returns,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) a great deal of race conditions and potential coherency issues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) are avoided.  While the interface seems odd for a "real device"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) or for real kernel-addressable RAM, it makes perfect sense for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) transcendent memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) * Why is non-shared cleancache "exclusive"?  And where is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205)   page "invalidated" after a "get"? (Minchan Kim)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) The main reason is to free up space in transcendent memory and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) to avoid unnecessary cleancache_invalidate calls.  If you want inclusive,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) the page can be "put" immediately following the "get".  If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) put-after-get for inclusive becomes common, the interface could
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) be easily extended to add a "get_no_invalidate" call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) The invalidate is done by the cleancache backend implementation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) * What's the performance impact?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) Performance analysis has been presented at OLS'09 and LCA'10.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) Briefly, performance gains can be significant on most workloads,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) especially when memory pressure is high (e.g. when RAM is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) overcommitted in a virtual workload); and because the hooks are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) invoked primarily in place of or in addition to a disk read/write,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) overhead is negligible even in worst case workloads.  Basically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) cleancache replaces I/O with memory-copy-CPU-overhead; on older
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) single-core systems with slow memory-copy speeds, cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) has little value, but in newer multicore machines, especially
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) consolidated/virtualized machines, it has great value.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) * How do I add cleancache support for filesystem X? (Boaz Harrash)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) Filesystems that are well-behaved and conform to certain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) restrictions can utilize cleancache simply by making a call to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) cleancache_init_fs at mount time.  Unusual, misbehaving, or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) poorly layered filesystems must either add additional hooks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) and/or undergo extensive additional testing... or should just
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) not enable the optional cleancache.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) Some points for a filesystem to consider:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239)   - The FS should be block-device-based (e.g. a ram-based FS such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240)     as tmpfs should not enable cleancache)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241)   - To ensure coherency/correctness, the FS must ensure that all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242)     file removal or truncation operations either go through VFS or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243)     add hooks to do the equivalent cleancache "invalidate" operations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244)   - To ensure coherency/correctness, either inode numbers must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245)     be unique across the lifetime of the on-disk file OR the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)     FS must provide an "encode_fh" function.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247)   - The FS must call the VFS superblock alloc and deactivate routines
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248)     or add hooks to do the equivalent cleancache calls done there.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249)   - To maximize performance, all pages fetched from the FS should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250)     go through the do_mpag_readpage routine or the FS should add
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251)     hooks to do the equivalent (cf. btrfs)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252)   - Currently, the FS blocksize must be the same as PAGESIZE.  This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253)     is not an architectural restriction, but no backends currently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254)     support anything different.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255)   - A clustered FS should invoke the "shared_init_fs" cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256)     hook to get best performance for some backends.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) * Why not use the KVA of the inode as the key? (Christoph Hellwig)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) If cleancache would use the inode virtual address instead of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) inode/filehandle, the pool id could be eliminated.  But, this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) won't work because cleancache retains pagecache data pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) persistently even when the inode has been pruned from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) inode unused list, and only invalidates the data page if the file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) gets removed/truncated.  So if cleancache used the inode kva,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) there would be potential coherency issues if/when the inode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) kva is reused for a different file.  Alternately, if cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) invalidated the pages when the inode kva was freed, much of the value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) of cleancache would be lost because the cache of pages in cleanache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) is potentially much larger than the kernel pagecache and is most
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) useful if the pages survive inode cache removal.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) * Why is a global variable required?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) The cleancache_enabled flag is checked in all of the frequently-used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) cleancache hooks.  The alternative is a function call to check a static
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) variable. Since cleancache is enabled dynamically at runtime, systems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) that don't enable cleancache would suffer thousands (possibly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) tens-of-thousands) of unnecessary function calls per second.  So the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) global variable allows cleancache to be enabled by default at compile
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) time, but have insignificant performance impact when cleancache remains
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) disabled at runtime.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) * Does cleanache work with KVM?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) The memory model of KVM is sufficiently different that a cleancache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) backend may have less value for KVM.  This remains to be tested,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) especially in an overcommitted system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) * Does cleancache work in userspace?  It sounds useful for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291)   memory hungry caches like web browsers.  (Jamie Lokier)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) No plans yet, though we agree it sounds useful, at least for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) apps that bypass the page cache (e.g. O_DIRECT).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) Last updated: Dan Magenheimer, April 13 2011