^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. _frontswap:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Frontswap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Frontswap provides a "transcendent memory" interface for swap pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) In some environments, dramatic performance savings may be obtained because
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) (Note, frontswap -- and :ref:`cleancache` (merged at 3.0) -- are the "frontends"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) and the only necessary changes to the core kernel for transcendent memory;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) all other supporting code -- the "backends" -- is implemented as drivers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) See the LWN.net article `Transcendent memory in a nutshell`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) for a detailed overview of frontswap and related kernel parts)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) .. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) Frontswap is so named because it can be thought of as the opposite of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) a "backing" store for a swap device. The storage is assumed to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) to the requirements of transcendent memory (such as Xen's "tmem", or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) in-kernel compressed memory, aka "zcache", or future RAM-like devices);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) this pseudo-RAM device is not directly accessible or addressable by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) kernel and is of unknown and possibly time-varying size. The driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) links itself to frontswap by calling frontswap_register_ops to set the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) frontswap_ops funcs appropriately and the functions it provides must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) conform to certain policies as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) An "init" prepares the device to receive frontswap pages associated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) with the specified swap device number (aka "type"). A "store" will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) copy the page to transcendent memory and associate it with the type and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) offset associated with the page. A "load" will copy the page, if found,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) from transcendent memory into kernel memory, but will NOT remove the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) from transcendent memory. An "invalidate_page" will remove the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) from transcendent memory and an "invalidate_area" will remove ALL pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) associated with the swap type (e.g., like swapoff) and notify the "device"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) to refuse further stores with that swap type.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) Once a page is successfully stored, a matching load on the page will normally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) succeed. So when the kernel finds itself in a situation where it needs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) to swap out a page, it first attempts to use frontswap. If the store returns
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) success, the data has been successfully saved to transcendent memory and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) a disk write and, if the data is later read back, a disk read are avoided.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) If a store returns failure, transcendent memory has rejected the data, and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) page can be written to swap as usual.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) If a backend chooses, frontswap can be configured as a "writethrough
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) cache" by calling frontswap_writethrough(). In this mode, the reduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) in swap device writes is lost (and also a non-trivial performance advantage)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) in order to allow the backend to arbitrarily "reclaim" space used to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) store frontswap pages to more completely manage its memory usage.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) Note that if a page is stored and the page already exists in transcendent memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) (a "duplicate" store), either the store succeeds and the data is overwritten,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) or the store fails AND the page is invalidated. This ensures stale data may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) never be obtained from frontswap.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) If properly configured, monitoring of frontswap is done via debugfs in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) the `/sys/kernel/debug/frontswap` directory. The effectiveness of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) frontswap can be measured (across all swap devices) with:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) ``failed_stores``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) how many store attempts have failed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) ``loads``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) how many loads were attempted (all should succeed)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) ``succ_stores``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) how many store attempts have succeeded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) ``invalidates``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) how many invalidates were attempted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) A backend implementation may provide additional metrics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) FAQ
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) ===
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) * Where's the value?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) When a workload starts swapping, performance falls through the floor.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) Frontswap significantly increases performance in many such workloads by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) providing a clean, dynamic interface to read and write swap pages to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) "transcendent memory" that is otherwise not directly addressable to the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) This interface is ideal when data is transformed to a different form
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) and size (such as with compression) or secretly moved (as might be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) useful for write-balancing for some RAM-like devices). Swap pages (and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) evicted page-cache pages) are a great use for this kind of slower-than-RAM-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) cleancache) interface to transcendent memory provides a nice way to read
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) and write -- and indirectly "name" -- the pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) Frontswap -- and cleancache -- with a fairly small impact on the kernel,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) provides a huge amount of flexibility for more dynamic, flexible RAM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) utilization in various system configurations:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) In the single kernel case, aka "zcache", pages are compressed and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) stored in local memory, thus increasing the total anonymous pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) that can be safely kept in RAM. Zcache essentially trades off CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) cycles used in compression/decompression for better memory utilization.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) Benchmarks have shown little or no impact when memory pressure is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) low while providing a significant performance improvement (25%+)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) on some workloads under high memory pressure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) "RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) support for clustered systems. Frontswap pages are locally compressed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) as in zcache, but then "remotified" to another system's RAM. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) allows RAM to be dynamically load-balanced back-and-forth as needed,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) i.e. when system A is overcommitted, it can swap to system B, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) vice versa. RAMster can also be configured as a memory server so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) many servers in a cluster can swap, dynamically as needed, to a single
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) server configured with a large amount of RAM... without pre-configuring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) how much of the RAM is available for each of the clients!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) In the virtual case, the whole point of virtualization is to statistically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) multiplex physical resources across the varying demands of multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) virtual machines. This is really hard to do with RAM and efforts to do
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) it well with no kernel changes have essentially failed (except in some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) well-publicized special-case workloads).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) Specifically, the Xen Transcendent Memory backend allows otherwise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) virtual machines, but the pages can be compressed and deduplicated to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) optimize RAM utilization. And when guest OS's are induced to surrender
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) underutilized RAM (e.g. with "selfballooning"), sudden unexpected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) memory pressure may result in swapping; frontswap allows those pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) to be swapped to and from hypervisor RAM (if overall host system memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) conditions allow), thus mitigating the potentially awful performance impact
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) of unplanned swapping.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) A KVM implementation is underway and has been RFC'ed to lkml. And,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) using frontswap, investigation is also underway on the use of NVM as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) a memory extension technology.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) * Sure there may be performance advantages in some situations, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) what's the space/time overhead of frontswap?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) nothingness and the only overhead is a few extra bytes per swapon'ed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) registers, there is one extra global variable compared to zero for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) every swap page read or written. If CONFIG_FRONTSWAP is enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) AND a frontswap backend registers AND the backend fails every "store"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) request (i.e. provides no memory despite claiming it might),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) CPU overhead is still negligible -- and since every frontswap fail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) precedes a swap page write-to-disk, the system is highly likely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) to be I/O bound and using a small fraction of a percent of a CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) will be irrelevant anyway.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) registers, one bit is allocated for every swap page for every swap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) device that is swapon'd. This is added to the EIGHT bits (which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) was sixteen until about 2.6.34) that the kernel already allocates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) for every swap page for every swap device that is swapon'd. (Hugh
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) Dickins has observed that frontswap could probably steal one of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) the existing eight bits, but let's worry about that minor optimization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) later.) For very large swap disks (which are rare) on a standard
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 4K pagesize, this is 1MB per 32GB swap.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) When swap pages are stored in transcendent memory instead of written
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) out to disk, there is a side effect that this may create more memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) pressure that can potentially outweigh the other advantages. A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) backend, such as zcache, must implement policies to carefully (but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) dynamically) manage memory limits to ensure this doesn't happen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) * OK, how about a quick overview of what this frontswap patch does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) in terms that a kernel hacker can grok?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) Let's assume that a frontswap "backend" has registered during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) kernel initialization; this registration indicates that this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) frontswap backend has access to some "memory" that is not directly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) accessible by the kernel. Exactly how much memory it provides is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) entirely dynamic and random.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) Whenever a swap-device is swapon'd frontswap_init() is called,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) passing the swap device number (aka "type") as a parameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) This notifies frontswap to expect attempts to "store" swap pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) associated with that number.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) Whenever the swap subsystem is readying a page to write to a swap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) device (c.f swap_writepage()), frontswap_store is called. Frontswap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) consults with the frontswap backend and if the backend says it does NOT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) have room, frontswap_store returns -1 and the kernel swaps the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) to the swap device as normal. Note that the response from the frontswap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) backend is unpredictable to the kernel; it may choose to never accept a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) page, it could accept every ninth page, or it might accept every
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) page. But if the backend does accept a page, the data from the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) has already been copied and associated with the type and offset,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) and the backend guarantees the persistence of the data. In this case,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) frontswap sets a bit in the "frontswap_map" for the swap device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) corresponding to the page offset on the swap device to which it would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) otherwise have written the data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) When the swap subsystem needs to swap-in a page (swap_readpage()),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) it first calls frontswap_load() which checks the frontswap_map to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) see if the page was earlier accepted by the frontswap backend. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) it was, the page of data is filled from the frontswap backend and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) the swap-in is complete. If not, the normal swap-in code is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) executed to obtain the page of data from the real swap device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) So every time the frontswap backend accepts a page, a swap device read
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) and (potentially) a swap device write are replaced by a "frontswap backend
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) store" and (possibly) a "frontswap backend loads", which are presumably much
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) faster.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) * Can't frontswap be configured as a "special" swap device that is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) just higher priority than any real swap device (e.g. like zswap,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) or maybe swap-over-nbd/NFS)?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) No. First, the existing swap subsystem doesn't allow for any kind of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) but this would require fairly drastic changes. Even if it were
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) rewritten, the existing swap subsystem uses the block I/O layer which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) assumes a swap device is fixed size and any page in it is linearly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) addressable. Frontswap barely touches the existing swap subsystem,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) and works around the constraints of the block I/O subsystem to provide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) a great deal of flexibility and dynamicity.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) For example, the acceptance of any swap page by the frontswap backend is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) entirely unpredictable. This is critical to the definition of frontswap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) backends because it grants completely dynamic discretion to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) backend. In zcache, one cannot know a priori how compressible a page is.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) "Poorly" compressible pages can be rejected, and "poorly" can itself be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) defined dynamically depending on current memory constraints.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) Further, frontswap is entirely synchronous whereas a real swap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) device is, by definition, asynchronous and uses block I/O. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) block I/O layer is not only unnecessary, but may perform "optimizations"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) that are inappropriate for a RAM-oriented device including delaying
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) the write of some pages for a significant amount of time. Synchrony is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) required to ensure the dynamicity of the backend and to avoid thorny race
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) conditions that would unnecessarily and greatly complicate frontswap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) and/or the block I/O subsystem. That said, only the initial "store"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) and "load" operations need be synchronous. A separate asynchronous thread
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) is free to manipulate the pages stored by frontswap. For example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) the "remotification" thread in RAMster uses standard asynchronous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) kernel sockets to move compressed frontswap pages to a remote machine.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) Similarly, a KVM guest-side implementation could do in-guest compression
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) and use "batched" hypercalls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) In a virtualized environment, the dynamicity allows the hypervisor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) (or host OS) to do "intelligent overcommit". For example, it can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) choose to accept pages only until host-swapping might be imminent,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) then force guests to do their own swapping.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) There is a downside to the transcendent memory specifications for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) frontswap: Since any "store" might fail, there must always be a real
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) slot on a real swap device to swap the page. Thus frontswap must be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) implemented as a "shadow" to every swapon'd device with the potential
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) capability of holding every page that the swap device might have held
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) and the possibility that it might hold no pages at all. This means
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) that frontswap cannot contain more pages than the total of swapon'd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) swap devices. For example, if NO swap device is configured on some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) installation, frontswap is useless. Swapless portable devices
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) can still use frontswap but a backend for such devices must configure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) some kind of "ghost" swap device and ensure that it is never used.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) * Why this weird definition about "duplicate stores"? If a page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) has been previously successfully stored, can't it always be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) successfully overwritten?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) Nearly always it can, but no, sometimes it cannot. Consider an example
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) where data is compressed and the original 4K page has been compressed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) to 1K. Now an attempt is made to overwrite the page with data that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) is non-compressible and so would take the entire 4K. But the backend
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) has no more space. In this case, the store must be rejected. Whenever
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) frontswap rejects a store that would overwrite, it also must invalidate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) the old data and ensure that it is no longer accessible. Since the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) swap subsystem then writes the new data to the read swap device,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) this is the correct course of action to ensure coherency.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) * What is frontswap_shrink for?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) When the (non-frontswap) swap subsystem swaps out a page to a real
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) swap device, that page is only taking up low-value pre-allocated disk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) space. But if frontswap has placed a page in transcendent memory, that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) page may be taking up valuable real estate. The frontswap_shrink
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) routine allows code outside of the swap subsystem to force pages out
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) of the memory managed by frontswap and back into kernel-addressable memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) For example, in RAMster, a "suction driver" thread will attempt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) to "repatriate" pages sent to a remote machine back to the local machine;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) this is driven using the frontswap_shrink mechanism when memory pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) subsides.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) * Why does the frontswap patch create the new include file swapfile.h?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) The frontswap code depends on some swap-subsystem-internal data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) structures that have, over the years, moved back and forth between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) static and global. This seemed a reasonable compromise: Define
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) them as global but declare them in a new include file that isn't
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) included by the large number of source files that include swap.h.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) Dan Magenheimer, last updated April 9, 2012