Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. _frontswap:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) Frontswap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) Frontswap provides a "transcendent memory" interface for swap pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) In some environments, dramatic performance savings may be obtained because
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) (Note, frontswap -- and :ref:`cleancache` (merged at 3.0) -- are the "frontends"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) and the only necessary changes to the core kernel for transcendent memory;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) all other supporting code -- the "backends" -- is implemented as drivers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) See the LWN.net article `Transcendent memory in a nutshell`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) for a detailed overview of frontswap and related kernel parts)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) .. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) Frontswap is so named because it can be thought of as the opposite of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) a "backing" store for a swap device.  The storage is assumed to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) to the requirements of transcendent memory (such as Xen's "tmem", or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) in-kernel compressed memory, aka "zcache", or future RAM-like devices);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) this pseudo-RAM device is not directly accessible or addressable by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) kernel and is of unknown and possibly time-varying size.  The driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) links itself to frontswap by calling frontswap_register_ops to set the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) frontswap_ops funcs appropriately and the functions it provides must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) conform to certain policies as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) An "init" prepares the device to receive frontswap pages associated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) with the specified swap device number (aka "type").  A "store" will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) copy the page to transcendent memory and associate it with the type and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) offset associated with the page. A "load" will copy the page, if found,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) from transcendent memory into kernel memory, but will NOT remove the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) from transcendent memory.  An "invalidate_page" will remove the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) from transcendent memory and an "invalidate_area" will remove ALL pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) associated with the swap type (e.g., like swapoff) and notify the "device"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) to refuse further stores with that swap type.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) Once a page is successfully stored, a matching load on the page will normally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) succeed.  So when the kernel finds itself in a situation where it needs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) to swap out a page, it first attempts to use frontswap.  If the store returns
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) success, the data has been successfully saved to transcendent memory and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) a disk write and, if the data is later read back, a disk read are avoided.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) If a store returns failure, transcendent memory has rejected the data, and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) page can be written to swap as usual.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) If a backend chooses, frontswap can be configured as a "writethrough
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) cache" by calling frontswap_writethrough().  In this mode, the reduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) in swap device writes is lost (and also a non-trivial performance advantage)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) in order to allow the backend to arbitrarily "reclaim" space used to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) store frontswap pages to more completely manage its memory usage.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) Note that if a page is stored and the page already exists in transcendent memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) (a "duplicate" store), either the store succeeds and the data is overwritten,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) or the store fails AND the page is invalidated.  This ensures stale data may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) never be obtained from frontswap.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) If properly configured, monitoring of frontswap is done via debugfs in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) the `/sys/kernel/debug/frontswap` directory.  The effectiveness of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) frontswap can be measured (across all swap devices) with:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) ``failed_stores``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 	how many store attempts have failed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) ``loads``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) 	how many loads were attempted (all should succeed)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) ``succ_stores``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) 	how many store attempts have succeeded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) ``invalidates``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 	how many invalidates were attempted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) A backend implementation may provide additional metrics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) FAQ
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) ===
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) * Where's the value?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) When a workload starts swapping, performance falls through the floor.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) Frontswap significantly increases performance in many such workloads by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) providing a clean, dynamic interface to read and write swap pages to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) "transcendent memory" that is otherwise not directly addressable to the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) This interface is ideal when data is transformed to a different form
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) and size (such as with compression) or secretly moved (as might be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) useful for write-balancing for some RAM-like devices).  Swap pages (and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) evicted page-cache pages) are a great use for this kind of slower-than-RAM-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) cleancache) interface to transcendent memory provides a nice way to read
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) and write -- and indirectly "name" -- the pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) Frontswap -- and cleancache -- with a fairly small impact on the kernel,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) provides a huge amount of flexibility for more dynamic, flexible RAM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) utilization in various system configurations:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) In the single kernel case, aka "zcache", pages are compressed and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) stored in local memory, thus increasing the total anonymous pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) that can be safely kept in RAM.  Zcache essentially trades off CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) cycles used in compression/decompression for better memory utilization.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) Benchmarks have shown little or no impact when memory pressure is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) low while providing a significant performance improvement (25%+)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) on some workloads under high memory pressure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) "RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) support for clustered systems.  Frontswap pages are locally compressed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) as in zcache, but then "remotified" to another system's RAM.  This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) allows RAM to be dynamically load-balanced back-and-forth as needed,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) i.e. when system A is overcommitted, it can swap to system B, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) vice versa.  RAMster can also be configured as a memory server so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) many servers in a cluster can swap, dynamically as needed, to a single
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) server configured with a large amount of RAM... without pre-configuring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) how much of the RAM is available for each of the clients!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) In the virtual case, the whole point of virtualization is to statistically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) multiplex physical resources across the varying demands of multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) virtual machines.  This is really hard to do with RAM and efforts to do
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) it well with no kernel changes have essentially failed (except in some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) well-publicized special-case workloads).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) Specifically, the Xen Transcendent Memory backend allows otherwise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) virtual machines, but the pages can be compressed and deduplicated to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) optimize RAM utilization.  And when guest OS's are induced to surrender
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) underutilized RAM (e.g. with "selfballooning"), sudden unexpected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) memory pressure may result in swapping; frontswap allows those pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) to be swapped to and from hypervisor RAM (if overall host system memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) conditions allow), thus mitigating the potentially awful performance impact
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) of unplanned swapping.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) A KVM implementation is underway and has been RFC'ed to lkml.  And,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) using frontswap, investigation is also underway on the use of NVM as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) a memory extension technology.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) * Sure there may be performance advantages in some situations, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)   what's the space/time overhead of frontswap?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) nothingness and the only overhead is a few extra bytes per swapon'ed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) swap device.  If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) registers, there is one extra global variable compared to zero for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) every swap page read or written.  If CONFIG_FRONTSWAP is enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) AND a frontswap backend registers AND the backend fails every "store"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) request (i.e. provides no memory despite claiming it might),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) CPU overhead is still negligible -- and since every frontswap fail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) precedes a swap page write-to-disk, the system is highly likely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) to be I/O bound and using a small fraction of a percent of a CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) will be irrelevant anyway.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) registers, one bit is allocated for every swap page for every swap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) device that is swapon'd.  This is added to the EIGHT bits (which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) was sixteen until about 2.6.34) that the kernel already allocates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) for every swap page for every swap device that is swapon'd.  (Hugh
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) Dickins has observed that frontswap could probably steal one of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) the existing eight bits, but let's worry about that minor optimization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) later.)  For very large swap disks (which are rare) on a standard
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 4K pagesize, this is 1MB per 32GB swap.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) When swap pages are stored in transcendent memory instead of written
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) out to disk, there is a side effect that this may create more memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) pressure that can potentially outweigh the other advantages.  A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) backend, such as zcache, must implement policies to carefully (but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) dynamically) manage memory limits to ensure this doesn't happen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) * OK, how about a quick overview of what this frontswap patch does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167)   in terms that a kernel hacker can grok?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) Let's assume that a frontswap "backend" has registered during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) kernel initialization; this registration indicates that this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) frontswap backend has access to some "memory" that is not directly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) accessible by the kernel.  Exactly how much memory it provides is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) entirely dynamic and random.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) Whenever a swap-device is swapon'd frontswap_init() is called,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) passing the swap device number (aka "type") as a parameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) This notifies frontswap to expect attempts to "store" swap pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) associated with that number.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) Whenever the swap subsystem is readying a page to write to a swap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) device (c.f swap_writepage()), frontswap_store is called.  Frontswap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) consults with the frontswap backend and if the backend says it does NOT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) have room, frontswap_store returns -1 and the kernel swaps the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) to the swap device as normal.  Note that the response from the frontswap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) backend is unpredictable to the kernel; it may choose to never accept a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) page, it could accept every ninth page, or it might accept every
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) page.  But if the backend does accept a page, the data from the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) has already been copied and associated with the type and offset,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) and the backend guarantees the persistence of the data.  In this case,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) frontswap sets a bit in the "frontswap_map" for the swap device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) corresponding to the page offset on the swap device to which it would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) otherwise have written the data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) When the swap subsystem needs to swap-in a page (swap_readpage()),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) it first calls frontswap_load() which checks the frontswap_map to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) see if the page was earlier accepted by the frontswap backend.  If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) it was, the page of data is filled from the frontswap backend and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) the swap-in is complete.  If not, the normal swap-in code is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) executed to obtain the page of data from the real swap device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) So every time the frontswap backend accepts a page, a swap device read
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) and (potentially) a swap device write are replaced by a "frontswap backend
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) store" and (possibly) a "frontswap backend loads", which are presumably much
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) faster.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) * Can't frontswap be configured as a "special" swap device that is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207)   just higher priority than any real swap device (e.g. like zswap,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208)   or maybe swap-over-nbd/NFS)?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) No.  First, the existing swap subsystem doesn't allow for any kind of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) swap hierarchy.  Perhaps it could be rewritten to accommodate a hierarchy,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) but this would require fairly drastic changes.  Even if it were
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) rewritten, the existing swap subsystem uses the block I/O layer which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) assumes a swap device is fixed size and any page in it is linearly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) addressable.  Frontswap barely touches the existing swap subsystem,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) and works around the constraints of the block I/O subsystem to provide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) a great deal of flexibility and dynamicity.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) For example, the acceptance of any swap page by the frontswap backend is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) entirely unpredictable. This is critical to the definition of frontswap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) backends because it grants completely dynamic discretion to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) backend.  In zcache, one cannot know a priori how compressible a page is.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) "Poorly" compressible pages can be rejected, and "poorly" can itself be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) defined dynamically depending on current memory constraints.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) Further, frontswap is entirely synchronous whereas a real swap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) device is, by definition, asynchronous and uses block I/O.  The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) block I/O layer is not only unnecessary, but may perform "optimizations"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) that are inappropriate for a RAM-oriented device including delaying
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) the write of some pages for a significant amount of time.  Synchrony is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) required to ensure the dynamicity of the backend and to avoid thorny race
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) conditions that would unnecessarily and greatly complicate frontswap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) and/or the block I/O subsystem.  That said, only the initial "store"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) and "load" operations need be synchronous.  A separate asynchronous thread
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) is free to manipulate the pages stored by frontswap.  For example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) the "remotification" thread in RAMster uses standard asynchronous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) kernel sockets to move compressed frontswap pages to a remote machine.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) Similarly, a KVM guest-side implementation could do in-guest compression
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) and use "batched" hypercalls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) In a virtualized environment, the dynamicity allows the hypervisor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) (or host OS) to do "intelligent overcommit".  For example, it can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) choose to accept pages only until host-swapping might be imminent,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) then force guests to do their own swapping.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) There is a downside to the transcendent memory specifications for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) frontswap:  Since any "store" might fail, there must always be a real
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) slot on a real swap device to swap the page.  Thus frontswap must be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) implemented as a "shadow" to every swapon'd device with the potential
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) capability of holding every page that the swap device might have held
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) and the possibility that it might hold no pages at all.  This means
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) that frontswap cannot contain more pages than the total of swapon'd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) swap devices.  For example, if NO swap device is configured on some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) installation, frontswap is useless.  Swapless portable devices
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) can still use frontswap but a backend for such devices must configure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) some kind of "ghost" swap device and ensure that it is never used.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) * Why this weird definition about "duplicate stores"?  If a page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259)   has been previously successfully stored, can't it always be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260)   successfully overwritten?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) Nearly always it can, but no, sometimes it cannot.  Consider an example
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) where data is compressed and the original 4K page has been compressed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) to 1K.  Now an attempt is made to overwrite the page with data that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) is non-compressible and so would take the entire 4K.  But the backend
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) has no more space.  In this case, the store must be rejected.  Whenever
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) frontswap rejects a store that would overwrite, it also must invalidate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) the old data and ensure that it is no longer accessible.  Since the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) swap subsystem then writes the new data to the read swap device,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) this is the correct course of action to ensure coherency.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) * What is frontswap_shrink for?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) When the (non-frontswap) swap subsystem swaps out a page to a real
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) swap device, that page is only taking up low-value pre-allocated disk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) space.  But if frontswap has placed a page in transcendent memory, that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) page may be taking up valuable real estate.  The frontswap_shrink
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) routine allows code outside of the swap subsystem to force pages out
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) of the memory managed by frontswap and back into kernel-addressable memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) For example, in RAMster, a "suction driver" thread will attempt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) to "repatriate" pages sent to a remote machine back to the local machine;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) this is driven using the frontswap_shrink mechanism when memory pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) subsides.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) * Why does the frontswap patch create the new include file swapfile.h?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) The frontswap code depends on some swap-subsystem-internal data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) structures that have, over the years, moved back and forth between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) static and global.  This seemed a reasonable compromise:  Define
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) them as global but declare them in a new include file that isn't
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) included by the large number of source files that include swap.h.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) Dan Magenheimer, last updated April 9, 2012