^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) =============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) Guidance for writing policies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) =============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) Try to keep transactionality out of it. The core is careful to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) avoid asking about anything that is migrating. This is a pain, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) makes it easier to write the policies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) Mappings are loaded into the policy at construction time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) Every bio that is mapped by the target is referred to the policy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) The policy can return a simple HIT or MISS or issue a migration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) Currently there's no way for the policy to issue background work,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) e.g. to start writing back dirty blocks that are going to be evicted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) soon.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) Because we map bios, rather than requests it's easy for the policy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) to get fooled by many small bios. For this reason the core target
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) issues periodic ticks to the policy. It's suggested that the policy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) doesn't update states (eg, hit counts) for a block more than once
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) for each tick. The core ticks by watching bios complete, and so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) trying to see when the io scheduler has let the ios run.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) Overview of supplied cache replacement policies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) ===============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) multiqueue (mq)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) This policy is now an alias for smq (see below).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) The following tunables are accepted, but have no effect::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) 'sequential_threshold <#nr_sequential_ios>'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) 'random_threshold <#nr_random_ios>'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) 'read_promote_adjustment <value>'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) 'write_promote_adjustment <value>'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) 'discard_promote_adjustment <value>'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) Stochastic multiqueue (smq)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) This policy is the default.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) The stochastic multi-queue (smq) policy addresses some of the problems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) with the multiqueue (mq) policy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) The smq policy (vs mq) offers the promise of less memory utilization,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) improved performance and increased adaptability in the face of changing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) workloads. smq also does not have any cumbersome tuning knobs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) Users may switch from "mq" to "smq" simply by appropriately reloading a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) DM table that is using the cache target. Doing so will cause all of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) mq policy's hints to be dropped. Also, performance of the cache may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) degrade slightly until smq recalculates the origin device's hotspots
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) that should be cached.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) Memory usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) ^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) The mq policy used a lot of memory; 88 bytes per cache block on a 64
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) bit machine.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) smq uses 28bit indexes to implement its data structures rather than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) pointers. It avoids storing an explicit hit count for each block. It
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) the entries (each hotspot block covers a larger area than a single
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) cache block).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) All this means smq uses ~25bytes per cache block. Still a lot of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) memory, but a substantial improvement nontheless.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) Level balancing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) ^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) mq placed entries in different levels of the multiqueue structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) based on their hit count (~ln(hit count)). This meant the bottom
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) levels generally had the most entries, and the top ones had very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) few. Having unbalanced levels like this reduced the efficacy of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) multiqueue.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) smq does not maintain a hit count, instead it swaps hit entries with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) the least recently used entry from the level above. The overall
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) ordering being a side effect of this stochastic process. With this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) scheme we can decide how many entries occupy each multiqueue level,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) resulting in better promotion/demotion decisions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) Adaptability:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) The mq policy maintained a hit count for each cache block. For a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) different block to get promoted to the cache its hit count has to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) exceed the lowest currently in the cache. This meant it could take a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) long time for the cache to adapt between varying IO patterns.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) smq doesn't maintain hit counts, so a lot of this problem just goes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) away. In addition it tracks performance of the hotspot queue, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) is used to decide which blocks to promote. If the hotspot queue is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) performing badly then it starts moving entries more quickly between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) levels. This lets it adapt to new IO patterns very quickly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) Performance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) ^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) Testing smq shows substantially better performance than mq.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) cleaner
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) The cleaner writes back all dirty blocks in a cache to decommission it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) Examples
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) The syntax for a table is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) cache <metadata dev> <cache dev> <origin dev> <block size>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) <#feature_args> [<feature arg>]*
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) <policy> <#policy_args> [<policy arg>]*
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) The syntax to send a message using the dmsetup command is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) dmsetup message <mapped device> 0 sequential_threshold 1024
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) dmsetup message <mapped device> 0 random_threshold 8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) Using dmsetup::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) /dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) creates a 128GB large mapped device named 'blah' with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) sequential threshold set to 1024 and the random_threshold set to 8.