^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. _swap_numa:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ===========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Automatically bind swap device to numa node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ===========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) If the system has more than one swap device and swap device has the node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) information, we can make use of this information to decide which swap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) device to use in get_swap_pages() to get better performance.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) How to use this feature
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) =======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) Swap device has priority and that decides the order of it to be used. To make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) use of automatically binding, there is no need to manipulate priority settings
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) swapB, with swapA attached to node 0 and swapB attached to node 1, are going
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) to be swapped on. Simply swapping them on by doing::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) # swapon /dev/swapA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) # swapon /dev/swapB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) Then node 0 will use the two swap devices in the order of swapA then swapB and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) node 1 will use the two swap devices in the order of swapB then swapA. Note
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) that the order of them being swapped on doesn't matter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) A more complex example on a 4 node machine. Assume 6 swap devices are going to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) be swapped on: swapA and swapB are attached to node 0, swapC is attached to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) node 1, swapD and swapE are attached to node 2 and swapF is attached to node3.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) The way to swap them on is the same as above::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) # swapon /dev/swapA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) # swapon /dev/swapB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) # swapon /dev/swapC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) # swapon /dev/swapD
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) # swapon /dev/swapE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) # swapon /dev/swapF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) Then node 0 will use them in the order of::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) swapA/swapB -> swapC -> swapD -> swapE -> swapF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) swapA and swapB will be used in a round robin mode before any other swap device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) node 1 will use them in the order of::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) swapC -> swapA -> swapB -> swapD -> swapE -> swapF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) node 2 will use them in the order of::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) swapD/swapE -> swapA -> swapB -> swapC -> swapF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) Similaly, swapD and swapE will be used in a round robin mode before any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) other swap devices.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) node 3 will use them in the order of::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) swapF -> swapA -> swapB -> swapC -> swapD -> swapE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) Implementation details
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) The current code uses a priority based list, swap_avail_list, to decide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) which swap device to use and if multiple swap devices share the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) priority, they are used round robin. This change here replaces the single
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) global swap_avail_list with a per-numa-node list, i.e. for each numa node,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) it sees its own priority based list of available swap devices. Swap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) device's priority can be promoted on its matching node's swap_avail_list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) The current swap device's priority is set as: user can set a >=0 value,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) or the system will pick one starting from -1 then downwards. The priority
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) value in the swap_avail_list is the negated value of the swap device's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) due to plist being sorted from low to high. The new policy doesn't change
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) the semantics for priority >=0 cases, the previous starting from -1 then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) downwards now becomes starting from -2 then downwards and -1 is reserved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) as the promoted value. So if multiple swap devices are attached to the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) node, they will all be promoted to priority -1 on that node's plist and will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) be used round robin before any other swap devices.