^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. _unevictable_lru:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ==============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Unevictable LRU Infrastructure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ==============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) .. contents:: :local:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) This document describes the Linux memory manager's "Unevictable LRU"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) infrastructure and the use of this to manage several types of "unevictable"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) The document attempts to provide the overall rationale behind this mechanism
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) and the rationale for some of the design decisions that drove the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) implementation. The latter design rationale is discussed in the context of an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) implementation description. Admittedly, one can obtain the implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) details - the "what does it do?" - by reading the code. One hopes that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) descriptions below add value by provide the answer to "why does it do that?".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) The Unevictable LRU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) The Unevictable LRU facility adds an additional LRU list to track unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) pages and to hide these pages from vmscan. This mechanism is based on a patch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) by Larry Woodman of Red Hat to address several scalability problems with page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) reclaim in Linux. The problems have been observed at customer sites on large
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) memory x86_64 systems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) main memory will have over 32 million 4k pages in a single zone. When a large
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) fraction of these pages are not evictable for any reason [see below], vmscan
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) will spend a lot of time scanning the LRU lists looking for the small fraction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) of pages that are evictable. This can result in a situation where all CPUs are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) spending 100% of their time in vmscan for hours or days on end, with the system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) completely unresponsive.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) The unevictable list addresses the following classes of unevictable pages:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) * Those owned by ramfs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) * Those mapped into SHM_LOCK'd shared memory regions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) * Those mapped into VM_LOCKED [mlock()ed] VMAs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) The infrastructure may also be able to handle other conditions that make pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) unevictable, either by definition or by circumstance, in the future.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) The Unevictable Page List
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) -------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) called the "unevictable" list and an associated page flag, PG_unevictable, to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) indicate that the page is being managed on the unevictable list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) The PG_unevictable flag is analogous to, and mutually exclusive with, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) PG_active flag in that it indicates on which LRU list a page resides when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) PG_lru is set.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) The Unevictable LRU infrastructure maintains unevictable pages on an additional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) LRU list for a few reasons:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) (1) We get to "treat unevictable pages just like we treat other pages in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) system - which means we get to use the same code to manipulate them, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) same code to isolate them (for migrate, etc.), the same code to keep track
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) of the statistics, etc..." [Rik van Riel]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) (2) We want to be able to migrate unevictable pages between nodes for memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) defragmentation, workload management and memory hotplug. The linux kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) can only migrate pages that it can successfully isolate from the LRU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) lists. If we were to maintain pages elsewhere than on an LRU-like list,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) where they can be found by isolate_lru_page(), we would prevent their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) migration, unless we reworked migration code to find the unevictable pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) itself.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) The unevictable list does not differentiate between file-backed and anonymous,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) swap-backed pages. This differentiation is only important while the pages are,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) in fact, evictable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) The unevictable list benefits from the "arrayification" of the per-zone LRU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) lists and statistics originally proposed and posted by Christoph Lameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) The unevictable list does not use the LRU pagevec mechanism. Rather,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) unevictable pages are placed directly on the page's zone's unevictable list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) under the zone lru_lock. This allows us to prevent the stranding of pages on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) the unevictable list when one task has the page isolated from the LRU and other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) tasks are changing the "evictability" state of the page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) Memory Control Group Interaction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) --------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) The unevictable LRU facility interacts with the memory control group [aka
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) lru_list enum.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) The memory controller data structure automatically gets a per-zone unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) list as a result of the "arrayification" of the per-zone LRU lists (one per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) lru_list enum element). The memory controller tracks the movement of pages to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) and from the unevictable list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) When a memory control group comes under memory pressure, the controller will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) not attempt to reclaim pages on the unevictable list. This has a couple of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) effects:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) (1) Because the pages are "hidden" from reclaim on the unevictable list, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) reclaim process can be more efficient, dealing only with pages that have a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) chance of being reclaimed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) (2) On the other hand, if too many of the pages charged to the control group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) are unevictable, the evictable portion of the working set of the tasks in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) the control group may not fit into the available memory. This can cause
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) the control group to thrash or to OOM-kill tasks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) .. _mark_addr_space_unevict:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) Marking Address Spaces Unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) ----------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) For facilities such as ramfs none of the pages attached to the address space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) address space flag is provided, and this can be manipulated by a filesystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) using a number of wrapper functions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) * ``void mapping_set_unevictable(struct address_space *mapping);``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) Mark the address space as being completely unevictable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) * ``void mapping_clear_unevictable(struct address_space *mapping);``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) Mark the address space as being evictable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) * ``int mapping_unevictable(struct address_space *mapping);``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) Query the address space, and return true if it is completely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) unevictable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) These are currently used in three places in the kernel:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) (1) By ramfs to mark the address spaces of its inodes when they are created,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) and this mark remains for the life of the inode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) (2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) Note that SHM_LOCK is not required to page in the locked pages if they're
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) swapped out; the application must touch the pages manually if it wants to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) ensure they're in memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) (3) By the i915 driver to mark pinned address space until it's unpinned. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) amount of unevictable memory marked by i915 driver is roughly the bounded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) object size in debugfs/dri/0/i915_gem_objects.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) Detecting Unevictable Pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) The function page_evictable() in vmscan.c determines whether a page is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) evictable or not using the query function outlined above [see section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) :ref:`Marking address spaces unevictable <mark_addr_space_unevict>`]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) to check the AS_UNEVICTABLE flag.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) For address spaces that are so marked after being populated (as SHM regions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) the page tables for the region as does, for example, mlock(), nor need it make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) any special effort to push any pages in the SHM_LOCK'd area to the unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) list. Instead, vmscan will do this if and when it encounters the pages during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) a reclamation scan.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) On an unlock action (such as SHM_UNLOCK), the unlocker (eg: shmctl()) must scan
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) the pages in the region and "rescue" them from the unevictable list if no other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) condition is keeping them unevictable. If an unevictable region is destroyed,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) the pages are also "rescued" from the unevictable list in the process of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) freeing them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) page_evictable() also checks for mlocked pages by testing an additional page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) Vmscan's Handling of Unevictable Pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) If unevictable pages are culled in the fault path, or moved to the unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) list at mlock() or mmap() time, vmscan will not encounter the pages until they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) have become evictable again (via munlock() for example) and have been "rescued"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) from the unevictable list. However, there may be situations where we decide,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) for the sake of expediency, to leave a unevictable page on one of the regular
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) active/inactive LRU lists for vmscan to deal with. vmscan checks for such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) pages in all of the shrink_{active|inactive|page}_list() functions and will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) "cull" such pages that it encounters: that is, it diverts those pages to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) unevictable list for the zone being scanned.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) There may be situations where a page is mapped into a VM_LOCKED VMA, but the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) page is not marked as PG_mlocked. Such pages will make it all the way to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) shrink_page_list() where they will be detected when vmscan walks the reverse
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) shrink_page_list() will cull the page at that point.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) To "cull" an unevictable page, vmscan simply puts the page back on the LRU list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) using putback_lru_page() - the inverse operation to isolate_lru_page() - after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) dropping the page lock. Because the condition which makes the page unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) may change once the page is unlocked, putback_lru_page() will recheck the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) unevictable state of a page that it places on the unevictable list. If the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) page has become unevictable, putback_lru_page() removes it from the list and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) retries, including the page_unevictable() test. Because such a race is a rare
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) event and movement of pages onto the unevictable list should be rare, these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) extra evictabilty checks should not occur in the majority of calls to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) putback_lru_page().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) MLOCKED Pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) The unevictable page list is also useful for mlock(), in addition to ramfs and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) NOMMU situations, all mappings are effectively mlocked.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) History
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) The "Unevictable mlocked Pages" infrastructure is based on work originally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) Nick posted his patch as an alternative to a patch posted by Christoph Lameter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) to achieve the same objective: hiding mlocked pages from vmscan.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) In Nick's patch, he used one of the struct page LRU list link fields as a count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) of VM_LOCKED VMAs that map the page. This use of the link field for a count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) prevented the management of the pages on an LRU list, and thus mlocked pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) were not migratable as isolate_lru_page() could not find them, and the LRU list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) link field was not available to the migration subsystem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) Nick resolved this by putting mlocked pages back on the lru list before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs. When
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) Nick's patch was integrated with the Unevictable LRU work, the count was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) replaced by walking the reverse map to determine whether any VM_LOCKED VMAs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) mapped the page. More on this below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) Basic Management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) ----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) pages. When such a page has been "noticed" by the memory management subsystem,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) the page is marked with the PG_mlocked flag. This can be manipulated using the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) PageMlocked() functions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) A PG_mlocked page will be placed on the unevictable list when it is added to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) the LRU. Such pages can be "noticed" by memory management in several places:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) (1) in the mlock()/mlockall() system call handlers;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) (2) in the mmap() system call handler when mmapping a region with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) MAP_LOCKED flag;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) (3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) flag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) (4) in the fault path, if mlocked pages are "culled" in the fault path,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) and when a VM_LOCKED stack segment is expanded; or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) (5) as mentioned above, in vmscan:shrink_page_list() when attempting to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) reclaim a page in a VM_LOCKED VMA via try_to_unmap()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) all of which result in the VM_LOCKED flag being set for the VMA if it doesn't
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) already have it set.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) mlocked pages become unlocked and rescued from the unevictable list when:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) (1) mapped in a range unlocked via the munlock()/munlockall() system calls;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) (2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) unmapping at task exit;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) (3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) (4) before a page is COW'd in a VM_LOCKED VMA.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) mlock()/mlockall() System Call Handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) ---------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) Both [do\_]mlock() and [do\_]mlockall() system call handlers call mlock_fixup()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) for each VMA in the range specified by the call. In the case of mlockall(),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) this is the entire active address space of the task. Note that mlock_fixup()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) is used for both mlocking and munlocking a range of memory. A call to mlock()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) treated as a no-op, and mlock_fixup() simply returns.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) If the VMA passes some filtering as described in "Filtering Special Vmas"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) below, mlock_fixup() will attempt to merge the VMA with its neighbors or split
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) off a subset of the VMA if the range does not cover the entire VMA. Once the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) VMA has been merged or split or neither, mlock_fixup() will call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) populate_vma_page_range() to fault in the pages via get_user_pages() and to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) mark the pages as mlocked via mlock_vma_page().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) Note that the VMA being mlocked might be mapped with PROT_NONE. In this case,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) get_user_pages() will be unable to fault in the pages. That's okay. If pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) do end up getting faulted into this VM_LOCKED VMA, we'll handle them in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) fault path or in vmscan.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) Also note that a page returned by get_user_pages() could be truncated or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) migrated out from under us, while we're trying to mlock it. To detect this,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) populate_vma_page_range() checks page_mapping() after acquiring the page lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) If the page is still associated with its mapping, we'll go ahead and call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) mlock_vma_page(). If the mapping is gone, we just unlock the page and move on.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) In the worst case, this will result in a page mapped in a VM_LOCKED VMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) remaining on a normal LRU list without being PageMlocked(). Again, vmscan will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) detect and cull such pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) mlock_vma_page() will call TestSetPageMlocked() for each page returned by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) get_user_pages(). We use TestSetPageMlocked() because the page might already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) be mlocked by another task/VMA and we don't want to do extra work. We
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) especially do not want to count an mlocked page more than once in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) statistics. If the page was already mlocked, mlock_vma_page() need do nothing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) more.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) page from the LRU, as it is likely on the appropriate active or inactive list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) back the page - by calling putback_lru_page() - which will notice that the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) is now mlocked and divert the page to the zone's unevictable list. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) it later if and when it attempts to reclaim the page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) Filtering Special VMAs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) ----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) mlock_fixup() filters several classes of "special" VMAs:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) 1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely. The pages behind
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) these mappings are inherently pinned, so we don't need to mark them as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) mlocked. In any case, most of the pages have no struct page in which to so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) mark the page. Because of this, get_user_pages() will fail for these VMAs,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) so there is no sense in attempting to visit them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) 2) VMAs mapping hugetlbfs page are already effectively pinned into memory. We
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) neither need nor want to mlock() these pages. However, to preserve the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) prior behavior of mlock() - before the unevictable/mlock changes -
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) allocate the huge pages and populate the ptes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) 3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) such as the VDSO page, relay channel pages, etc. These pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) are inherently unevictable and are not managed on the LRU lists.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) make_pages_present() to populate the ptes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) Note that for all of these special VMAs, mlock_fixup() does not set the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) VM_LOCKED flag. Therefore, we won't have to deal with them later during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) munlock(), munmap() or task exit. Neither does mlock_fixup() account these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) VMAs against the task's "locked_vm".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) .. _munlock_munlockall_handling:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) munlock()/munlockall() System Call Handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) -------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) The munlock() and munlockall() system calls are handled by the same functions -
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) do_mlock[all]() - as the mlock() and mlockall() system calls with the unlock vs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) lock operation indicated by an argument. So, these system calls are also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) handled by mlock_fixup(). Again, if called for an already munlocked VMA,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) mlock_fixup() simply returns. Because of the VMA filtering discussed above,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) VM_LOCKED will not be set in any "special" VMAs. So, these VMAs will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) ignored for munlock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) specified range. The range is then munlocked via the function
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) populate_vma_page_range() - the same function used to mlock a VMA range -
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) passing a flag to indicate that munlock() is being performed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) Because the VMA access protections could have been changed to PROT_NONE after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) faulting in and mlocking pages, get_user_pages() was unreliable for visiting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) these pages for munlocking. Because we don't want to leave pages mlocked,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) get_user_pages() was enhanced to accept a flag to ignore the permissions when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) fetching the pages - all of which should be resident as a result of previous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) mlocking.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) For munlock(), populate_vma_page_range() unlocks individual pages by calling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) flag using TestClearPageMlocked(). As with mlock_vma_page(),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) munlock_vma_page() use the Test*PageMlocked() function to handle the case where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) the page might have already been unlocked by another task. If the page was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) mlocked, munlock_vma_page() updates that zone statistics for the number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) mlocked pages. Note, however, that at this point we haven't checked whether
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) the page is mapped by other VM_LOCKED VMAs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) We can't call try_to_munlock(), the function that walks the reverse map to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) not be on an LRU list [more on these below]. However, the call to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). So,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) we go ahead and clear PG_mlocked up front, as this might be the only chance we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) have. If we can successfully isolate the page, we go ahead and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) try_to_munlock(), which will restore the PG_mlocked flag and update the zone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) page statistics if it finds another VMA holding the page mlocked. If we fail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) to isolate the page, we'll have left a potentially mlocked page on the LRU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) This is fine, because we'll catch it later if and if vmscan tries to reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) the page. This should be relatively rare.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) Migrating MLOCKED Pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) -----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) A page that is being migrated has been isolated from the LRU lists and is held
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) locked across unmapping of the page, updating the page's address space entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) and copying the contents and state, until the page table entry has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) replaced with an entry that refers to the new page. Linux supports migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) of mlocked pages and other unevictable pages. This involves simply moving the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) PG_mlocked and PG_unevictable states from the old page to the new page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) Note that page migration can race with mlocking or munlocking of the same page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) This has been discussed from the mlock/munlock perspective in the respective
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) sections above. Both processes (migration and m[un]locking) hold the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) locked. This provides the first level of synchronization. Page migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) zeros out the page_mapping of the old page before unlocking it, so m[un]lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) can skip these pages by testing the page mapping under page lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) To complete page migration, we place the new and old pages back onto the LRU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) after dropping the page lock. The "unneeded" page - old page on success, new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) page on failure - will be freed when the reference count held by the migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) process is released. To ensure that we don't strand pages on the unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) list because of a race between munlock and migration, page migration uses the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) putback_lru_page() function to add migrated pages back to the LRU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) Compacting MLOCKED Pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) ------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) The unevictable LRU can be scanned for compactable regions and the default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) behavior is to do so. /proc/sys/vm/compact_unevictable_allowed controls
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) this behavior (see Documentation/admin-guide/sysctl/vm.rst). Once scanning of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) unevictable LRU is enabled, the work of compaction is mostly handled by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) the page migration code and the same work flow as described in MIGRATING
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) MLOCKED PAGES will apply.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) MLOCKING Transparent Huge Pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) -------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) A transparent huge page is represented by a single entry on an LRU list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) Therefore, we can only make unevictable an entire compound page, not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) individual subpages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) If a user tries to mlock() part of a huge page, we want the rest of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) page to be reclaimable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) We cannot just split the page on partial mlock() as split_huge_page() can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) fail and new intermittent failure mode for the syscall is undesirable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) PMD on border of VM_LOCKED VMA will be split into PTE table.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) This way the huge page is accessible for vmscan. Under memory pressure the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) page will be split, subpages which belong to VM_LOCKED VMAs will be moved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) to unevictable LRU and the rest can be reclaimed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) See also comment in follow_trans_huge_pmd().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) mmap(MAP_LOCKED) System Call Handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) -------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) In addition the mlock()/mlockall() system calls, an application can request
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) call. There is one important and subtle difference here, though. mmap() + mlock()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) will fail if the range cannot be faulted in (e.g. because mm_populate fails)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) and returns with ENOMEM while mmap(MAP_LOCKED) will not fail. The mmaped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) area will still have properties of the locked area - aka. pages will not get
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478) swapped out - but major page faults to fault memory in might still happen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) Furthermore, any mmap() call or brk() call that expands the heap by a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481) task that has previously called mlockall() with the MCL_FUTURE flag will result
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) in the newly mapped memory being mlocked. Before the unevictable/mlock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) changes, the kernel simply called make_pages_present() to allocate pages and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) populate the page table.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) To mlock a range of memory under the unevictable/mlock infrastructure, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) mmap() handler and task address space expansion functions call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) populate_vma_page_range() specifying the vma and the address range to mlock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) The callers of populate_vma_page_range() will have already added the memory range
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) to be mlocked to the task's "locked_vm". To account for filtered VMAs,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) populate_vma_page_range() returns the number of pages NOT mlocked. All of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) callers then subtract a non-negative return value from the task's locked_vm. A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) negative return value represent an error - for example, from get_user_pages()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) attempting to fault in a VMA with PROT_NONE access. In this case, we leave the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) memory range accounted as locked_vm, as the protections could be changed later
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) and pages allocated into that region.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) munmap()/exit()/exec() System Call Handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) -------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) When unmapping an mlocked region of memory, whether by an explicit call to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) munmap() or via an internal unmap from exit() or exec() processing, we must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) Before the unevictable/mlock changes, mlocking did not mark the pages in any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507) way, so unmapping them required no processing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) To munlock a range of memory under the unevictable/mlock infrastructure, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) munmap() handler and task address space call tear down function
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) munlock_vma_pages_all(). The name reflects the observation that one always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) specifies the entire VMA range when munlock()ing during unmap of a region.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) Because of the VMA filtering when mlocking() regions, only "normal" VMAs that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) actually contain mlocked pages will be passed to munlock_vma_pages_all().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) munlock_vma_pages_all() clears the VM_LOCKED VMA flag and, like mlock_fixup()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) for the munlock case, calls __munlock_vma_pages_range() to walk the page table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) for the VMA's memory range and munlock_vma_page() each resident page mapped by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) the VMA. This effectively munlocks the page, only if this is the last
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) VM_LOCKED VMA that maps the page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) try_to_unmap()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) Pages can, of course, be mapped into multiple VMAs. Some of these VMAs may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527) have VM_LOCKED flag set. It is possible for a page mapped into one or more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) VM_LOCKED VMAs not to have the PG_mlocked flag set and therefore reside on one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) of the active or inactive LRU lists. This could happen if, for example, a task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) in the process of munlocking the page could not isolate the page from the LRU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) As a result, vmscan/shrink_page_list() might encounter such a page as described
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) in section "vmscan's handling of unevictable pages". To handle this situation,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) try_to_unmap() checks for VM_LOCKED VMAs while it is walking a page's reverse
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534) map.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) try_to_unmap() is always called, by either vmscan for reclaim or for page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) migration, with the argument page locked and isolated from the LRU. Separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538) functions handle anonymous and mapped file and KSM pages, as these types of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539) pages have different reverse map lookup mechanisms, with different locking.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540) In each case, whether rmap_walk_anon() or rmap_walk_file() or rmap_walk_ksm(),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) it will call try_to_unmap_one() for every VMA which might contain the page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543) When trying to reclaim, if try_to_unmap_one() finds the page in a VM_LOCKED
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) VMA, it will then mlock the page via mlock_vma_page() instead of unmapping it,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545) and return SWAP_MLOCK to indicate that the page is unevictable: and the scan
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) stops there.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) mlock_vma_page() is called while holding the page table's lock (in addition
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) to the page lock, and the rmap lock): to serialize against concurrent mlock or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551) holepunching, and truncation of file pages and their anonymous COWed pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) try_to_munlock() Reverse Map Scan
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555) ---------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557) .. warning::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559) page_referenced() reverse map walker.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561) When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562) Handling <munlock_munlockall_handling>` above] tries to munlock a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563) page, it needs to determine whether or not the page is mapped by any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564) VM_LOCKED VMA without actually attempting to unmap all PTEs from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565) page. For this purpose, the unevictable/mlock infrastructure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566) introduced a variant of try_to_unmap() called try_to_munlock().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568) try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569) mapped file and KSM pages with a flag argument specifying unlock versus unmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570) processing. Again, these functions walk the respective reverse maps looking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571) for VM_LOCKED VMAs. When such a VMA is found, as in the try_to_unmap() case,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572) the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573) undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575) Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576) reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577) However, the scan can terminate when it encounters a VM_LOCKED VMA.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 578) Although try_to_munlock() might be called a great many times when munlocking a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 579) large region or tearing down a large address space that has been mlocked via
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 580) mlockall(), overall this is a fairly rare event.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 581)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 582)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 583) Page Reclaim in shrink_*_list()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 584) -------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 585)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 586) shrink_active_list() culls any obviously unevictable pages - i.e.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 587) !page_evictable(page) - diverting these to the unevictable list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 588) However, shrink_active_list() only sees unevictable pages that made it onto the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 589) active/inactive lru lists. Note that these pages do not have PageUnevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 590) set - otherwise they would be on the unevictable list and shrink_active_list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 591) would never see them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 592)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 593) Some examples of these unevictable pages on the LRU lists are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 594)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 595) (1) ramfs pages that have been placed on the LRU lists when first allocated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 596)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 597) (2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 598) allocate or fault in the pages in the shared memory region. This happens
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 599) when an application accesses the page the first time after SHM_LOCK'ing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 600) the segment.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 601)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 602) (3) mlocked pages that could not be isolated from the LRU and moved to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 603) unevictable list in mlock_vma_page().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 604)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 605) shrink_inactive_list() also diverts any unevictable pages that it finds on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 606) inactive lists to the appropriate zone's unevictable list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 607)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 608) shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 609) after shrink_active_list() had moved them to the inactive list, or pages mapped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 610) into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 611) recheck via try_to_munlock(). shrink_inactive_list() won't notice the latter,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 612) but will pass on to shrink_page_list().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 613)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 614) shrink_page_list() again culls obviously unevictable pages that it could
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 615) encounter for similar reason to shrink_inactive_list(). Pages mapped into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 616) VM_LOCKED VMAs but without PG_mlocked set will make it all the way to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 617) try_to_unmap(). shrink_page_list() will divert them to the unevictable list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 618) when try_to_unmap() returns SWAP_MLOCK, as discussed above.