Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. _unevictable_lru:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ==============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) Unevictable LRU Infrastructure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) ==============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) .. contents:: :local:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) This document describes the Linux memory manager's "Unevictable LRU"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) infrastructure and the use of this to manage several types of "unevictable"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) The document attempts to provide the overall rationale behind this mechanism
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) and the rationale for some of the design decisions that drove the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) implementation.  The latter design rationale is discussed in the context of an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) implementation description.  Admittedly, one can obtain the implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) details - the "what does it do?" - by reading the code.  One hopes that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) descriptions below add value by provide the answer to "why does it do that?".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) The Unevictable LRU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) The Unevictable LRU facility adds an additional LRU list to track unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) pages and to hide these pages from vmscan.  This mechanism is based on a patch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) by Larry Woodman of Red Hat to address several scalability problems with page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) reclaim in Linux.  The problems have been observed at customer sites on large
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) memory x86_64 systems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) main memory will have over 32 million 4k pages in a single zone.  When a large
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) fraction of these pages are not evictable for any reason [see below], vmscan
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) will spend a lot of time scanning the LRU lists looking for the small fraction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) of pages that are evictable.  This can result in a situation where all CPUs are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) spending 100% of their time in vmscan for hours or days on end, with the system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) completely unresponsive.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) The unevictable list addresses the following classes of unevictable pages:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45)  * Those owned by ramfs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47)  * Those mapped into SHM_LOCK'd shared memory regions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49)  * Those mapped into VM_LOCKED [mlock()ed] VMAs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) The infrastructure may also be able to handle other conditions that make pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) unevictable, either by definition or by circumstance, in the future.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) The Unevictable Page List
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) -------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) called the "unevictable" list and an associated page flag, PG_unevictable, to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) indicate that the page is being managed on the unevictable list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) The PG_unevictable flag is analogous to, and mutually exclusive with, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) PG_active flag in that it indicates on which LRU list a page resides when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) PG_lru is set.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) The Unevictable LRU infrastructure maintains unevictable pages on an additional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) LRU list for a few reasons:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69)  (1) We get to "treat unevictable pages just like we treat other pages in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70)      system - which means we get to use the same code to manipulate them, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71)      same code to isolate them (for migrate, etc.), the same code to keep track
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72)      of the statistics, etc..." [Rik van Riel]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74)  (2) We want to be able to migrate unevictable pages between nodes for memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75)      defragmentation, workload management and memory hotplug.  The linux kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76)      can only migrate pages that it can successfully isolate from the LRU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77)      lists.  If we were to maintain pages elsewhere than on an LRU-like list,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78)      where they can be found by isolate_lru_page(), we would prevent their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79)      migration, unless we reworked migration code to find the unevictable pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80)      itself.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) The unevictable list does not differentiate between file-backed and anonymous,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) swap-backed pages.  This differentiation is only important while the pages are,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) in fact, evictable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) The unevictable list benefits from the "arrayification" of the per-zone LRU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) lists and statistics originally proposed and posted by Christoph Lameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) The unevictable list does not use the LRU pagevec mechanism. Rather,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) unevictable pages are placed directly on the page's zone's unevictable list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) under the zone lru_lock.  This allows us to prevent the stranding of pages on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) the unevictable list when one task has the page isolated from the LRU and other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) tasks are changing the "evictability" state of the page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) Memory Control Group Interaction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) --------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) The unevictable LRU facility interacts with the memory control group [aka
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) lru_list enum.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) The memory controller data structure automatically gets a per-zone unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) list as a result of the "arrayification" of the per-zone LRU lists (one per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) lru_list enum element).  The memory controller tracks the movement of pages to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) and from the unevictable list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) When a memory control group comes under memory pressure, the controller will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) not attempt to reclaim pages on the unevictable list.  This has a couple of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) effects:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)  (1) Because the pages are "hidden" from reclaim on the unevictable list, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)      reclaim process can be more efficient, dealing only with pages that have a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)      chance of being reclaimed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)  (2) On the other hand, if too many of the pages charged to the control group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118)      are unevictable, the evictable portion of the working set of the tasks in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119)      the control group may not fit into the available memory.  This can cause
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120)      the control group to thrash or to OOM-kill tasks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) .. _mark_addr_space_unevict:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) Marking Address Spaces Unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) ----------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) For facilities such as ramfs none of the pages attached to the address space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) may be evicted.  To prevent eviction of any such pages, the AS_UNEVICTABLE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) address space flag is provided, and this can be manipulated by a filesystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) using a number of wrapper functions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)  * ``void mapping_set_unevictable(struct address_space *mapping);``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) 	Mark the address space as being completely unevictable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)  * ``void mapping_clear_unevictable(struct address_space *mapping);``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) 	Mark the address space as being evictable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)  * ``int mapping_unevictable(struct address_space *mapping);``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) 	Query the address space, and return true if it is completely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) 	unevictable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) These are currently used in three places in the kernel:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148)  (1) By ramfs to mark the address spaces of its inodes when they are created,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)      and this mark remains for the life of the inode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)  (2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)      Note that SHM_LOCK is not required to page in the locked pages if they're
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154)      swapped out; the application must touch the pages manually if it wants to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)      ensure they're in memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157)  (3) By the i915 driver to mark pinned address space until it's unpinned. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158)      amount of unevictable memory marked by i915 driver is roughly the bounded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)      object size in debugfs/dri/0/i915_gem_objects.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) Detecting Unevictable Pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) The function page_evictable() in vmscan.c determines whether a page is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) evictable or not using the query function outlined above [see section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) :ref:`Marking address spaces unevictable <mark_addr_space_unevict>`]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) to check the AS_UNEVICTABLE flag.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) For address spaces that are so marked after being populated (as SHM regions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) the page tables for the region as does, for example, mlock(), nor need it make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) any special effort to push any pages in the SHM_LOCK'd area to the unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) list.  Instead, vmscan will do this if and when it encounters the pages during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) a reclamation scan.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) On an unlock action (such as SHM_UNLOCK), the unlocker (eg: shmctl()) must scan
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) the pages in the region and "rescue" them from the unevictable list if no other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) condition is keeping them unevictable.  If an unevictable region is destroyed,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) the pages are also "rescued" from the unevictable list in the process of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) freeing them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) page_evictable() also checks for mlocked pages by testing an additional page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) Vmscan's Handling of Unevictable Pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) If unevictable pages are culled in the fault path, or moved to the unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) list at mlock() or mmap() time, vmscan will not encounter the pages until they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) have become evictable again (via munlock() for example) and have been "rescued"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) from the unevictable list.  However, there may be situations where we decide,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) for the sake of expediency, to leave a unevictable page on one of the regular
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) active/inactive LRU lists for vmscan to deal with.  vmscan checks for such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) pages in all of the shrink_{active|inactive|page}_list() functions and will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) "cull" such pages that it encounters: that is, it diverts those pages to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) unevictable list for the zone being scanned.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) There may be situations where a page is mapped into a VM_LOCKED VMA, but the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) page is not marked as PG_mlocked.  Such pages will make it all the way to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) shrink_page_list() where they will be detected when vmscan walks the reverse
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) map in try_to_unmap().  If try_to_unmap() returns SWAP_MLOCK,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) shrink_page_list() will cull the page at that point.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) To "cull" an unevictable page, vmscan simply puts the page back on the LRU list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) using putback_lru_page() - the inverse operation to isolate_lru_page() - after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) dropping the page lock.  Because the condition which makes the page unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) may change once the page is unlocked, putback_lru_page() will recheck the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) unevictable state of a page that it places on the unevictable list.  If the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) page has become unevictable, putback_lru_page() removes it from the list and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) retries, including the page_unevictable() test.  Because such a race is a rare
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) event and movement of pages onto the unevictable list should be rare, these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) extra evictabilty checks should not occur in the majority of calls to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) putback_lru_page().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) MLOCKED Pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) The unevictable page list is also useful for mlock(), in addition to ramfs and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) SYSV SHM.  Note that mlock() is only available in CONFIG_MMU=y situations; in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) NOMMU situations, all mappings are effectively mlocked.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) History
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) The "Unevictable mlocked Pages" infrastructure is based on work originally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) Nick posted his patch as an alternative to a patch posted by Christoph Lameter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) to achieve the same objective: hiding mlocked pages from vmscan.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) In Nick's patch, he used one of the struct page LRU list link fields as a count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) of VM_LOCKED VMAs that map the page.  This use of the link field for a count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) prevented the management of the pages on an LRU list, and thus mlocked pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) were not migratable as isolate_lru_page() could not find them, and the LRU list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) link field was not available to the migration subsystem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) Nick resolved this by putting mlocked pages back on the lru list before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs.  When
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) Nick's patch was integrated with the Unevictable LRU work, the count was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) replaced by walking the reverse map to determine whether any VM_LOCKED VMAs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) mapped the page.  More on this below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) Basic Management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) ----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) pages.  When such a page has been "noticed" by the memory management subsystem,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) the page is marked with the PG_mlocked flag.  This can be manipulated using the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) PageMlocked() functions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) A PG_mlocked page will be placed on the unevictable list when it is added to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) the LRU.  Such pages can be "noticed" by memory management in several places:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259)  (1) in the mlock()/mlockall() system call handlers;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261)  (2) in the mmap() system call handler when mmapping a region with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262)      MAP_LOCKED flag;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264)  (3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265)      flag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267)  (4) in the fault path, if mlocked pages are "culled" in the fault path,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268)      and when a VM_LOCKED stack segment is expanded; or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270)  (5) as mentioned above, in vmscan:shrink_page_list() when attempting to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271)      reclaim a page in a VM_LOCKED VMA via try_to_unmap()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) all of which result in the VM_LOCKED flag being set for the VMA if it doesn't
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) already have it set.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) mlocked pages become unlocked and rescued from the unevictable list when:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278)  (1) mapped in a range unlocked via the munlock()/munlockall() system calls;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280)  (2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)      unmapping at task exit;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283)  (3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)      or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286)  (4) before a page is COW'd in a VM_LOCKED VMA.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) mlock()/mlockall() System Call Handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) ---------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) Both [do\_]mlock() and [do\_]mlockall() system call handlers call mlock_fixup()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) for each VMA in the range specified by the call.  In the case of mlockall(),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) this is the entire active address space of the task.  Note that mlock_fixup()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) is used for both mlocking and munlocking a range of memory.  A call to mlock()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) treated as a no-op, and mlock_fixup() simply returns.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) If the VMA passes some filtering as described in "Filtering Special Vmas"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) below, mlock_fixup() will attempt to merge the VMA with its neighbors or split
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) off a subset of the VMA if the range does not cover the entire VMA.  Once the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) VMA has been merged or split or neither, mlock_fixup() will call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) populate_vma_page_range() to fault in the pages via get_user_pages() and to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) mark the pages as mlocked via mlock_vma_page().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) Note that the VMA being mlocked might be mapped with PROT_NONE.  In this case,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) get_user_pages() will be unable to fault in the pages.  That's okay.  If pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) do end up getting faulted into this VM_LOCKED VMA, we'll handle them in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) fault path or in vmscan.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) Also note that a page returned by get_user_pages() could be truncated or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) migrated out from under us, while we're trying to mlock it.  To detect this,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) populate_vma_page_range() checks page_mapping() after acquiring the page lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) If the page is still associated with its mapping, we'll go ahead and call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) mlock_vma_page().  If the mapping is gone, we just unlock the page and move on.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) In the worst case, this will result in a page mapped in a VM_LOCKED VMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) remaining on a normal LRU list without being PageMlocked().  Again, vmscan will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) detect and cull such pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) mlock_vma_page() will call TestSetPageMlocked() for each page returned by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) get_user_pages().  We use TestSetPageMlocked() because the page might already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) be mlocked by another task/VMA and we don't want to do extra work.  We
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) especially do not want to count an mlocked page more than once in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) statistics.  If the page was already mlocked, mlock_vma_page() need do nothing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) more.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) page from the LRU, as it is likely on the appropriate active or inactive list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) at that time.  If the isolate_lru_page() succeeds, mlock_vma_page() will put
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) back the page - by calling putback_lru_page() - which will notice that the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) is now mlocked and divert the page to the zone's unevictable list.  If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) it later if and when it attempts to reclaim the page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) Filtering Special VMAs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) ----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) mlock_fixup() filters several classes of "special" VMAs:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) 1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely.  The pages behind
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342)    these mappings are inherently pinned, so we don't need to mark them as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343)    mlocked.  In any case, most of the pages have no struct page in which to so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344)    mark the page.  Because of this, get_user_pages() will fail for these VMAs,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345)    so there is no sense in attempting to visit them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) 2) VMAs mapping hugetlbfs page are already effectively pinned into memory.  We
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348)    neither need nor want to mlock() these pages.  However, to preserve the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349)    prior behavior of mlock() - before the unevictable/mlock changes -
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350)    mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351)    allocate the huge pages and populate the ptes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) 3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354)    such as the VDSO page, relay channel pages, etc. These pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355)    are inherently unevictable and are not managed on the LRU lists.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356)    mlock_fixup() treats these VMAs the same as hugetlbfs VMAs.  It calls
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357)    make_pages_present() to populate the ptes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) Note that for all of these special VMAs, mlock_fixup() does not set the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) VM_LOCKED flag.  Therefore, we won't have to deal with them later during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) munlock(), munmap() or task exit.  Neither does mlock_fixup() account these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) VMAs against the task's "locked_vm".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) .. _munlock_munlockall_handling:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) munlock()/munlockall() System Call Handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) -------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) The munlock() and munlockall() system calls are handled by the same functions -
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) do_mlock[all]() - as the mlock() and mlockall() system calls with the unlock vs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) lock operation indicated by an argument.  So, these system calls are also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) handled by mlock_fixup().  Again, if called for an already munlocked VMA,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) mlock_fixup() simply returns.  Because of the VMA filtering discussed above,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) VM_LOCKED will not be set in any "special" VMAs.  So, these VMAs will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) ignored for munlock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) specified range.  The range is then munlocked via the function
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) populate_vma_page_range() - the same function used to mlock a VMA range -
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) passing a flag to indicate that munlock() is being performed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) Because the VMA access protections could have been changed to PROT_NONE after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) faulting in and mlocking pages, get_user_pages() was unreliable for visiting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) these pages for munlocking.  Because we don't want to leave pages mlocked,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) get_user_pages() was enhanced to accept a flag to ignore the permissions when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) fetching the pages - all of which should be resident as a result of previous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) mlocking.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) For munlock(), populate_vma_page_range() unlocks individual pages by calling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) munlock_vma_page().  munlock_vma_page() unconditionally clears the PG_mlocked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) flag using TestClearPageMlocked().  As with mlock_vma_page(),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) munlock_vma_page() use the Test*PageMlocked() function to handle the case where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) the page might have already been unlocked by another task.  If the page was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) mlocked, munlock_vma_page() updates that zone statistics for the number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) mlocked pages.  Note, however, that at this point we haven't checked whether
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) the page is mapped by other VM_LOCKED VMAs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) We can't call try_to_munlock(), the function that walks the reverse map to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) not be on an LRU list [more on these below].  However, the call to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) isolate_lru_page() could fail, in which case we couldn't try_to_munlock().  So,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) we go ahead and clear PG_mlocked up front, as this might be the only chance we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) have.  If we can successfully isolate the page, we go ahead and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) try_to_munlock(), which will restore the PG_mlocked flag and update the zone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) page statistics if it finds another VMA holding the page mlocked.  If we fail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) to isolate the page, we'll have left a potentially mlocked page on the LRU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) This is fine, because we'll catch it later if and if vmscan tries to reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) the page.  This should be relatively rare.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) Migrating MLOCKED Pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) -----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) A page that is being migrated has been isolated from the LRU lists and is held
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) locked across unmapping of the page, updating the page's address space entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) and copying the contents and state, until the page table entry has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) replaced with an entry that refers to the new page.  Linux supports migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) of mlocked pages and other unevictable pages.  This involves simply moving the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) PG_mlocked and PG_unevictable states from the old page to the new page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) Note that page migration can race with mlocking or munlocking of the same page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) This has been discussed from the mlock/munlock perspective in the respective
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) sections above.  Both processes (migration and m[un]locking) hold the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) locked.  This provides the first level of synchronization.  Page migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) zeros out the page_mapping of the old page before unlocking it, so m[un]lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) can skip these pages by testing the page mapping under page lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) To complete page migration, we place the new and old pages back onto the LRU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) after dropping the page lock.  The "unneeded" page - old page on success, new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) page on failure - will be freed when the reference count held by the migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) process is released.  To ensure that we don't strand pages on the unevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) list because of a race between munlock and migration, page migration uses the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) putback_lru_page() function to add migrated pages back to the LRU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) Compacting MLOCKED Pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) ------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) The unevictable LRU can be scanned for compactable regions and the default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) behavior is to do so.  /proc/sys/vm/compact_unevictable_allowed controls
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) this behavior (see Documentation/admin-guide/sysctl/vm.rst).  Once scanning of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) unevictable LRU is enabled, the work of compaction is mostly handled by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) the page migration code and the same work flow as described in MIGRATING
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) MLOCKED PAGES will apply.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) MLOCKING Transparent Huge Pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) -------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) A transparent huge page is represented by a single entry on an LRU list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) Therefore, we can only make unevictable an entire compound page, not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) individual subpages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) If a user tries to mlock() part of a huge page, we want the rest of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) page to be reclaimable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) We cannot just split the page on partial mlock() as split_huge_page() can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) fail and new intermittent failure mode for the syscall is undesirable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) PMD on border of VM_LOCKED VMA will be split into PTE table.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) This way the huge page is accessible for vmscan. Under memory pressure the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) page will be split, subpages which belong to VM_LOCKED VMAs will be moved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) to unevictable LRU and the rest can be reclaimed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) See also comment in follow_trans_huge_pmd().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) mmap(MAP_LOCKED) System Call Handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) -------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) In addition the mlock()/mlockall() system calls, an application can request
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) call. There is one important and subtle difference here, though. mmap() + mlock()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) will fail if the range cannot be faulted in (e.g. because mm_populate fails)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) and returns with ENOMEM while mmap(MAP_LOCKED) will not fail. The mmaped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) area will still have properties of the locked area - aka. pages will not get
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478) swapped out - but major page faults to fault memory in might still happen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) Furthermore, any mmap() call or brk() call that expands the heap by a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481) task that has previously called mlockall() with the MCL_FUTURE flag will result
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) in the newly mapped memory being mlocked.  Before the unevictable/mlock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) changes, the kernel simply called make_pages_present() to allocate pages and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) populate the page table.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) To mlock a range of memory under the unevictable/mlock infrastructure, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) mmap() handler and task address space expansion functions call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) populate_vma_page_range() specifying the vma and the address range to mlock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) The callers of populate_vma_page_range() will have already added the memory range
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) to be mlocked to the task's "locked_vm".  To account for filtered VMAs,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) populate_vma_page_range() returns the number of pages NOT mlocked.  All of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) callers then subtract a non-negative return value from the task's locked_vm.  A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) negative return value represent an error - for example, from get_user_pages()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) attempting to fault in a VMA with PROT_NONE access.  In this case, we leave the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) memory range accounted as locked_vm, as the protections could be changed later
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) and pages allocated into that region.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) munmap()/exit()/exec() System Call Handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) -------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) When unmapping an mlocked region of memory, whether by an explicit call to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) munmap() or via an internal unmap from exit() or exec() processing, we must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) Before the unevictable/mlock changes, mlocking did not mark the pages in any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507) way, so unmapping them required no processing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) To munlock a range of memory under the unevictable/mlock infrastructure, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) munmap() handler and task address space call tear down function
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) munlock_vma_pages_all().  The name reflects the observation that one always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) specifies the entire VMA range when munlock()ing during unmap of a region.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) Because of the VMA filtering when mlocking() regions, only "normal" VMAs that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) actually contain mlocked pages will be passed to munlock_vma_pages_all().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) munlock_vma_pages_all() clears the VM_LOCKED VMA flag and, like mlock_fixup()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) for the munlock case, calls __munlock_vma_pages_range() to walk the page table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) for the VMA's memory range and munlock_vma_page() each resident page mapped by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) the VMA.  This effectively munlocks the page, only if this is the last
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) VM_LOCKED VMA that maps the page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) try_to_unmap()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) Pages can, of course, be mapped into multiple VMAs.  Some of these VMAs may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527) have VM_LOCKED flag set.  It is possible for a page mapped into one or more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) VM_LOCKED VMAs not to have the PG_mlocked flag set and therefore reside on one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) of the active or inactive LRU lists.  This could happen if, for example, a task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) in the process of munlocking the page could not isolate the page from the LRU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) As a result, vmscan/shrink_page_list() might encounter such a page as described
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) in section "vmscan's handling of unevictable pages".  To handle this situation,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) try_to_unmap() checks for VM_LOCKED VMAs while it is walking a page's reverse
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534) map.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) try_to_unmap() is always called, by either vmscan for reclaim or for page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) migration, with the argument page locked and isolated from the LRU.  Separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538) functions handle anonymous and mapped file and KSM pages, as these types of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539) pages have different reverse map lookup mechanisms, with different locking.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540) In each case, whether rmap_walk_anon() or rmap_walk_file() or rmap_walk_ksm(),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) it will call try_to_unmap_one() for every VMA which might contain the page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543) When trying to reclaim, if try_to_unmap_one() finds the page in a VM_LOCKED
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) VMA, it will then mlock the page via mlock_vma_page() instead of unmapping it,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545) and return SWAP_MLOCK to indicate that the page is unevictable: and the scan
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) stops there.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) mlock_vma_page() is called while holding the page table's lock (in addition
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) to the page lock, and the rmap lock): to serialize against concurrent mlock or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551) holepunching, and truncation of file pages and their anonymous COWed pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) try_to_munlock() Reverse Map Scan
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555) ---------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557) .. warning::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558)    [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559)    page_referenced() reverse map walker.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561) When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562) Handling <munlock_munlockall_handling>` above] tries to munlock a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563) page, it needs to determine whether or not the page is mapped by any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564) VM_LOCKED VMA without actually attempting to unmap all PTEs from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565) page.  For this purpose, the unevictable/mlock infrastructure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566) introduced a variant of try_to_unmap() called try_to_munlock().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568) try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569) mapped file and KSM pages with a flag argument specifying unlock versus unmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570) processing.  Again, these functions walk the respective reverse maps looking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571) for VM_LOCKED VMAs.  When such a VMA is found, as in the try_to_unmap() case,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572) the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK.  This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573) undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575) Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576) reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577) However, the scan can terminate when it encounters a VM_LOCKED VMA.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 578) Although try_to_munlock() might be called a great many times when munlocking a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 579) large region or tearing down a large address space that has been mlocked via
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 580) mlockall(), overall this is a fairly rare event.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 581) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 582) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 583) Page Reclaim in shrink_*_list()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 584) -------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 585) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 586) shrink_active_list() culls any obviously unevictable pages - i.e.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 587) !page_evictable(page) - diverting these to the unevictable list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 588) However, shrink_active_list() only sees unevictable pages that made it onto the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 589) active/inactive lru lists.  Note that these pages do not have PageUnevictable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 590) set - otherwise they would be on the unevictable list and shrink_active_list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 591) would never see them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 592) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 593) Some examples of these unevictable pages on the LRU lists are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 594) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 595)  (1) ramfs pages that have been placed on the LRU lists when first allocated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 596) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 597)  (2) SHM_LOCK'd shared memory pages.  shmctl(SHM_LOCK) does not attempt to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 598)      allocate or fault in the pages in the shared memory region.  This happens
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 599)      when an application accesses the page the first time after SHM_LOCK'ing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 600)      the segment.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 601) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 602)  (3) mlocked pages that could not be isolated from the LRU and moved to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 603)      unevictable list in mlock_vma_page().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 604) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 605) shrink_inactive_list() also diverts any unevictable pages that it finds on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 606) inactive lists to the appropriate zone's unevictable list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 607) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 608) shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 609) after shrink_active_list() had moved them to the inactive list, or pages mapped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 610) into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 611) recheck via try_to_munlock().  shrink_inactive_list() won't notice the latter,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 612) but will pass on to shrink_page_list().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 613) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 614) shrink_page_list() again culls obviously unevictable pages that it could
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 615) encounter for similar reason to shrink_inactive_list().  Pages mapped into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 616) VM_LOCKED VMAs but without PG_mlocked set will make it all the way to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 617) try_to_unmap().  shrink_page_list() will divert them to the unevictable list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 618) when try_to_unmap() returns SWAP_MLOCK, as discussed above.