^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) ===============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) Documentation for /proc/sys/vm/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ===============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) kernel version 2.6.29
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) Copyright (c) 2008 Peter W. Morreale <pmorreale@novell.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) For general info and legal blurb, please look in index.rst.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) ------------------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) This file contains the documentation for the sysctl files in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) /proc/sys/vm and is valid for Linux kernel version 2.6.29.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) The files in this directory can be used to tune the operation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) of the virtual memory (VM) subsystem of the Linux kernel and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) the writeout of dirty data to disk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) Default values and initialization routines for most of these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) files can be found in mm/swap.c.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) Currently, these files are in /proc/sys/vm:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) - admin_reserve_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) - block_dump
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) - compact_memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) - compaction_proactiveness
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) - compact_unevictable_allowed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) - dirty_background_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) - dirty_background_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) - dirty_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) - dirty_expire_centisecs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) - dirty_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) - dirtytime_expire_seconds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) - dirty_writeback_centisecs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) - drop_caches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) - extfrag_threshold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) - extra_free_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) - highmem_is_dirtyable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) - hugetlb_shm_group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) - laptop_mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) - legacy_va_layout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) - lowmem_reserve_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) - max_map_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) - memory_failure_early_kill
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) - memory_failure_recovery
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) - min_free_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) - min_slab_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) - min_unmapped_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) - mmap_min_addr
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) - mmap_rnd_bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) - mmap_rnd_compat_bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) - nr_hugepages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) - nr_hugepages_mempolicy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) - nr_overcommit_hugepages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) - nr_trim_pages (only if CONFIG_MMU=n)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) - numa_zonelist_order
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) - oom_dump_tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) - oom_kill_allocating_task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) - overcommit_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) - overcommit_memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) - overcommit_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) - page-cluster
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) - panic_on_oom
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) - percpu_pagelist_fraction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) - stat_interval
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) - stat_refresh
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) - numa_stat
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) - swappiness
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) - unprivileged_userfaultfd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) - user_reserve_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) - vfs_cache_pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) - watermark_boost_factor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) - watermark_scale_factor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) - zone_reclaim_mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) admin_reserve_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) The amount of free memory in the system that should be reserved for users
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) with the capability cap_sys_admin.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) admin_reserve_kbytes defaults to min(3% of free pages, 8MB)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) That should provide enough for the admin to log in and kill a process,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) if necessary, under the default overcommit 'guess' mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) Systems running under overcommit 'never' should increase this to account
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) for the full Virtual Memory Size of programs used to recover. Otherwise,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) root may not be able to log in to recover the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) How do you calculate a minimum useful reserve?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) For overcommit 'guess', we can sum resident set sizes (RSS).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) On x86_64 this is about 8MB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) For overcommit 'never', we can take the max of their virtual sizes (VSZ)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) and add the sum of their RSS.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) On x86_64 this is about 128MB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) Changing this takes effect whenever an application requests memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) block_dump
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) block_dump enables block I/O debugging when set to a nonzero value. More
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) information on block I/O debugging is in Documentation/admin-guide/laptops/laptop-mode.rst.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) compact_memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) Available only when CONFIG_COMPACTION is set. When 1 is written to the file,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) all zones are compacted such that free memory is available in contiguous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) blocks where possible. This can be important for example in the allocation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) huge pages although processes will also directly compact memory as required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) compaction_proactiveness
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) This tunable takes a value in the range [0, 100] with a default value of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 20. This tunable determines how aggressively compaction is done in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) background. On write of non zero value to this tunable will immediately
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) trigger the proactive compaction. Setting it to 0 disables proactive compaction.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) Note that compaction has a non-trivial system-wide impact as pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) belonging to different processes are moved around, which could also lead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) to latency spikes in unsuspecting applications. The kernel employs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) various heuristics to avoid wasting CPU cycles if it detects that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) proactive compaction is not being effective.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) Be careful when setting it to extreme values like 100, as that may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) cause excessive background compaction activity.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) compact_unevictable_allowed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) ===========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) Available only when CONFIG_COMPACTION is set. When set to 1, compaction is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) allowed to examine the unevictable lru (mlocked pages) for pages to compact.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) This should be used on systems where stalls for minor page faults are an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) acceptable trade for large contiguous free memory. Set to 0 to prevent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) compaction from moving pages that are unevictable. Default value is 1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) to compaction, which would block the task from becomming active until the fault
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) is resolved.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) dirty_background_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) Contains the amount of dirty memory at which the background kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) flusher threads will start writeback.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) Note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) dirty_background_bytes is the counterpart of dirty_background_ratio. Only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) one of them may be specified at a time. When one sysctl is written it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) immediately taken into account to evaluate the dirty memory limits and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) other appears as 0 when read.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) dirty_background_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) Contains, as a percentage of total available memory that contains free pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) and reclaimable pages, the number of pages at which the background kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) flusher threads will start writing out dirty data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) The total available memory is not equal to total system memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) dirty_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) Contains the amount of dirty memory at which a process generating disk writes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) will itself start writeback.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) specified at a time. When one sysctl is written it is immediately taken into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) account to evaluate the dirty memory limits and the other appears as 0 when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) read.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) value lower than this limit will be ignored and the old configuration will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) retained.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) dirty_expire_centisecs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) This tunable is used to define when dirty data is old enough to be eligible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) for writeout by the kernel flusher threads. It is expressed in 100'ths
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) of a second. Data which has been dirty in-memory for longer than this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) interval will be written out next time a flusher thread wakes up.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) dirty_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) Contains, as a percentage of total available memory that contains free pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) and reclaimable pages, the number of pages at which a process which is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) generating disk writes will itself start writing out dirty data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) The total available memory is not equal to total system memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) dirtytime_expire_seconds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) When a lazytime inode is constantly having its pages dirtied, the inode with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) an updated timestamp will never get chance to be written out. And, if the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) only thing that has happened on the file system is a dirtytime inode caused
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) by an atime update, a worker will be scheduled to make sure that inode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) eventually gets pushed out to disk. This tunable is used to define when dirty
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) inode is old enough to be eligible for writeback by the kernel flusher threads.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) And, it is also used as the interval to wakeup dirtytime_writeback thread.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) dirty_writeback_centisecs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) =========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) The kernel flusher threads will periodically wake up and write `old` data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) out to disk. This tunable expresses the interval between those wakeups, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) 100'ths of a second.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) Setting this to zero disables periodic writeback altogether.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) drop_caches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) Writing to this will cause the kernel to drop clean caches, as well as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) reclaimable slab objects like dentries and inodes. Once dropped, their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) memory becomes free.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) To free pagecache::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) echo 1 > /proc/sys/vm/drop_caches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) To free reclaimable slab objects (includes dentries and inodes)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) echo 2 > /proc/sys/vm/drop_caches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) To free slab objects and pagecache::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) echo 3 > /proc/sys/vm/drop_caches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) This is a non-destructive operation and will not free any dirty objects.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) To increase the number of objects freed by this operation, the user may run
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) `sync` prior to writing to /proc/sys/vm/drop_caches. This will minimize the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) number of dirty objects on the system and create more candidates to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) dropped.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) This file is not a means to control the growth of the various kernel caches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) (inodes, dentries, pagecache, etc...) These objects are automatically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) reclaimed by the kernel when memory is needed elsewhere on the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) Use of this file can cause performance problems. Since it discards cached
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) objects, it may cost a significant amount of I/O and CPU to recreate the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) dropped objects, especially if they were under heavy use. Because of this,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) use outside of a testing or debugging environment is not recommended.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) You may see informational messages in your kernel log when this file is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) used::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) cat (1234): drop_caches: 3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) These are informational only. They do not mean that anything is wrong
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) with your system. To disable them, echo 4 (bit 2) into drop_caches.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) extfrag_threshold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) This parameter affects whether the kernel will compact memory or direct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) debugfs shows what the fragmentation index for each order is in each zone in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) the system. Values tending towards 0 imply allocations would fail due to lack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) of memory, values towards 1000 imply failures are due to fragmentation and -1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) implies that the allocation will succeed as long as watermarks are met.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) The kernel will not compact memory in a zone if the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) fragmentation index is <= extfrag_threshold. The default value is 500.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) highmem_is_dirtyable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) Available only for systems with CONFIG_HIGHMEM enabled (32b systems).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) This parameter controls whether the high memory is considered for dirty
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) writers throttling. This is not the case by default which means that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) only the amount of memory directly visible/usable by the kernel can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) be dirtied. As a result, on systems with a large amount of memory and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) lowmem basically depleted writers might be throttled too early and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) streaming writes can get very slow.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) Changing the value to non zero would allow more memory to be dirtied
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) and thus allow writers to write more data which can be flushed to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) storage more effectively. Note this also comes with a risk of pre-mature
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) OOM killer because some writers (e.g. direct block device writes) can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) only use the low memory and they can fill it up with dirty data without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) any throttling.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) extra_free_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) This parameter tells the VM to keep extra free memory between the threshold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) where background reclaim (kswapd) kicks in, and the threshold where direct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) reclaim (by allocating processes) kicks in.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) This is useful for workloads that require low latency memory allocations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) and have a bounded burstiness in memory allocations, for example a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) realtime application that receives and transmits network traffic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) (causing in-kernel memory allocations) with a maximum total message burst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) size of 200MB may need 200MB of extra free memory to avoid direct reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) related latencies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) ==============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) hugetlb_shm_group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) hugetlb_shm_group contains group id that is allowed to create SysV
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) shared memory segment using hugetlb page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) laptop_mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) laptop_mode is a knob that controls "laptop mode". All the things that are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) controlled by this knob are discussed in Documentation/admin-guide/laptops/laptop-mode.rst.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) legacy_va_layout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) will use the legacy (2.4) layout for all processes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) lowmem_reserve_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) For some specialised workloads on highmem machines it is dangerous for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) the kernel to allow process memory to be allocated from the "lowmem"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) zone. This is because that memory could then be pinned via the mlock()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) system call, or by unavailability of swapspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) And on large highmem machines this lack of reclaimable lowmem memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) can be fatal.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) So the Linux page allocator has a mechanism which prevents allocations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) which *could* use highmem from using too much lowmem. This means that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) a certain amount of lowmem is defended from the possibility of being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) captured into pinned user memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) (The same argument applies to the old 16 megabyte ISA DMA region. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) mechanism will also defend that region from allocations which could use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) highmem or lowmem).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) The `lowmem_reserve_ratio` tunable determines how aggressive the kernel is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) in defending these lower zones.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) If you have a machine which uses highmem or ISA DMA and your
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) applications are using mlock(), or if you are running with no swap then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) you probably should change the lowmem_reserve_ratio setting.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) The lowmem_reserve_ratio is an array. You can see them by reading this file::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) % cat /proc/sys/vm/lowmem_reserve_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) 256 256 32
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) But, these values are not used directly. The kernel calculates # of protection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) pages for each zones from them. These are shown as array of protection pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) in /proc/zoneinfo like followings. (This is an example of x86-64 box).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) Each zone has an array of protection pages like this::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) Node 0, zone DMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) pages free 1355
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) min 3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) low 3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) high 4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) :
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) :
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) numa_other 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) protection: (0, 2004, 2004, 2004)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) pagesets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) cpu: 0 pcp: 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) :
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) These protections are added to score to judge whether this zone should be used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) for page allocation or should be reclaimed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) In this example, if normal pages (index=2) are required to this DMA zone and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) not be used because pages_free(1355) is smaller than watermark + protection[2]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) (4 + 2004 = 2008). If this protection value is 0, this zone would be used for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) normal page requirement. If requirement is DMA zone(index=0), protection[0]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) (=0) is used.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) zone[i]'s protection[j] is calculated by following expression::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) (i < j):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) zone[i]->protection[j]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) = (total sums of managed_pages from zone[i+1] to zone[j] on the node)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) / lowmem_reserve_ratio[i];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) (i = j):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) (should not be protected. = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) (i > j):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) (not necessary, but looks 0)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) The default values of lowmem_reserve_ratio[i] are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) === ====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) 256 (if zone[i] means DMA or DMA32 zone)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) 32 (others)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) === ====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) As above expression, they are reciprocal number of ratio.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) 256 means 1/256. # of protection pages becomes about "0.39%" of total managed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) pages of higher zones on the node.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) If you would like to protect more pages, smaller values are effective.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) The minimum value is 1 (1/1 -> 100%). The value less than 1 completely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) disables protection of the pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) max_map_count:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) This file contains the maximum number of memory map areas a process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) may have. Memory map areas are used as a side-effect of calling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) malloc, directly by mmap, mprotect, and madvise, and also when loading
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) shared libraries.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) While most applications need less than a thousand maps, certain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) programs, particularly malloc debuggers, may consume lots of them,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) e.g., up to one or two maps per allocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) The default value is 65536.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) memory_failure_early_kill:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) Control how to kill processes when uncorrected memory error (typically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) a 2bit error in a memory module) is detected in the background by hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456) that cannot be handled by the kernel. In some cases (like the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) still having a valid copy on disk) the kernel will handle the failure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) transparently without affecting any applications. But if there is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) no other uptodate copy of the data it will kill to prevent any data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) corruptions from propagating.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) 1: Kill all processes that have the corrupted and not reloadable page mapped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) as soon as the corruption is detected. Note this is not supported
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) for a few types of pages, like kernel internally allocated data or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) the swap cache, but works for the majority of user pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) 0: Only unmap the corrupted page from all processes and only kill a process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) who tries to access it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) handle this if they want to.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) This is only active on architectures/platforms with advanced machine
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) check handling and depends on the hardware capabilities.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) Applications can override this setting individually with the PR_MCE_KILL prctl
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) memory_failure_recovery
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) =======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) Enable memory failure recovery (when supported by the platform)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) 1: Attempt recovery.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) 0: Always panic on a memory failure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) min_free_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) ===============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) This is used to force the Linux VM to keep a minimum number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) of kilobytes free. The VM uses this number to compute a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) watermark[WMARK_MIN] value for each lowmem zone in the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) Each lowmem zone gets a number of reserved free pages based
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) proportionally on its size.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) Some minimal amount of memory is needed to satisfy PF_MEMALLOC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) allocations; if you set this to lower than 1024KB, your system will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) become subtly broken, and prone to deadlock under high loads.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) Setting this too high will OOM your machine instantly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) min_slab_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) This is available only on NUMA kernels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) A percentage of the total pages in each zone. On Zone reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) (fallback from the local zone occurs) slabs will be reclaimed if more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) than this percentage of pages in a zone are reclaimable slab pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) This insures that the slab growth stays under control even in NUMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) systems that rarely perform global reclaim.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) The default is 5 percent.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) Note that slab reclaim is triggered in a per zone / node fashion.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) The process of reclaiming slab memory is currently not node specific
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) and may not be fast.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) min_unmapped_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524) ==================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) This is available only on NUMA kernels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) This is a percentage of the total pages in each zone. Zone reclaim will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) only occur if more than this percentage of pages are in a state that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) zone_reclaim_mode allows to be reclaimed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) against all file-backed unmapped pages including swapcache pages and tmpfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534) files. Otherwise, only unmapped pages backed by normal files but not tmpfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) files and similar are considered.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) The default is 1 percent.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540) mmap_min_addr
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543) This file indicates the amount of address space which a user process will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) be restricted from mmapping. Since kernel null dereference bugs could
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545) accidentally operate based on the information in the first couple of pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) of memory userspace processes should not be allowed to write to them. By
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547) default this value is set to 0 and no protections will be enforced by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) security module. Setting this value to something like 64k will allow the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) vast majority of applications to work correctly and provide defense in depth
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) against future potential kernel bugs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) mmap_rnd_bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556) This value can be used to select the number of bits to use to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557) determine the random offset to the base address of vma regions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) resulting from mmap allocations on architectures which support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559) tuning address space randomization. This value will be bounded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560) by the architecture's minimum and maximum supported values.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562) This value can be changed after boot using the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563) /proc/sys/vm/mmap_rnd_bits tunable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566) mmap_rnd_compat_bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569) This value can be used to select the number of bits to use to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570) determine the random offset to the base address of vma regions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571) resulting from mmap allocations for applications run in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572) compatibility mode on architectures which support tuning address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573) space randomization. This value will be bounded by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574) architecture's minimum and maximum supported values.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576) This value can be changed after boot using the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577) /proc/sys/vm/mmap_rnd_compat_bits tunable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 578)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 579)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 580) nr_hugepages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 581) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 582)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 583) Change the minimum size of the hugepage pool.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 584)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 585) See Documentation/admin-guide/mm/hugetlbpage.rst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 586)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 587)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 588) nr_hugepages_mempolicy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 589) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 590)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 591) Change the size of the hugepage pool at run-time on a specific
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 592) set of NUMA nodes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 593)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 594) See Documentation/admin-guide/mm/hugetlbpage.rst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 595)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 596)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 597) nr_overcommit_hugepages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 598) =======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 599)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 600) Change the maximum size of the hugepage pool. The maximum is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 601) nr_hugepages + nr_overcommit_hugepages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 602)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 603) See Documentation/admin-guide/mm/hugetlbpage.rst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 604)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 605)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 606) nr_trim_pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 607) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 608)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 609) This is available only on NOMMU kernels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 610)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 611) This value adjusts the excess page trimming behaviour of power-of-2 aligned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 612) NOMMU mmap allocations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 613)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 614) A value of 0 disables trimming of allocations entirely, while a value of 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 615) trims excess pages aggressively. Any value >= 1 acts as the watermark where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 616) trimming of allocations is initiated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 617)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 618) The default value is 1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 619)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 620) See Documentation/admin-guide/mm/nommu-mmap.rst for more information.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 621)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 622)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 623) numa_zonelist_order
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 624) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 625)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 626) This sysctl is only for NUMA and it is deprecated. Anything but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 627) Node order will fail!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 628)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 629) 'where the memory is allocated from' is controlled by zonelists.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 630)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 631) (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 632) you may be able to read ZONE_DMA as ZONE_DMA32...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 633)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 634) In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 635) ZONE_NORMAL -> ZONE_DMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 636) This means that a memory allocation request for GFP_KERNEL will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 637) get memory from ZONE_DMA only when ZONE_NORMAL is not available.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 638)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 639) In NUMA case, you can think of following 2 types of order.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 640) Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 641)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 642) (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 643) (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 644)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 645) Type(A) offers the best locality for processes on Node(0), but ZONE_DMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 646) will be used before ZONE_NORMAL exhaustion. This increases possibility of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 647) out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 648)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 649) Type(B) cannot offer the best locality but is more robust against OOM of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 650) the DMA zone.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 651)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 652) Type(A) is called as "Node" order. Type (B) is "Zone" order.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 653)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 654) "Node order" orders the zonelists by node, then by zone within each node.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 655) Specify "[Nn]ode" for node order
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 656)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 657) "Zone Order" orders the zonelists by zone type, then by node within each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 658) zone. Specify "[Zz]one" for zone order.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 659)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 660) Specify "[Dd]efault" to request automatic configuration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 661)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 662) On 32-bit, the Normal zone needs to be preserved for allocations accessible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 663) by the kernel, so "zone" order will be selected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 664)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 665) On 64-bit, devices that require DMA32/DMA are relatively rare, so "node"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 666) order will be selected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 667)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 668) Default order is recommended unless this is causing problems for your
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 669) system/application.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 670)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 671)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 672) oom_dump_tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 673) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 674)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 675) Enables a system-wide task dump (excluding kernel threads) to be produced
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 676) when the kernel performs an OOM-killing and includes such information as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 677) pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 678) score, and name. This is helpful to determine why the OOM killer was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 679) invoked, to identify the rogue task that caused it, and to determine why
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 680) the OOM killer chose the task it did to kill.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 681)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 682) If this is set to zero, this information is suppressed. On very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 683) large systems with thousands of tasks it may not be feasible to dump
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 684) the memory state information for each one. Such systems should not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 685) be forced to incur a performance penalty in OOM conditions when the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 686) information may not be desired.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 687)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 688) If this is set to non-zero, this information is shown whenever the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 689) OOM killer actually kills a memory-hogging task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 690)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 691) The default value is 1 (enabled).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 692)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 693)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 694) oom_kill_allocating_task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 695) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 696)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 697) This enables or disables killing the OOM-triggering task in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 698) out-of-memory situations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 699)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 700) If this is set to zero, the OOM killer will scan through the entire
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 701) tasklist and select a task based on heuristics to kill. This normally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 702) selects a rogue memory-hogging task that frees up a large amount of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 703) memory when killed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 704)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 705) If this is set to non-zero, the OOM killer simply kills the task that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 706) triggered the out-of-memory condition. This avoids the expensive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 707) tasklist scan.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 708)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 709) If panic_on_oom is selected, it takes precedence over whatever value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 710) is used in oom_kill_allocating_task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 711)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 712) The default value is 0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 713)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 714)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 715) overcommit_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 716) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 717)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 718) When overcommit_memory is set to 2, the committed address space is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 719) permitted to exceed swap plus this amount of physical RAM. See below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 720)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 721) Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 722) of them may be specified at a time. Setting one disables the other (which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 723) then appears as 0 when read).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 724)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 725)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 726) overcommit_memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 727) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 728)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 729) This value contains a flag that enables memory overcommitment.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 730)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 731) When this flag is 0, the kernel attempts to estimate the amount
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 732) of free memory left when userspace requests more memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 733)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 734) When this flag is 1, the kernel pretends there is always enough
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 735) memory until it actually runs out.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 736)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 737) When this flag is 2, the kernel uses a "never overcommit"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 738) policy that attempts to prevent any overcommit of memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 739) Note that user_reserve_kbytes affects this policy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 740)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 741) This feature can be very useful because there are a lot of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 742) programs that malloc() huge amounts of memory "just-in-case"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 743) and don't use much of it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 744)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 745) The default value is 0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 746)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 747) See Documentation/vm/overcommit-accounting.rst and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 748) mm/util.c::__vm_enough_memory() for more information.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 749)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 750)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 751) overcommit_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 752) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 753)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 754) When overcommit_memory is set to 2, the committed address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 755) space is not permitted to exceed swap plus this percentage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 756) of physical RAM. See above.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 757)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 758)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 759) page-cluster
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 760) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 761)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 762) page-cluster controls the number of pages up to which consecutive pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 763) are read in from swap in a single attempt. This is the swap counterpart
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 764) to page cache readahead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 765) The mentioned consecutivity is not in terms of virtual/physical addresses,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 766) but consecutive on swap space - that means they were swapped out together.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 767)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 768) It is a logarithmic value - setting it to zero means "1 page", setting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 769) it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 770) Zero disables swap readahead completely.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 771)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 772) The default value is three (eight pages at a time). There may be some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 773) small benefits in tuning this to a different value if your workload is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 774) swap-intensive.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 775)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 776) Lower values mean lower latencies for initial faults, but at the same time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 777) extra faults and I/O delays for following faults if they would have been part of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 778) that consecutive pages readahead would have brought in.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 779)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 780)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 781) panic_on_oom
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 782) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 783)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 784) This enables or disables panic on out-of-memory feature.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 785)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 786) If this is set to 0, the kernel will kill some rogue process,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 787) called oom_killer. Usually, oom_killer can kill rogue processes and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 788) system will survive.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 789)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 790) If this is set to 1, the kernel panics when out-of-memory happens.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 791) However, if a process limits using nodes by mempolicy/cpusets,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 792) and those nodes become memory exhaustion status, one process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 793) may be killed by oom-killer. No panic occurs in this case.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 794) Because other nodes' memory may be free. This means system total status
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 795) may be not fatal yet.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 796)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 797) If this is set to 2, the kernel panics compulsorily even on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 798) above-mentioned. Even oom happens under memory cgroup, the whole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 799) system panics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 800)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 801) The default value is 0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 802)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 803) 1 and 2 are for failover of clustering. Please select either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 804) according to your policy of failover.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 805)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 806) panic_on_oom=2+kdump gives you very strong tool to investigate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 807) why oom happens. You can get snapshot.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 808)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 809)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 810) percpu_pagelist_fraction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 811) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 812)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 813) This is the fraction of pages at most (high mark pcp->high) in each zone that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 814) are allocated for each per cpu page list. The min value for this is 8. It
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 815) means that we don't allow more than 1/8th of pages in each zone to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 816) allocated in any single per_cpu_pagelist. This entry only changes the value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 817) of hot per cpu pagelists. User can specify a number like 100 to allocate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 818) 1/100th of each zone to each per cpu page list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 819)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 820) The batch value of each per cpu pagelist is also updated as a result. It is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 821) set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 822)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 823) The initial value is zero. Kernel does not use this value at boot time to set
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 824) the high water marks for each per cpu page list. If the user writes '0' to this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 825) sysctl, it will revert to this default behavior.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 826)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 827)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 828) stat_interval
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 829) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 830)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 831) The time interval between which vm statistics are updated. The default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 832) is 1 second.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 833)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 834)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 835) stat_refresh
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 836) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 837)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 838) Any read or write (by root only) flushes all the per-cpu vm statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 839) into their global totals, for more accurate reports when testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 840) e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 841)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 842) As a side-effect, it also checks for negative totals (elsewhere reported
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 843) as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 844) (At time of writing, a few stats are known sometimes to be found negative,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 845) with no ill effects: errors and warnings on these stats are suppressed.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 846)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 847)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 848) numa_stat
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 849) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 850)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 851) This interface allows runtime configuration of numa statistics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 852)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 853) When page allocation performance becomes a bottleneck and you can tolerate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 854) some possible tool breakage and decreased numa counter precision, you can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 855) do::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 856)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 857) echo 0 > /proc/sys/vm/numa_stat
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 858)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 859) When page allocation performance is not a bottleneck and you want all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 860) tooling to work, you can do::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 861)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 862) echo 1 > /proc/sys/vm/numa_stat
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 863)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 864)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 865) swappiness
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 866) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 867)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 868) This control is used to define the rough relative IO cost of swapping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 869) and filesystem paging, as a value between 0 and 200. At 100, the VM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 870) assumes equal IO cost and will thus apply memory pressure to the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 871) cache and swap-backed pages equally; lower values signify more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 872) expensive swap IO, higher values indicates cheaper.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 873)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 874) Keep in mind that filesystem IO patterns under memory pressure tend to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 875) be more efficient than swap's random IO. An optimal value will require
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 876) experimentation and will also be workload-dependent.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 877)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 878) The default value is 60.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 879)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 880) For in-memory swap, like zram or zswap, as well as hybrid setups that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 881) have swap on faster devices than the filesystem, values beyond 100 can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 882) be considered. For example, if the random IO against the swap device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 883) is on average 2x faster than IO from the filesystem, swappiness should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 884) be 133 (x + 2x = 200, 2x = 133.33).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 885)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 886) At 0, the kernel will not initiate swap until the amount of free and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 887) file-backed pages is less than the high watermark in a zone.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 888)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 889)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 890) unprivileged_userfaultfd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 891) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 892)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 893) This flag controls the mode in which unprivileged users can use the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 894) userfaultfd system calls. Set this to 0 to restrict unprivileged users
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 895) to handle page faults in user mode only. In this case, users without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 896) SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 897) succeed. Prohibiting use of userfaultfd for handling faults from kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 898) mode may make certain vulnerabilities more difficult to exploit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 899)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 900) Set this to 1 to allow unprivileged users to use the userfaultfd system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 901) calls without any restrictions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 902)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 903) The default value is 0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 904)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 905)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 906) user_reserve_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 907) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 908)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 909) When overcommit_memory is set to 2, "never overcommit" mode, reserve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 910) min(3% of current process size, user_reserve_kbytes) of free memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 911) This is intended to prevent a user from starting a single memory hogging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 912) process, such that they cannot recover (kill the hog).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 913)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 914) user_reserve_kbytes defaults to min(3% of the current process size, 128MB).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 915)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 916) If this is reduced to zero, then the user will be allowed to allocate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 917) all free memory with a single process, minus admin_reserve_kbytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 918) Any subsequent attempts to execute a command will result in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 919) "fork: Cannot allocate memory".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 920)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 921) Changing this takes effect whenever an application requests memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 922)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 923)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 924) vfs_cache_pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 925) ==================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 926)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 927) This percentage value controls the tendency of the kernel to reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 928) the memory which is used for caching of directory and inode objects.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 929)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 930) At the default value of vfs_cache_pressure=100 the kernel will attempt to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 931) reclaim dentries and inodes at a "fair" rate with respect to pagecache and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 932) swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 933) to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 934) never reclaim dentries and inodes due to memory pressure and this can easily
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 935) lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 936) causes the kernel to prefer to reclaim dentries and inodes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 937)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 938) Increasing vfs_cache_pressure significantly beyond 100 may have negative
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 939) performance impact. Reclaim code needs to take various locks to find freeable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 940) directory and inode objects. With vfs_cache_pressure=1000, it will look for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 941) ten times more freeable objects than there are.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 942)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 943)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 944) watermark_boost_factor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 945) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 946)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 947) This factor controls the level of reclaim when memory is being fragmented.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 948) It defines the percentage of the high watermark of a zone that will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 949) reclaimed if pages of different mobility are being mixed within pageblocks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 950) The intent is that compaction has less work to do in the future and to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 951) increase the success rate of future high-order allocations such as SLUB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 952) allocations, THP and hugetlbfs pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 953)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 954) To make it sensible with respect to the watermark_scale_factor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 955) parameter, the unit is in fractions of 10,000. The default value of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 956) 15,000 on !DISCONTIGMEM configurations means that up to 150% of the high
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 957) watermark will be reclaimed in the event of a pageblock being mixed due
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 958) to fragmentation. The level of reclaim is determined by the number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 959) fragmentation events that occurred in the recent past. If this value is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 960) smaller than a pageblock then a pageblocks worth of pages will be reclaimed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 961) (e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 962)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 963)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 964) watermark_scale_factor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 965) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 966)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 967) This factor controls the aggressiveness of kswapd. It defines the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 968) amount of memory left in a node/system before kswapd is woken up and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 969) how much memory needs to be free before kswapd goes back to sleep.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 970)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 971) The unit is in fractions of 10,000. The default value of 10 means the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 972) distances between watermarks are 0.1% of the available memory in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 973) node/system. The maximum value is 1000, or 10% of memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 974)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 975) A high rate of threads entering direct reclaim (allocstall) or kswapd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 976) going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 977) that the number of free pages kswapd maintains for latency reasons is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 978) too small for the allocation bursts occurring in the system. This knob
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 979) can then be used to tune kswapd aggressiveness accordingly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 980)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 981)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 982) zone_reclaim_mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 983) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 984)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 985) Zone_reclaim_mode allows someone to set more or less aggressive approaches to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 986) reclaim memory when a zone runs out of memory. If it is set to zero then no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 987) zone reclaim occurs. Allocations will be satisfied from other zones / nodes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 988) in the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 989)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 990) This is value OR'ed together of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 991)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 992) = ===================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 993) 1 Zone reclaim on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 994) 2 Zone reclaim writes dirty pages out
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 995) 4 Zone reclaim swaps pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 996) = ===================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 997)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 998) zone_reclaim_mode is disabled by default. For file servers or workloads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 999) that benefit from having their data cached, zone_reclaim_mode should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1000) left disabled as the caching effect is likely to be more important than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1001) data locality.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1002)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1003) Consider enabling one or more zone_reclaim mode bits if it's known that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1004) workload is partitioned such that each partition fits within a NUMA node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1005) and that accessing remote memory would cause a measurable performance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1006) reduction. The page allocator will take additional actions before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1007) allocating off node pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1008)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1009) Allowing zone reclaim to write out pages stops processes that are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1010) writing large amounts of data from dirtying pages on other nodes. Zone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1011) reclaim will write out dirty pages if a zone fills up and so effectively
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1012) throttle the process. This may decrease the performance of a single process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1013) since it cannot use all of system memory to buffer the outgoing writes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1014) anymore but it preserve the memory on other nodes so that the performance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1015) of other processes running on other nodes will not be affected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1016)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1017) Allowing regular swap effectively restricts allocations to the local
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1018) node unless explicitly overridden by memory policies or cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1019) configurations.