Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    1) ===============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    2) Documentation for /proc/sys/vm/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    3) ===============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    5) kernel version 2.6.29
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    7) Copyright (c) 1998, 1999,  Rik van Riel <riel@nl.linux.org>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    9) Copyright (c) 2008         Peter W. Morreale <pmorreale@novell.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   11) For general info and legal blurb, please look in index.rst.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   12) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   13) ------------------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   14) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   15) This file contains the documentation for the sysctl files in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   16) /proc/sys/vm and is valid for Linux kernel version 2.6.29.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   17) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   18) The files in this directory can be used to tune the operation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   19) of the virtual memory (VM) subsystem of the Linux kernel and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   20) the writeout of dirty data to disk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   21) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   22) Default values and initialization routines for most of these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   23) files can be found in mm/swap.c.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   24) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   25) Currently, these files are in /proc/sys/vm:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   26) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   27) - admin_reserve_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   28) - block_dump
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   29) - compact_memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   30) - compaction_proactiveness
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   31) - compact_unevictable_allowed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   32) - dirty_background_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   33) - dirty_background_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   34) - dirty_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   35) - dirty_expire_centisecs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   36) - dirty_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   37) - dirtytime_expire_seconds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   38) - dirty_writeback_centisecs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   39) - drop_caches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   40) - extfrag_threshold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   41) - extra_free_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   42) - highmem_is_dirtyable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   43) - hugetlb_shm_group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   44) - laptop_mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   45) - legacy_va_layout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   46) - lowmem_reserve_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   47) - max_map_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   48) - memory_failure_early_kill
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   49) - memory_failure_recovery
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   50) - min_free_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   51) - min_slab_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   52) - min_unmapped_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   53) - mmap_min_addr
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   54) - mmap_rnd_bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   55) - mmap_rnd_compat_bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   56) - nr_hugepages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   57) - nr_hugepages_mempolicy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   58) - nr_overcommit_hugepages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   59) - nr_trim_pages         (only if CONFIG_MMU=n)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   60) - numa_zonelist_order
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   61) - oom_dump_tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   62) - oom_kill_allocating_task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   63) - overcommit_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   64) - overcommit_memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   65) - overcommit_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   66) - page-cluster
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   67) - panic_on_oom
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   68) - percpu_pagelist_fraction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   69) - stat_interval
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   70) - stat_refresh
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   71) - numa_stat
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   72) - swappiness
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   73) - unprivileged_userfaultfd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   74) - user_reserve_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   75) - vfs_cache_pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   76) - watermark_boost_factor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   77) - watermark_scale_factor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   78) - zone_reclaim_mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   79) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   80) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   81) admin_reserve_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   82) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   83) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   84) The amount of free memory in the system that should be reserved for users
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   85) with the capability cap_sys_admin.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   86) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   87) admin_reserve_kbytes defaults to min(3% of free pages, 8MB)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   88) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   89) That should provide enough for the admin to log in and kill a process,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   90) if necessary, under the default overcommit 'guess' mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   91) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   92) Systems running under overcommit 'never' should increase this to account
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   93) for the full Virtual Memory Size of programs used to recover. Otherwise,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   94) root may not be able to log in to recover the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   95) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   96) How do you calculate a minimum useful reserve?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   97) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   98) sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   99) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  100) For overcommit 'guess', we can sum resident set sizes (RSS).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  101) On x86_64 this is about 8MB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  102) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  103) For overcommit 'never', we can take the max of their virtual sizes (VSZ)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  104) and add the sum of their RSS.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  105) On x86_64 this is about 128MB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  106) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  107) Changing this takes effect whenever an application requests memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  108) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  109) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  110) block_dump
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  111) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  112) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  113) block_dump enables block I/O debugging when set to a nonzero value. More
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  114) information on block I/O debugging is in Documentation/admin-guide/laptops/laptop-mode.rst.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  115) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  116) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  117) compact_memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  118) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  119) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  120) Available only when CONFIG_COMPACTION is set. When 1 is written to the file,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  121) all zones are compacted such that free memory is available in contiguous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  122) blocks where possible. This can be important for example in the allocation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  123) huge pages although processes will also directly compact memory as required.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  124) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  125) compaction_proactiveness
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  126) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  127) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  128) This tunable takes a value in the range [0, 100] with a default value of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  129) 20. This tunable determines how aggressively compaction is done in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  130) background. On write of non zero value to this tunable will immediately
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  131) trigger the proactive compaction. Setting it to 0 disables proactive compaction.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  132) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  133) Note that compaction has a non-trivial system-wide impact as pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  134) belonging to different processes are moved around, which could also lead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  135) to latency spikes in unsuspecting applications. The kernel employs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  136) various heuristics to avoid wasting CPU cycles if it detects that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  137) proactive compaction is not being effective.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  138) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  139) Be careful when setting it to extreme values like 100, as that may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  140) cause excessive background compaction activity.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  141) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  142) compact_unevictable_allowed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  143) ===========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  144) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  145) Available only when CONFIG_COMPACTION is set. When set to 1, compaction is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  146) allowed to examine the unevictable lru (mlocked pages) for pages to compact.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  147) This should be used on systems where stalls for minor page faults are an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  148) acceptable trade for large contiguous free memory.  Set to 0 to prevent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  149) compaction from moving pages that are unevictable.  Default value is 1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  150) On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  151) to compaction, which would block the task from becomming active until the fault
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  152) is resolved.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  153) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  154) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  155) dirty_background_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  156) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  157) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  158) Contains the amount of dirty memory at which the background kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  159) flusher threads will start writeback.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  160) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  161) Note:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  162)   dirty_background_bytes is the counterpart of dirty_background_ratio. Only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  163)   one of them may be specified at a time. When one sysctl is written it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  164)   immediately taken into account to evaluate the dirty memory limits and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  165)   other appears as 0 when read.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  166) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  167) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  168) dirty_background_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  169) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  170) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  171) Contains, as a percentage of total available memory that contains free pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  172) and reclaimable pages, the number of pages at which the background kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  173) flusher threads will start writing out dirty data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  174) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  175) The total available memory is not equal to total system memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  176) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  177) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  178) dirty_bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  179) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  180) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  181) Contains the amount of dirty memory at which a process generating disk writes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  182) will itself start writeback.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  183) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  184) Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  185) specified at a time. When one sysctl is written it is immediately taken into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  186) account to evaluate the dirty memory limits and the other appears as 0 when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  187) read.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  188) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  189) Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  190) value lower than this limit will be ignored and the old configuration will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  191) retained.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  192) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  193) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  194) dirty_expire_centisecs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  195) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  196) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  197) This tunable is used to define when dirty data is old enough to be eligible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  198) for writeout by the kernel flusher threads.  It is expressed in 100'ths
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  199) of a second.  Data which has been dirty in-memory for longer than this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  200) interval will be written out next time a flusher thread wakes up.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  201) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  202) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  203) dirty_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  204) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  205) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  206) Contains, as a percentage of total available memory that contains free pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  207) and reclaimable pages, the number of pages at which a process which is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  208) generating disk writes will itself start writing out dirty data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  209) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  210) The total available memory is not equal to total system memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  211) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  212) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  213) dirtytime_expire_seconds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  214) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  215) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  216) When a lazytime inode is constantly having its pages dirtied, the inode with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  217) an updated timestamp will never get chance to be written out.  And, if the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  218) only thing that has happened on the file system is a dirtytime inode caused
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  219) by an atime update, a worker will be scheduled to make sure that inode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  220) eventually gets pushed out to disk.  This tunable is used to define when dirty
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  221) inode is old enough to be eligible for writeback by the kernel flusher threads.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  222) And, it is also used as the interval to wakeup dirtytime_writeback thread.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  223) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  224) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  225) dirty_writeback_centisecs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  226) =========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  227) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  228) The kernel flusher threads will periodically wake up and write `old` data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  229) out to disk.  This tunable expresses the interval between those wakeups, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  230) 100'ths of a second.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  231) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  232) Setting this to zero disables periodic writeback altogether.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  233) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  234) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  235) drop_caches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  236) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  237) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  238) Writing to this will cause the kernel to drop clean caches, as well as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  239) reclaimable slab objects like dentries and inodes.  Once dropped, their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  240) memory becomes free.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  241) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  242) To free pagecache::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  243) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  244) 	echo 1 > /proc/sys/vm/drop_caches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  245) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  246) To free reclaimable slab objects (includes dentries and inodes)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  247) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  248) 	echo 2 > /proc/sys/vm/drop_caches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  249) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  250) To free slab objects and pagecache::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  251) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  252) 	echo 3 > /proc/sys/vm/drop_caches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  253) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  254) This is a non-destructive operation and will not free any dirty objects.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  255) To increase the number of objects freed by this operation, the user may run
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  256) `sync` prior to writing to /proc/sys/vm/drop_caches.  This will minimize the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  257) number of dirty objects on the system and create more candidates to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  258) dropped.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  259) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  260) This file is not a means to control the growth of the various kernel caches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  261) (inodes, dentries, pagecache, etc...)  These objects are automatically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  262) reclaimed by the kernel when memory is needed elsewhere on the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  263) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  264) Use of this file can cause performance problems.  Since it discards cached
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  265) objects, it may cost a significant amount of I/O and CPU to recreate the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  266) dropped objects, especially if they were under heavy use.  Because of this,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  267) use outside of a testing or debugging environment is not recommended.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  268) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  269) You may see informational messages in your kernel log when this file is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  270) used::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  271) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  272) 	cat (1234): drop_caches: 3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  273) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  274) These are informational only.  They do not mean that anything is wrong
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  275) with your system.  To disable them, echo 4 (bit 2) into drop_caches.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  276) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  277) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  278) extfrag_threshold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  279) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  280) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  281) This parameter affects whether the kernel will compact memory or direct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  282) reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  283) debugfs shows what the fragmentation index for each order is in each zone in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  284) the system. Values tending towards 0 imply allocations would fail due to lack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  285) of memory, values towards 1000 imply failures are due to fragmentation and -1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  286) implies that the allocation will succeed as long as watermarks are met.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  287) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  288) The kernel will not compact memory in a zone if the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  289) fragmentation index is <= extfrag_threshold. The default value is 500.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  290) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  291) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  292) highmem_is_dirtyable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  293) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  294) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  295) Available only for systems with CONFIG_HIGHMEM enabled (32b systems).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  296) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  297) This parameter controls whether the high memory is considered for dirty
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  298) writers throttling.  This is not the case by default which means that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  299) only the amount of memory directly visible/usable by the kernel can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  300) be dirtied. As a result, on systems with a large amount of memory and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  301) lowmem basically depleted writers might be throttled too early and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  302) streaming writes can get very slow.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  303) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  304) Changing the value to non zero would allow more memory to be dirtied
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  305) and thus allow writers to write more data which can be flushed to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  306) storage more effectively. Note this also comes with a risk of pre-mature
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  307) OOM killer because some writers (e.g. direct block device writes) can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  308) only use the low memory and they can fill it up with dirty data without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  309) any throttling.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  310) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  311) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  312) extra_free_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  313) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  314) This parameter tells the VM to keep extra free memory between the threshold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  315) where background reclaim (kswapd) kicks in, and the threshold where direct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  316) reclaim (by allocating processes) kicks in.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  317) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  318) This is useful for workloads that require low latency memory allocations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  319) and have a bounded burstiness in memory allocations, for example a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  320) realtime application that receives and transmits network traffic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  321) (causing in-kernel memory allocations) with a maximum total message burst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  322) size of 200MB may need 200MB of extra free memory to avoid direct reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  323) related latencies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  324) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  325) ==============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  326) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  327) hugetlb_shm_group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  328) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  329) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  330) hugetlb_shm_group contains group id that is allowed to create SysV
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  331) shared memory segment using hugetlb page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  332) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  333) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  334) laptop_mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  335) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  336) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  337) laptop_mode is a knob that controls "laptop mode". All the things that are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  338) controlled by this knob are discussed in Documentation/admin-guide/laptops/laptop-mode.rst.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  339) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  340) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  341) legacy_va_layout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  342) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  343) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  344) If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  345) will use the legacy (2.4) layout for all processes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  346) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  347) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  348) lowmem_reserve_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  349) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  350) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  351) For some specialised workloads on highmem machines it is dangerous for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  352) the kernel to allow process memory to be allocated from the "lowmem"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  353) zone.  This is because that memory could then be pinned via the mlock()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  354) system call, or by unavailability of swapspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  355) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  356) And on large highmem machines this lack of reclaimable lowmem memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  357) can be fatal.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  358) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  359) So the Linux page allocator has a mechanism which prevents allocations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  360) which *could* use highmem from using too much lowmem.  This means that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  361) a certain amount of lowmem is defended from the possibility of being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  362) captured into pinned user memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  363) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  364) (The same argument applies to the old 16 megabyte ISA DMA region.  This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  365) mechanism will also defend that region from allocations which could use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  366) highmem or lowmem).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  367) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  368) The `lowmem_reserve_ratio` tunable determines how aggressive the kernel is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  369) in defending these lower zones.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  370) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  371) If you have a machine which uses highmem or ISA DMA and your
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  372) applications are using mlock(), or if you are running with no swap then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  373) you probably should change the lowmem_reserve_ratio setting.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  374) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  375) The lowmem_reserve_ratio is an array. You can see them by reading this file::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  376) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  377) 	% cat /proc/sys/vm/lowmem_reserve_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  378) 	256     256     32
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  379) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  380) But, these values are not used directly. The kernel calculates # of protection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  381) pages for each zones from them. These are shown as array of protection pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  382) in /proc/zoneinfo like followings. (This is an example of x86-64 box).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  383) Each zone has an array of protection pages like this::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  384) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  385)   Node 0, zone      DMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  386)     pages free     1355
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  387)           min      3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  388)           low      3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  389)           high     4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  390) 	:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  391) 	:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  392)       numa_other   0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  393)           protection: (0, 2004, 2004, 2004)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  394) 	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  395)     pagesets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  396)       cpu: 0 pcp: 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  397)           :
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  398) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  399) These protections are added to score to judge whether this zone should be used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  400) for page allocation or should be reclaimed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  401) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  402) In this example, if normal pages (index=2) are required to this DMA zone and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  403) watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  404) not be used because pages_free(1355) is smaller than watermark + protection[2]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  405) (4 + 2004 = 2008). If this protection value is 0, this zone would be used for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  406) normal page requirement. If requirement is DMA zone(index=0), protection[0]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  407) (=0) is used.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  408) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  409) zone[i]'s protection[j] is calculated by following expression::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  410) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  411)   (i < j):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  412)     zone[i]->protection[j]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  413)     = (total sums of managed_pages from zone[i+1] to zone[j] on the node)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  414)       / lowmem_reserve_ratio[i];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  415)   (i = j):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  416)      (should not be protected. = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  417)   (i > j):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  418)      (not necessary, but looks 0)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  419) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  420) The default values of lowmem_reserve_ratio[i] are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  421) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  422)     === ====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  423)     256 (if zone[i] means DMA or DMA32 zone)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  424)     32  (others)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  425)     === ====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  426) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  427) As above expression, they are reciprocal number of ratio.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  428) 256 means 1/256. # of protection pages becomes about "0.39%" of total managed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  429) pages of higher zones on the node.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  430) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  431) If you would like to protect more pages, smaller values are effective.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  432) The minimum value is 1 (1/1 -> 100%). The value less than 1 completely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  433) disables protection of the pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  434) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  435) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  436) max_map_count:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  437) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  438) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  439) This file contains the maximum number of memory map areas a process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  440) may have. Memory map areas are used as a side-effect of calling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  441) malloc, directly by mmap, mprotect, and madvise, and also when loading
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  442) shared libraries.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  443) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  444) While most applications need less than a thousand maps, certain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  445) programs, particularly malloc debuggers, may consume lots of them,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  446) e.g., up to one or two maps per allocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  447) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  448) The default value is 65536.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  449) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  450) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  451) memory_failure_early_kill:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  452) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  453) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  454) Control how to kill processes when uncorrected memory error (typically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  455) a 2bit error in a memory module) is detected in the background by hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  456) that cannot be handled by the kernel. In some cases (like the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  457) still having a valid copy on disk) the kernel will handle the failure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  458) transparently without affecting any applications. But if there is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  459) no other uptodate copy of the data it will kill to prevent any data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  460) corruptions from propagating.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  461) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  462) 1: Kill all processes that have the corrupted and not reloadable page mapped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  463) as soon as the corruption is detected.  Note this is not supported
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  464) for a few types of pages, like kernel internally allocated data or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  465) the swap cache, but works for the majority of user pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  466) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  467) 0: Only unmap the corrupted page from all processes and only kill a process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  468) who tries to access it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  469) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  470) The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  471) handle this if they want to.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  472) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  473) This is only active on architectures/platforms with advanced machine
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  474) check handling and depends on the hardware capabilities.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  475) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  476) Applications can override this setting individually with the PR_MCE_KILL prctl
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  477) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  478) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  479) memory_failure_recovery
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  480) =======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  481) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  482) Enable memory failure recovery (when supported by the platform)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  483) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  484) 1: Attempt recovery.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  485) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  486) 0: Always panic on a memory failure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  487) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  488) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  489) min_free_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  490) ===============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  491) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  492) This is used to force the Linux VM to keep a minimum number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  493) of kilobytes free.  The VM uses this number to compute a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  494) watermark[WMARK_MIN] value for each lowmem zone in the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  495) Each lowmem zone gets a number of reserved free pages based
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  496) proportionally on its size.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  497) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  498) Some minimal amount of memory is needed to satisfy PF_MEMALLOC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  499) allocations; if you set this to lower than 1024KB, your system will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  500) become subtly broken, and prone to deadlock under high loads.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  501) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  502) Setting this too high will OOM your machine instantly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  503) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  504) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  505) min_slab_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  506) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  507) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  508) This is available only on NUMA kernels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  509) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  510) A percentage of the total pages in each zone.  On Zone reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  511) (fallback from the local zone occurs) slabs will be reclaimed if more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  512) than this percentage of pages in a zone are reclaimable slab pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  513) This insures that the slab growth stays under control even in NUMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  514) systems that rarely perform global reclaim.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  515) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  516) The default is 5 percent.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  517) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  518) Note that slab reclaim is triggered in a per zone / node fashion.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  519) The process of reclaiming slab memory is currently not node specific
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  520) and may not be fast.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  521) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  522) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  523) min_unmapped_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  524) ==================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  525) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  526) This is available only on NUMA kernels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  527) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  528) This is a percentage of the total pages in each zone. Zone reclaim will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  529) only occur if more than this percentage of pages are in a state that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  530) zone_reclaim_mode allows to be reclaimed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  531) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  532) If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  533) against all file-backed unmapped pages including swapcache pages and tmpfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  534) files. Otherwise, only unmapped pages backed by normal files but not tmpfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  535) files and similar are considered.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  536) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  537) The default is 1 percent.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  538) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  539) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  540) mmap_min_addr
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  541) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  542) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  543) This file indicates the amount of address space  which a user process will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  544) be restricted from mmapping.  Since kernel null dereference bugs could
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  545) accidentally operate based on the information in the first couple of pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  546) of memory userspace processes should not be allowed to write to them.  By
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  547) default this value is set to 0 and no protections will be enforced by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  548) security module.  Setting this value to something like 64k will allow the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  549) vast majority of applications to work correctly and provide defense in depth
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  550) against future potential kernel bugs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  551) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  552) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  553) mmap_rnd_bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  554) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  555) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  556) This value can be used to select the number of bits to use to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  557) determine the random offset to the base address of vma regions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  558) resulting from mmap allocations on architectures which support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  559) tuning address space randomization.  This value will be bounded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  560) by the architecture's minimum and maximum supported values.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  561) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  562) This value can be changed after boot using the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  563) /proc/sys/vm/mmap_rnd_bits tunable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  564) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  565) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  566) mmap_rnd_compat_bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  567) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  568) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  569) This value can be used to select the number of bits to use to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  570) determine the random offset to the base address of vma regions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  571) resulting from mmap allocations for applications run in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  572) compatibility mode on architectures which support tuning address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  573) space randomization.  This value will be bounded by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  574) architecture's minimum and maximum supported values.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  575) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  576) This value can be changed after boot using the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  577) /proc/sys/vm/mmap_rnd_compat_bits tunable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  578) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  579) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  580) nr_hugepages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  581) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  582) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  583) Change the minimum size of the hugepage pool.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  584) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  585) See Documentation/admin-guide/mm/hugetlbpage.rst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  586) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  587) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  588) nr_hugepages_mempolicy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  589) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  590) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  591) Change the size of the hugepage pool at run-time on a specific
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  592) set of NUMA nodes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  593) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  594) See Documentation/admin-guide/mm/hugetlbpage.rst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  595) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  596) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  597) nr_overcommit_hugepages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  598) =======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  599) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  600) Change the maximum size of the hugepage pool. The maximum is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  601) nr_hugepages + nr_overcommit_hugepages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  602) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  603) See Documentation/admin-guide/mm/hugetlbpage.rst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  604) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  605) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  606) nr_trim_pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  607) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  608) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  609) This is available only on NOMMU kernels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  610) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  611) This value adjusts the excess page trimming behaviour of power-of-2 aligned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  612) NOMMU mmap allocations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  613) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  614) A value of 0 disables trimming of allocations entirely, while a value of 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  615) trims excess pages aggressively. Any value >= 1 acts as the watermark where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  616) trimming of allocations is initiated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  617) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  618) The default value is 1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  619) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  620) See Documentation/admin-guide/mm/nommu-mmap.rst for more information.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  621) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  622) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  623) numa_zonelist_order
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  624) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  625) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  626) This sysctl is only for NUMA and it is deprecated. Anything but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  627) Node order will fail!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  628) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  629) 'where the memory is allocated from' is controlled by zonelists.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  630) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  631) (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  632) you may be able to read ZONE_DMA as ZONE_DMA32...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  633) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  634) In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  635) ZONE_NORMAL -> ZONE_DMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  636) This means that a memory allocation request for GFP_KERNEL will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  637) get memory from ZONE_DMA only when ZONE_NORMAL is not available.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  638) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  639) In NUMA case, you can think of following 2 types of order.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  640) Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  641) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  642)   (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  643)   (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  644) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  645) Type(A) offers the best locality for processes on Node(0), but ZONE_DMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  646) will be used before ZONE_NORMAL exhaustion. This increases possibility of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  647) out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  648) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  649) Type(B) cannot offer the best locality but is more robust against OOM of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  650) the DMA zone.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  651) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  652) Type(A) is called as "Node" order. Type (B) is "Zone" order.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  653) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  654) "Node order" orders the zonelists by node, then by zone within each node.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  655) Specify "[Nn]ode" for node order
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  656) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  657) "Zone Order" orders the zonelists by zone type, then by node within each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  658) zone.  Specify "[Zz]one" for zone order.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  659) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  660) Specify "[Dd]efault" to request automatic configuration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  661) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  662) On 32-bit, the Normal zone needs to be preserved for allocations accessible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  663) by the kernel, so "zone" order will be selected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  664) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  665) On 64-bit, devices that require DMA32/DMA are relatively rare, so "node"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  666) order will be selected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  667) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  668) Default order is recommended unless this is causing problems for your
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  669) system/application.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  670) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  671) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  672) oom_dump_tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  673) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  674) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  675) Enables a system-wide task dump (excluding kernel threads) to be produced
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  676) when the kernel performs an OOM-killing and includes such information as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  677) pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  678) score, and name.  This is helpful to determine why the OOM killer was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  679) invoked, to identify the rogue task that caused it, and to determine why
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  680) the OOM killer chose the task it did to kill.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  681) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  682) If this is set to zero, this information is suppressed.  On very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  683) large systems with thousands of tasks it may not be feasible to dump
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  684) the memory state information for each one.  Such systems should not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  685) be forced to incur a performance penalty in OOM conditions when the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  686) information may not be desired.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  687) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  688) If this is set to non-zero, this information is shown whenever the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  689) OOM killer actually kills a memory-hogging task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  690) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  691) The default value is 1 (enabled).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  692) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  693) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  694) oom_kill_allocating_task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  695) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  696) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  697) This enables or disables killing the OOM-triggering task in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  698) out-of-memory situations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  699) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  700) If this is set to zero, the OOM killer will scan through the entire
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  701) tasklist and select a task based on heuristics to kill.  This normally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  702) selects a rogue memory-hogging task that frees up a large amount of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  703) memory when killed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  704) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  705) If this is set to non-zero, the OOM killer simply kills the task that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  706) triggered the out-of-memory condition.  This avoids the expensive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  707) tasklist scan.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  708) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  709) If panic_on_oom is selected, it takes precedence over whatever value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  710) is used in oom_kill_allocating_task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  711) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  712) The default value is 0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  713) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  714) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  715) overcommit_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  716) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  717) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  718) When overcommit_memory is set to 2, the committed address space is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  719) permitted to exceed swap plus this amount of physical RAM. See below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  720) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  721) Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  722) of them may be specified at a time. Setting one disables the other (which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  723) then appears as 0 when read).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  724) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  725) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  726) overcommit_memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  727) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  728) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  729) This value contains a flag that enables memory overcommitment.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  730) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  731) When this flag is 0, the kernel attempts to estimate the amount
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  732) of free memory left when userspace requests more memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  733) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  734) When this flag is 1, the kernel pretends there is always enough
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  735) memory until it actually runs out.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  736) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  737) When this flag is 2, the kernel uses a "never overcommit"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  738) policy that attempts to prevent any overcommit of memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  739) Note that user_reserve_kbytes affects this policy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  740) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  741) This feature can be very useful because there are a lot of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  742) programs that malloc() huge amounts of memory "just-in-case"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  743) and don't use much of it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  744) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  745) The default value is 0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  746) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  747) See Documentation/vm/overcommit-accounting.rst and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  748) mm/util.c::__vm_enough_memory() for more information.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  749) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  750) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  751) overcommit_ratio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  752) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  753) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  754) When overcommit_memory is set to 2, the committed address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  755) space is not permitted to exceed swap plus this percentage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  756) of physical RAM.  See above.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  757) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  758) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  759) page-cluster
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  760) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  761) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  762) page-cluster controls the number of pages up to which consecutive pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  763) are read in from swap in a single attempt. This is the swap counterpart
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  764) to page cache readahead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  765) The mentioned consecutivity is not in terms of virtual/physical addresses,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  766) but consecutive on swap space - that means they were swapped out together.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  767) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  768) It is a logarithmic value - setting it to zero means "1 page", setting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  769) it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  770) Zero disables swap readahead completely.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  771) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  772) The default value is three (eight pages at a time).  There may be some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  773) small benefits in tuning this to a different value if your workload is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  774) swap-intensive.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  775) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  776) Lower values mean lower latencies for initial faults, but at the same time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  777) extra faults and I/O delays for following faults if they would have been part of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  778) that consecutive pages readahead would have brought in.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  779) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  780) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  781) panic_on_oom
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  782) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  783) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  784) This enables or disables panic on out-of-memory feature.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  785) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  786) If this is set to 0, the kernel will kill some rogue process,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  787) called oom_killer.  Usually, oom_killer can kill rogue processes and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  788) system will survive.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  789) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  790) If this is set to 1, the kernel panics when out-of-memory happens.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  791) However, if a process limits using nodes by mempolicy/cpusets,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  792) and those nodes become memory exhaustion status, one process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  793) may be killed by oom-killer. No panic occurs in this case.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  794) Because other nodes' memory may be free. This means system total status
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  795) may be not fatal yet.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  796) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  797) If this is set to 2, the kernel panics compulsorily even on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  798) above-mentioned. Even oom happens under memory cgroup, the whole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  799) system panics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  800) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  801) The default value is 0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  802) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  803) 1 and 2 are for failover of clustering. Please select either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  804) according to your policy of failover.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  805) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  806) panic_on_oom=2+kdump gives you very strong tool to investigate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  807) why oom happens. You can get snapshot.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  808) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  809) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  810) percpu_pagelist_fraction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  811) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  812) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  813) This is the fraction of pages at most (high mark pcp->high) in each zone that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  814) are allocated for each per cpu page list.  The min value for this is 8.  It
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  815) means that we don't allow more than 1/8th of pages in each zone to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  816) allocated in any single per_cpu_pagelist.  This entry only changes the value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  817) of hot per cpu pagelists.  User can specify a number like 100 to allocate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  818) 1/100th of each zone to each per cpu page list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  819) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  820) The batch value of each per cpu pagelist is also updated as a result.  It is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  821) set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  822) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  823) The initial value is zero.  Kernel does not use this value at boot time to set
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  824) the high water marks for each per cpu page list.  If the user writes '0' to this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  825) sysctl, it will revert to this default behavior.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  826) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  827) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  828) stat_interval
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  829) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  830) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  831) The time interval between which vm statistics are updated.  The default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  832) is 1 second.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  833) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  834) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  835) stat_refresh
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  836) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  837) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  838) Any read or write (by root only) flushes all the per-cpu vm statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  839) into their global totals, for more accurate reports when testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  840) e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  841) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  842) As a side-effect, it also checks for negative totals (elsewhere reported
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  843) as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  844) (At time of writing, a few stats are known sometimes to be found negative,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  845) with no ill effects: errors and warnings on these stats are suppressed.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  846) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  847) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  848) numa_stat
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  849) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  850) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  851) This interface allows runtime configuration of numa statistics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  852) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  853) When page allocation performance becomes a bottleneck and you can tolerate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  854) some possible tool breakage and decreased numa counter precision, you can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  855) do::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  856) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  857) 	echo 0 > /proc/sys/vm/numa_stat
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  858) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  859) When page allocation performance is not a bottleneck and you want all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  860) tooling to work, you can do::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  861) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  862) 	echo 1 > /proc/sys/vm/numa_stat
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  863) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  864) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  865) swappiness
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  866) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  867) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  868) This control is used to define the rough relative IO cost of swapping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  869) and filesystem paging, as a value between 0 and 200. At 100, the VM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  870) assumes equal IO cost and will thus apply memory pressure to the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  871) cache and swap-backed pages equally; lower values signify more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  872) expensive swap IO, higher values indicates cheaper.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  873) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  874) Keep in mind that filesystem IO patterns under memory pressure tend to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  875) be more efficient than swap's random IO. An optimal value will require
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  876) experimentation and will also be workload-dependent.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  877) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  878) The default value is 60.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  879) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  880) For in-memory swap, like zram or zswap, as well as hybrid setups that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  881) have swap on faster devices than the filesystem, values beyond 100 can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  882) be considered. For example, if the random IO against the swap device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  883) is on average 2x faster than IO from the filesystem, swappiness should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  884) be 133 (x + 2x = 200, 2x = 133.33).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  885) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  886) At 0, the kernel will not initiate swap until the amount of free and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  887) file-backed pages is less than the high watermark in a zone.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  888) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  889) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  890) unprivileged_userfaultfd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  891) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  892) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  893) This flag controls the mode in which unprivileged users can use the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  894) userfaultfd system calls. Set this to 0 to restrict unprivileged users
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  895) to handle page faults in user mode only. In this case, users without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  896) SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  897) succeed. Prohibiting use of userfaultfd for handling faults from kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  898) mode may make certain vulnerabilities more difficult to exploit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  899) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  900) Set this to 1 to allow unprivileged users to use the userfaultfd system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  901) calls without any restrictions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  902) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  903) The default value is 0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  904) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  905) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  906) user_reserve_kbytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  907) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  908) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  909) When overcommit_memory is set to 2, "never overcommit" mode, reserve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  910) min(3% of current process size, user_reserve_kbytes) of free memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  911) This is intended to prevent a user from starting a single memory hogging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  912) process, such that they cannot recover (kill the hog).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  913) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  914) user_reserve_kbytes defaults to min(3% of the current process size, 128MB).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  915) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  916) If this is reduced to zero, then the user will be allowed to allocate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  917) all free memory with a single process, minus admin_reserve_kbytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  918) Any subsequent attempts to execute a command will result in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  919) "fork: Cannot allocate memory".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  920) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  921) Changing this takes effect whenever an application requests memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  922) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  923) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  924) vfs_cache_pressure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  925) ==================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  926) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  927) This percentage value controls the tendency of the kernel to reclaim
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  928) the memory which is used for caching of directory and inode objects.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  929) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  930) At the default value of vfs_cache_pressure=100 the kernel will attempt to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  931) reclaim dentries and inodes at a "fair" rate with respect to pagecache and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  932) swapcache reclaim.  Decreasing vfs_cache_pressure causes the kernel to prefer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  933) to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  934) never reclaim dentries and inodes due to memory pressure and this can easily
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  935) lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  936) causes the kernel to prefer to reclaim dentries and inodes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  937) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  938) Increasing vfs_cache_pressure significantly beyond 100 may have negative
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  939) performance impact. Reclaim code needs to take various locks to find freeable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  940) directory and inode objects. With vfs_cache_pressure=1000, it will look for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  941) ten times more freeable objects than there are.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  942) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  943) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  944) watermark_boost_factor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  945) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  946) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  947) This factor controls the level of reclaim when memory is being fragmented.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  948) It defines the percentage of the high watermark of a zone that will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  949) reclaimed if pages of different mobility are being mixed within pageblocks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  950) The intent is that compaction has less work to do in the future and to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  951) increase the success rate of future high-order allocations such as SLUB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  952) allocations, THP and hugetlbfs pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  953) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  954) To make it sensible with respect to the watermark_scale_factor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  955) parameter, the unit is in fractions of 10,000. The default value of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  956) 15,000 on !DISCONTIGMEM configurations means that up to 150% of the high
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  957) watermark will be reclaimed in the event of a pageblock being mixed due
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  958) to fragmentation. The level of reclaim is determined by the number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  959) fragmentation events that occurred in the recent past. If this value is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  960) smaller than a pageblock then a pageblocks worth of pages will be reclaimed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  961) (e.g.  2MB on 64-bit x86). A boost factor of 0 will disable the feature.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  962) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  963) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  964) watermark_scale_factor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  965) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  966) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  967) This factor controls the aggressiveness of kswapd. It defines the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  968) amount of memory left in a node/system before kswapd is woken up and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  969) how much memory needs to be free before kswapd goes back to sleep.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  970) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  971) The unit is in fractions of 10,000. The default value of 10 means the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  972) distances between watermarks are 0.1% of the available memory in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  973) node/system. The maximum value is 1000, or 10% of memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  974) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  975) A high rate of threads entering direct reclaim (allocstall) or kswapd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  976) going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  977) that the number of free pages kswapd maintains for latency reasons is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  978) too small for the allocation bursts occurring in the system. This knob
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  979) can then be used to tune kswapd aggressiveness accordingly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  980) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  981) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  982) zone_reclaim_mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  983) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  984) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  985) Zone_reclaim_mode allows someone to set more or less aggressive approaches to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  986) reclaim memory when a zone runs out of memory. If it is set to zero then no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  987) zone reclaim occurs. Allocations will be satisfied from other zones / nodes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  988) in the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  989) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  990) This is value OR'ed together of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  991) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  992) =	===================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  993) 1	Zone reclaim on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  994) 2	Zone reclaim writes dirty pages out
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  995) 4	Zone reclaim swaps pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  996) =	===================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  997) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  998) zone_reclaim_mode is disabled by default.  For file servers or workloads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  999) that benefit from having their data cached, zone_reclaim_mode should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1000) left disabled as the caching effect is likely to be more important than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1001) data locality.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1002) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1003) Consider enabling one or more zone_reclaim mode bits if it's known that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1004) workload is partitioned such that each partition fits within a NUMA node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1005) and that accessing remote memory would cause a measurable performance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1006) reduction.  The page allocator will take additional actions before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1007) allocating off node pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1008) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1009) Allowing zone reclaim to write out pages stops processes that are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1010) writing large amounts of data from dirtying pages on other nodes. Zone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1011) reclaim will write out dirty pages if a zone fills up and so effectively
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1012) throttle the process. This may decrease the performance of a single process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1013) since it cannot use all of system memory to buffer the outgoing writes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1014) anymore but it preserve the memory on other nodes so that the performance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1015) of other processes running on other nodes will not be affected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1016) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1017) Allowing regular swap effectively restricts allocations to the local
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1018) node unless explicitly overridden by memory policies or cpuset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1019) configurations.