Orange Pi5 kernel

^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. _balance:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) Memory Balancing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) well as for non __GFP_IO allocations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) The first reason why a caller may avoid reclaim is that the caller can not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) sleep due to holding a spinlock or is in interrupt context. The second may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) be that the caller is willing to fail the allocation without incurring the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) overhead of page reclaim. This may happen for opportunistic high-order
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) allocation requests that have order-0 fallback options. In such cases,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) the caller may also wish to avoid waking kswapd.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) __GFP_IO allocation requests are made to prevent file system deadlocks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) In the absence of non sleepable allocation requests, it seems detrimental
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) to be doing balancing. Page reclamation can be kicked off lazily, that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) is, only when needed (aka zone free memory is 0), instead of making it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) a proactive process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) That being said, the kernel should try to fulfill requests for direct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) mapped pages from the direct mapped pool, instead of falling back on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) the dma pool, so as to keep the dma pool filled for dma requests (atomic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) or not). A similar argument applies to highmem and direct mapped pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) OTOH, if there is a lot of free dma pages, it is preferable to satisfy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) regular memory requests by allocating one from the dma pool, instead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) of incurring the overhead of regular zone balancing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) In 2.2, memory balancing/page reclamation would kick off only when the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) _total_ number of free pages fell below 1/64 th of total memory. With the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) right ratio of dma and regular memory, it is quite possible that balancing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) would not be done even when the dma zone was completely empty. 2.2 has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) been running production machines of varying memory sizes, and seems to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) doing fine even with the presence of this problem. In 2.3, due to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) HIGHMEM, this problem is aggravated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) In 2.3, zone balancing can be done in one of two ways: depending on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) zone size (and possibly of the size of lower class zones), we can decide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) at init time how many free pages we should aim for while balancing any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) zone. The good part is, while balancing, we do not need to look at sizes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) of lower class zones, the bad part is, we might do too frequent balancing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) due to ignoring possibly lower usage in the lower class zones. Also,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) with a slight change in the allocation routine, it is possible to reduce
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) the memclass() macro to be a simple equality.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) Another possible solution is that we balance only when the free memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) of a zone _and_ all its lower class zones falls below 1/64th of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) total memory in the zone and its lower class zones. This fixes the 2.2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) balancing problem, and stays as close to 2.2 behavior as possible. Also,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) the balancing algorithm works the same way on the various architectures,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) which have different numbers and types of zones. If we wanted to get
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) fancy, we could assign different weights to free pages in different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) zones in the future.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) Note that if the size of the regular zone is huge compared to dma zone,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) it becomes less significant to consider the free dma pages while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) deciding whether to balance the regular zone. The first solution
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) becomes more attractive then.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) The appended patch implements the second solution. It also "fixes" two
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) problems: first, kswapd is woken up as in 2.2 on low memory conditions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) for non-sleepable allocations. Second, the HIGHMEM zone is also balanced,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) so as to give a fighting chance for replace_with_highmem() to get a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) HIGHMEM page, as well as to ensure that HIGHMEM allocations do not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) fall back into regular zone. This also makes sure that HIGHMEM pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) are not leaked (for example, in situations where a HIGHMEM page is in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) the swapcache but is not being used by anyone)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) kswapd also needs to know about the zones it should balance. kswapd is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) primarily needed in a situation where balancing can not be done,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) probably because all allocation requests are coming from intr context
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) and all process contexts are sleeping. For 2.3, kswapd does not really
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) need to balance the highmem zone, since intr context does not request
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) highmem pages. kswapd looks at the zone_wake_kswapd field in the zone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) structure to decide whether a zone needs balancing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) Page stealing from process memory and shm is done if stealing the page would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) alleviate memory pressure on any zone in the page's node that has fallen below
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) its watermark.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) are per-zone fields, used to determine when a zone needs to be balanced. When
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) the number of pages falls below watermark[WMARK_MIN], the hysteric field
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) low_on_memory gets set. This stays set till the number of free pages becomes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) try to free some pages in the zone (providing GFP_WAIT is set in the request).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) Orthogonal to this, is the decision to poke kswapd to free some zone pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) That decision is not hysteresis based, and is done when the number of free
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) (Good) Ideas that I have heard:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) 1. Dynamic experience should influence balancing: number of failed requests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100)    for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) 2. Implement a replace_with_highmem()-like replace_with_regular() to preserve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)    dma pages. (lkd@tantalophile.demon.co.uk)