Orange Pi5 kernel

^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  3) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  4) The TLB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  5) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  7) When the kernel unmaps or modified the attributes of a range of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  8) memory, it has two choices:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  9) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10)  1. Flush the entire TLB with a two-instruction sequence.  This is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)     a quick operation, but it causes collateral damage: TLB entries
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)     from areas other than the one we are trying to flush will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13)     destroyed and must be refilled later, at some cost.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)  2. Use the invlpg instruction to invalidate a single page at a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)     time.  This could potentially cost many more instructions, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16)     it is a much more precise operation, causing no collateral
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)     damage to other TLB entries.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) Which method to do depends on a few things:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)  1. The size of the flush being performed.  A flush of the entire
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22)     address space is obviously better performed by flushing the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23)     entire TLB than doing 2^48/PAGE_SIZE individual flushes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)  2. The contents of the TLB.  If the TLB is empty, then there will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)     be no collateral damage caused by doing the global flush, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)     all of the individual flush will have ended up being wasted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27)     work.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28)  3. The size of the TLB.  The larger the TLB, the more collateral
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)     damage we do with a full flush.  So, the larger the TLB, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30)     more attractive an individual flush looks.  Data and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31)     instructions have separate TLBs, as do different page sizes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)  4. The microarchitecture.  The TLB has become a multi-level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33)     cache on modern CPUs, and the global flushes have become more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34)     expensive relative to single-page flushes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) There is obviously no way the kernel can know all these things,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) especially the contents of the TLB during a given flush.  The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) sizes of the flush will vary greatly depending on the workload as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) well.  There is essentially no "right" point to choose.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) You may be doing too many individual invalidations if you see the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) invlpg instruction (or instructions _near_ it) show up high in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) profiles.  If you believe that individual invalidations being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) called too often, you can lower the tunable::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) 	/sys/kernel/debug/x86/tlb_single_page_flush_ceiling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) This will cause us to do the global flush for more cases.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) Lowering it to 0 will disable the use of the individual flushes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) Setting it to 1 is a very conservative setting and it should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) never need to be 0 under normal circumstances.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) Despite the fact that a single individual flush on x86 is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) guaranteed to flush a full 2MB [1]_, hugetlbfs always uses the full
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) flushes.  THP is treated exactly the same as normal memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) You might see invlpg inside of flush_tlb_mm_range() show up in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) profiles, or you can use the trace_tlb_flush() tracepoints. to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) determine how long the flush operations are taking.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) Essentially, you are balancing the cycles you spend doing invlpg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) with the cycles that you spend refilling the TLB later.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) You can measure how expensive TLB refills are by using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) performance counters and 'perf stat', like this::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)   perf stat -e
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68)     cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)     cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70)     cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)     cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)     cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73)     cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) That works on an IvyBridge-era CPU (i5-3320M).  Different CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) may have differently-named counters, but they should at least
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) be there in some form.  You can use pmu-tools 'ocperf list'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) (https://github.com/andikleen/pmu-tools) to find the right
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) counters for a given CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) .. [1] A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)    says: "One execution of INVLPG is sufficient even for a page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)    with size greater than 4 KBytes."