Orange Pi5 kernel

^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. _numa:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) What is NUMA?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) This question can be answered from a couple of perspectives:  the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) hardware view and the Linux software view.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) From the hardware perspective, a NUMA system is a computer platform that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) comprises multiple components or assemblies each of which may contain 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) or more CPUs, local memory, and/or IO buses.  For brevity and to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) disambiguate the hardware view of these physical components/assemblies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) from the software abstraction thereof, we'll call the components/assemblies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) 'cells' in this document.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) Each of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) of the system--although some components necessary for a stand-alone SMP system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) may not be populated on any given cell.   The cells of the NUMA system are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) connected together with some sort of system interconnect--e.g., a crossbar or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) point-to-point link are common types of NUMA system interconnects.  Both of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) these types of interconnects can be aggregated to create NUMA platforms with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) cells at multiple distances from other cells.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) For Linux, the NUMA platforms of interest are primarily what is known as Cache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) Coherent NUMA or ccNUMA systems.   With ccNUMA systems, all memory is visible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) to and accessible from any CPU attached to any cell and cache coherency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) is handled in hardware by the processor caches and/or the system interconnect.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) Memory access time and effective memory bandwidth varies depending on how far
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) away the cell containing the CPU or IO bus making the memory access is from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) cell containing the target memory.  For example, access to memory by CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) attached to the same cell will experience faster access times and higher
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) bandwidths than accesses to memory on other, remote cells.  NUMA platforms
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) can have cells at multiple remote distances from any given cell.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) Platform vendors don't build NUMA systems just to make software developers'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) lives interesting.  Rather, this architecture is a means to provide scalable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) memory bandwidth.  However, to achieve scalable memory bandwidth, system and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) application software must arrange for a large majority of the memory references
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) [cache misses] to be to "local" memory--memory on the same cell, if any--or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) to the closest cell with memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) This leads to the Linux software view of a NUMA system:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) Linux divides the system's hardware resources into multiple software
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) abstractions called "nodes".  Linux maps the nodes onto the physical cells
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) of the hardware platform, abstracting away some of the details for some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) architectures.  As with physical cells, software nodes may contain 0 or more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) CPUs, memory and/or IO buses.  And, again, memory accesses to memory on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) "closer" nodes--nodes that map to closer cells--will generally experience
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) faster access times and higher effective bandwidth than accesses to more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) remote cells.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) For some architectures, such as x86, Linux will "hide" any node representing a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) physical cell that has no memory attached, and reassign any CPUs attached to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) that cell to a node representing a cell that does have memory.  Thus, on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) these architectures, one cannot assume that all CPUs that Linux associates with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) a given node will see the same local memory access times and bandwidth.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) In addition, for some architectures, again x86 is an example, Linux supports
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) the emulation of additional nodes.  For NUMA emulation, linux will carve up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) the existing nodes--or the system memory for non-NUMA platforms--into multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) nodes.  Each emulated node will manage a fraction of the underlying cells'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) physical memory.  NUMA emluation is useful for testing NUMA kernel and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) application features on non-NUMA platforms, and as a sort of memory resource
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) management mechanism when used together with cpusets.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) [see Documentation/admin-guide/cgroup-v1/cpusets.rst]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) For each node with memory, Linux constructs an independent memory management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) subsystem, complete with its own free page lists, in-use page lists, usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) statistics and locks to mediate access.  In addition, Linux constructs for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE],
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) an ordered "zonelist".  A zonelist specifies the zones/nodes to visit when a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) selected zone/node cannot satisfy the allocation request.  This situation,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) when a zone has no available memory to satisfy a request, is called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) "overflow" or "fallback".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) Because some nodes contain multiple zones containing different types of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) memory, Linux must decide whether to order the zonelists such that allocations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) fall back to the same zone type on a different node, or to a different zone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) type on the same node.  This is an important consideration because some zones,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) such as DMA or DMA32, represent relatively scarce resources.  Linux chooses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) a default Node ordered zonelist. This means it tries to fallback to other zones
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) from the same node before using remote nodes which are ordered by NUMA distance.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) By default, Linux will attempt to satisfy memory allocation requests from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) node to which the CPU that executes the request is assigned.  Specifically,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) Linux will attempt to allocate from the first node in the appropriate zonelist
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) for the node where the request originates.  This is called "local allocation."
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) If the "local" node cannot satisfy the request, the kernel will examine other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) nodes' zones in the selected zonelist looking for the first zone in the list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) that can satisfy the request.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) Local allocation will tend to keep subsequent access to the allocated memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) "local" to the underlying physical resources and off the system interconnect--
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) as long as the task on whose behalf the kernel allocated some memory does not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) later migrate away from that memory.  The Linux scheduler is aware of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) NUMA topology of the platform--embodied in the "scheduling domains" data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) structures [see Documentation/scheduler/sched-domains.rst]--and the scheduler
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) attempts to minimize task migration to distant scheduling domains.  However,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) the scheduler does not take a task's NUMA footprint into account directly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) Thus, under sufficient imbalance, tasks can migrate between nodes, remote
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) from their initial node and kernel data structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) System administrators and application designers can restrict a task's migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) to improve NUMA locality using various CPU affinity command line interfaces,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) such as taskset(1) and numactl(1), and program interfaces such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) sched_setaffinity(2).  Further, one can modify the kernel's default local
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) allocation behavior using Linux NUMA memory policy. [see
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`].
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) System administrators can restrict the CPUs and nodes' memories that a non-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) privileged user can specify in the scheduling or NUMA commands and functions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) using control groups and CPUsets.  [see Documentation/admin-guide/cgroup-v1/cpusets.rst]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) On architectures that do not hide memoryless nodes, Linux will include only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) zones [nodes] with memory in the zonelists.  This means that for a memoryless
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) node the "local memory node"--the node of the first zone in CPU's node's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) zonelist--will not be the node itself.  Rather, it will be the node that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) kernel selected as the nearest node with memory when it built the zonelists.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) So, default, local allocations will succeed with the kernel supplying the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) closest available memory.  This is a consequence of the same mechanism that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) allows such allocations to fallback to other nearby nodes when a node that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) does contain memory overflows.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) Some kernel allocations do not want or cannot tolerate this allocation fallback
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) behavior.  Rather they want to be sure they get memory from the specified node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) or get notified that the node has no free memory.  This is usually the case when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) a subsystem allocates per CPU memory resources, for example.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) A typical model for making such an allocation is to obtain the node id of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) node to which the "current CPU" is attached using one of the kernel's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) numa_node_id() or CPU_to_node() functions and then request memory from only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) the node id returned.  When such an allocation fails, the requesting subsystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) may revert to its own fallback path.  The slab kernel memory allocator is an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) example of this.  Or, the subsystem may choose to disable or not to enable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) itself on allocation failure.  The kernel profiling subsystem is an example of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) this.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) If the architecture supports--does not hide--memoryless nodes, then CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) attached to memoryless nodes would always incur the fallback path overhead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) or some subsystems would fail to initialize if they attempted to allocated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) memory exclusively from a node without memory.  To support such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) architectures transparently, kernel subsystems can use the numa_mem_id()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) or cpu_to_mem() function to locate the "local memory node" for the calling or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) specified CPU.  Again, this is the same node from which default, local page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) allocations will be attempted.