^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. _numa:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) What is NUMA?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) This question can be answered from a couple of perspectives: the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) hardware view and the Linux software view.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) From the hardware perspective, a NUMA system is a computer platform that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) comprises multiple components or assemblies each of which may contain 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) or more CPUs, local memory, and/or IO buses. For brevity and to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) disambiguate the hardware view of these physical components/assemblies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) from the software abstraction thereof, we'll call the components/assemblies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) 'cells' in this document.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) Each of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) of the system--although some components necessary for a stand-alone SMP system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) may not be populated on any given cell. The cells of the NUMA system are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) connected together with some sort of system interconnect--e.g., a crossbar or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) point-to-point link are common types of NUMA system interconnects. Both of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) these types of interconnects can be aggregated to create NUMA platforms with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) cells at multiple distances from other cells.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) For Linux, the NUMA platforms of interest are primarily what is known as Cache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) Coherent NUMA or ccNUMA systems. With ccNUMA systems, all memory is visible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) to and accessible from any CPU attached to any cell and cache coherency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) is handled in hardware by the processor caches and/or the system interconnect.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) Memory access time and effective memory bandwidth varies depending on how far
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) away the cell containing the CPU or IO bus making the memory access is from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) cell containing the target memory. For example, access to memory by CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) attached to the same cell will experience faster access times and higher
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) bandwidths than accesses to memory on other, remote cells. NUMA platforms
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) can have cells at multiple remote distances from any given cell.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) Platform vendors don't build NUMA systems just to make software developers'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) lives interesting. Rather, this architecture is a means to provide scalable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) memory bandwidth. However, to achieve scalable memory bandwidth, system and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) application software must arrange for a large majority of the memory references
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) [cache misses] to be to "local" memory--memory on the same cell, if any--or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) to the closest cell with memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) This leads to the Linux software view of a NUMA system:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) Linux divides the system's hardware resources into multiple software
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) abstractions called "nodes". Linux maps the nodes onto the physical cells
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) of the hardware platform, abstracting away some of the details for some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) architectures. As with physical cells, software nodes may contain 0 or more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) CPUs, memory and/or IO buses. And, again, memory accesses to memory on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) "closer" nodes--nodes that map to closer cells--will generally experience
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) faster access times and higher effective bandwidth than accesses to more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) remote cells.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) For some architectures, such as x86, Linux will "hide" any node representing a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) physical cell that has no memory attached, and reassign any CPUs attached to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) that cell to a node representing a cell that does have memory. Thus, on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) these architectures, one cannot assume that all CPUs that Linux associates with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) a given node will see the same local memory access times and bandwidth.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) In addition, for some architectures, again x86 is an example, Linux supports
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) the emulation of additional nodes. For NUMA emulation, linux will carve up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) the existing nodes--or the system memory for non-NUMA platforms--into multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) nodes. Each emulated node will manage a fraction of the underlying cells'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) physical memory. NUMA emluation is useful for testing NUMA kernel and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) application features on non-NUMA platforms, and as a sort of memory resource
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) management mechanism when used together with cpusets.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) [see Documentation/admin-guide/cgroup-v1/cpusets.rst]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) For each node with memory, Linux constructs an independent memory management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) subsystem, complete with its own free page lists, in-use page lists, usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) statistics and locks to mediate access. In addition, Linux constructs for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE],
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) an ordered "zonelist". A zonelist specifies the zones/nodes to visit when a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) selected zone/node cannot satisfy the allocation request. This situation,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) when a zone has no available memory to satisfy a request, is called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) "overflow" or "fallback".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) Because some nodes contain multiple zones containing different types of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) memory, Linux must decide whether to order the zonelists such that allocations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) fall back to the same zone type on a different node, or to a different zone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) type on the same node. This is an important consideration because some zones,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) such as DMA or DMA32, represent relatively scarce resources. Linux chooses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) a default Node ordered zonelist. This means it tries to fallback to other zones
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) from the same node before using remote nodes which are ordered by NUMA distance.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) By default, Linux will attempt to satisfy memory allocation requests from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) node to which the CPU that executes the request is assigned. Specifically,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) Linux will attempt to allocate from the first node in the appropriate zonelist
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) for the node where the request originates. This is called "local allocation."
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) If the "local" node cannot satisfy the request, the kernel will examine other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) nodes' zones in the selected zonelist looking for the first zone in the list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) that can satisfy the request.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) Local allocation will tend to keep subsequent access to the allocated memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) "local" to the underlying physical resources and off the system interconnect--
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) as long as the task on whose behalf the kernel allocated some memory does not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) later migrate away from that memory. The Linux scheduler is aware of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) NUMA topology of the platform--embodied in the "scheduling domains" data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) structures [see Documentation/scheduler/sched-domains.rst]--and the scheduler
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) attempts to minimize task migration to distant scheduling domains. However,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) the scheduler does not take a task's NUMA footprint into account directly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) Thus, under sufficient imbalance, tasks can migrate between nodes, remote
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) from their initial node and kernel data structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) System administrators and application designers can restrict a task's migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) to improve NUMA locality using various CPU affinity command line interfaces,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) such as taskset(1) and numactl(1), and program interfaces such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) sched_setaffinity(2). Further, one can modify the kernel's default local
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) allocation behavior using Linux NUMA memory policy. [see
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`].
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) System administrators can restrict the CPUs and nodes' memories that a non-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) privileged user can specify in the scheduling or NUMA commands and functions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) using control groups and CPUsets. [see Documentation/admin-guide/cgroup-v1/cpusets.rst]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) On architectures that do not hide memoryless nodes, Linux will include only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) zones [nodes] with memory in the zonelists. This means that for a memoryless
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) node the "local memory node"--the node of the first zone in CPU's node's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) zonelist--will not be the node itself. Rather, it will be the node that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) kernel selected as the nearest node with memory when it built the zonelists.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) So, default, local allocations will succeed with the kernel supplying the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) closest available memory. This is a consequence of the same mechanism that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) allows such allocations to fallback to other nearby nodes when a node that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) does contain memory overflows.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) Some kernel allocations do not want or cannot tolerate this allocation fallback
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) behavior. Rather they want to be sure they get memory from the specified node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) or get notified that the node has no free memory. This is usually the case when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) a subsystem allocates per CPU memory resources, for example.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) A typical model for making such an allocation is to obtain the node id of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) node to which the "current CPU" is attached using one of the kernel's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) numa_node_id() or CPU_to_node() functions and then request memory from only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) the node id returned. When such an allocation fails, the requesting subsystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) may revert to its own fallback path. The slab kernel memory allocator is an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) example of this. Or, the subsystem may choose to disable or not to enable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) itself on allocation failure. The kernel profiling subsystem is an example of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) this.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) If the architecture supports--does not hide--memoryless nodes, then CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) attached to memoryless nodes would always incur the fallback path overhead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) or some subsystems would fail to initialize if they attempted to allocated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) memory exclusively from a node without memory. To support such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) architectures transparently, kernel subsystems can use the numa_mem_id()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) or cpu_to_mem() function to locate the "local memory node" for the calling or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) specified CPU. Again, this is the same node from which default, local page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) allocations will be attempted.