Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) =====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) Scaling in the Linux Networking Stack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) =====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) This document describes a set of complementary techniques in the Linux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) networking stack to increase parallelism and improve performance for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) multi-processor systems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) The following technologies are described:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) - RSS: Receive Side Scaling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) - RPS: Receive Packet Steering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) - RFS: Receive Flow Steering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) - Accelerated Receive Flow Steering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) - XPS: Transmit Packet Steering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) RSS: Receive Side Scaling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) =========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) Contemporary NICs support multiple receive and transmit descriptor queues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) (multi-queue). On reception, a NIC can send different packets to different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) queues to distribute processing among CPUs. The NIC distributes packets by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) applying a filter to each packet that assigns it to one of a small number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) of logical flows. Packets for each flow are steered to a separate receive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) queue, which in turn can be processed by separate CPUs. This mechanism is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) generally known as “Receive-side Scaling” (RSS). The goal of RSS and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) the other scaling techniques is to increase performance uniformly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) Multi-queue distribution can also be used for traffic prioritization, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) that is not the focus of these techniques.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) The filter used in RSS is typically a hash function over the network
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) and/or transport layer headers-- for example, a 4-tuple hash over
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) IP addresses and TCP ports of a packet. The most common hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) implementation of RSS uses a 128-entry indirection table where each entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) stores a queue number. The receive queue for a packet is determined
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) by masking out the low order seven bits of the computed hash for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) packet (usually a Toeplitz hash), taking this number as a key into the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) indirection table and reading the corresponding value.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) Some advanced NICs allow steering packets to queues based on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) programmable filters. For example, webserver bound TCP port 80 packets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) can be directed to their own receive queue. Such “n-tuple” filters can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) be configured from ethtool (--config-ntuple).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) RSS Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) The driver for a multi-queue capable NIC typically provides a kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) module parameter for specifying the number of hardware queues to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) configure. In the bnx2x driver, for instance, this parameter is called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) num_queues. A typical RSS configuration would be to have one receive queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) for each CPU if the device supports enough queues, or otherwise at least
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) one for each memory domain, where a memory domain is a set of CPUs that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) share a particular memory level (L1, L2, NUMA node, etc.).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) The indirection table of an RSS device, which resolves a queue by masked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) hash, is usually programmed by the driver at initialization. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) default mapping is to distribute the queues evenly in the table, but the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) indirection table can be retrieved and modified at runtime using ethtool
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) indirection table could be done to give different queues different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) relative weights.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) RSS IRQ Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) ~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) Each receive queue has a separate IRQ associated with it. The NIC triggers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) this to notify a CPU when new packets arrive on the given queue. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) signaling path for PCIe devices uses message signaled interrupts (MSI-X),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) that can route each interrupt to a particular CPU. The active mapping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) of queues to IRQs can be determined from /proc/interrupts. By default,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) an IRQ may be handled on any CPU. Because a non-negligible part of packet
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) processing takes place in receive interrupt handling, it is advantageous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) to spread receive interrupts between CPUs. To manually adjust the IRQ
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) affinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) will be running irqbalance, a daemon that dynamically optimizes IRQ
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) assignments and as a result may override any manual settings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) Suggested Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) ~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) RSS should be enabled when latency is a concern or whenever receive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) interrupt processing forms a bottleneck. Spreading load between CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) decreases queue length. For low latency networking, the optimal setting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) is to allocate as many queues as there are CPUs in the system (or the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) NIC maximum, if lower). The most efficient high-rate configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) is likely the one with the smallest number of receive queues where no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) receive queue overflows due to a saturated CPU, because in default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) mode with interrupt coalescing enabled, the aggregate number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) interrupts (and thus work) grows with each additional queue.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) Per-cpu load can be observed using the mpstat utility, but note that on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) processors with hyperthreading (HT), each hyperthread is represented as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) a separate CPU. For interrupt handling, HT has shown no benefit in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) initial tests, so limit the number of queues to the number of CPU cores
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) in the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) RPS: Receive Packet Steering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) Receive Packet Steering (RPS) is logically a software implementation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) RSS. Being in software, it is necessarily called later in the datapath.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) Whereas RSS selects the queue and hence CPU that will run the hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) interrupt handler, RPS selects the CPU to perform protocol processing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) above the interrupt handler. This is accomplished by placing the packet
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) on the desired CPU’s backlog queue and waking up the CPU for processing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) RPS has some advantages over RSS:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) 1) it can be used with any NIC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) 2) software filters can easily be added to hash over new protocols
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) 3) it does not increase hardware device interrupt rate (although it does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123)    introduce inter-processor interrupts (IPIs))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) RPS is called during bottom half of the receive interrupt handler, when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) a driver sends a packet up the network stack with netif_rx() or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) netif_receive_skb(). These call the get_rps_cpu() function, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) selects the queue that should process a packet.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) The first step in determining the target CPU for RPS is to calculate a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) depending on the protocol). This serves as a consistent hash of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) associated flow of the packet. The hash is either provided by hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) or will be computed in the stack. Capable hardware can pass the hash in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) the receive descriptor for the packet; this would usually be the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) skb->hash and can be used elsewhere in the stack as a hash of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) packet’s flow.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) Each receive hardware queue has an associated list of CPUs to which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) RPS may enqueue packets for processing. For each received packet,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) an index into the list is computed from the flow hash modulo the size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) of the list. The indexed CPU is the target for processing the packet,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) and the packet is queued to the tail of that CPU’s backlog queue. At
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) the end of the bottom half routine, IPIs are sent to any CPUs for which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) packets have been queued to their backlog queue. The IPI wakes backlog
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) processing on the remote CPU, and any queued packets are then processed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) up the networking stack.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) RPS Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) by default for SMP). Even when compiled in, RPS remains disabled until
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) explicitly configured. The list of CPUs to which RPS may forward traffic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) can be configured for each receive queue using a sysfs file entry::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)   /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) This file implements a bitmap of CPUs. RPS is disabled when it is zero
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) (the default), in which case packets are processed on the interrupting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) CPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) the bitmap.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) Suggested Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) ~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) For a single queue device, a typical RPS configuration would be to set
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) the rps_cpus to the CPUs in the same memory domain of the interrupting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) CPU. If NUMA locality is not an issue, this could also be all CPUs in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) the system. At high interrupt rate, it might be wise to exclude the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) interrupting CPU from the map since that already performs much work.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) For a multi-queue system, if RSS is configured so that a hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) receive queue is mapped to each CPU, then RPS is probably redundant
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) and unnecessary. If there are fewer hardware queues than CPUs, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) RPS might be beneficial if the rps_cpus for each queue are the ones that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) share the same memory domain as the interrupting CPU for that queue.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) RPS Flow Limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) RPS scales kernel receive processing across CPUs without introducing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) reordering. The trade-off to sending all packets from the same flow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) to the same CPU is CPU load imbalance if flows vary in packet rate.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) In the extreme case a single flow dominates traffic. Especially on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) common server workloads with many concurrent connections, such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) behavior indicates a problem such as a misconfiguration or spoofed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) source Denial of Service attack.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) Flow Limit is an optional RPS feature that prioritizes small flows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) during CPU contention by dropping packets from large flows slightly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) ahead of those from small flows. It is active only when an RPS or RFS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) destination CPU approaches saturation.  Once a CPU's input packet
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) queue exceeds half the maximum queue length (as set by sysctl
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) net.core.netdev_max_backlog), the kernel starts a per-flow packet
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) count over the last 256 packets. If a flow exceeds a set ratio (by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) default, half) of these packets when a new packet arrives, then the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) new packet is dropped. Packets from other flows are still only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) dropped once the input packet queue reaches netdev_max_backlog.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) No packets are dropped when the input packet queue length is below
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) the threshold, so flow limit does not sever connections outright:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) even large flows maintain connectivity.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) Interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) ~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) turned on. It is implemented for each CPU independently (to avoid lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) and cache contention) and toggled per CPU by setting the relevant bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) bitmap interface as rps_cpus (see above) when called from procfs::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218)   /proc/sys/net/core/flow_limit_cpu_bitmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) Per-flow rate is calculated by hashing each packet into a hashtable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) bucket and incrementing a per-bucket counter. The hash function is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) the same that selects a CPU in RPS, but as the number of buckets can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) be much larger than the number of CPUs, flow limit has finer-grained
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) identification of large flows and fewer false positives. The default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) table has 4096 buckets. This value can be modified through sysctl::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227)   net.core.flow_limit_table_len
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) The value is only consulted when a new table is allocated. Modifying
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) it does not update active tables.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) Suggested Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) ~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) Flow limit is useful on systems with many concurrent connections,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) where a single connection taking up 50% of a CPU indicates a problem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) In such environments, enable the feature on all CPUs that handle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) network rx interrupts (as set in /proc/irq/N/smp_affinity).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) The feature depends on the input packet queue length to exceed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) the flow limit threshold (50%) + the flow history length (256).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) Setting net.core.netdev_max_backlog to either 1000 or 10000
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) performed well in experiments.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) RFS: Receive Flow Steering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) ==========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) While RPS steers packets solely based on hash, and thus generally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) provides good load distribution, it does not take into account
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) application locality. This is accomplished by Receive Flow Steering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) (RFS). The goal of RFS is to increase datacache hitrate by steering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) kernel processing of packets to the CPU where the application thread
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) consuming the packet is running. RFS relies on the same RPS mechanisms
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) to enqueue packets onto the backlog of another CPU and to wake up that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) In RFS, packets are not forwarded directly by the value of their hash,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) but the hash is used as index into a flow lookup table. This table maps
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) flows to the CPUs where those flows are being processed. The flow hash
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) (see RPS section above) is used to calculate the index into this table.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) The CPU recorded in each entry is the one which last processed the flow.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) If an entry does not hold a valid CPU, then packets mapped to that entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) are steered using plain RPS. Multiple table entries may point to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) same CPU. Indeed, with many flows and few CPUs, it is very likely that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) a single application thread handles flows with many different flow hashes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) rps_sock_flow_table is a global flow table that contains the *desired* CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) for flows: the CPU that is currently processing the flow in userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) Each table value is a CPU index that is updated during calls to recvmsg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) and tcp_splice_read()).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) When the scheduler moves a thread to a new CPU while it has outstanding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) receive packets on the old CPU, packets may arrive out of order. To
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) avoid this, RFS uses a second flow table to track outstanding packets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) for each flow: rps_dev_flow_table is a table specific to each hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) receive queue of each device. Each table value stores a CPU index and a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) counter. The CPU index represents the *current* CPU onto which packets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) for this flow are enqueued for further kernel processing. Ideally, kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) and userspace processing occur on the same CPU, and hence the CPU index
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) in both tables is identical. This is likely false if the scheduler has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) recently migrated a userspace thread while the kernel still has packets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) enqueued for kernel processing on the old CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) The counter in rps_dev_flow_table values records the length of the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) CPU's backlog when a packet in this flow was last enqueued. Each backlog
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) queue has a head counter that is incremented on dequeue. A tail counter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) is computed as head counter + queue length. In other words, the counter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) in rps_dev_flow[i] records the last element in flow i that has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) been enqueued onto the currently designated CPU for flow i (of course,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) entry i is actually selected by hash and multiple flows may hash to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) same entry i).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) And now the trick for avoiding out of order packets: when selecting the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) and the rps_dev_flow table of the queue that the packet was received on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) are compared. If the desired CPU for the flow (found in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) rps_sock_flow table) matches the current CPU (found in the rps_dev_flow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) table), the packet is enqueued onto that CPU’s backlog. If they differ,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) the current CPU is updated to match the desired CPU if one of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) following is true:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305)   - The current CPU's queue head counter >= the recorded tail counter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306)     value in rps_dev_flow[i]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307)   - The current CPU is unset (>= nr_cpu_ids)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308)   - The current CPU is offline
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) After this check, the packet is sent to the (possibly updated) current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) CPU. These rules aim to ensure that a flow only moves to a new CPU when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) there are no packets outstanding on the old CPU, as the outstanding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) packets could arrive later than those about to be processed on the new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) CPU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) RFS Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) by default for SMP). The functionality remains disabled until explicitly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) configured. The number of entries in the global flow table is set through::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324)   /proc/sys/net/core/rps_sock_flow_entries
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) The number of entries in the per-queue flow table are set through::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328)   /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) Suggested Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) ~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) Both of these need to be set before RFS is enabled for a receive queue.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) Values for both are rounded up to the nearest power of two. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) suggested flow count depends on the expected number of active connections
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) at any given time, which may be significantly less than the number of open
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) connections. We have found that a value of 32768 for rps_sock_flow_entries
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) works fairly well on a moderately loaded server.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) For a single queue device, the rps_flow_cnt value for the single queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) would normally be configured to the same value as rps_sock_flow_entries.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) For a multi-queue device, the rps_flow_cnt for each queue might be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) configured as rps_sock_flow_entries / N, where N is the number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) queues. So for instance, if rps_sock_flow_entries is set to 32768 and there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) are 16 configured receive queues, rps_flow_cnt for each queue might be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) configured as 2048.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) Accelerated RFS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) ===============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) balancing mechanism that uses soft state to steer flows based on where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) the application thread consuming the packets of each flow is running.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) Accelerated RFS should perform better than RFS since packets are sent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) directly to a CPU local to the thread consuming the data. The target CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) will either be the same CPU where the application runs, or at least a CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) which is local to the application thread’s CPU in the cache hierarchy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) To enable accelerated RFS, the networking stack calls the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) ndo_rx_flow_steer driver function to communicate the desired hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) queue for packets matching a particular flow. The network stack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) automatically calls this function every time a flow entry in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) rps_dev_flow_table is updated. The driver in turn uses a device specific
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) method to program the NIC to steer the packets.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) The hardware queue for a flow is derived from the CPU recorded in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) rps_dev_flow_table. The stack consults a CPU to hardware queue map which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) is maintained by the NIC driver. This is an auto-generated reverse map of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) the IRQ affinity table shown by /proc/interrupts. Drivers can use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) functions in the cpu_rmap (“CPU affinity reverse map”) kernel library
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) to populate the map. For each CPU, the corresponding queue in the map is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) set to be one whose processing CPU is closest in cache locality.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) Accelerated RFS Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) -----------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) Accelerated RFS is only available if the kernel is compiled with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) CONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) It also requires that ntuple filtering is enabled via ethtool. The map
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) of CPU to queues is automatically deduced from the IRQ affinities
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) configured for each receive queue by the driver, so no additional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) configuration should be necessary.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) Suggested Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) ~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) This technique should be enabled whenever one wants to use RFS and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) NIC supports hardware acceleration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) XPS: Transmit Packet Steering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) =============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) Transmit Packet Steering is a mechanism for intelligently selecting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) which transmit queue to use when transmitting a packet on a multi-queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) device. This can be accomplished by recording two kinds of maps, either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) to hardware transmit queue(s).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) 1. XPS using CPUs map
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) The goal of this mapping is usually to assign queues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) exclusively to a subset of CPUs, where the transmit completions for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) these queues are processed on a CPU within this set. This choice
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) provides two benefits. First, contention on the device queue lock is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) significantly reduced since fewer CPUs contend for the same queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) (contention can be eliminated completely if each CPU has its own
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) transmit queue). Secondly, cache miss rate on transmit completion is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) reduced, in particular for data cache lines that hold the sk_buff
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) 2. XPS using receive queues map
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) This mapping is used to pick transmit queue based on the receive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) queue(s) map configuration set by the administrator. A set of receive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) queues can be mapped to a set of transmit queues (many:many), although
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) the common use case is a 1:1 mapping. This will enable sending packets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) on the same queue associations for transmit and receive. This is useful for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) busy polling multi-threaded workloads where there are challenges in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) associating a given CPU to a given application thread. The application
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) threads are not pinned to CPUs and each thread handles packets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) received on a single queue. The receive queue number is cached in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) socket for the connection. In this model, sending the packets on the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) transmit queue corresponding to the associated receive queue has benefits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) in keeping the CPU overhead low. Transmit completion work is locked into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) the same queue-association that a given application is polling on. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) avoids the overhead of triggering an interrupt on another CPU. When the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) application cleans up the packets during the busy poll, transmit completion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) may be processed along with it in the same thread context and so result in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) reduced latency.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) XPS is configured per transmit queue by setting a bitmap of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) CPUs/receive-queues that may use that queue to transmit. The reverse
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) mapping, from CPUs to transmit queues or from receive-queues to transmit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) queues, is computed and maintained for each network device. When
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) transmitting the first packet in a flow, the function get_xps_queue() is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) called to select a queue. This function uses the ID of the receive queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) for the socket connection for a match in the receive queue-to-transmit queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) lookup table. Alternatively, this function can also use the ID of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) running CPU as a key into the CPU-to-queue lookup table. If the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) ID matches a single queue, that is used for transmission. If multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) queues match, one is selected by using the flow hash to compute an index
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) into the set. When selecting the transmit queue based on receive queue(s)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) map, the transmit device is not validated against the receive device as it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) requires expensive lookup operation in the datapath.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) The queue chosen for transmitting a particular flow is saved in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) corresponding socket structure for the flow (e.g. a TCP connection).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) This transmit queue is used for subsequent packets sent on the flow to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) prevent out of order (ooo) packets. The choice also amortizes the cost
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) of calling get_xps_queues() over all packets in the flow. To avoid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456) ooo packets, the queue for a flow can subsequently only be changed if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) skb->ooo_okay is set for a packet in the flow. This flag indicates that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) there are no outstanding packets in the flow, so the transmit queue can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) change without the risk of generating out of order packets. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) transport layer is responsible for setting ooo_okay appropriately. TCP,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) for instance, sets the flag when all data for a connection has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) acknowledged.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) XPS Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) default for SMP). If compiled in, it is driver dependent whether, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) how, XPS is configured at device init. The mapping of CPUs/receive-queues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) to transmit queue can be inspected and configured using sysfs:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) For selection based on CPUs map::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474)   /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) For selection based on receive-queues map::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478)   /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481) Suggested Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) ~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) For a network device with a single transmission queue, XPS configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) has no effect, since there is no choice in this case. In a multi-queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) system, XPS is preferably configured so that each CPU maps onto one queue.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) If there are as many queues as there are CPUs in the system, then each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) queue can also map onto one CPU, resulting in exclusive pairings that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) experience no contention. If there are fewer queues than CPUs, then the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) best CPUs to share a given queue are probably those that share the cache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) with the CPU that processes transmit completions for that queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) (transmit interrupts).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) For transmit queue selection based on receive queue(s), XPS has to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) explicitly configured mapping receive-queue(s) to transmit queue(s). If the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) user configuration for receive-queue map does not apply, then the transmit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) queue is selected based on the CPUs map.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) Per TX Queue rate limitation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) These are rate-limitation mechanisms implemented by HW, where currently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) a max-rate attribute is supported, by setting a Mbps value to::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506)   /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) A value of zero means disabled, and this is the default.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) Further Information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) 2.6.38. Original patches were submitted by Tom Herbert
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) (therbert@google.com)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) Accelerated RFS was introduced in 2.6.35. Original patches were
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) submitted by Ben Hutchings (bwh@kernel.org)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) Authors:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) - Tom Herbert (therbert@google.com)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) - Willem de Bruijn (willemb@google.com)