^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Checksum Offloads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) This document describes a set of techniques in the Linux networking stack to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) take advantage of checksum offload capabilities of various NICs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) The following technologies are described:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) * TX Checksum Offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) * LCO: Local Checksum Offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) * RCO: Remote Checksum Offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) Things that should be documented here but aren't yet:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) * RX Checksum Offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) * CHECKSUM_UNNECESSARY conversion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) TX Checksum Offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) The interface for offloading a transmit checksum to a device is explained in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) detail in comments near the top of include/linux/skbuff.h.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) In brief, it allows to request the device fill in a single ones-complement
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) checksum defined by the sk_buff fields skb->csum_start and skb->csum_offset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) The device should compute the 16-bit ones-complement checksum (i.e. the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) 'IP-style' checksum) from csum_start to the end of the packet, and fill in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) result at (csum_start + csum_offset).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) Because csum_offset cannot be negative, this ensures that the previous value of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) the checksum field is included in the checksum computation, thus it can be used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) to supply any needed corrections to the checksum (such as the sum of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) pseudo-header for UDP or TCP).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) This interface only allows a single checksum to be offloaded. Where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) encapsulation is used, the packet may have multiple checksum fields in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) different header layers, and the rest will have to be handled by another
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) mechanism such as LCO or RCO.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) CRC32c can also be offloaded using this interface, by means of filling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) skb->csum_start and skb->csum_offset as described above, and setting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) skb->csum_not_inet: see skbuff.h comment (section 'D') for more details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) No offloading of the IP header checksum is performed; it is always done in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) software. This is OK because when we build the IP header, we obviously have it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) in cache, so summing it isn't expensive. It's also rather short.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) The requirements for GSO are more complicated, because when segmenting an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) encapsulated packet both the inner and outer checksums may need to be edited or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) recomputed for each resulting segment. See the skbuff.h comment (section 'E')
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) for more details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) A driver declares its offload capabilities in netdev->hw_features; see
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) Documentation/networking/netdev-features.rst for more. Note that a device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) csum_offset given in the SKB; if it tries to deduce these itself in hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) (as some NICs do) the driver should check that the values in the SKB match
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) those which the hardware will deduce, and if not, fall back to checksumming in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) software instead (with skb_csum_hwoffload_help() or one of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) skb_checksum_help() / skb_crc32c_csum_help functions, as mentioned in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) include/linux/skbuff.h).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) The stack should, for the most part, assume that checksum offload is supported
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) by the underlying device. The only place that should check is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) validate_xmit_skb(), and the functions it calls directly or indirectly. That
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) function compares the offload features requested by the SKB (which may include
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) other offloads besides TX Checksum Offload) and, if they are not supported or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) enabled on the device (determined by netdev->features), performs the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) corresponding offload in software. In the case of TX Checksum Offload, that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) means calling skb_csum_hwoffload_help(skb, features).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) LCO: Local Checksum Offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) ===========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) LCO is a technique for efficiently computing the outer checksum of an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) encapsulated datagram when the inner checksum is due to be offloaded.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) The ones-complement sum of a correctly checksummed TCP or UDP packet is equal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) to the complement of the sum of the pseudo header, because everything else gets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) 'cancelled out' by the checksum field. This is because the sum was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) complemented before being written to the checksum field.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) More generally, this holds in any case where the 'IP-style' ones complement
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) checksum is used, and thus any checksum that TX Checksum Offload supports.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) That is, if we have set up TX Checksum Offload with a start/offset pair, we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) know that after the device has filled in that checksum, the ones complement sum
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) from csum_start to the end of the packet will be equal to the complement of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) whatever value we put in the checksum field beforehand. This allows us to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) compute the outer checksum without looking at the payload: we simply stop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) summing when we get to csum_start, then add the complement of the 16-bit word
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) at (csum_start + csum_offset).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) Then, when the true inner checksum is filled in (either by hardware or by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) skb_checksum_help()), the outer checksum will become correct by virtue of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) arithmetic.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) LCO is performed by the stack when constructing an outer UDP header for an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) IPv6 equivalents, in udp6_set_csum().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) It is also performed when constructing an IPv4 GRE header, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) constructing an IPv6 GRE header; the GRE checksum is computed over the whole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be possible to use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) LCO here as IPv6 GRE still uses an IP-style checksum.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) All of the LCO implementations use a helper function lco_csum(), in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) include/linux/skbuff.h.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) LCO can safely be used for nested encapsulations; in this case, the outer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) encapsulation layer will sum over both its own header and the 'middle' header.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) This does mean that the 'middle' header will get summed multiple times, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) there doesn't seem to be a way to avoid that without incurring bigger costs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) (e.g. in SKB bloat).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) RCO: Remote Checksum Offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) RCO is a technique for eliding the inner checksum of an encapsulated datagram,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) allowing the outer checksum to be offloaded. It does, however, involve a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) change to the encapsulation protocols, which the receiver must also support.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) For this reason, it is disabled by default.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) RCO is detailed in the following Internet-Drafts:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) * https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) * https://tools.ietf.org/html/draft-herbert-vxlan-rco-00
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) In Linux, RCO is implemented individually in each encapsulation protocol, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) most tunnel types have flags controlling its use. For instance, VXLAN has the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that RCO should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) used when transmitting to a given remote destination.