^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) .. include:: <isonum.txt>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) ===============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) Ethernet switch device driver model (switchdev)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) ===============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) Copyright |copy| 2014 Jiri Pirko <jiri@resnulli.us>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) Copyright |copy| 2014-2015 Scott Feldman <sfeldma@gmail.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) The Ethernet switch device driver model (switchdev) is an in-kernel driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) model for switch devices which offload the forwarding (data) plane from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) Figure 1 is a block diagram showing the components of the switchdev model for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) an example setup using a data-center-class switch ASIC chip. Other setups
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) with SR-IOV or soft switches, such as OVS, are possible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) User-space tools
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) user space |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) +-------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) kernel | Netlink
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) +--------------+-------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) | Network stack |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) | (Linux) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) +----------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) sw1p2 sw1p4 sw1p6
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) sw1p1 + sw1p3 + sw1p5 + eth1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) + | + | + | +
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) | | | | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) +--+----+----+----+----+----+---+ +-----+-----+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) | Switch driver | | mgmt |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) | (this document) | | driver |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) +--------------+----------------+ +-----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) kernel | HW bus (eg PCI)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) +-------------------------------------------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) hardware |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) +--------------+----------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) | Switch device (sw1) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) | +----+ +--------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) | | v offloaded data path | mgmt port
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) +--|----|----+----+----+----+---+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) | | | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) + + + + + +
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) p1 p2 p3 p4 p5 p6
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) front-panel ports
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) Fig 1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) Include Files
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) #include <linux/netdevice.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) #include <net/switchdev.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) Configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) support is built for driver.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) Switch Ports
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) ------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) On switchdev driver initialization, the driver will allocate and register a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) struct net_device (using register_netdev()) for each enumerated physical switch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) port, called the port netdev. A port netdev is the software representation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) the physical port and provides a conduit for control traffic to/from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) controller (the kernel) and the network, as well as an anchor point for higher
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) standard netdev tools (iproute2, ethtool, etc), the port netdev can also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) provide to the user access to the physical properties of the switch port such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) as PHY link state and I/O statistics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) There is (currently) no higher-level kernel object for the switch beyond the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) A switch management port is outside the scope of the switchdev driver model.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) Typically, the management port is not participating in offloaded data plane and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) is loaded with a different driver, such as a NIC driver, on the management port
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) Switch ID
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) ^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) The switchdev driver must implement the net_device operation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) ndo_get_port_parent_id for each port netdev, returning the same physical ID for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) each port of a switch. The ID must be unique between switches on the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) system. The ID does not need to be unique between switches on different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) systems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) The switch ID is used to locate ports on a switch and to know if aggregated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) ports belong to the same switch.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) Port Netdev Naming
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) ^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) Udev rules should be used for port netdev naming, using some unique attribute
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) of the port as a key, for example the port MAC address or the port PHYS name.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) Hard-coding of kernel netdev names within the driver is discouraged; let the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) kernel pick the default netdev name, and let udev set the final name based on a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) port attribute.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) Using port PHYS name (ndo_get_phys_port_name) for the key is particularly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) useful for dynamically-named ports where the device names its ports based on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) external configuration. For example, if a physical 40G port is split logically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) into 4 10G ports, resulting in 4 port netdevs, the device can give a unique
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) name for each port using port PHYS name. The udev rule would be::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) would be sub-port 0 on port 1 on switch 1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) Port Features
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) ^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) NETIF_F_NETNS_LOCAL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) If the switchdev driver (and device) only supports offloading of the default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) network namespace (netns), the driver should set this feature flag to prevent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) the port netdev from being moved out of the default netns. A netns-aware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) driver/device would not set this flag and be responsible for partitioning
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) hardware to preserve netns containment. This means hardware cannot forward
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) traffic from a port in one namespace to another port in another namespace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) Port Topology
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) ^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) The port netdevs representing the physical switch ports can be organized into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) higher-level switching constructs. The default construct is a standalone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) router port, used to offload L3 forwarding. Two or more ports can be bonded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) tunnels can be built on ports. These constructs are built using standard Linux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) tools such as the bridge driver, the bonding/team drivers, and netlink-based
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) tools such as iproute2.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) The switchdev driver can know a particular port's position in the topology by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) bond will see it's upper master change. If that bond is moved into a bridge,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) the bond's upper master will change. And so on. The driver will track such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) movements to know what position a port is in in the overall topology by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) registering for netdevice events and acting on NETDEV_CHANGEUPPER.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) L2 Forwarding Offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) ---------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) The idea is to offload the L2 data forwarding (switching) path from the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) to the switchdev device by mirroring bridge FDB entries down to the device. An
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) FDB entry is the {port, MAC, VLAN} tuple forwarding destination.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) To offloading L2 bridging, the switchdev driver/device should support:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) - Static FDB entries installed on a bridge port
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) - Notification of learned/forgotten src mac/vlans from device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) - STP state changes on the port
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) - VLAN flooding of multicast/broadcast and unknown unicast packets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) Static FDB Entries
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) ^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) to support static FDB entries installed to the device. Static bridge FDB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) entries are installed, for example, using iproute2 bridge cmd::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) bridge fdb add ADDR dev DEV [vlan VID] [self]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) switchdev_port_obj_xxx ops.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) XXX: what should be done if offloading this rule to hardware fails (for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) example, due to full capacity in hardware tables) ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) Note: by default, the bridge does not filter on VLAN and only bridges untagged
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) traffic. To enable VLAN support, turn on VLAN filtering::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) Notification of Learned/Forgotten Source MAC/VLANs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) The switch device will learn/forget source MAC address/VLAN on ingress packets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) and notify the switch driver of the mac/vlan/port tuples. The switch driver,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) in turn, will notify the bridge driver using the switchdev notifier call::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) err = call_switchdev_notifiers(val, dev, info, extack);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) forgetting, and info points to a struct switchdev_notifier_fdb_info. On
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) command will label these entries "offload"::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) $ bridge fdb
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) 52:54:00:12:35:01 dev sw1p1 master br0 permanent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) 00:02:00:00:02:00 dev sw1p1 master br0 offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) 00:02:00:00:02:00 dev sw1p1 self
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) 52:54:00:12:35:02 dev sw1p2 master br0 permanent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) 00:02:00:00:03:00 dev sw1p2 master br0 offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) 00:02:00:00:03:00 dev sw1p2 self
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) 33:33:00:00:00:01 dev eth0 self permanent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) 01:00:5e:00:00:01 dev eth0 self permanent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) 33:33:ff:00:00:00 dev eth0 self permanent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) 01:80:c2:00:00:0e dev eth0 self permanent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) 33:33:00:00:00:01 dev br0 self permanent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) 01:00:5e:00:00:01 dev br0 self permanent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) 33:33:ff:12:35:01 dev br0 self permanent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) Learning on the port should be disabled on the bridge using the bridge command::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) bridge link set dev DEV learning off
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) Learning on the device port should be enabled, as well as learning_sync::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) bridge link set dev DEV learning on self
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) bridge link set dev DEV learning_sync on self
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) Learning_sync attribute enables syncing of the learned/forgotten FDB entry to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) the bridge's FDB. It's possible, but not optimal, to enable learning on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) device port and on the bridge port, and disable learning_sync.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) To support learning, the driver implements switchdev op
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) switchdev_port_attr_set for SWITCHDEV_ATTR_PORT_ID_{PRE}_BRIDGE_FLAGS.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) FDB Ageing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) ^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) the responsibility of the port driver/device to age out these entries. If the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) port device supports ageing, when the FDB entry expires, it will notify the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) device does not support ageing, the driver can simulate ageing using a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) garbage collection timer to monitor FDB entries. Expired entries will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) notified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) example of driver running ageing timer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) notification will reset the FDB entry's last-used time to now. The driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) should rate limit refresh notifications, for example, no more than once a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) second. (The last-used time is visible using the bridge -s fdb option).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) STP State Change on Port
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) ^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) Internally or with a third-party STP protocol implementation (e.g. mstpd), the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) bridge driver maintains the STP state for ports, and will notify the switch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) driver of STP state change on a port using the switchdev op
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) State is one of BR_STATE_*. The switch driver can use STP state updates to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) update ingress packet filter list for the port. For example, if port is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) Note that STP BDPUs are untagged and STP state applies to all VLANs on the port
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) so packet filters should be applied consistently across untagged and tagged
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) VLANs on the port.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) Flooding L2 domain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) ^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) For a given L2 VLAN domain, the switch device should flood multicast/broadcast
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) and unknown unicast packets to all ports in domain, if allowed by port's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) current STP state. The switch driver, knowing which ports are within which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) vlan L2 domain, can program the switch device for flooding. The packet may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) be sent to the port netdev for processing by the bridge driver. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) bridge should not reflood the packet to the same ports the device flooded,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) otherwise there will be duplicate packets on the wire.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) To avoid duplicate packets, the switch driver should mark a packet as already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) forwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) the skb using the ingress bridge port's mark and prevent it from being forwarded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) through any bridge port with the same mark.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) It is possible for the switch device to not handle flooding and push the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) packets up to the bridge driver for flooding. This is not ideal as the number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) of ports scale in the L2 domain as the device is much more efficient at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) flooding packets that software.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) If supported by the device, flood control can be offloaded to it, preventing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) certain netdevs from flooding unicast traffic for which there is no FDB entry.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) IGMP Snooping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) ^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) In order to support IGMP snooping, the port netdevs should trap to the bridge
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) driver all IGMP join and leave messages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) The bridge multicast module will notify port netdevs on every multicast group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) changed whether it is static configured or dynamically joined/leave.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) The hardware implementation should be forwarding all registered multicast
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) traffic groups only to the configured ports.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) L3 Routing Offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) Offloading L3 routing requires that device be programmed with FIB entries from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) the kernel, with the device doing the FIB lookup and forwarding. The device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) does a longest prefix match (LPM) on FIB entries matching route prefix and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) forwards the packet to the matching FIB entry's nexthop(s) egress ports.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) To program the device, the driver has to register a FIB notifier handler
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) using register_fib_notifier. The following events are available:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) =================== ===================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) FIB_EVENT_ENTRY_ADD used for both adding a new FIB entry to the device,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) or modifying an existing entry on the device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) FIB_EVENT_ENTRY_DEL used for removing a FIB entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) FIB_EVENT_RULE_ADD,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) FIB_EVENT_RULE_DEL used to propagate FIB rule changes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) =================== ===================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) struct fib_entry_notifier_info {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) struct fib_notifier_info info; /* must be first */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) u32 dst;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) int dst_len;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) struct fib_info *fi;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) u8 tos;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) u8 type;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) u32 tb_id;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) u32 nlflags;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The ``*fi``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) structure holds details on the route and route's nexthops. ``*dev`` is one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) of the port netdevs mentioned in the route's next hop list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) Routes offloaded to the device are labeled with "offload" in the ip route
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) listing::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) $ ip route show
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) default via 192.168.0.2 dev eth0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) 12.0.0.2 proto zebra metric 30 offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) nexthop via 11.0.0.1 dev sw1p1 weight 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) nexthop via 11.0.0.9 dev sw1p2 weight 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) The "offload" flag is set in case at least one device offloads the FIB entry.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) XXX: add/mod/del IPv6 FIB API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) Nexthop Resolution
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) ^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) the switch device to forward the packet with the correct dst mac address, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) address discovery comes via the ARP (or ND) process and is available via the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) should trigger the kernel's neighbor resolution process. See the rocker
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) driver's rocker_port_ipv4_resolve() for an example.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) The driver can monitor for updates to arp_tbl using the netevent notifier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) to know when arp_tbl neighbor entries are purged from the port.