^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Netfilter's flowtable infrastructure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) This documentation describes the software flowtable infrastructure available in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) Netfilter since Linux kernel 4.16.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) Overview
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) --------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) Initial packets follow the classic forwarding path, once the flow enters the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) established state according to the conntrack semantics (ie. we have seen traffic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) in both directions), then you can decide to offload the flow to the flowtable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) from the forward chain via the 'flow offload' action available in nftables.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) Packets that find an entry in the flowtable (ie. flowtable hit) are sent to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) output netdevice via neigh_xmit(), hence, they bypass the classic forwarding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) path (the visible effect is that you do not see these packets from any of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) netfilter hooks coming after the ingress). In case of flowtable miss, the packet
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) follows the classic forward path.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) The flowtable uses a resizable hashtable, lookups are based on the following
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) 7-tuple selectors: source, destination, layer 3 and layer 4 protocols, source
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) and destination ports and the input interface (useful in case there are several
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) conntrack zones in place).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) Flowtables are populated via the 'flow offload' nftables action, so the user can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) selectively specify what flows are placed into the flow table. Hence, packets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) follow the classic forwarding path unless the user explicitly instruct packets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) to use this new alternative forwarding path via nftables policy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) This is represented in Fig.1, which describes the classic forwarding path
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) including the Netfilter hooks and the flowtable fastpath bypass.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) userspace process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) ^ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) _____|____ ____\/___
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) / \ / \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) | input | | output |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) \__________/ \_________/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) ^ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) _________ __________ --------- _____\/_____
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) / \ / \ |Routing | / \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) --> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) \_________/ \__________/ ---------- \____________/ ^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) | ^ | ^ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) flowtable | ____\/___ | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) | | / \ | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) __\/___ | | forward |------------ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) |-----| | \_________/ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) |-----| | 'flow offload' rule |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) |-----| | adds entry to |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) |_____| | flowtable |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) / \ | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) /hit\_no_| |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) \ ? / |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) \ / |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) |__yes_________________fastpath bypass ____________________________|
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) Fig.1 Netfilter hooks and flowtable interactions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) The flowtable entry also stores the NAT configuration, so all packets are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) mangled according to the NAT policy that matches the initial packets that went
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) through the classic forwarding path. The TTL is decremented before calling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) neigh_xmit(). Fragmented traffic is passed up to follow the classic forwarding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) path given that the transport selectors are missing, therefore flowtable lookup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) is not possible.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) Example configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) ---------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) Enabling the flowtable bypass is relatively easy, you only need to create a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) flowtable and add one rule to your forward chain::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) table inet x {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) flowtable f {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) hook ingress priority 0; devices = { eth0, eth1 };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) chain y {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) type filter hook forward priority 0; policy accept;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) ip protocol tcp flow offload @f
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) counter packets 0 bytes 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) netdevices. You can create as many flowtables as you want in case you need to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) perform resource partitioning. The flowtable priority defines the order in which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) hooks are run in the pipeline, this is convenient in case you already have a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) nftables ingress chain (make sure the flowtable priority is smaller than the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) nftables ingress chain hence the flowtable runs before in the pipeline).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) The 'flow offload' action from the forward chain 'y' adds an entry to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) flowtable for the TCP syn-ack packet coming in the reply direction. Once the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) flow is offloaded, you will observe that the counter rule in the example above
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) does not get updated for the packets that are being forwarded through the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) forwarding bypass.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) More reading
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) ------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) This documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) also made a very complete and comprehensive summary called "A state of network
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) acceleration" that describes how things were before this infrastructure was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) mainlined [3]_ and it also makes a rough summary of this work [4]_.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) .. [1] https://lwn.net/Articles/738214/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) .. [2] https://lwn.net/Articles/742164/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) .. [3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) .. [4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html