^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) BPF sk_lookup program
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) BPF sk_lookup program type (``BPF_PROG_TYPE_SK_LOOKUP``) introduces programmability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) into the socket lookup performed by the transport layer when a packet is to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) delivered locally.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) When invoked BPF sk_lookup program can select a socket that will receive the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) incoming packet by calling the ``bpf_sk_assign()`` BPF helper function.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) Hooks for a common attach point (``BPF_SK_LOOKUP``) exist for both TCP and UDP.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) Motivation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) BPF sk_lookup program type was introduced to address setup scenarios where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) binding sockets to an address with ``bind()`` socket call is impractical, such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) as:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) 1. receiving connections on a range of IP addresses, e.g. 192.0.2.0/24, when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) binding to a wildcard address ``INADRR_ANY`` is not possible due to a port
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) conflict,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) 2. receiving connections on all or a wide range of ports, i.e. an L7 proxy use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) case.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) Such setups would require creating and ``bind()``'ing one socket to each of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) IP address/port in the range, leading to resource consumption and potential
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) latency spikes during socket lookup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) Attachment
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) BPF sk_lookup program can be attached to a network namespace with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) ``bpf(BPF_LINK_CREATE, ...)`` syscall using the ``BPF_SK_LOOKUP`` attach type and a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) netns FD as attachment ``target_fd``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) Multiple programs can be attached to one network namespace. Programs will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) invoked in the same order as they were attached.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) Hooks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) =====
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) The attached BPF sk_lookup programs run whenever the transport layer needs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) find a listening (TCP) or an unconnected (UDP) socket for an incoming packet.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) Incoming traffic to established (TCP) and connected (UDP) sockets is delivered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) as usual without triggering the BPF sk_lookup hook.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) The attached BPF programs must return with either ``SK_PASS`` or ``SK_DROP``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) verdict code. As for other BPF program types that are network filters,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) ``SK_PASS`` signifies that the socket lookup should continue on to regular
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) hashtable-based lookup, while ``SK_DROP`` causes the transport layer to drop the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) packet.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) A BPF sk_lookup program can also select a socket to receive the packet by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) calling ``bpf_sk_assign()`` BPF helper. Typically, the program looks up a socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) in a map holding sockets, such as ``SOCKMAP`` or ``SOCKHASH``, and passes a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) ``struct bpf_sock *`` to ``bpf_sk_assign()`` helper to record the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) selection. Selecting a socket only takes effect if the program has terminated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) with ``SK_PASS`` code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) When multiple programs are attached, the end result is determined from return
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) codes of all the programs according to the following rules:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) 1. If any program returned ``SK_PASS`` and selected a valid socket, the socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) is used as the result of the socket lookup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) 2. If more than one program returned ``SK_PASS`` and selected a socket, the last
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) selection takes effect.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) 3. If any program returned ``SK_DROP``, and no program returned ``SK_PASS`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) selected a socket, socket lookup fails.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) 4. If all programs returned ``SK_PASS`` and none of them selected a socket,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) socket lookup continues on.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) ===
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) In its context, an instance of ``struct bpf_sk_lookup``, BPF sk_lookup program
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) receives information about the packet that triggered the socket lookup. Namely:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) * IP version (``AF_INET`` or ``AF_INET6``),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) * L4 protocol identifier (``IPPROTO_TCP`` or ``IPPROTO_UDP``),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) * source and destination IP address,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) * source and destination L4 port,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) * the socket that has been selected with ``bpf_sk_assign()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) Refer to ``struct bpf_sk_lookup`` declaration in ``linux/bpf.h`` user API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) header, and `bpf-helpers(7)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) <https://man7.org/linux/man-pages/man7/bpf-helpers.7.html>`_ man-page section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) for ``bpf_sk_assign()`` for details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) Example
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) See ``tools/testing/selftests/bpf/prog_tests/sk_lookup.c`` for the reference
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) implementation.