^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Packet MMAP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Abstract
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) This file documents the mmap() facility available with the PACKET
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) i) capture network traffic with utilities like tcpdump,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) ii) transmit network traffic, or any other that needs raw
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) access to network interface.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) Howto can be found at:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) https://sites.google.com/site/packetmmap/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) Please send your comments to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) - Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) - Johann Baudy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) Why use PACKET_MMAP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) inefficient. It uses very limited buffers and requires one system call to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) capture each packet, it requires two if you want to get packet's timestamp
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) (like libpcap always does).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) configurable circular buffer mapped in user space that can be used to either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) send or receive packets. This way reading packets just needs to wait for them,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) most of the time there is no need to issue a single system call. Concerning
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) transmission, multiple packets can be sent through one system call to get the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) highest bandwidth. By using a shared buffer between the kernel and the user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) also has the benefit of minimizing packet copies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) It's fine to use PACKET_MMAP to improve the performance of the capture and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) transmission process, but it isn't everything. At least, if you are capturing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) at high speeds (this is relative to the cpu speed), you should check if the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) device driver of your network interface card supports some sort of interrupt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) load mitigation or (even better) if it supports NAPI, also make sure it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) supported by devices of your network. CPU IRQ pinning of your network interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) card can also be an advantage.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) How to use mmap() to improve capture process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) ============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) From the user standpoint, you should use the higher level libpcap library, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) is a de facto standard, portable across nearly all operating systems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) including Win32.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) Packet MMAP support was integrated into libpcap around the time of version 1.3.0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) TPACKET_V3 support was added in version 1.5.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) How to use mmap() directly to improve capture process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) =====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) From the system calls stand point, the use of PACKET_MMAP involves
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) the following process::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) [setup] socket() -------> creation of the capture socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) setsockopt() ---> allocation of the circular buffer (ring)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) option: PACKET_RX_RING
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) mmap() ---------> mapping of the allocated buffer to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) user process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) [capture] poll() ---------> to wait for incoming packets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) [shutdown] close() --------> destruction of the capture socket and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) deallocation of all associated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) resources.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) socket creation and destruction is straight forward, and is done
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) the same way with or without PACKET_MMAP::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) where mode is SOCK_RAW for the raw interface were link level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) information can be captured or SOCK_DGRAM for the cooked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) interface where link level information capture is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) supported and a link level pseudo-header is provided
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) by the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) The destruction of the socket and all associated resources
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) is done by a simple call to close(fd).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) Similarly as without PACKET_MMAP, it is possible to use one socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) for capture and transmission. This can be done by mapping the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) allocated RX and TX buffer ring with a single mmap() call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) See "Mapping and use of the circular buffer (ring)".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) Next I will describe PACKET_MMAP settings and its constraints,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) also the mapping of the circular buffer in the user process and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) the use of this buffer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) How to use mmap() directly to improve transmission process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) ==========================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) Transmission process is similar to capture as shown below::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) [setup] socket() -------> creation of the transmission socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) setsockopt() ---> allocation of the circular buffer (ring)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) option: PACKET_TX_RING
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) bind() ---------> bind transmission socket with a network interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) mmap() ---------> mapping of the allocated buffer to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) user process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) [transmission] poll() ---------> wait for free packets (optional)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) send() ---------> send all packets that are set as ready in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) the ring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) The flag MSG_DONTWAIT can be used to return
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) before end of transfer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) [shutdown] close() --------> destruction of the transmission socket and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) deallocation of all associated resources.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) Socket creation and destruction is also straight forward, and is done
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) the same way as in capturing described in the previous paragraph::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) int fd = socket(PF_PACKET, mode, 0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) The protocol can optionally be 0 in case we only want to transmit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) via this socket, which avoids an expensive call to packet_rcv().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) In this case, you also need to bind(2) the TX_RING with sll_protocol = 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) Binding the socket to your network interface is mandatory (with zero copy) to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) know the header size of frames used in the circular buffer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) As capture, each frame contains two parts::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) --------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) | struct tpacket_hdr | Header. It contains the status of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) | | of this frame
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) |--------------------|
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) | data buffer |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) . . Data that will be sent over the network interface.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) . .
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) --------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) bind() associates the socket to your network interface thanks to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) sll_ifindex parameter of struct sockaddr_ll.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) Initialization example::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) struct sockaddr_ll my_addr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) struct ifreq s_ifr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) /* get interface index of eth0 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) /* fill sockaddr_ll struct to prepare binding */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) my_addr.sll_family = AF_PACKET;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) my_addr.sll_protocol = htons(ETH_P_ALL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) my_addr.sll_ifindex = s_ifr.ifr_ifindex;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) /* bind socket to eth0 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) A complete tutorial is available at: https://sites.google.com/site/packetmmap/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) By default, the user should put data at::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) the beginning of the user data will be at::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) If you wish to put user data at a custom offset from the beginning of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) the frame (for payload alignment with SOCK_RAW mode for instance) you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) to make this work it must be enabled previously with setsockopt()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) and the PACKET_TX_HAS_OFF option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) PACKET_MMAP settings
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) To setup PACKET_MMAP from user level code is done with a call like
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) - Capture process::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) - Transmission process::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) The most significant argument in the previous call is the req parameter,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) this parameter must to have the following structure::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) struct tpacket_req
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) unsigned int tp_block_size; /* Minimal size of contiguous block */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) unsigned int tp_block_nr; /* Number of blocks */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) unsigned int tp_frame_size; /* Size of frame */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) unsigned int tp_frame_nr; /* Total number of frames */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) This structure is defined in /usr/include/linux/if_packet.h and establishes a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) circular buffer (ring) of unswappable memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) Being mapped in the capture process allows reading the captured frames and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) related meta-information like timestamps without requiring a system call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) Frames are grouped in blocks. Each block is a physically contiguous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) region of memory and holds tp_block_size/tp_frame_size frames. The total number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) frames_per_block = tp_block_size/tp_frame_size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) indeed, packet_set_ring checks that the following condition is true::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) frames_per_block * tp_block_nr == tp_frame_nr
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) Lets see an example, with the following values::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) tp_block_size= 4096
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) tp_frame_size= 2048
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) tp_block_nr = 4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) tp_frame_nr = 8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) we will get the following buffer structure::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) block #1 block #2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) +---------+---------+ +---------+---------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) | frame 1 | frame 2 | | frame 3 | frame 4 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) +---------+---------+ +---------+---------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) block #3 block #4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) +---------+---------+ +---------+---------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) | frame 5 | frame 6 | | frame 7 | frame 8 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) +---------+---------+ +---------+---------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) A frame can be of any size with the only condition it can fit in a block. A block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) can only hold an integer number of frames, or in other words, a frame cannot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) be spawned across two blocks, so there are some details you have to take into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) account when choosing the frame_size. See "Mapping and use of the circular
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) buffer (ring)".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) PACKET_MMAP setting constraints
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) ===============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) 16384 in a 64 bit architecture. For information on these kernel versions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) Block size limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) ----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) As stated earlier, each block is a contiguous physical region of memory. These
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) memory regions are allocated with calls to the __get_free_pages() function. As
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) the name indicates, this function allocates pages of memory, and the second
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) argument is "order" or a power of two number of pages, that is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) order=2 ==> 16384 bytes, etc. The maximum size of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) precisely the limit can be calculated as::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) PAGE_SIZE << MAX_ORDER
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) In a i386 architecture PAGE_SIZE is 4096 bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) In a 2.4/i386 kernel MAX_ORDER is 10
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) In a 2.6/i386 kernel MAX_ORDER is 11
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) respectively, with an i386 architecture.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) User space programs can include /usr/include/sys/user.h and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) /usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) The pagesize can also be determined dynamically with the getpagesize (2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) system call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) Block number limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) To understand the constraints of PACKET_MMAP, we have to see the structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) used to hold the pointers to each block.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) Currently, this structure is a dynamically allocated vector with kmalloc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) called pg_vec, its size limits the number of blocks that can be allocated::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) +---+---+---+---+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) | x | x | x | x |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) +---+---+---+---+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) | | | v
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) | | v block #4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) | v block #3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) v block #2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) block #1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) kmalloc allocates any number of bytes of physically contiguous memory from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) a pool of pre-determined sizes. This pool of memory is maintained by the slab
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) allocator which is at the end the responsible for doing the allocation and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) hence which imposes the maximum memory that kmalloc can allocate.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) predetermined sizes that kmalloc uses can be checked in the "size-<bytes>"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) entries of /proc/slabinfo
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) In a 32 bit architecture, pointers are 4 bytes long, so the total number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) pointers to blocks is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) 131072/4 = 32768 blocks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) PACKET_MMAP buffer size calculator
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) ==================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) Definitions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) ============== ================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) <size-max> is the maximum size of allocable with kmalloc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) (see /proc/slabinfo)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) <pointer size> depends on the architecture -- ``sizeof(void *)``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) <page size> depends on the architecture -- PAGE_SIZE or getpagesize (2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) <max-order> is the value defined with MAX_ORDER
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) <frame size> it's an upper bound of frame's capture size (more on this later)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) ============== ================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) from these definitions we will derive::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) <block number> = <size-max>/<pointer size>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) <block size> = <pagesize> << <max-order>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) so, the max buffer size is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) <block number> * <block size>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) and, the number of frames be::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) <block number> * <block size> / <frame size>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) Suppose the following parameters, which apply for 2.6 kernel and an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) i386 architecture::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) <size-max> = 131072 bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) <pointer size> = 4 bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) <pagesize> = 4096 bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) <max-order> = 11
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) and a value for <frame size> of 2048 bytes. These parameters will yield::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) <block number> = 131072/4 = 32768 blocks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) <block size> = 4096 << 11 = 8 MiB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) and hence the buffer will have a 262144 MiB size. So it can hold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) 262144 MiB / 2048 bytes = 134217728 frames
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) Actually, this buffer size is not possible with an i386 architecture.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) Remember that the memory is allocated in kernel space, in the case of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) an i386 kernel's memory size is limited to 1GiB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) All memory allocations are not freed until the socket is closed. The memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) allocations are done with GFP_KERNEL priority, this basically means that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) the allocation can wait and swap other process' memory in order to allocate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) the necessary memory, so normally limits can be reached.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) Other constraints
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) If you check the source code you will see that what I draw here as a frame
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) is not only the link level frame. At the beginning of each frame there is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) meta information like timestamp. So what we draw here a frame it's really
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) the following (from include/linux/if_packet.h)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) /*
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) Frame structure:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) - Start. Frame must be aligned to TPACKET_ALIGNMENT=16
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) - struct tpacket_hdr
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) - pad to TPACKET_ALIGNMENT=16
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) - struct sockaddr_ll
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) - Gap, chosen so that packet data (Start+tp_net) aligns to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) TPACKET_ALIGNMENT=16
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) - Start+tp_mac: [ Optional MAC header ]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) - Pad to align to TPACKET_ALIGNMENT=16
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) The following are conditions that are checked in packet_set_ring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) - tp_block_size must be a multiple of PAGE_SIZE (1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) - tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) - tp_frame_size must be a multiple of TPACKET_ALIGNMENT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) - tp_frame_nr must be exactly frames_per_block*tp_block_nr
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) Note that tp_block_size should be chosen to be a power of two or there will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) be a waste of memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) Mapping and use of the circular buffer (ring)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) ---------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) The mapping of the buffer in the user process is done with the conventional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) mmap function. Even the circular buffer is compound of several physically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) discontiguous blocks of memory, they are contiguous to the user space, hence
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) just one call to mmap is needed::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) If tp_frame_size is a divisor of tp_block_size frames will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) contiguously spaced by tp_frame_size bytes. If not, each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) tp_block_size/tp_frame_size frames there will be a gap between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) the frames. This is because a frame cannot be spawn across two
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) blocks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) To use one socket for capture and transmission, the mapping of both the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) RX and TX buffer ring has to be done with one call to mmap::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) tx_ring = rx_ring + size;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) RX must be the first as the kernel maps the TX ring memory right
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) after the RX one.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) At the beginning of each frame there is an status field (see
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) struct tpacket_hdr). If this field is 0 means that the frame is ready
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) to be used for the kernel, If not, there is a frame the user can read
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) and the following flags apply:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) Capture process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) ^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) from include/linux/if_packet.h
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) #define TP_STATUS_COPY (1 << 1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) #define TP_STATUS_LOSING (1 << 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) #define TP_STATUS_CSUMNOTREADY (1 << 3)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) #define TP_STATUS_CSUM_VALID (1 << 7)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) ====================== =======================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) TP_STATUS_COPY This flag indicates that the frame (and associated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) meta information) has been truncated because it's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) larger than tp_frame_size. This packet can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) read entirely with recvfrom().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) In order to make this work it must to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) enabled previously with setsockopt() and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) the PACKET_COPY_THRESH option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) The number of frames that can be buffered to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) be read with recvfrom is limited like a normal socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) See the SO_RCVBUF option in the socket (7) man page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) TP_STATUS_LOSING indicates there were packet drops from last time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) statistics where checked with getsockopt() and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) the PACKET_STATISTICS option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) TP_STATUS_CSUMNOTREADY currently it's used for outgoing IP packets which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) its checksum will be done in hardware. So while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) reading the packet we should not try to check the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) checksum.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) TP_STATUS_CSUM_VALID This flag indicates that at least the transport
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) header checksum of the packet has been already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) validated on the kernel side. If the flag is not set
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) then we are free to check the checksum by ourselves
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) provided that TP_STATUS_CSUMNOTREADY is also not set.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) ====================== =======================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) for convenience there are also the following defines::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) #define TP_STATUS_KERNEL 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) #define TP_STATUS_USER 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) receives a packet it puts in the buffer and updates the status with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) at least the TP_STATUS_USER flag. Then the user can read the packet,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) once the packet is read the user must zero the status field, so the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) can use again that frame buffer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) The user can use poll (any other variant should apply too) to check if new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) packets are in the ring::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) struct pollfd pfd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) pfd.fd = fd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) pfd.revents = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) pfd.events = POLLIN|POLLRDNORM|POLLERR;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) if (status == TP_STATUS_KERNEL)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) retval = poll(&pfd, 1, timeout);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) It doesn't incur in a race condition to first check the status value and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) then poll for frames.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) Transmission process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) ^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) Those defines are also used for transmission::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) #define TP_STATUS_AVAILABLE 0 // Frame is available
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) #define TP_STATUS_SENDING 2 // Frame is currently in transmission
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) packet, the user fills a data buffer of an available frame, sets tp_len to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) This can be done on multiple frames. Once the user is ready to transmit, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) forwarded to the network device. The kernel updates each status of sent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) frames with TP_STATUS_SENDING until the end of transfer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525) header->tp_len = in_i_size;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) header->tp_status = TP_STATUS_SEND_REQUEST;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527) retval = send(this->socket, NULL, 0, 0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) The user can also use poll() to check if a buffer is available:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) (status == TP_STATUS_SENDING)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) struct pollfd pfd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) pfd.fd = fd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) pfd.revents = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538) pfd.events = POLLOUT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539) retval = poll(&pfd, 1, timeout);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) What TPACKET versions are available and when to use them?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) =========================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) int val = tpacket_version;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547) setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552) TPACKET_V1:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) - Default if not otherwise specified by setsockopt(2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) - RX_RING, TX_RING available
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556) TPACKET_V1 --> TPACKET_V2:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557) - Made 64 bit clean due to unsigned long usage in TPACKET_V1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) structures, thus this also works on 64 bit kernel with 32 bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559) userspace and the like
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560) - Timestamp resolution in nanoseconds instead of microseconds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561) - RX_RING, TX_RING available
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562) - VLAN metadata information available for packets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563) (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564) in the tpacket2_hdr structure:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566) - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567) that the tp_vlan_tci field has valid VLAN TCI value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568) - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569) indicates that the tp_vlan_tpid field has valid VLAN TPID value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571) - How to switch to TPACKET_V2:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573) 1. Replace struct tpacket_hdr by struct tpacket2_hdr
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574) 2. Query header len and save
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575) 3. Set protocol version to 2, set up ring as usual
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576) 4. For getting the sockaddr_ll,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577) use ``(void *)hdr + TPACKET_ALIGN(hdrlen)`` instead of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 578) ``(void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 579)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 580) TPACKET_V2 --> TPACKET_V3:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 581) - Flexible buffer implementation for RX_RING:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 582) 1. Blocks can be configured with non-static frame-size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 583) 2. Read/poll is at a block-level (as opposed to packet-level)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 584) 3. Added poll timeout to avoid indefinite user-space wait
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 585) on idle links
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 586) 4. Added user-configurable knobs:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 587)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 588) 4.1 block::timeout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 589) 4.2 tpkt_hdr::sk_rxhash
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 590)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 591) - RX Hash data available in user space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 592) - TX_RING semantics are conceptually similar to TPACKET_V2;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 593) use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 594) instead of TPACKET2_HDRLEN. In the current implementation,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 595) the tp_next_offset field in the tpacket3_hdr MUST be set to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 596) zero, indicating that the ring does not hold variable sized frames.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 597) Packets with non-zero values of tp_next_offset will be dropped.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 598)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 599) AF_PACKET fanout mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 600) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 601)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 602) In the AF_PACKET fanout mode, packet reception can be load balanced among
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 603) processes. This also works in combination with mmap(2) on packet sockets.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 604)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 605) Currently implemented fanout policies are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 606)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 607) - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 608) - PACKET_FANOUT_LB: schedule to socket by round-robin
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 609) - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 610) - PACKET_FANOUT_RND: schedule to socket by random selection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 611) - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 612) - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 613)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 614) Minimal example code by David S. Miller (try things like "./test eth0 hash",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 615) "./test eth0 lb", etc.)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 616)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 617) #include <stddef.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 618) #include <stdlib.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 619) #include <stdio.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 620) #include <string.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 621)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 622) #include <sys/types.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 623) #include <sys/wait.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 624) #include <sys/socket.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 625) #include <sys/ioctl.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 626)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 627) #include <unistd.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 628)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 629) #include <linux/if_ether.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 630) #include <linux/if_packet.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 631)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 632) #include <net/if.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 633)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 634) static const char *device_name;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 635) static int fanout_type;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 636) static int fanout_id;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 637)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 638) #ifndef PACKET_FANOUT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 639) # define PACKET_FANOUT 18
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 640) # define PACKET_FANOUT_HASH 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 641) # define PACKET_FANOUT_LB 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 642) #endif
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 643)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 644) static int setup_socket(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 645) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 646) int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 647) struct sockaddr_ll ll;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 648) struct ifreq ifr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 649) int fanout_arg;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 650)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 651) if (fd < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 652) perror("socket");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 653) return EXIT_FAILURE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 654) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 655)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 656) memset(&ifr, 0, sizeof(ifr));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 657) strcpy(ifr.ifr_name, device_name);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 658) err = ioctl(fd, SIOCGIFINDEX, &ifr);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 659) if (err < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 660) perror("SIOCGIFINDEX");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 661) return EXIT_FAILURE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 662) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 663)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 664) memset(&ll, 0, sizeof(ll));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 665) ll.sll_family = AF_PACKET;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 666) ll.sll_ifindex = ifr.ifr_ifindex;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 667) err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 668) if (err < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 669) perror("bind");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 670) return EXIT_FAILURE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 671) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 672)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 673) fanout_arg = (fanout_id | (fanout_type << 16));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 674) err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 675) &fanout_arg, sizeof(fanout_arg));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 676) if (err) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 677) perror("setsockopt");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 678) return EXIT_FAILURE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 679) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 680)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 681) return fd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 682) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 683)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 684) static void fanout_thread(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 685) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 686) int fd = setup_socket();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 687) int limit = 10000;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 688)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 689) if (fd < 0)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 690) exit(fd);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 691)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 692) while (limit-- > 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 693) char buf[1600];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 694) int err;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 695)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 696) err = read(fd, buf, sizeof(buf));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 697) if (err < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 698) perror("read");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 699) exit(EXIT_FAILURE);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 700) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 701) if ((limit % 10) == 0)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 702) fprintf(stdout, "(%d) \n", getpid());
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 703) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 704)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 705) fprintf(stdout, "%d: Received 10000 packets\n", getpid());
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 706)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 707) close(fd);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 708) exit(0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 709) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 710)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 711) int main(int argc, char **argp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 712) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 713) int fd, err;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 714) int i;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 715)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 716) if (argc != 3) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 717) fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 718) return EXIT_FAILURE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 719) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 720)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 721) if (!strcmp(argp[2], "hash"))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 722) fanout_type = PACKET_FANOUT_HASH;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 723) else if (!strcmp(argp[2], "lb"))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 724) fanout_type = PACKET_FANOUT_LB;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 725) else {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 726) fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 727) exit(EXIT_FAILURE);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 728) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 729)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 730) device_name = argp[1];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 731) fanout_id = getpid() & 0xffff;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 732)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 733) for (i = 0; i < 4; i++) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 734) pid_t pid = fork();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 735)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 736) switch (pid) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 737) case 0:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 738) fanout_thread();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 739)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 740) case -1:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 741) perror("fork");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 742) exit(EXIT_FAILURE);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 743) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 744) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 745)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 746) for (i = 0; i < 4; i++) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 747) int status;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 748)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 749) wait(&status);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 750) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 751)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 752) return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 753) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 754)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 755) AF_PACKET TPACKET_V3 example
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 756) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 757)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 758) AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 759) sizes by doing it's own memory management. It is based on blocks where polling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 760) works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 761)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 762) It is said that TPACKET_V3 brings the following benefits:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 763)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 764) * ~15% - 20% reduction in CPU-usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 765) * ~20% increase in packet capture rate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 766) * ~2x increase in packet density
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 767) * Port aggregation analysis
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 768) * Non static frame size to capture entire packet payload
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 769)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 770) So it seems to be a good candidate to be used with packet fanout.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 771)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 772) Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 773) it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 774)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 775) /* Written from scratch, but kernel-to-user space API usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 776) * dissected from lolpcap:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 777) * Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 778) * License: GPL, version 2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 779) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 780)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 781) #include <stdio.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 782) #include <stdlib.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 783) #include <stdint.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 784) #include <string.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 785) #include <assert.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 786) #include <net/if.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 787) #include <arpa/inet.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 788) #include <netdb.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 789) #include <poll.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 790) #include <unistd.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 791) #include <signal.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 792) #include <inttypes.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 793) #include <sys/socket.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 794) #include <sys/mman.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 795) #include <linux/if_packet.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 796) #include <linux/if_ether.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 797) #include <linux/ip.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 798)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 799) #ifndef likely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 800) # define likely(x) __builtin_expect(!!(x), 1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 801) #endif
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 802) #ifndef unlikely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 803) # define unlikely(x) __builtin_expect(!!(x), 0)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 804) #endif
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 805)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 806) struct block_desc {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 807) uint32_t version;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 808) uint32_t offset_to_priv;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 809) struct tpacket_hdr_v1 h1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 810) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 811)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 812) struct ring {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 813) struct iovec *rd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 814) uint8_t *map;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 815) struct tpacket_req3 req;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 816) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 817)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 818) static unsigned long packets_total = 0, bytes_total = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 819) static sig_atomic_t sigint = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 820)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 821) static void sighandler(int num)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 822) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 823) sigint = 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 824) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 825)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 826) static int setup_socket(struct ring *ring, char *netdev)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 827) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 828) int err, i, fd, v = TPACKET_V3;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 829) struct sockaddr_ll ll;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 830) unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 831) unsigned int blocknum = 64;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 832)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 833) fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 834) if (fd < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 835) perror("socket");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 836) exit(1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 837) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 838)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 839) err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 840) if (err < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 841) perror("setsockopt");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 842) exit(1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 843) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 844)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 845) memset(&ring->req, 0, sizeof(ring->req));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 846) ring->req.tp_block_size = blocksiz;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 847) ring->req.tp_frame_size = framesiz;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 848) ring->req.tp_block_nr = blocknum;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 849) ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 850) ring->req.tp_retire_blk_tov = 60;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 851) ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 852)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 853) err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 854) sizeof(ring->req));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 855) if (err < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 856) perror("setsockopt");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 857) exit(1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 858) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 859)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 860) ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 861) PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 862) if (ring->map == MAP_FAILED) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 863) perror("mmap");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 864) exit(1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 865) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 866)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 867) ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 868) assert(ring->rd);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 869) for (i = 0; i < ring->req.tp_block_nr; ++i) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 870) ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 871) ring->rd[i].iov_len = ring->req.tp_block_size;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 872) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 873)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 874) memset(&ll, 0, sizeof(ll));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 875) ll.sll_family = PF_PACKET;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 876) ll.sll_protocol = htons(ETH_P_ALL);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 877) ll.sll_ifindex = if_nametoindex(netdev);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 878) ll.sll_hatype = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 879) ll.sll_pkttype = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 880) ll.sll_halen = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 881)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 882) err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 883) if (err < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 884) perror("bind");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 885) exit(1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 886) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 887)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 888) return fd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 889) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 890)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 891) static void display(struct tpacket3_hdr *ppd)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 892) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 893) struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 894) struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 895)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 896) if (eth->h_proto == htons(ETH_P_IP)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 897) struct sockaddr_in ss, sd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 898) char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 899)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 900) memset(&ss, 0, sizeof(ss));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 901) ss.sin_family = PF_INET;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 902) ss.sin_addr.s_addr = ip->saddr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 903) getnameinfo((struct sockaddr *) &ss, sizeof(ss),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 904) sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 905)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 906) memset(&sd, 0, sizeof(sd));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 907) sd.sin_family = PF_INET;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 908) sd.sin_addr.s_addr = ip->daddr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 909) getnameinfo((struct sockaddr *) &sd, sizeof(sd),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 910) dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 911)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 912) printf("%s -> %s, ", sbuff, dbuff);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 913) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 914)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 915) printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 916) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 917)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 918) static void walk_block(struct block_desc *pbd, const int block_num)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 919) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 920) int num_pkts = pbd->h1.num_pkts, i;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 921) unsigned long bytes = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 922) struct tpacket3_hdr *ppd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 923)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 924) ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 925) pbd->h1.offset_to_first_pkt);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 926) for (i = 0; i < num_pkts; ++i) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 927) bytes += ppd->tp_snaplen;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 928) display(ppd);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 929)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 930) ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 931) ppd->tp_next_offset);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 932) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 933)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 934) packets_total += num_pkts;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 935) bytes_total += bytes;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 936) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 937)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 938) static void flush_block(struct block_desc *pbd)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 939) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 940) pbd->h1.block_status = TP_STATUS_KERNEL;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 941) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 942)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 943) static void teardown_socket(struct ring *ring, int fd)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 944) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 945) munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 946) free(ring->rd);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 947) close(fd);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 948) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 949)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 950) int main(int argc, char **argp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 951) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 952) int fd, err;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 953) socklen_t len;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 954) struct ring ring;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 955) struct pollfd pfd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 956) unsigned int block_num = 0, blocks = 64;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 957) struct block_desc *pbd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 958) struct tpacket_stats_v3 stats;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 959)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 960) if (argc != 2) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 961) fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 962) return EXIT_FAILURE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 963) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 964)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 965) signal(SIGINT, sighandler);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 966)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 967) memset(&ring, 0, sizeof(ring));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 968) fd = setup_socket(&ring, argp[argc - 1]);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 969) assert(fd > 0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 970)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 971) memset(&pfd, 0, sizeof(pfd));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 972) pfd.fd = fd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 973) pfd.events = POLLIN | POLLERR;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 974) pfd.revents = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 975)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 976) while (likely(!sigint)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 977) pbd = (struct block_desc *) ring.rd[block_num].iov_base;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 978)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 979) if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 980) poll(&pfd, 1, -1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 981) continue;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 982) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 983)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 984) walk_block(pbd, block_num);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 985) flush_block(pbd);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 986) block_num = (block_num + 1) % blocks;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 987) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 988)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 989) len = sizeof(stats);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 990) err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 991) if (err < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 992) perror("getsockopt");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 993) exit(1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 994) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 995)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 996) fflush(stdout);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 997) printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 998) stats.tp_packets, bytes_total, stats.tp_drops,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 999) stats.tp_freeze_q_cnt);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1000)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1001) teardown_socket(&ring, fd);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1002) return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1003) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1004)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1005) PACKET_QDISC_BYPASS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1006) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1007)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1008) If there is a requirement to load the network with many packets in a similar
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1009) fashion as pktgen does, you might set the following option after socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1010) creation::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1011)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1012) int one = 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1013) setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1014)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1015) This has the side-effect, that packets sent through PF_PACKET will bypass the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1016) kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1017) packet are not buffered, tc disciplines are ignored, increased loss can occur
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1018) and such packets are also not visible to other PF_PACKET sockets anymore. So,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1019) you have been warned; generally, this can be useful for stress testing various
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1020) components of a system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1021)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1022) On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1023) on PF_PACKET sockets.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1024)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1025) PACKET_TIMESTAMP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1026) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1027)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1028) The PACKET_TIMESTAMP setting determines the source of the timestamp in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1029) the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1030) NIC is capable of timestamping packets in hardware, you can request those
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1031) hardware timestamps to be used. Note: you may need to enable the generation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1032) of hardware timestamps with SIOCSHWTSTAMP (see related information from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1033) Documentation/networking/timestamping.rst).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1034)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1035) PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1036)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1037) int req = SOF_TIMESTAMPING_RAW_HARDWARE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1038) setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1039)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1040) For the mmap(2)ed ring buffers, such timestamps are stored in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1041) ``tpacket{,2,3}_hdr`` structure's tp_sec and ``tp_{n,u}sec`` members.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1042) To determine what kind of timestamp has been reported, the tp_status field
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1043) is binary or'ed with the following possible bits ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1044)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1045) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1046)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1047) TP_STATUS_TS_RAW_HARDWARE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1048) TP_STATUS_TS_SOFTWARE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1049)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1050) ... that are equivalent to its ``SOF_TIMESTAMPING_*`` counterparts. For the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1051) RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1052) software fallback was invoked *within* PF_PACKET's processing code (less
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1053) precise).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1054)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1055) Getting timestamps for the TX_RING works as follows: i) fill the ring frames,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1056) ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1057) frames to be updated resp. the frame handed over to the application, iv) walk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1058) through the frames to pick up the individual hw/sw timestamps.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1059)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1060) Only (!) if transmit timestamping is enabled, then these bits are combined
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1061) with binary | with TP_STATUS_AVAILABLE, so you must check for that in your
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1062) application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1063) in a first step to see if the frame belongs to the application, and then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1064) one can extract the type of timestamp in a second step from tp_status)!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1065)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1066) If you don't care about them, thus having it disabled, checking for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1067) TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1068) TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1069) members do not contain a valid value. For TX_RINGs, by default no timestamp
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1070) is generated!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1071)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1072) See include/linux/net_tstamp.h and Documentation/networking/timestamping.rst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1073) for more information on hardware timestamps.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1074)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1075) Miscellaneous bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1076) ==================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1077)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1078) - Packet sockets work well together with Linux socket filters, thus you also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1079) might want to have a look at Documentation/networking/filter.rst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1080)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1081) THANKS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1082) ======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1083)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1084) Jesse Brandeburg, for fixing my grammathical/spelling errors