^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) ==================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) IP over InfiniBand
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ==================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) The ib_ipoib driver is an implementation of the IP over InfiniBand
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) protocol as specified by RFC 4391 and 4392, issued by the IETF ipoib
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) working group. It is a "native" implementation in the sense of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) setting the interface type to ARPHRD_INFINIBAND and the hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) address length to 20 (earlier proprietary implementations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) masqueraded to the kernel as ethernet interfaces).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) Partitions and P_Keys
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) When the IPoIB driver is loaded, it creates one interface for each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) port using the P_Key at index 0. To create an interface with a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) different P_Key, write the desired P_Key into the main interface's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) /sys/class/net/<intf name>/create_child file. For example::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) echo 0x8001 > /sys/class/net/ib0/create_child
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) This will create an interface named ib0.8001 with P_Key 0x8001. To
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) remove a subinterface, use the "delete_child" file::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) echo 0x8001 > /sys/class/net/ib0/delete_child
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) The P_Key for any interface is given by the "pkey" file, and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) main interface for a subinterface is in "parent."
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) Child interface create/delete can also be done using IPoIB's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) rtnl_link_ops, where children created using either way behave the same.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) Datagram vs Connected modes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) ===========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) The IPoIB driver supports two modes of operation: datagram and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) connected. The mode is set and read through an interface's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) /sys/class/net/<intf name>/mode file.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) In datagram mode, the IB UD (Unreliable Datagram) transport is used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) and so the interface MTU has is equal to the IB L2 MTU minus the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) IPoIB encapsulation header (4 bytes). For example, in a typical IB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) In connected mode, the IB RC (Reliable Connected) transport is used.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) Connected mode takes advantage of the connected nature of the IB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) transport and allows an MTU up to the maximal IP packet size of 64K,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) which reduces the number of IP packets needed for handling large UDP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) datagrams, TCP segments, etc and increases the performance for large
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) messages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) In connected mode, the interface's UD QP is still used for multicast
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) and communication with peers that don't support connected mode. In
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) this case, RX emulation of ICMP PMTU packets is used to cause the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) networking stack to use the smaller UD MTU for these neighbours.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) Stateless offloads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) ==================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) If the IB HW supports IPoIB stateless offloads, IPoIB advertises
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) TCP/IP checksum and/or Large Send (LSO) offloading capability to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) network stack.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) Large Receive (LRO) offloading is also implemented and may be turned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) on/off using ethtool calls. Currently LRO is supported only for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) checksum offload capable devices.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) Stateless offloads are supported only in datagram mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) Interrupt moderation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) If the underlying IB device supports CQ event moderation, one can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) use ethtool to set interrupt mitigation parameters and thus reduce
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) the overhead incurred by handling interrupts. The main code path of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) IPoIB doesn't use events for TX completion signaling so only RX
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) moderation is supported.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) Debugging Information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) to 'y', tracing messages are compiled into the driver. They are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) turned on by setting the module parameters debug_level and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) mcast_debug_level to 1. These parameters can be controlled at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) runtime through files in /sys/module/ib_ipoib/.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) virtual filesystem. By mounting this filesystem, for example with::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) mount -t debugfs none /sys/kernel/debug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) it is possible to get statistics about multicast groups from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) files /sys/kernel/debug/ipoib/ib0_mcg and so on.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) The performance impact of this option is negligible, so it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) is safe to enable this option with debug_level set to 0 for normal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) operation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) the data path when data_debug_level is set to 1. However, even with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) the output disabled, enabling this configuration option will affect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) performance, because it adds tests to the fast path.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) References
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) Transmission of IP over InfiniBand (IPoIB) (RFC 4391)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) http://ietf.org/rfc/rfc4391.txt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) IP over InfiniBand (IPoIB) Architecture (RFC 4392)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) http://ietf.org/rfc/rfc4392.txt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) IP over InfiniBand: Connected Mode (RFC 4755)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) http://ietf.org/rfc/rfc4755.txt