^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ==
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) RDS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ===
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Overview
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) This readme tries to provide some background on the hows and whys of RDS,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) and will hopefully help you find your way around the code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) In addition, please see this email about RDS origins:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) RDS Architecture
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) RDS provides reliable, ordered datagram delivery by using a single
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) reliable connection between any two nodes in the cluster. This allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) applications to use a single socket to talk to any other process in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) cluster - so in a cluster with N processes you need N sockets, in contrast
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) to N*N if you use a connection-oriented socket transport like TCP.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) RDS is not Infiniband-specific; it was designed to support different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) transports. The current implementation used to support RDS over TCP as well
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) as IB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) The high-level semantics of RDS from the application's point of view are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) * Addressing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) RDS uses IPv4 addresses and 16bit port numbers to identify
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) the end point of a connection. All socket operations that involve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) passing addresses between kernel and user space generally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) use a struct sockaddr_in.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) The fact that IPv4 addresses are used does not mean the underlying
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) transport has to be IP-based. In fact, RDS over IB uses a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) reliable IB connection; the IP address is used exclusively to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) locate the remote node's GID (by ARPing for the given IP).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) The port space is entirely independent of UDP, TCP or any other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) protocol.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) * Socket interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) RDS sockets work *mostly* as you would expect from a BSD
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) socket. The next section will cover the details. At any rate,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) all I/O is performed through the standard BSD socket API.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) Some additions like zerocopy support are implemented through
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) control messages, while other extensions use the getsockopt/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) setsockopt calls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) Sockets must be bound before you can send or receive data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) This is needed because binding also selects a transport and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) attaches it to the socket. Once bound, the transport assignment
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) does not change. RDS will tolerate IPs moving around (eg in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) a active-active HA scenario), but only as long as the address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) doesn't move to a different transport.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) * sysctls
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) RDS supports a number of sysctls in /proc/sys/net/rds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) Socket Interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) AF_RDS, PF_RDS, SOL_RDS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) AF_RDS and PF_RDS are the domain type to be used with socket(2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) to create RDS sockets. SOL_RDS is the socket-level to be used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) with setsockopt(2) and getsockopt(2) for RDS specific socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) options.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) This creates a new, unbound RDS socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) setsockopt(SOL_SOCKET): send and receive buffer size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) RDS honors the send and receive buffer size socket options.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) You are not allowed to queue more than SO_SNDSIZE bytes to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) a socket. A message is queued when sendmsg is called, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) it leaves the queue when the remote system acknowledges
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) its arrival.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) The SO_RCVSIZE option controls the maximum receive queue length.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) This is a soft limit rather than a hard limit - RDS will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) continue to accept and queue incoming messages, even if that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) takes the queue length over the limit. However, it will also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) mark the port as "congested" and send a congestion update to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) the source node. The source node is supposed to throttle any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) processes sending to this congested port.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) bind(fd, &sockaddr_in, ...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) This binds the socket to a local IP address and port, and a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) transport, if one has not already been selected via the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) SO_RDS_TRANSPORT socket option
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) sendmsg(fd, ...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) Sends a message to the indicated recipient. The kernel will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) transparently establish the underlying reliable connection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) if it isn't up yet.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) An attempt to send a message that exceeds SO_SNDSIZE will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) return with -EMSGSIZE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) An attempt to send a message that would take the total number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) of queued bytes over the SO_SNDSIZE threshold will return
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) EAGAIN.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) An attempt to send a message to a destination that is marked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) as "congested" will return ENOBUFS.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) recvmsg(fd, ...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) Receives a message that was queued to this socket. The sockets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) recv queue accounting is adjusted, and if the queue length
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) drops below SO_SNDSIZE, the port is marked uncongested, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) a congestion update is sent to all peers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) Applications can ask the RDS kernel module to receive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) notifications via control messages (for instance, there is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) notification when a congestion update arrived, or when a RDMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) operation completes). These notifications are received through
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) the msg.msg_control buffer of struct msghdr. The format of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) messages is described in manpages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) poll(fd)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) RDS supports the poll interface to allow the application
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) to implement async I/O.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) POLLIN handling is pretty straightforward. When there's an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) incoming message queued to the socket, or a pending notification,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) we signal POLLIN.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) POLLOUT is a little harder. Since you can essentially send
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) to any destination, RDS will always signal POLLOUT as long as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) there's room on the send queue (ie the number of bytes queued
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) is less than the sendbuf size).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) However, the kernel will refuse to accept messages to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) a destination marked congested - in this case you will loop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) forever if you rely on poll to tell you what to do.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) This isn't a trivial problem, but applications can deal with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) this - by using congestion notifications, and by checking for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) ENOBUFS errors returned by sendmsg.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) This allows the application to discard all messages queued to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) specific destination on this particular socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) This allows the application to cancel outstanding messages if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) it detects a timeout. For instance, if it tried to send a message,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) and the remote host is unreachable, RDS will keep trying forever.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) The application may decide it's not worth it, and cancel the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) operation. In this case, it would use RDS_CANCEL_SENT_TO to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) nuke any pending messages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) ``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) Set or read an integer defining the underlying
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) encapsulating transport to be used for RDS packets on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) socket. When setting the option, integer argument may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) value, RDS_TRANS_NONE will be returned on an unbound socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) This socket option may only be set exactly once on the socket,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) prior to binding it via the bind(2) system call. Attempts to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) set SO_RDS_TRANSPORT on a socket for which the transport has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) been previously attached explicitly (by SO_RDS_TRANSPORT) or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) implicitly (via bind(2)) will return an error of EOPNOTSUPP.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) always return EINVAL.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) RDMA for RDS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) see rds-rdma(7) manpage (available in rds-tools)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) Congestion Notifications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) see rds(7) manpage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) RDS Protocol
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) Message header
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) The message header is a 'struct rds_header' (see rds.h):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) Fields:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) h_sequence:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) per-packet sequence number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) h_ack:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) piggybacked acknowledgment of last packet received
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) h_len:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) length of data, not including header
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) h_sport:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) source port
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) h_dport:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) destination port
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) h_flags:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) Can be:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) ============= ==================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) CONG_BITMAP this is a congestion update bitmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) ACK_REQUIRED receiver must ack this packet
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) RETRANSMITTED packet has previously been sent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) ============= ==================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) h_credit:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) indicate to other end of connection that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) it has more credits available (i.e. there is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) more send room)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) h_padding[4]:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) unused, for future use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) h_csum:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) header checksum
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) h_exthdr:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) optional data can be passed here. This is currently used for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) passing RDMA-related information.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) ACK and retransmit handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) One might think that with reliable IB connections you wouldn't need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) to ack messages that have been received. The problem is that IB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) hardware generates an ack message before it has DMAed the message
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) into memory. This creates a potential message loss if the HCA is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) disabled for any reason between when it sends the ack and before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) the message is DMAed and processed. This is only a potential issue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) if another HCA is available for fail-over.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) Sending an ack immediately would allow the sender to free the sent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) message from their send queue quickly, but could cause excessive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) traffic to be used for acks. RDS piggybacks acks on sent data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) packets. Ack-only packets are reduced by only allowing one to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) in flight at a time, and by the sender only asking for acks when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) its send buffers start to fill up. All retransmissions are also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) acked.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) Flow Control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) RDS's IB transport uses a credit-based mechanism to verify that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) there is space in the peer's receive buffers for more data. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) eliminates the need for hardware retries on the connection.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) Congestion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) Messages waiting in the receive queue on the receiving socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) are accounted against the sockets SO_RCVBUF option value. Only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) the payload bytes in the message are accounted for. If the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) number of bytes queued equals or exceeds rcvbuf then the socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) is congested. All sends attempted to this socket's address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) should return block or return -EWOULDBLOCK.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) Applications are expected to be reasonably tuned such that this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) situation very rarely occurs. An application encountering this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) "back-pressure" is considered a bug.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) This is implemented by having each node maintain bitmaps which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) indicate which ports on bound addresses are congested. As the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) bitmap changes it is sent through all the connections which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) terminate in the local address of the bitmap which changed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) The bitmaps are allocated as connections are brought up. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) avoids allocation in the interrupt handling path which queues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) sages on sockets. The dense bitmaps let transports send the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) entire bitmap on any bitmap change reasonably efficiently. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) is much easier to implement than some finer-grained
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) communication of per-port congestion. The sender does a very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) inexpensive bit test to test if the port it's about to send to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) is congested or not.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) RDS Transport Layer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) As mentioned above, RDS is not IB-specific. Its code is divided
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) into a general RDS layer and a transport layer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) The general layer handles the socket API, congestion handling,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) loopback, stats, usermem pinning, and the connection state machine.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) The transport layer handles the details of the transport. The IB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) transport, for example, handles all the queue pairs, work requests,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) CM event handlers, and other Infiniband details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) RDS Kernel Structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) struct rds_message
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) aka possibly "rds_outgoing", the generic RDS layer copies data to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) be sent and sets header fields as needed, based on the socket API.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) This is then queued for the individual connection and sent by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) connection's transport.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) struct rds_incoming
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) a generic struct referring to incoming data that can be handed from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) the transport to the general code and queued by the general code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) while the socket is awoken. It is then passed back to the transport
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) code to handle the actual copy-to-user.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) struct rds_socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) per-socket information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) struct rds_connection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) per-connection information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) struct rds_transport
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) pointers to transport-specific functions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) struct rds_statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) non-transport-specific statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) struct rds_cong_map
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) wraps the raw congestion bitmap, contains rbnode, waitq, etc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) Connection management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) ERROR states.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) The first time an attempt is made by an RDS socket to send data to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) a node, a connection is allocated and connected. That connection is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) then maintained forever -- if there are transport errors, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) connection will be dropped and re-established.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) Dropping a connection while packets are queued will cause queued or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) partially-sent datagrams to be retransmitted when the connection is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) re-established.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) The send path
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) rds_sendmsg()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) - struct rds_message built from incoming data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) - CMSGs parsed (e.g. RDMA ops)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) - transport connection alloced and connected if not already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) - rds_message placed on send queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) - send worker awoken
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) rds_send_worker()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) - calls rds_send_xmit() until queue is empty
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) rds_send_xmit()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) - transmits congestion map if one is pending
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) - may set ACK_REQUIRED
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) - calls transport to send either non-RDMA or RDMA message
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) (RDMA ops never retransmitted)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) rds_ib_xmit()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) - allocs work requests from send ring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) - adds any new send credits available to peer (h_credits)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) - maps the rds_message's sg list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) - piggybacks ack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) - populates work requests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) - post send to connection's queue pair
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) The recv path
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) rds_ib_recv_cq_comp_handler()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) - looks at write completions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) - unmaps recv buffer from device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) - no errors, call rds_ib_process_recv()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) - refill recv ring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) rds_ib_process_recv()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) - validate header checksum
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) - copy header to rds_ib_incoming struct if start of a new datagram
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) - add to ibinc's fraglist
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) - if competed datagram:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) - update cong map if datagram was cong update
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) - call rds_recv_incoming() otherwise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) - note if ack is required
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) rds_recv_incoming()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) - drop duplicate packets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) - respond to pings
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) - find the sock associated with this datagram
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) - add to sock queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) - wake up sock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) - do some congestion calculations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) rds_recvmsg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) - copy data into user iovec
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) - handle CMSGs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) - return to application
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) Multipath RDS (mprds)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) (though the concept can be extended to other transports). The classical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) implementation of RDS-over-TCP is implemented by demultiplexing multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) port]) over a single TCP socket between the 2 IP addresses involved. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) has the limitation that it ends up funneling multiple RDS flows over a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) single TCP flow, thus it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) (a) upper-bounded to the single-flow bandwidth,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) (b) suffers from head-of-line blocking for all the RDS sockets.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) Better throughput (for a fixed small packet size, MTU) can be achieved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) connection. RDS sockets will be attached to a path based on some hash
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) (e.g., of local address and RDS port number) and packets for that RDS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) socket will be sent over the attached path using TCP to segment/reassemble
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) RDS datagrams on that path.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) Multipathed RDS is implemented by splitting the struct rds_connection into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) a common (to all paths) part, and a per-path struct rds_conn_path. All
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) I/O workqs and reconnect threads are driven from the rds_conn_path.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) Transports such as TCP that are multipath capable may then set up a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) TCP socket per rds_conn_path, and this is managed by the transport via
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) the transport privatee cp_transport_data pointer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) Transports announce themselves as multipath capable by setting the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) t_mp_capable bit during registration with the rds core module. When the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) across multiple paths. The outgoing hash is computed based on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) local address and port that the PF_RDS socket is bound to.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) Additionally, even if the transport is MP capable, we may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) peering with some node that does not support mprds, or supports
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) a different number of paths. As a result, the peering nodes need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) to agree on the number of paths to be used for the connection.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) This is done by sending out a control packet exchange before the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) first data packet. The control packet exchange must have completed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) prior to outgoing hash completion in rds_sendmsg() when the transport
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) is mutlipath capable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) The control packet is an RDS ping packet (i.e., packet to rds dest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) port 0) with the ping packet having a rds extension header option of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) number of paths supported by the sender. The "probe" ping packet will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) be able to compute the min(sender_paths, rcvr_paths). The pong
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) sent in response to a probe-ping should contain the rcvr's npaths
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) when the rcvr is mprds-capable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) If the rcvr is not mprds-capable, the exthdr in the ping will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) ignored. In this case the pong will not have any exthdrs, so the sender
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) of the probe-ping can default to single-path mprds.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448)