Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ==
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) RDS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) ===
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) Overview
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) This readme tries to provide some background on the hows and whys of RDS,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) and will hopefully help you find your way around the code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) In addition, please see this email about RDS origins:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) RDS Architecture
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) RDS provides reliable, ordered datagram delivery by using a single
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) reliable connection between any two nodes in the cluster. This allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) applications to use a single socket to talk to any other process in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) cluster - so in a cluster with N processes you need N sockets, in contrast
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) to N*N if you use a connection-oriented socket transport like TCP.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) RDS is not Infiniband-specific; it was designed to support different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) transports.  The current implementation used to support RDS over TCP as well
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) as IB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) The high-level semantics of RDS from the application's point of view are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31)  *	Addressing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) 	RDS uses IPv4 addresses and 16bit port numbers to identify
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 	the end point of a connection. All socket operations that involve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) 	passing addresses between kernel and user space generally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) 	use a struct sockaddr_in.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) 	The fact that IPv4 addresses are used does not mean the underlying
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) 	transport has to be IP-based. In fact, RDS over IB uses a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) 	reliable IB connection; the IP address is used exclusively to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 	locate the remote node's GID (by ARPing for the given IP).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) 	The port space is entirely independent of UDP, TCP or any other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) 	protocol.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46)  *	Socket interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) 	RDS sockets work *mostly* as you would expect from a BSD
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) 	socket. The next section will cover the details. At any rate,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) 	all I/O is performed through the standard BSD socket API.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) 	Some additions like zerocopy support are implemented through
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) 	control messages, while other extensions use the getsockopt/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) 	setsockopt calls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) 	Sockets must be bound before you can send or receive data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) 	This is needed because binding also selects a transport and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) 	attaches it to the socket. Once bound, the transport assignment
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) 	does not change. RDS will tolerate IPs moving around (eg in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) 	a active-active HA scenario), but only as long as the address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) 	doesn't move to a different transport.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62)  *	sysctls
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 	RDS supports a number of sysctls in /proc/sys/net/rds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) Socket Interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) ================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70)   AF_RDS, PF_RDS, SOL_RDS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 	AF_RDS and PF_RDS are the domain type to be used with socket(2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 	to create RDS sockets. SOL_RDS is the socket-level to be used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 	with setsockopt(2) and getsockopt(2) for RDS specific socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) 	options.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76)   fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 	This creates a new, unbound RDS socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79)   setsockopt(SOL_SOCKET): send and receive buffer size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) 	RDS honors the send and receive buffer size socket options.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) 	You are not allowed to queue more than SO_SNDSIZE bytes to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) 	a socket. A message is queued when sendmsg is called, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) 	it leaves the queue when the remote system acknowledges
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) 	its arrival.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) 	The SO_RCVSIZE option controls the maximum receive queue length.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) 	This is a soft limit rather than a hard limit - RDS will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) 	continue to accept and queue incoming messages, even if that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 	takes the queue length over the limit. However, it will also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) 	mark the port as "congested" and send a congestion update to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) 	the source node. The source node is supposed to throttle any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) 	processes sending to this congested port.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94)   bind(fd, &sockaddr_in, ...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) 	This binds the socket to a local IP address and port, and a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) 	transport, if one has not already been selected via the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 	SO_RDS_TRANSPORT socket option
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99)   sendmsg(fd, ...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) 	Sends a message to the indicated recipient. The kernel will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) 	transparently establish the underlying reliable connection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) 	if it isn't up yet.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) 	An attempt to send a message that exceeds SO_SNDSIZE will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 	return with -EMSGSIZE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) 	An attempt to send a message that would take the total number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 	of queued bytes over the SO_SNDSIZE threshold will return
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) 	EAGAIN.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) 	An attempt to send a message to a destination that is marked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) 	as "congested" will return ENOBUFS.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)   recvmsg(fd, ...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) 	Receives a message that was queued to this socket. The sockets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) 	recv queue accounting is adjusted, and if the queue length
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) 	drops below SO_SNDSIZE, the port is marked uncongested, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) 	a congestion update is sent to all peers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) 	Applications can ask the RDS kernel module to receive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) 	notifications via control messages (for instance, there is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) 	notification when a congestion update arrived, or when a RDMA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) 	operation completes). These notifications are received through
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) 	the msg.msg_control buffer of struct msghdr. The format of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) 	messages is described in manpages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)   poll(fd)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) 	RDS supports the poll interface to allow the application
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 	to implement async I/O.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) 	POLLIN handling is pretty straightforward. When there's an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) 	incoming message queued to the socket, or a pending notification,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) 	we signal POLLIN.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) 	POLLOUT is a little harder. Since you can essentially send
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) 	to any destination, RDS will always signal POLLOUT as long as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 	there's room on the send queue (ie the number of bytes queued
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) 	is less than the sendbuf size).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 	However, the kernel will refuse to accept messages to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) 	a destination marked congested - in this case you will loop
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 	forever if you rely on poll to tell you what to do.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) 	This isn't a trivial problem, but applications can deal with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) 	this - by using congestion notifications, and by checking for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 	ENOBUFS errors returned by sendmsg.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)   setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) 	This allows the application to discard all messages queued to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 	specific destination on this particular socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 	This allows the application to cancel outstanding messages if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 	it detects a timeout. For instance, if it tried to send a message,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) 	and the remote host is unreachable, RDS will keep trying forever.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) 	The application may decide it's not worth it, and cancel the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 	operation. In this case, it would use RDS_CANCEL_SENT_TO to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 	nuke any pending messages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158)   ``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) 	Set or read an integer defining  the underlying
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 	encapsulating transport to be used for RDS packets on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 	socket. When setting the option, integer argument may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) 	one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 	value, RDS_TRANS_NONE will be returned on an unbound socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 	This socket option may only be set exactly once on the socket,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 	prior to binding it via the bind(2) system call. Attempts to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) 	set SO_RDS_TRANSPORT on a socket for which the transport has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) 	been previously attached explicitly (by SO_RDS_TRANSPORT) or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 	implicitly (via bind(2)) will return an error of EOPNOTSUPP.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 	An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) 	always return EINVAL.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) RDMA for RDS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)   see rds-rdma(7) manpage (available in rds-tools)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) Congestion Notifications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181)   see rds(7) manpage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) RDS Protocol
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187)   Message header
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189)     The message header is a 'struct rds_header' (see rds.h):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191)     Fields:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)       h_sequence:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) 	  per-packet sequence number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195)       h_ack:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) 	  piggybacked acknowledgment of last packet received
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197)       h_len:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) 	  length of data, not including header
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199)       h_sport:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) 	  source port
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201)       h_dport:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) 	  destination port
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203)       h_flags:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) 	  Can be:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) 	  =============  ==================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) 	  CONG_BITMAP    this is a congestion update bitmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) 	  ACK_REQUIRED   receiver must ack this packet
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) 	  RETRANSMITTED  packet has previously been sent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) 	  =============  ==================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)       h_credit:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) 	  indicate to other end of connection that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) 	  it has more credits available (i.e. there is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) 	  more send room)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216)       h_padding[4]:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) 	  unused, for future use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218)       h_csum:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) 	  header checksum
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220)       h_exthdr:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) 	  optional data can be passed here. This is currently used for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) 	  passing RDMA-related information.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224)   ACK and retransmit handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226)       One might think that with reliable IB connections you wouldn't need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227)       to ack messages that have been received.  The problem is that IB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228)       hardware generates an ack message before it has DMAed the message
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229)       into memory.  This creates a potential message loss if the HCA is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230)       disabled for any reason between when it sends the ack and before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231)       the message is DMAed and processed.  This is only a potential issue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232)       if another HCA is available for fail-over.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234)       Sending an ack immediately would allow the sender to free the sent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235)       message from their send queue quickly, but could cause excessive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236)       traffic to be used for acks. RDS piggybacks acks on sent data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237)       packets.  Ack-only packets are reduced by only allowing one to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238)       in flight at a time, and by the sender only asking for acks when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239)       its send buffers start to fill up. All retransmissions are also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240)       acked.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242)   Flow Control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244)       RDS's IB transport uses a credit-based mechanism to verify that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245)       there is space in the peer's receive buffers for more data. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)       eliminates the need for hardware retries on the connection.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248)   Congestion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250)       Messages waiting in the receive queue on the receiving socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251)       are accounted against the sockets SO_RCVBUF option value.  Only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252)       the payload bytes in the message are accounted for.  If the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253)       number of bytes queued equals or exceeds rcvbuf then the socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254)       is congested.  All sends attempted to this socket's address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255)       should return block or return -EWOULDBLOCK.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257)       Applications are expected to be reasonably tuned such that this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258)       situation very rarely occurs.  An application encountering this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259)       "back-pressure" is considered a bug.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261)       This is implemented by having each node maintain bitmaps which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262)       indicate which ports on bound addresses are congested.  As the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263)       bitmap changes it is sent through all the connections which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264)       terminate in the local address of the bitmap which changed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266)       The bitmaps are allocated as connections are brought up.  This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267)       avoids allocation in the interrupt handling path which queues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268)       sages on sockets.  The dense bitmaps let transports send the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269)       entire bitmap on any bitmap change reasonably efficiently.  This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270)       is much easier to implement than some finer-grained
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271)       communication of per-port congestion.  The sender does a very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272)       inexpensive bit test to test if the port it's about to send to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273)       is congested or not.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) RDS Transport Layer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279)   As mentioned above, RDS is not IB-specific. Its code is divided
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280)   into a general RDS layer and a transport layer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282)   The general layer handles the socket API, congestion handling,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283)   loopback, stats, usermem pinning, and the connection state machine.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285)   The transport layer handles the details of the transport. The IB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286)   transport, for example, handles all the queue pairs, work requests,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)   CM event handlers, and other Infiniband details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) RDS Kernel Structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293)   struct rds_message
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294)     aka possibly "rds_outgoing", the generic RDS layer copies data to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295)     be sent and sets header fields as needed, based on the socket API.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296)     This is then queued for the individual connection and sent by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297)     connection's transport.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299)   struct rds_incoming
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300)     a generic struct referring to incoming data that can be handed from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301)     the transport to the general code and queued by the general code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302)     while the socket is awoken. It is then passed back to the transport
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303)     code to handle the actual copy-to-user.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305)   struct rds_socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306)     per-socket information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308)   struct rds_connection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309)     per-connection information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311)   struct rds_transport
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312)     pointers to transport-specific functions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314)   struct rds_statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315)     non-transport-specific statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317)   struct rds_cong_map
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318)     wraps the raw congestion bitmap, contains rbnode, waitq, etc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) Connection management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323)   Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324)   ERROR states.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326)   The first time an attempt is made by an RDS socket to send data to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327)   a node, a connection is allocated and connected. That connection is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328)   then maintained forever -- if there are transport errors, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329)   connection will be dropped and re-established.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331)   Dropping a connection while packets are queued will cause queued or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332)   partially-sent datagrams to be retransmitted when the connection is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333)   re-established.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) The send path
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339)   rds_sendmsg()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340)     - struct rds_message built from incoming data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341)     - CMSGs parsed (e.g. RDMA ops)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342)     - transport connection alloced and connected if not already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343)     - rds_message placed on send queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344)     - send worker awoken
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346)   rds_send_worker()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347)     - calls rds_send_xmit() until queue is empty
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349)   rds_send_xmit()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350)     - transmits congestion map if one is pending
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351)     - may set ACK_REQUIRED
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352)     - calls transport to send either non-RDMA or RDMA message
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353)       (RDMA ops never retransmitted)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355)   rds_ib_xmit()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356)     - allocs work requests from send ring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357)     - adds any new send credits available to peer (h_credits)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358)     - maps the rds_message's sg list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359)     - piggybacks ack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360)     - populates work requests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361)     - post send to connection's queue pair
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) The recv path
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366)   rds_ib_recv_cq_comp_handler()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367)     - looks at write completions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368)     - unmaps recv buffer from device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369)     - no errors, call rds_ib_process_recv()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370)     - refill recv ring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372)   rds_ib_process_recv()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373)     - validate header checksum
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374)     - copy header to rds_ib_incoming struct if start of a new datagram
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375)     - add to ibinc's fraglist
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376)     - if competed datagram:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) 	 - update cong map if datagram was cong update
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) 	 - call rds_recv_incoming() otherwise
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) 	 - note if ack is required
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381)   rds_recv_incoming()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382)     - drop duplicate packets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383)     - respond to pings
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384)     - find the sock associated with this datagram
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385)     - add to sock queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386)     - wake up sock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387)     - do some congestion calculations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388)   rds_recvmsg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389)     - copy data into user iovec
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390)     - handle CMSGs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391)     - return to application
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) Multipath RDS (mprds)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395)   Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396)   (though the concept can be extended to other transports). The classical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397)   implementation of RDS-over-TCP is implemented by demultiplexing multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398)   PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399)   port]) over a single TCP socket between the 2 IP addresses involved. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400)   has the limitation that it ends up funneling multiple RDS flows over a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401)   single TCP flow, thus it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402)   (a) upper-bounded to the single-flow bandwidth,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403)   (b) suffers from head-of-line blocking for all the RDS sockets.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405)   Better throughput (for a fixed small packet size, MTU) can be achieved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406)   by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407)   RDS (mprds).  Each such TCP/IP flow constitutes a path for the rds/tcp
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408)   connection. RDS sockets will be attached to a path based on some hash
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409)   (e.g., of local address and RDS port number) and packets for that RDS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410)   socket will be sent over the attached path using TCP to segment/reassemble
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411)   RDS datagrams on that path.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413)   Multipathed RDS is implemented by splitting the struct rds_connection into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414)   a common (to all paths) part, and a per-path struct rds_conn_path. All
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415)   I/O workqs and reconnect threads are driven from the rds_conn_path.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416)   Transports such as TCP that are multipath capable may then set up a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417)   TCP socket per rds_conn_path, and this is managed by the transport via
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418)   the transport privatee cp_transport_data pointer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420)   Transports announce themselves as multipath capable by setting the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421)   t_mp_capable bit during registration with the rds core module. When the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422)   transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423)   across multiple paths. The outgoing hash is computed based on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424)   local address and port that the PF_RDS socket is bound to.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426)   Additionally, even if the transport is MP capable, we may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427)   peering with some node that does not support mprds, or supports
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428)   a different number of paths. As a result, the peering nodes need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429)   to agree on the number of paths to be used for the connection.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430)   This is done by sending out a control packet exchange before the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431)   first data packet. The control packet exchange must have completed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432)   prior to outgoing hash completion in rds_sendmsg() when the transport
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433)   is mutlipath capable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435)   The control packet is an RDS ping packet (i.e., packet to rds dest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436)   port 0) with the ping packet having a rds extension header option  of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437)   type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438)   number of paths supported by the sender. The "probe" ping packet will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439)   get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440)   The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441)   be able to compute the min(sender_paths, rcvr_paths). The pong
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442)   sent in response to a probe-ping should contain the rcvr's npaths
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443)   when the rcvr is mprds-capable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445)   If the rcvr is not mprds-capable, the exthdr in the ping will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446)   ignored.  In this case the pong will not have any exthdrs, so the sender
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447)   of the probe-ping can default to single-path mprds.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448)