^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) =============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Kernel Connection Multiplexor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) =============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) interface over TCP for generic application protocols. With KCM an application
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) can efficiently send and receive application protocol messages over TCP using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) datagram sockets.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) KCM implements an NxM multiplexor in the kernel as diagrammed below::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) +------------+ +------------+ +------------+ +------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) | KCM socket | | KCM socket | | KCM socket | | KCM socket |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) +------------+ +------------+ +------------+ +------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) +-----------+ | | +----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) +----------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) | Multiplexor |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) +----------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) | | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) +---------+ | | | ------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) | | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) +----------+ +----------+ +----------+ +----------+ +----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) | Psock | | Psock | | Psock | | Psock | | Psock |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) +----------+ +----------+ +----------+ +----------+ +----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) | | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) +----------+ +----------+ +----------+ +----------+ +----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) +----------+ +----------+ +----------+ +----------+ +----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) KCM sockets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) The KCM sockets provide the user interface to the multiplexor. All the KCM sockets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) bound to a multiplexor are considered to have equivalent function, and I/O
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) operations in different sockets may be done in parallel without the need for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) synchronization between threads in userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) Multiplexor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) The multiplexor provides the message steering. In the transmit path, messages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) written on a KCM socket are sent atomically on an appropriate TCP socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) Similarly, in the receive path, messages are constructed on each TCP socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) (Psock) and complete messages are steered to a KCM socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) TCP sockets & Psocks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) for each bound TCP socket, this structure holds the state for constructing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) messages on receive as well as other connection specific information for KCM.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) Connected mode semantics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) Each multiplexor assumes that all attached TCP connections are to the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) destination and can use the different connections for load balancing when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) can be used to send and receive messages from the KCM socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) Socket types
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) Message delineation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) -------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) Messages are sent over a TCP stream with some application protocol message
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) format that typically includes a header which frames the messages. The length
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) of a received message can be deduced from the application protocol header
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) (often just a simple length field).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) A TCP stream must be parsed to determine message boundaries. Berkeley Packet
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) BPF program must be specified. The program is called at the start of receiving
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) a new message and is given an skbuff that contains the bytes received so far.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) It parses the message header and returns the length of the message. Given this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) information, KCM will construct the message of the stated length and deliver it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) to a KCM socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) TCP socket management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) ---------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) write space available (POLLOUT) events are handled by the multiplexor. If there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) is a state change (disconnection) or other error on a TCP socket, an error is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) posted on the TCP socket so that a POLLERR event happens and KCM discontinues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) using the socket. When the application gets the error notification for a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) TCP socket, it should unattach the socket from KCM and then handle the error
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) condition (the typical response is to close the socket and create a new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) connection if necessary).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) KCM limits the maximum receive message size to be the size of the receive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) socket buffer on the attached TCP socket (the socket buffer size can be set by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) SO_RCVBUF). If the length of a new message reported by the BPF program is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) socket. The BPF program may also enforce a maximum messages size and report an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) error when it is exceeded.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) A timeout may be set for assembling messages on a receive socket. The timeout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) value is taken from the receive timeout of the attached TCP socket (this is set
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) by SO_RCVTIMEO). If the timer expires before assembly is complete an error
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) (ETIMEDOUT) is posted on the socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) User interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) Creating a multiplexor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) ----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) A new multiplexor and initial KCM socket is created by a socket call::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) socket(AF_KCM, type, protocol)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) - type is either SOCK_DGRAM or SOCK_SEQPACKET
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) - protocol is KCMPROTO_CONNECTED
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) Cloning KCM sockets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) -------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) After the first KCM socket is created using the socket call as described
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) above, additional sockets for the multiplexor can be created by cloning
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) a KCM socket. This is accomplished by an ioctl on a KCM socket::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) /* From linux/kcm.h */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) struct kcm_clone {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) int fd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) struct kcm_clone info;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) memset(&info, 0, sizeof(info));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) err = ioctl(kcmfd, SIOCKCMCLONE, &info);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) if (!err)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) newkcmfd = info.fd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) Attach transport sockets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) ------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) Attaching of transport sockets to a multiplexor is performed by calling an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) ioctl on a KCM socket for the multiplexor. e.g.::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) /* From linux/kcm.h */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) struct kcm_attach {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) int fd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) int bpf_fd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) struct kcm_attach info;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) memset(&info, 0, sizeof(info));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) info.fd = tcpfd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) info.bpf_fd = bpf_prog_fd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) ioctl(kcmfd, SIOCKCMATTACH, &info);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) The kcm_attach structure contains:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) - fd: file descriptor for TCP socket being attached
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) - bpf_prog_fd: file descriptor for compiled BPF program downloaded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) Unattach transport sockets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) --------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) Unattaching a transport socket from a multiplexor is straightforward. An
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) "unattach" ioctl is done with the kcm_unattach structure as the argument::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) /* From linux/kcm.h */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) struct kcm_unattach {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) int fd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) struct kcm_unattach info;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) memset(&info, 0, sizeof(info));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) info.fd = cfd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) ioctl(fd, SIOCKCMUNATTACH, &info);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) Disabling receive on KCM socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) -------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) A setsockopt is used to disable or enable receiving on a KCM socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) When receive is disabled, any pending messages in the socket's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) receive buffer are moved to other sockets. This feature is useful
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) if an application thread knows that it will be doing a lot of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) work on a request and won't be able to service new messages for a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) while. Example use::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) int val = 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) BFP programs for message delineation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) ------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) BPF programs can be compiled using the BPF LLVM backend. For example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) the BPF program for parsing Thrift is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) #include "bpf.h" /* for __sk_buff */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) #include "bpf_helpers.h" /* for load_word intrinsic */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) SEC("socket_kcm")
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) int bpf_prog1(struct __sk_buff *skb)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) return load_word(skb, 0) + 4;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) char _license[] SEC("license") = "GPL";
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) Use in applications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) KCM accelerates application layer protocols. Specifically, it allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) applications to use a message based interface for sending and receiving
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) messages. The kernel provides necessary assurances that messages are sent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) and received atomically. This relieves much of the burden applications have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) in mapping a message based protocol onto the TCP stream. KCM also make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) application layer messages a unit of work in the kernel for the purposes of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) steering and scheduling, which in turn allows a simpler networking model in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) multithreaded applications.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) Configurations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) In an Nx1 configuration, KCM logically provides multiple socket handles
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) to the same TCP connection. This allows parallelism between in I/O
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) operations on the TCP socket (for instance copyin and copyout of data is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) parallelized). In an application, a KCM socket can be opened for each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) processing thread and inserted into the epoll (similar to how SO_REUSEPORT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) is used to allow multiple listener sockets on the same port).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) In a MxN configuration, multiple connections are established to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) same destination. These are used for simple load balancing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) Message batching
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) ----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) The primary purpose of KCM is load balancing between KCM sockets and hence
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) threads in a nominal use case. Perfect load balancing, that is steering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) each received message to a different KCM socket or steering each sent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) message to a different TCP socket, can negatively impact performance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) since this doesn't allow for affinities to be established. Balancing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) based on groups, or batches of messages, can be beneficial for performance.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) On transmit, there are three ways an application can batch (pipeline)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) messages on a KCM socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) 1) Send multiple messages in a single sendmmsg.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) 2) Send a group of messages each with a sendmsg call, where all messages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) except the last have MSG_BATCH in the flags of sendmsg call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) 3) Create "super message" composed of multiple messages and send this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) with a single sendmsg.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) On receive, the KCM module attempts to queue messages received on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) same KCM socket during each TCP ready callback. The targeted KCM socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) changes at each receive ready callback on the KCM socket. The application
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) does not need to configure this.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) Error handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) An application should include a thread to monitor errors raised on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) the TCP connection. Normally, this will be done by placing each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) TCP socket attached to a KCM multiplexor in epoll set for POLLERR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) event. If an error occurs on an attached TCP socket, KCM sets an EPIPE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) on the socket thus waking up the application thread. When the application
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) sees the error (which may just be a disconnect) it should unattach the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) socket from KCM and then close it. It is assumed that once an error is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) posted on the TCP socket the data stream is unrecoverable (i.e. an error
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) may have occurred in the middle of receiving a message).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) TCP connection monitoring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) -------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) In KCM there is no means to correlate a message to the TCP socket that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) was used to send or receive the message (except in the case there is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) only one attached TCP socket). However, the application does retain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) an open file descriptor to the socket so it will be able to get statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) from the socket which can be used in detecting issues (such as high
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) retransmissions on the socket).