Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) MSG_ZEROCOPY
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) Intro
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) =====
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) The feature is currently implemented for TCP and UDP sockets.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) Opportunity and Caveats
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) -----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) Copying large buffers between user process and kernel can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) expensive. Linux supports various interfaces that eschew copying,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) such as sendpage and splice. The MSG_ZEROCOPY flag extends the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) underlying copy avoidance mechanism to common socket send calls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) Copy avoidance is not a free lunch. As implemented, with page pinning,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) it replaces per byte copy cost with page accounting and completion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) notification overhead. As a result, MSG_ZEROCOPY is generally only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) effective at writes over around 10 KB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) Page pinning also changes system call semantics. It temporarily shares
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) the buffer between process and network stack. Unlike with copying, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) process cannot immediately overwrite the buffer after system call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) return without possibly modifying the data in flight. Kernel integrity
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) is not affected, but a buggy program can possibly corrupt its own data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) stream.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) The kernel returns a notification when it is safe to modify data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) Converting an existing application to MSG_ZEROCOPY is not always as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) trivial as just passing the flag, then.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) More Info
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) ---------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) Much of this document was derived from a longer paper presented at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) netdev 2.1. For more in-depth information see that paper and talk,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) the excellent reporting over at LWN.net or read the original code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45)   paper, slides, video
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46)     https://netdevconf.org/2.1/session.html?debruijn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48)   LWN article
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49)     https://lwn.net/Articles/726917/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51)   patchset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52)     [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53)     https://lkml.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) Interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) avoidance, but not the only one.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) Socket Setup
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) ------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) The kernel is permissive when applications pass undefined flags to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) send system call. By default it simply ignores these. To avoid enabling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) copy avoidance mode for legacy processes that accidentally already pass
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) this flag, a process must first signal intent by setting a socket option:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 	if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 		error(1, errno, "setsockopt zerocopy");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) Transmission
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) ------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) Pass the new flag.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) 	ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) A zerocopy failure will return -1 with errno ENOBUFS. This happens if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) the socket option was not set, the socket exceeds its optmem limit or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) the user exceeds its ulimit on locked pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) Mixing copy avoidance and copying
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) Many workloads have a mixture of large and small buffers. Because copy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) avoidance is more expensive than copying for small packets, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) feature is implemented as a flag. It is safe to mix calls with the flag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) with those without.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) Notifications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) The kernel has to notify the process when it is safe to reuse a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) previously passed buffer. It queues completion notifications on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) socket error queue, akin to the transmit timestamping interface.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) The notification itself is a simple scalar value. Each socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) maintains an internal unsigned 32-bit counter. Each send call with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) MSG_ZEROCOPY that successfully sends data increments the counter. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) counter is not incremented on failure or if called with length zero.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) The counter counts system call invocations, not bytes. It wraps after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) UINT_MAX calls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) Notification Reception
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) ~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) The below snippet demonstrates the API. In the simplest case, each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) send syscall is followed by a poll and recvmsg on the error queue.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) Reading from the error queue is always a non-blocking operation. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) poll call is there to block until an error is outstanding. It will set
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) POLLERR in its output flags. That flag does not have to be set in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) events field. Errors are signaled unconditionally.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) 	pfd.fd = fd;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) 	pfd.events = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 	if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) 		error(1, errno, "poll");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) 	ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) 	if (ret == -1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) 		error(1, errno, "recvmsg");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) 	read_notification(msg);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) The example is for demonstration purpose only. In practice, it is more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) efficient to not wait for notifications, but read without blocking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) every couple of send calls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) Notifications can be processed out of order with other operations on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) the socket. A socket that has an error queued would normally block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) other operations until the error is read. Zerocopy notifications have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) a zero error code, however, to not block send and recv calls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) Notification Batching
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) ~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) Multiple outstanding packets can be read at once using the recvmmsg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) call. This is often not needed. In each message the kernel returns not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) a single value, but a range. It coalesces consecutive notifications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) while one is outstanding for reception on the error queue.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) When a new notification is about to be queued, it checks whether the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) new value extends the range of the notification at the tail of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) queue. If so, it drops the new notification packet and instead increases
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) the range upper value of the outstanding notification.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) For protocols that acknowledge data in-order, like TCP, each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) notification can be squashed into the previous one, so that no more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) than one notification is outstanding at any one point.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) Ordered delivery is the common case, but not guaranteed. Notifications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) may arrive out of order on retransmission and socket teardown.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) Notification Parsing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) ~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) The below snippet demonstrates how to parse the control message: the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) read_notification() call in the previous snippet. A notification
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) is encoded in the standard error format, sock_extended_err.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) The level and type fields in the control data are protocol family
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) specific, IP_RECVERR or IPV6_RECVERR.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) as explained before, to avoid blocking read and write system calls on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) the socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) The 32-bit notification range is encoded as [ee_info, ee_data]. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) range is inclusive. Other fields in the struct must be treated as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) undefined, bar for ee_code, as discussed below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) 	struct sock_extended_err *serr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) 	struct cmsghdr *cm;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) 	cm = CMSG_FIRSTHDR(msg);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) 	if (cm->cmsg_level != SOL_IP &&
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) 	    cm->cmsg_type != IP_RECVERR)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) 		error(1, 0, "cmsg");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) 	serr = (void *) CMSG_DATA(cm);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) 	if (serr->ee_errno != 0 ||
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) 	    serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) 		error(1, 0, "serr");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) 	printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) Deferred copies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) ~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) avoidance, and a contract that the kernel will queue a completion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) notification. It is not a guarantee that the copy is elided.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) Copy avoidance is not always feasible. Devices that do not support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) scatter-gather I/O cannot send packets made up of kernel generated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) protocol headers plus zerocopy user data. A packet may need to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) converted to a private copy of data deep in the stack, say to compute
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) a checksum.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) In all these cases, the kernel returns a completion notification when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) it releases its hold on the shared pages. That notification may arrive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) before the (copied) data is fully transmitted. A zerocopy completion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) notification is not a transmit completion notification, therefore.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) Deferred copies can be more expensive than a copy immediately in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) system call, if the data is no longer warm in the cache. The process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) also incurs notification processing cost for no benefit. For this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) reason, the kernel signals if data was completed with a copy, by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) A process may use this signal to stop passing flag MSG_ZEROCOPY on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) subsequent requests on the same socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) Implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) Loopback
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) --------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) Data sent to local sockets can be queued indefinitely if the receive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) process does not read its socket. Unbound notification latency is not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) acceptable. For this reason all packets generated with MSG_ZEROCOPY
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) that are looped to a local socket will incur a deferred copy. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) includes looping onto packet sockets (e.g., tcpdump) and tun devices.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) Testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) More realistic example code can be found in the kernel source under
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) tools/testing/selftests/net/msg_zerocopy.c.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) Be cognizant of the loopback constraint. The test can be run between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) a pair of hosts. But if run between a local pair of processes, for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) instance when run with msg_zerocopy.sh between a veth pair across
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) namespaces, the test will not show any improvement. For testing, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) loopback restriction can be temporarily relaxed by making
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) skb_orphan_frags_rx identical to skb_orphan_frags.