Orange Pi5 kernel

^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) ===============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) BPF ring buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ===============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) This document describes BPF ring buffer design, API, and implementation details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) .. contents::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8)     :local:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9)     :depth: 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) Motivation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) ----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) There are two distinctive motivators for this work, which are not satisfied by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) existing perf buffer, which prompted creation of a new ring buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) implementation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) - more efficient memory utilization by sharing ring buffer across CPUs;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) - preserving ordering of events that happen sequentially in time, even across
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20)   multiple CPUs (e.g., fork/exec/exit events for a task).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) These two problems are independent, but perf buffer fails to satisfy both.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) Both are a result of a choice to have per-CPU perf ring buffer.  Both can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) also solved by having an MPSC implementation of ring buffer. The ordering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) problem could technically be solved for perf buffer with some in-kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) counting, but given the first one requires an MPSC buffer, the same solution
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) would solve the second problem automatically.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) Semantics and APIs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) Single ring buffer is presented to BPF programs as an instance of BPF map of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) type ``BPF_MAP_TYPE_RINGBUF``. Two other alternatives considered, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) ultimately rejected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) One way would be to, similar to ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) ``BPF_MAP_TYPE_RINGBUF`` could represent an array of ring buffers, but not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) enforce "same CPU only" rule. This would be more familiar interface compatible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) with existing perf buffer use in BPF, but would fail if application needed more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) advanced logic to lookup ring buffer by arbitrary key.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) ``BPF_MAP_TYPE_HASH_OF_MAPS`` addresses this with current approach.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) Additionally, given the performance of BPF ringbuf, many use cases would just
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) opt into a simple single ring buffer shared among all CPUs, for which current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) approach would be an overkill.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) Another approach could introduce a new concept, alongside BPF map, to represent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) generic "container" object, which doesn't necessarily have key/value interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) with lookup/update/delete operations. This approach would add a lot of extra
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) infrastructure that has to be built for observability and verifier support. It
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) would also add another concept that BPF developers would have to familiarize
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) themselves with, new syntax in libbpf, etc. But then would really provide no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) additional benefits over the approach of using a map.  ``BPF_MAP_TYPE_RINGBUF``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) doesn't support lookup/update/delete operations, but so doesn't few other map
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) types (e.g., queue and stack; array doesn't support delete, etc).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) The approach chosen has an advantage of re-using existing BPF map
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) infrastructure (introspection APIs in kernel, libbpf support, etc), being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) familiar concept (no need to teach users a new type of object in BPF program),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) and utilizing existing tooling (bpftool). For common scenario of using a single
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) ring buffer for all CPUs, it's as simple and straightforward, as would be with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) a dedicated "container" object. On the other hand, by being a map, it can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) combined with ``ARRAY_OF_MAPS`` and ``HASH_OF_MAPS`` map-in-maps to implement
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) a wide variety of topologies, from one ring buffer for each CPU (e.g., as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) a replacement for perf buffer use cases), to a complicated application
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) hashing/sharding of ring buffers (e.g., having a small pool of ring buffers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) with hashed task's tgid being a look up key to preserve order, but reduce
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) contention).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) Key and value sizes are enforced to be zero. ``max_entries`` is used to specify
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) the size of ring buffer and has to be a power of 2 value.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) There are a bunch of similarities between perf buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) (``BPF_MAP_TYPE_PERF_EVENT_ARRAY``) and new BPF ring buffer semantics:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) - variable-length records;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) - if there is no more space left in ring buffer, reservation fails, no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77)   blocking;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) - memory-mappable data area for user-space applications for ease of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79)   consumption and high performance;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) - epoll notifications for new incoming data;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) - but still the ability to do busy polling for new data to achieve the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82)   lowest latency, if necessary.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) BPF ringbuf provides two sets of APIs to BPF programs:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) - ``bpf_ringbuf_output()`` allows to *copy* data from one place to a ring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87)   buffer, similarly to ``bpf_perf_event_output()``;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) - ``bpf_ringbuf_reserve()``/``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89)   APIs split the whole process into two steps. First, a fixed amount of space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90)   is reserved. If successful, a pointer to a data inside ring buffer data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91)   area is returned, which BPF programs can use similarly to a data inside
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92)   array/hash maps. Once ready, this piece of memory is either committed or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93)   discarded. Discard is similar to commit, but makes consumer ignore the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94)   record.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) ``bpf_ringbuf_output()`` has disadvantage of incurring extra memory copy,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) because record has to be prepared in some other place first. But it allows to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) submit records of the length that's not known to verifier beforehand. It also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) closely matches ``bpf_perf_event_output()``, so will simplify migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) significantly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) ``bpf_ringbuf_reserve()`` avoids the extra copy of memory by providing a memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) pointer directly to ring buffer memory. In a lot of cases records are larger
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) than BPF stack space allows, so many programs have use extra per-CPU array as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) completely. But in exchange, it only allows a known constant size of memory to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) be reserved, such that verifier can verify that BPF program can't access memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) outside its reserved record space. bpf_ringbuf_output(), while slightly slower
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) due to extra memory copy, covers some use cases that are not suitable for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) ``bpf_ringbuf_reserve()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) The difference between commit and discard is very small. Discard just marks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) a record as discarded, and such records are supposed to be ignored by consumer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) code. Discard is useful for some advanced use-cases, such as ensuring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) all-or-nothing multi-record submission, or emulating temporary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) ``malloc()``/``free()`` within single BPF program invocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) Each reserved record is tracked by verifier through existing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) reference-tracking logic, similar to socket ref-tracking. It is thus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) impossible to reserve a record, but forget to submit (or discard) it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) ``bpf_ringbuf_query()`` helper allows to query various properties of ring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) buffer.  Currently 4 are supported:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) - ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) - ``BPF_RB_RING_SIZE`` returns the size of ring buffer;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) - ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical possition
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)   of consumer/producer, respectively.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) Returned values are momentarily snapshots of ring buffer state and could be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) off by the time helper returns, so this should be used only for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) debugging/reporting reasons or for implementing various heuristics, that take
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) into account highly-changeable nature of some of those characteristics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) One such heuristic might involve more fine-grained control over poll/epoll
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) notifications about new data availability in ring buffer. Together with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) ``BPF_RB_NO_WAKEUP``/``BPF_RB_FORCE_WAKEUP`` flags for output/commit/discard
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) helpers, it allows BPF program a high degree of control and, e.g., more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) efficient batched notifications. Default self-balancing strategy, though,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) should be adequate for most applications and will work reliable and efficiently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) already.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) Design and Implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) -------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) This reserve/commit schema allows a natural way for multiple producers, either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) on different CPUs or even on the same CPU/in the same BPF program, to reserve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) independent records and work with them without blocking other producers. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) means that if BPF program was interruped by another BPF program sharing the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) same ring buffer, they will both get a record reserved (provided there is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) enough space left) and can work with it and submit it independently. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) applies to NMI context as well, except that due to using a spinlock during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) reservation, in NMI context, ``bpf_ringbuf_reserve()`` might fail to get
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) a lock, in which case reservation will fail even if ring buffer is not full.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) The ring buffer itself internally is implemented as a power-of-2 sized
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) circular buffer, with two logical and ever-increasing counters (which might
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) wrap around on 32-bit architectures, that's not a problem):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) - consumer counter shows up to which logical position consumer consumed the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161)   data;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) - producer counter denotes amount of data reserved by all producers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) Each time a record is reserved, producer that "owns" the record will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) successfully advance producer counter. At that point, data is still not yet
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) ready to be consumed, though. Each record has 8 byte header, which contains the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) length of reserved record, as well as two extra bits: busy bit to denote that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) record is still being worked on, and discard bit, which might be set at commit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) time if record is discarded. In the latter case, consumer is supposed to skip
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) the record and move on to the next one. Record header also encodes record's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) relative offset from the beginning of ring buffer data area (in pages). This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) allows ``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` to accept only the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) pointer to the record itself, without requiring also the pointer to ring buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) itself. Ring buffer memory location will be restored from record metadata
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) header. This significantly simplifies verifier, as well as improving API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) usability.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) Producer counter increments are serialized under spinlock, so there is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) a strict ordering between reservations. Commits, on the other hand, are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) completely lockless and independent. All records become available to consumer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) in the order of reservations, but only after all previous records where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) already committed. It is thus possible for slow producers to temporarily hold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) off submitted records, that were reserved later.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) One interesting implementation bit, that significantly simplifies (and thus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) speeds up as well) implementation of both producers and consumers is how data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) area is mapped twice contiguously back-to-back in the virtual memory. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) allows to not take any special measures for samples that have to wrap around
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) at the end of the circular buffer data area, because the next page after the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) last data page would be first data page again, and thus the sample will still
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) appear completely contiguous in virtual memory. See comment and a simple ASCII
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) diagram showing this visually in ``bpf_ringbuf_area_alloc()``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) Another feature that distinguishes BPF ringbuf from perf ring buffer is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) a self-pacing notifications of new data being availability.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) ``bpf_ringbuf_commit()`` implementation will send a notification of new record
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) being available after commit only if consumer has already caught up right up to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) the record being committed. If not, consumer still has to catch up and thus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) will see new data anyways without needing an extra poll notification.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbufs.c) show that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) this allows to achieve a very high throughput without having to resort to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) tricks like "notify only every Nth sample", which are necessary with perf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) buffer. For extreme cases, when BPF program wants more manual control of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) notifications, commit/discard/output helpers accept ``BPF_RB_NO_WAKEUP`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) ``BPF_RB_FORCE_WAKEUP`` flags, which give full control over notifications of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) data availability, but require extra caution and diligence in using this API.