^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. _list_rcu_doc:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) Using RCU to Protect Read-Mostly Linked Lists
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) =============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) One of the best applications of RCU is to protect read-mostly linked lists
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) (``struct list_head`` in list.h). One big advantage of this approach
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) is that all of the required memory barriers are included for you in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) the list macros. This document describes several applications of RCU,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) with the best fits first.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) Example 1: Read-mostly list: Deferred Destruction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) -------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) A widely used usecase for RCU lists in the kernel is lockless iteration over
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) all processes in the system. ``task_struct::tasks`` represents the list node that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) links all the processes. The list can be traversed in parallel to any list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) additions or removals.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) The traversal of the list is done using ``for_each_process()`` which is defined
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) by the 2 macros::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) #define next_task(p) \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) #define for_each_process(p) \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) for (p = &init_task ; (p = next_task(p)) != &init_task ; )
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) The code traversing the list of all processes typically looks like::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) for_each_process(p) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) /* Do something with p */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) The simplified code for removing a process from a task list is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) void release_task(struct task_struct *p)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) write_lock(&tasklist_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) list_del_rcu(&p->tasks);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) write_unlock(&tasklist_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) call_rcu(&p->rcu, delayed_put_task_struct);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) When a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)`` under
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) ``tasklist_lock`` writer lock protection, to remove the task from the list of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) all tasks. The ``tasklist_lock`` prevents concurrent list additions/removals
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) from corrupting the list. Readers using ``for_each_process()`` are not protected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) with the ``tasklist_lock``. To prevent readers from noticing changes in the list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) pointers, the ``task_struct`` object is freed only after one or more grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) periods elapse (with the help of call_rcu()). This deferring of destruction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) ensures that any readers traversing the list will see valid ``p->tasks.next``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) pointers and deletion/freeing can happen in parallel with traversal of the list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) This pattern is also called an **existence lock**, since RCU pins the object in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) memory until all existing readers finish.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) Example 2: Read-Side Action Taken Outside of Lock: No In-Place Updates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) ----------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) The best applications are cases where, if reader-writer locking were
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) used, the read-side lock would be dropped before taking any action
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) based on the results of the search. The most celebrated example is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) the routing table. Because the routing table is tracking the state of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) equipment outside of the computer, it will at times contain stale data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) Therefore, once the route has been computed, there is no need to hold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) the routing table static during transmission of the packet. After all,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) you can hold the routing table static all you want, but that won't keep
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) the external Internet from changing, and it is the state of the external
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) Internet that really matters. In addition, routing entries are typically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) added or deleted, rather than being modified in place.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) A straightforward example of this use of RCU may be found in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) system-call auditing support. For example, a reader-writer locked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) implementation of ``audit_filter_task()`` might be as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) static enum audit_state audit_filter_task(struct task_struct *tsk)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) enum audit_state state;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) read_lock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) /* Note: audit_filter_mutex held by caller. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) list_for_each_entry(e, &audit_tsklist, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) read_unlock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) return state;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) read_unlock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) return AUDIT_BUILD_CONTEXT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) Here the list is searched under the lock, but the lock is dropped before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) the corresponding value is returned. By the time that this value is acted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) on, the list may well have been modified. This makes sense, since if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) you are turning auditing off, it is OK to audit a few extra system calls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) This means that RCU can be easily applied to the read side, as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) static enum audit_state audit_filter_task(struct task_struct *tsk)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) enum audit_state state;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) /* Note: audit_filter_mutex held by caller. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) list_for_each_entry_rcu(e, &audit_tsklist, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) return state;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) return AUDIT_BUILD_CONTEXT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) The ``read_lock()`` and ``read_unlock()`` calls have become rcu_read_lock()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) and rcu_read_unlock(), respectively, and the list_for_each_entry() has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) become list_for_each_entry_rcu(). The **_rcu()** list-traversal primitives
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) insert the read-side memory barriers that are required on DEC Alpha CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) The changes to the update side are also straightforward. A reader-writer lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) might be used as follows for deletion and insertion::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) static inline int audit_del_rule(struct audit_rule *rule,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) struct list_head *list)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) write_lock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) list_for_each_entry(e, list, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) if (!audit_compare_rule(rule, &e->rule)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) list_del(&e->list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) write_unlock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) write_unlock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) return -EFAULT; /* No matching rule */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) static inline int audit_add_rule(struct audit_entry *entry,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) struct list_head *list)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) write_lock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) if (entry->rule.flags & AUDIT_PREPEND) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) entry->rule.flags &= ~AUDIT_PREPEND;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) list_add(&entry->list, list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) } else {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) list_add_tail(&entry->list, list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) write_unlock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) Following are the RCU equivalents for these two functions::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) static inline int audit_del_rule(struct audit_rule *rule,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) struct list_head *list)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) /* No need to use the _rcu iterator here, since this is the only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) * deletion routine. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) list_for_each_entry(e, list, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) if (!audit_compare_rule(rule, &e->rule)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) list_del_rcu(&e->list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) call_rcu(&e->rcu, audit_free_rule);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) return -EFAULT; /* No matching rule */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) static inline int audit_add_rule(struct audit_entry *entry,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) struct list_head *list)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) if (entry->rule.flags & AUDIT_PREPEND) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) entry->rule.flags &= ~AUDIT_PREPEND;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) list_add_rcu(&entry->list, list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) } else {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) list_add_tail_rcu(&entry->list, list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) Normally, the ``write_lock()`` and ``write_unlock()`` would be replaced by a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) spin_lock() and a spin_unlock(). But in this case, all callers hold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) ``audit_filter_mutex``, so no additional locking is required. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) ``auditsc_lock`` can therefore be eliminated, since use of RCU eliminates the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) need for writers to exclude readers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) The list_del(), list_add(), and list_add_tail() primitives have been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) The **_rcu()** list-manipulation primitives add memory barriers that are needed on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) weakly ordered CPUs (most of them!). The list_del_rcu() primitive omits the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) pointer poisoning debug-assist code that would otherwise cause concurrent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) readers to fail spectacularly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) So, when readers can tolerate stale data and when entries are either added or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) deleted, without in-place modification, it is very easy to use RCU!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) Example 3: Handling In-Place Updates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) ------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) The system-call auditing code does not update auditing rules in place. However,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) if it did, the reader-writer-locked code to do so might look as follows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) (assuming only ``field_count`` is updated, otherwise, the added fields would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) need to be filled in)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) static inline int audit_upd_rule(struct audit_rule *rule,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) struct list_head *list,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) __u32 newaction,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) __u32 newfield_count)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) struct audit_entry *ne;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) write_lock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) /* Note: audit_filter_mutex held by caller. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) list_for_each_entry(e, list, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) if (!audit_compare_rule(rule, &e->rule)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) e->rule.action = newaction;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) e->rule.field_count = newfield_count;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) write_unlock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) write_unlock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) return -EFAULT; /* No matching rule */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) The RCU version creates a copy, updates the copy, then replaces the old
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) entry with the newly updated entry. This sequence of actions, allowing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) concurrent reads while making a copy to perform an update, is what gives
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) RCU (*read-copy update*) its name. The RCU code is as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) static inline int audit_upd_rule(struct audit_rule *rule,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) struct list_head *list,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) __u32 newaction,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) __u32 newfield_count)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) struct audit_entry *ne;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) list_for_each_entry(e, list, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) if (!audit_compare_rule(rule, &e->rule)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) ne = kmalloc(sizeof(*entry), GFP_ATOMIC);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) if (ne == NULL)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) return -ENOMEM;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) audit_copy_rule(&ne->rule, &e->rule);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) ne->rule.action = newaction;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) ne->rule.field_count = newfield_count;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) list_replace_rcu(&e->list, &ne->list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) call_rcu(&e->rcu, audit_free_rule);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) return -EFAULT; /* No matching rule */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) Again, this assumes that the caller holds ``audit_filter_mutex``. Normally, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) writer lock would become a spinlock in this sort of code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) Another use of this pattern can be found in the openswitch driver's *connection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) tracking table* code in ``ct_limit_set()``. The table holds connection tracking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) entries and has a limit on the maximum entries. There is one such table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) per-zone and hence one *limit* per zone. The zones are mapped to their limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) through a hashtable using an RCU-managed hlist for the hash chains. When a new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) limit is set, a new limit object is allocated and ``ct_limit_set()`` is called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) to replace the old limit object with the new one using list_replace_rcu().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) The old limit object is then freed after a grace period using kfree_rcu().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) Example 4: Eliminating Stale Data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) ---------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) The auditing example above tolerates stale data, as do most algorithms
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) that are tracking external state. Because there is a delay from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) time the external state changes before Linux becomes aware of the change,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) additional RCU-induced staleness is generally not a problem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) However, there are many examples where stale data cannot be tolerated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) One example in the Linux kernel is the System V IPC (see the shm_lock()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) function in ipc/shm.c). This code checks a *deleted* flag under a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) per-entry spinlock, and, if the *deleted* flag is set, pretends that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) entry does not exist. For this to be helpful, the search function must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) return holding the per-entry spinlock, as shm_lock() does in fact do.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) .. _quick_quiz:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) Quick Quiz:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) For the deleted-flag technique to be helpful, why is it necessary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) to hold the per-entry lock while returning from the search function?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) :ref:`Answer to Quick Quiz <quick_quiz_answer>`
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) If the system-call audit module were to ever need to reject stale data, one way
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) audit_entry structure, and modify ``audit_filter_task()`` as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) static enum audit_state audit_filter_task(struct task_struct *tsk)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) enum audit_state state;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) list_for_each_entry_rcu(e, &audit_tsklist, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) spin_lock(&e->lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) if (e->deleted) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) spin_unlock(&e->lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) return AUDIT_BUILD_CONTEXT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) return state;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) return AUDIT_BUILD_CONTEXT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) Note that this example assumes that entries are only added and deleted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) Additional mechanism is required to deal correctly with the update-in-place
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) performed by ``audit_upd_rule()``. For one thing, ``audit_upd_rule()`` would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) need additional memory barriers to ensure that the list_add_rcu() was really
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) executed before the list_del_rcu().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) The ``audit_del_rule()`` function would need to set the ``deleted`` flag under the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) spinlock as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) static inline int audit_del_rule(struct audit_rule *rule,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) struct list_head *list)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) /* No need to use the _rcu iterator here, since this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) * is the only deletion routine. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) list_for_each_entry(e, list, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) if (!audit_compare_rule(rule, &e->rule)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) spin_lock(&e->lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) list_del_rcu(&e->list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) e->deleted = 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) spin_unlock(&e->lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) call_rcu(&e->rcu, audit_free_rule);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) return -EFAULT; /* No matching rule */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) This too assumes that the caller holds ``audit_filter_mutex``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) Example 5: Skipping Stale Objects
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) ---------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) For some usecases, reader performance can be improved by skipping stale objects
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) during read-side list traversal if the object in concern is pending destruction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) after one or more grace periods. One such example can be found in the timerfd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) subsystem. When a ``CLOCK_REALTIME`` clock is reprogrammed - for example due to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) setting of the system time, then all programmed timerfds that depend on this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) clock get triggered and processes waiting on them to expire are woken up in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) advance of their scheduled expiry. To facilitate this, all such timers are added
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) to an RCU-managed ``cancel_list`` when they are setup in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) ``timerfd_setup_cancel()``::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) spin_lock(&ctx->cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) if ((ctx->clockid == CLOCK_REALTIME &&
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) if (!ctx->might_cancel) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) ctx->might_cancel = true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) spin_lock(&cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) list_add_rcu(&ctx->clist, &cancel_list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) spin_unlock(&cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) spin_unlock(&ctx->cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) When a timerfd is freed (fd is closed), then the ``might_cancel`` flag of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) timerfd object is cleared, the object removed from the ``cancel_list`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) destroyed::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) int timerfd_release(struct inode *inode, struct file *file)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) struct timerfd_ctx *ctx = file->private_data;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) spin_lock(&ctx->cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) if (ctx->might_cancel) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) ctx->might_cancel = false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) spin_lock(&cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) list_del_rcu(&ctx->clist);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) spin_unlock(&cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) spin_unlock(&ctx->cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) hrtimer_cancel(&ctx->t.tmr);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) kfree_rcu(ctx, rcu);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) If the ``CLOCK_REALTIME`` clock is set, for example by a time server, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) hrtimer framework calls ``timerfd_clock_was_set()`` which walks the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) ``cancel_list`` and wakes up processes waiting on the timerfd. While iterating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) the ``cancel_list``, the ``might_cancel`` flag is consulted to skip stale
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) objects::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) void timerfd_clock_was_set(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) struct timerfd_ctx *ctx;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) unsigned long flags;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) list_for_each_entry_rcu(ctx, &cancel_list, clist) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) if (!ctx->might_cancel)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) continue;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) spin_lock_irqsave(&ctx->wqh.lock, flags);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) if (ctx->moffs != ktime_mono_to_real(0)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) ctx->moffs = KTIME_MAX;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) ctx->ticks++;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) wake_up_locked_poll(&ctx->wqh, EPOLLIN);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) spin_unlock_irqrestore(&ctx->wqh.lock, flags);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) The key point here is, because RCU-traversal of the ``cancel_list`` happens
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) while objects are being added and removed to the list, sometimes the traversal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) can step on an object that has been removed from the list. In this example, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) is seen that it is better to skip such objects using a flag.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) Summary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) Read-mostly list-based data structures that can tolerate stale data are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) the most amenable to use of RCU. The simplest case is where entries are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) either added or deleted from the data structure (or atomically modified
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) in place), but non-atomic in-place modifications can be handled by making
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) a copy, updating the copy, then replacing the original with the copy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) If stale data cannot be tolerated, then a *deleted* flag may be used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) in conjunction with a per-entry spinlock in order to allow the search
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) function to reject newly deleted data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) .. _quick_quiz_answer:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) Answer to Quick Quiz:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) For the deleted-flag technique to be helpful, why is it necessary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) to hold the per-entry lock while returning from the search function?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) If the search function drops the per-entry lock before returning,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) then the caller will be processing stale data in any case. If it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) is really OK to be processing stale data, then you don't need a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) *deleted* flag. If processing stale data really is a problem,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) then you need to hold the per-entry lock across all of the code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) that uses the value that was returned.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) :ref:`Back to Quick Quiz <quick_quiz>`