Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. _list_rcu_doc:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) Using RCU to Protect Read-Mostly Linked Lists
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) =============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) One of the best applications of RCU is to protect read-mostly linked lists
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) (``struct list_head`` in list.h).  One big advantage of this approach
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) is that all of the required memory barriers are included for you in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) the list macros.  This document describes several applications of RCU,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) with the best fits first.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) Example 1: Read-mostly list: Deferred Destruction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) -------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) A widely used usecase for RCU lists in the kernel is lockless iteration over
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) all processes in the system. ``task_struct::tasks`` represents the list node that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) links all the processes. The list can be traversed in parallel to any list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) additions or removals.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) The traversal of the list is done using ``for_each_process()`` which is defined
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) by the 2 macros::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) 	#define next_task(p) \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 		list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) 	#define for_each_process(p) \
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 		for (p = &init_task ; (p = next_task(p)) != &init_task ; )
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) The code traversing the list of all processes typically looks like::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) 	rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) 	for_each_process(p) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 		/* Do something with p */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) 	rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) The simplified code for removing a process from a task list is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) 	void release_task(struct task_struct *p)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) 		write_lock(&tasklist_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) 		list_del_rcu(&p->tasks);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) 		write_unlock(&tasklist_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) 		call_rcu(&p->rcu, delayed_put_task_struct);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) When a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)`` under
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) ``tasklist_lock`` writer lock protection, to remove the task from the list of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) all tasks. The ``tasklist_lock`` prevents concurrent list additions/removals
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) from corrupting the list. Readers using ``for_each_process()`` are not protected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) with the ``tasklist_lock``. To prevent readers from noticing changes in the list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) pointers, the ``task_struct`` object is freed only after one or more grace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) periods elapse (with the help of call_rcu()). This deferring of destruction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) ensures that any readers traversing the list will see valid ``p->tasks.next``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) pointers and deletion/freeing can happen in parallel with traversal of the list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) This pattern is also called an **existence lock**, since RCU pins the object in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) memory until all existing readers finish.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) Example 2: Read-Side Action Taken Outside of Lock: No In-Place Updates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) ----------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) The best applications are cases where, if reader-writer locking were
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) used, the read-side lock would be dropped before taking any action
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) based on the results of the search.  The most celebrated example is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) the routing table.  Because the routing table is tracking the state of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) equipment outside of the computer, it will at times contain stale data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) Therefore, once the route has been computed, there is no need to hold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) the routing table static during transmission of the packet.  After all,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) you can hold the routing table static all you want, but that won't keep
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) the external Internet from changing, and it is the state of the external
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) Internet that really matters.  In addition, routing entries are typically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) added or deleted, rather than being modified in place.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) A straightforward example of this use of RCU may be found in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) system-call auditing support.  For example, a reader-writer locked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) implementation of ``audit_filter_task()`` might be as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) 	static enum audit_state audit_filter_task(struct task_struct *tsk)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) 		struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) 		enum audit_state   state;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) 		read_lock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) 		/* Note: audit_filter_mutex held by caller. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) 		list_for_each_entry(e, &audit_tsklist, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) 			if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 				read_unlock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) 				return state;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) 			}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) 		read_unlock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) 		return AUDIT_BUILD_CONTEXT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) Here the list is searched under the lock, but the lock is dropped before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) the corresponding value is returned.  By the time that this value is acted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) on, the list may well have been modified.  This makes sense, since if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) you are turning auditing off, it is OK to audit a few extra system calls.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) This means that RCU can be easily applied to the read side, as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) 	static enum audit_state audit_filter_task(struct task_struct *tsk)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) 		struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) 		enum audit_state   state;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) 		rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) 		/* Note: audit_filter_mutex held by caller. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) 		list_for_each_entry_rcu(e, &audit_tsklist, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) 			if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 				rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) 				return state;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) 			}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) 		rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) 		return AUDIT_BUILD_CONTEXT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) The ``read_lock()`` and ``read_unlock()`` calls have become rcu_read_lock()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) and rcu_read_unlock(), respectively, and the list_for_each_entry() has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) become list_for_each_entry_rcu().  The **_rcu()** list-traversal primitives
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) insert the read-side memory barriers that are required on DEC Alpha CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) The changes to the update side are also straightforward. A reader-writer lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) might be used as follows for deletion and insertion::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 	static inline int audit_del_rule(struct audit_rule *rule,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) 					 struct list_head *list)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) 		struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) 		write_lock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) 		list_for_each_entry(e, list, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) 			if (!audit_compare_rule(rule, &e->rule)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 				list_del(&e->list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) 				write_unlock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) 				return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 			}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 		write_unlock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) 		return -EFAULT;		/* No matching rule */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 	static inline int audit_add_rule(struct audit_entry *entry,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) 					 struct list_head *list)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 		write_lock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 		if (entry->rule.flags & AUDIT_PREPEND) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 			entry->rule.flags &= ~AUDIT_PREPEND;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 			list_add(&entry->list, list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) 		} else {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) 			list_add_tail(&entry->list, list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 		write_unlock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) 		return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) Following are the RCU equivalents for these two functions::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) 	static inline int audit_del_rule(struct audit_rule *rule,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 					 struct list_head *list)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 		struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) 		/* No need to use the _rcu iterator here, since this is the only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 		 * deletion routine. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 		list_for_each_entry(e, list, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) 			if (!audit_compare_rule(rule, &e->rule)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 				list_del_rcu(&e->list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) 				call_rcu(&e->rcu, audit_free_rule);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) 				return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) 			}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) 		return -EFAULT;		/* No matching rule */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) 	static inline int audit_add_rule(struct audit_entry *entry,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) 					 struct list_head *list)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) 		if (entry->rule.flags & AUDIT_PREPEND) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) 			entry->rule.flags &= ~AUDIT_PREPEND;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) 			list_add_rcu(&entry->list, list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) 		} else {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) 			list_add_tail_rcu(&entry->list, list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) 		return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) Normally, the ``write_lock()`` and ``write_unlock()`` would be replaced by a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) spin_lock() and a spin_unlock(). But in this case, all callers hold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) ``audit_filter_mutex``, so no additional locking is required. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) ``auditsc_lock`` can therefore be eliminated, since use of RCU eliminates the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) need for writers to exclude readers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) The list_del(), list_add(), and list_add_tail() primitives have been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) The **_rcu()** list-manipulation primitives add memory barriers that are needed on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) weakly ordered CPUs (most of them!).  The list_del_rcu() primitive omits the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) pointer poisoning debug-assist code that would otherwise cause concurrent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) readers to fail spectacularly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) So, when readers can tolerate stale data and when entries are either added or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) deleted, without in-place modification, it is very easy to use RCU!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) Example 3: Handling In-Place Updates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) ------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) The system-call auditing code does not update auditing rules in place.  However,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) if it did, the reader-writer-locked code to do so might look as follows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) (assuming only ``field_count`` is updated, otherwise, the added fields would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) need to be filled in)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) 	static inline int audit_upd_rule(struct audit_rule *rule,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) 					 struct list_head *list,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) 					 __u32 newaction,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) 					 __u32 newfield_count)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) 		struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) 		struct audit_entry *ne;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) 		write_lock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) 		/* Note: audit_filter_mutex held by caller. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) 		list_for_each_entry(e, list, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) 			if (!audit_compare_rule(rule, &e->rule)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) 				e->rule.action = newaction;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) 				e->rule.field_count = newfield_count;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) 				write_unlock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) 				return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) 			}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) 		write_unlock(&auditsc_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) 		return -EFAULT;		/* No matching rule */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) The RCU version creates a copy, updates the copy, then replaces the old
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) entry with the newly updated entry.  This sequence of actions, allowing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) concurrent reads while making a copy to perform an update, is what gives
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) RCU (*read-copy update*) its name.  The RCU code is as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) 	static inline int audit_upd_rule(struct audit_rule *rule,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) 					 struct list_head *list,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) 					 __u32 newaction,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) 					 __u32 newfield_count)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) 		struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) 		struct audit_entry *ne;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) 		list_for_each_entry(e, list, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) 			if (!audit_compare_rule(rule, &e->rule)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) 				ne = kmalloc(sizeof(*entry), GFP_ATOMIC);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) 				if (ne == NULL)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) 					return -ENOMEM;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) 				audit_copy_rule(&ne->rule, &e->rule);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) 				ne->rule.action = newaction;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) 				ne->rule.field_count = newfield_count;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) 				list_replace_rcu(&e->list, &ne->list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) 				call_rcu(&e->rcu, audit_free_rule);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) 				return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) 			}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) 		return -EFAULT;		/* No matching rule */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) Again, this assumes that the caller holds ``audit_filter_mutex``.  Normally, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) writer lock would become a spinlock in this sort of code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) Another use of this pattern can be found in the openswitch driver's *connection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) tracking table* code in ``ct_limit_set()``.  The table holds connection tracking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) entries and has a limit on the maximum entries.  There is one such table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) per-zone and hence one *limit* per zone.  The zones are mapped to their limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) through a hashtable using an RCU-managed hlist for the hash chains. When a new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) limit is set, a new limit object is allocated and ``ct_limit_set()`` is called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) to replace the old limit object with the new one using list_replace_rcu().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) The old limit object is then freed after a grace period using kfree_rcu().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) Example 4: Eliminating Stale Data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) ---------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) The auditing example above tolerates stale data, as do most algorithms
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) that are tracking external state.  Because there is a delay from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) time the external state changes before Linux becomes aware of the change,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) additional RCU-induced staleness is generally not a problem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) However, there are many examples where stale data cannot be tolerated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) One example in the Linux kernel is the System V IPC (see the shm_lock()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) function in ipc/shm.c).  This code checks a *deleted* flag under a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) per-entry spinlock, and, if the *deleted* flag is set, pretends that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) entry does not exist.  For this to be helpful, the search function must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) return holding the per-entry spinlock, as shm_lock() does in fact do.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) .. _quick_quiz:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) Quick Quiz:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) 	For the deleted-flag technique to be helpful, why is it necessary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) 	to hold the per-entry lock while returning from the search function?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) :ref:`Answer to Quick Quiz <quick_quiz_answer>`
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) If the system-call audit module were to ever need to reject stale data, one way
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) audit_entry structure, and modify ``audit_filter_task()`` as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) 	static enum audit_state audit_filter_task(struct task_struct *tsk)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) 		struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) 		enum audit_state   state;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) 		rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) 		list_for_each_entry_rcu(e, &audit_tsklist, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) 			if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) 				spin_lock(&e->lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) 				if (e->deleted) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) 					spin_unlock(&e->lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) 					rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) 					return AUDIT_BUILD_CONTEXT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) 				}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) 				rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) 				return state;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) 			}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) 		rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) 		return AUDIT_BUILD_CONTEXT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) Note that this example assumes that entries are only added and deleted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) Additional mechanism is required to deal correctly with the update-in-place
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) performed by ``audit_upd_rule()``.  For one thing, ``audit_upd_rule()`` would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) need additional memory barriers to ensure that the list_add_rcu() was really
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) executed before the list_del_rcu().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) The ``audit_del_rule()`` function would need to set the ``deleted`` flag under the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) spinlock as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) 	static inline int audit_del_rule(struct audit_rule *rule,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) 					 struct list_head *list)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) 		struct audit_entry *e;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) 		/* No need to use the _rcu iterator here, since this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) 		 * is the only deletion routine. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) 		list_for_each_entry(e, list, list) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) 			if (!audit_compare_rule(rule, &e->rule)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) 				spin_lock(&e->lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) 				list_del_rcu(&e->list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) 				e->deleted = 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) 				spin_unlock(&e->lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) 				call_rcu(&e->rcu, audit_free_rule);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) 				return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) 			}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) 		return -EFAULT;		/* No matching rule */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) This too assumes that the caller holds ``audit_filter_mutex``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) Example 5: Skipping Stale Objects
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) ---------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) For some usecases, reader performance can be improved by skipping stale objects
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) during read-side list traversal if the object in concern is pending destruction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) after one or more grace periods. One such example can be found in the timerfd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) subsystem. When a ``CLOCK_REALTIME`` clock is reprogrammed - for example due to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) setting of the system time, then all programmed timerfds that depend on this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) clock get triggered and processes waiting on them to expire are woken up in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) advance of their scheduled expiry. To facilitate this, all such timers are added
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) to an RCU-managed ``cancel_list`` when they are setup in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) ``timerfd_setup_cancel()``::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) 	static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) 		spin_lock(&ctx->cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) 		if ((ctx->clockid == CLOCK_REALTIME &&
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) 		    (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) 			if (!ctx->might_cancel) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) 				ctx->might_cancel = true;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) 				spin_lock(&cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) 				list_add_rcu(&ctx->clist, &cancel_list);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) 				spin_unlock(&cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) 			}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) 		spin_unlock(&ctx->cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) When a timerfd is freed (fd is closed), then the ``might_cancel`` flag of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) timerfd object is cleared, the object removed from the ``cancel_list`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) destroyed::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) 	int timerfd_release(struct inode *inode, struct file *file)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) 		struct timerfd_ctx *ctx = file->private_data;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) 		spin_lock(&ctx->cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) 		if (ctx->might_cancel) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) 			ctx->might_cancel = false;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) 			spin_lock(&cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) 			list_del_rcu(&ctx->clist);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) 			spin_unlock(&cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) 		spin_unlock(&ctx->cancel_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) 		hrtimer_cancel(&ctx->t.tmr);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) 		kfree_rcu(ctx, rcu);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) 		return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) If the ``CLOCK_REALTIME`` clock is set, for example by a time server, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) hrtimer framework calls ``timerfd_clock_was_set()`` which walks the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) ``cancel_list`` and wakes up processes waiting on the timerfd. While iterating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) the ``cancel_list``, the ``might_cancel`` flag is consulted to skip stale
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) objects::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) 	void timerfd_clock_was_set(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) 		struct timerfd_ctx *ctx;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) 		unsigned long flags;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) 		rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) 		list_for_each_entry_rcu(ctx, &cancel_list, clist) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) 			if (!ctx->might_cancel)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) 				continue;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) 			spin_lock_irqsave(&ctx->wqh.lock, flags);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) 			if (ctx->moffs != ktime_mono_to_real(0)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) 				ctx->moffs = KTIME_MAX;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) 				ctx->ticks++;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) 				wake_up_locked_poll(&ctx->wqh, EPOLLIN);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) 			}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) 			spin_unlock_irqrestore(&ctx->wqh.lock, flags);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) 		rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) The key point here is, because RCU-traversal of the ``cancel_list`` happens
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) while objects are being added and removed to the list, sometimes the traversal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) can step on an object that has been removed from the list. In this example, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) is seen that it is better to skip such objects using a flag.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) Summary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) Read-mostly list-based data structures that can tolerate stale data are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) the most amenable to use of RCU.  The simplest case is where entries are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) either added or deleted from the data structure (or atomically modified
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) in place), but non-atomic in-place modifications can be handled by making
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) a copy, updating the copy, then replacing the original with the copy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) If stale data cannot be tolerated, then a *deleted* flag may be used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) in conjunction with a per-entry spinlock in order to allow the search
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) function to reject newly deleted data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) .. _quick_quiz_answer:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) Answer to Quick Quiz:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) 	For the deleted-flag technique to be helpful, why is it necessary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) 	to hold the per-entry lock while returning from the search function?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) 	If the search function drops the per-entry lock before returning,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) 	then the caller will be processing stale data in any case.  If it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) 	is really OK to be processing stale data, then you don't need a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) 	*deleted* flag.  If processing stale data really is a problem,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) 	then you need to hold the per-entry lock across all of the code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) 	that uses the value that was returned.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) :ref:`Back to Quick Quiz <quick_quiz>`