^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ====================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Reference-count design for elements of lists/arrays protected by RCU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ====================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) Please note that the percpu-ref feature is likely your first
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) stop if you need to combine reference counts and RCU. Please see
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) include/linux/percpu-refcount.h for more information. However, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) those unusual cases where percpu-ref would consume too much memory,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) please read on.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) ------------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) Reference counting on elements of lists which are protected by traditional
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) reader/writer spinlocks or semaphores are straightforward:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) CODE LISTING A::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) 1. 2.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) add() search_and_reference()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) { {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) alloc_object read_lock(&list_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) ... search_for_element
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) atomic_set(&el->rc, 1); atomic_inc(&el->rc);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) write_lock(&list_lock); ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) add_element read_unlock(&list_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) ... ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) write_unlock(&list_lock); }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) 3. 4.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) release_referenced() delete()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) { {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) ... write_lock(&list_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) if(atomic_dec_and_test(&el->rc)) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) kfree(el);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) ... remove_element
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) } write_unlock(&list_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) if (atomic_dec_and_test(&el->rc))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) kfree(el);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) If this list/array is made lock free using RCU as in changing the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) write_lock() in add() and delete() to spin_lock() and changing read_lock()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) in search_and_reference() to rcu_read_lock(), the atomic_inc() in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) search_and_reference() could potentially hold reference to an element which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) has already been deleted from the list/array. Use atomic_inc_not_zero()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) in this scenario as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) CODE LISTING B::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) 1. 2.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) add() search_and_reference()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) { {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) alloc_object rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) ... search_for_element
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) atomic_set(&el->rc, 1); if (!atomic_inc_not_zero(&el->rc)) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) spin_lock(&list_lock); rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) return FAIL;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) add_element }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) ... ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) spin_unlock(&list_lock); rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) } }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) 3. 4.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) release_referenced() delete()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) { {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) ... spin_lock(&list_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) if (atomic_dec_and_test(&el->rc)) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) call_rcu(&el->head, el_free); remove_element
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) ... spin_unlock(&list_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) } ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) if (atomic_dec_and_test(&el->rc))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) call_rcu(&el->head, el_free);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) Sometimes, a reference to the element needs to be obtained in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) update (write) stream. In such cases, atomic_inc_not_zero() might be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) overkill, since we hold the update-side spinlock. One might instead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) use atomic_inc() in such cases.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) It is not always convenient to deal with "FAIL" in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) search_and_reference() code path. In such cases, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) atomic_dec_and_test() may be moved from delete() to el_free()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) CODE LISTING C::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) 1. 2.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) add() search_and_reference()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) { {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) alloc_object rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) ... search_for_element
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) atomic_set(&el->rc, 1); atomic_inc(&el->rc);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) spin_lock(&list_lock); ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) add_element rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) ... }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) spin_unlock(&list_lock); 4.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) } delete()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 3. {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) release_referenced() spin_lock(&list_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) { ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) ... remove_element
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) if (atomic_dec_and_test(&el->rc)) spin_unlock(&list_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) kfree(el); ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) ... call_rcu(&el->head, el_free);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) } ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 5. }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) void el_free(struct rcu_head *rhp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) release_referenced();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) The key point is that the initial reference added by add() is not removed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) until after a grace period has elapsed following removal. This means that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) search_and_reference() cannot find this element, which means that the value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) of el->rc cannot increase. Thus, once it reaches zero, there are no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) readers that can or ever will be able to reference the element. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) element can therefore safely be freed. This in turn guarantees that if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) any reader finds the element, that reader may safely acquire a reference
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) without checking the value of the reference counter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) A clear advantage of the RCU-based pattern in listing C over the one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) in listing B is that any call to search_and_reference() that locates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) a given object will succeed in obtaining a reference to that object,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) even given a concurrent invocation of delete() for that same object.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) Similarly, a clear advantage of both listings B and C over listing A is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) that a call to delete() is not delayed even if there are an arbitrarily
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) large number of calls to search_and_reference() searching for the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) object that delete() was invoked on. Instead, all that is delayed is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) the eventual invocation of kfree(), which is usually not a problem on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) modern computer systems, even the small ones.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) In cases where delete() can sleep, synchronize_rcu() can be called from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) delete(), so that el_free() can be subsumed into delete as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 4.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) delete()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) spin_lock(&list_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) remove_element
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) spin_unlock(&list_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) synchronize_rcu();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) if (atomic_dec_and_test(&el->rc))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) kfree(el);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) As additional examples in the kernel, the pattern in listing C is used by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) reference counting of struct pid, while the pattern in listing B is used by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) struct posix_acl.