Orange Pi5 kernel

^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) This document provides "recipes", that is, litmus tests for commonly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) occurring situations, as well as a few that illustrate subtly broken but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) attractive nuisances.  Many of these recipes include example code from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) v5.7 of the Linux kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) The first section covers simple special cases, the second section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) takes off the training wheels to cover more involved examples,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) and the third section provides a few rules of thumb.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) Simple special cases
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) This section presents two simple special cases, the first being where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) there is only one CPU or only one memory location is accessed, and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) second being use of that old concurrency workhorse, locking.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) Single CPU or single memory location
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) ------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) If there is only one CPU on the one hand or only one variable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) on the other, the code will execute in order.  There are (as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) usual) some things to be careful of:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 1.	Some aspects of the C language are unordered.  For example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) 	in the expression "f(x) + g(y)", the order in which f and g are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 	called is not defined; the object code is allowed to use either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) 	order or even to interleave the computations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) 2.	Compilers are permitted to use the "as-if" rule.  That is, a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) 	compiler can emit whatever code it likes for normal accesses,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) 	as long as the results of a single-threaded execution appear
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 	just as if the compiler had followed all the relevant rules.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) 	To see this, compile with a high level of optimization and run
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) 	the debugger on the resulting binary.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) 3.	If there is only one variable but multiple CPUs, that variable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) 	must be properly aligned and all accesses to that variable must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) 	be full sized.	Variables that straddle cachelines or pages void
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 	your full-ordering warranty, as do undersized accesses that load
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) 	from or store to only part of the variable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) 4.	If there are multiple CPUs, accesses to shared variables should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) 	use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) 	tearing, load/store fusing, and invented loads and stores.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) 	There are exceptions to this rule, including:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) 	i.	When there is no possibility of a given shared variable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) 		being updated by some other CPU, for example, while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) 		holding the update-side lock, reads from that variable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) 		need not use READ_ONCE().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) 	ii.	When there is no possibility of a given shared variable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) 		being either read or updated by other CPUs, for example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) 		when running during early boot, reads from that variable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) 		need not use READ_ONCE() and writes to that variable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) 		need not use WRITE_ONCE().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) Locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) Locking is well-known and straightforward, at least if you don't think
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) about it too hard.  And the basic rule is indeed quite simple: Any CPU that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) has acquired a given lock sees any changes previously seen or made by any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) CPU before it released that same lock.  Note that this statement is a bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) stronger than "Any CPU holding a given lock sees all changes made by any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) CPU during the time that CPU was holding this same lock".  For example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) consider the following pair of code fragments:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 	/* See MP+polocks.litmus. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 	void CPU0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) 		WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) 		spin_lock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 		WRITE_ONCE(y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) 		spin_unlock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) 	void CPU1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) 		spin_lock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) 		r0 = READ_ONCE(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) 		spin_unlock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) 		r1 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) The basic rule guarantees that if CPU0() acquires mylock before CPU1(),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) then both r0 and r1 must be set to the value 1.  This also has the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) consequence that if the final value of r0 is equal to 1, then the final
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) value of r1 must also be equal to 1.  In contrast, the weaker rule would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) say nothing about the final value of r1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) The converse to the basic rule also holds, as illustrated by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) following litmus test:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) 	/* See MP+porevlocks.litmus. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) 	void CPU0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) 		r0 = READ_ONCE(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) 		spin_lock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 		r1 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) 		spin_unlock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) 	void CPU1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) 		spin_lock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) 		WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) 		spin_unlock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) 		WRITE_ONCE(y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) This converse to the basic rule guarantees that if CPU0() acquires
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) mylock before CPU1(), then both r0 and r1 must be set to the value 0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) This also has the consequence that if the final value of r1 is equal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) to 0, then the final value of r0 must also be equal to 0.  In contrast,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) the weaker rule would say nothing about the final value of r0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) These examples show only a single pair of CPUs, but the effects of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) locking basic rule extend across multiple acquisitions of a given lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) across multiple CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) However, it is not necessarily the case that accesses ordered by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) locking will be seen as ordered by CPUs not holding that lock.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) Consider this example:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 	/* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) 	void CPU0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) 		spin_lock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) 		WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) 		WRITE_ONCE(y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) 		spin_unlock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) 	void CPU1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 		spin_lock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) 		r0 = READ_ONCE(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 		WRITE_ONCE(z, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) 		spin_unlock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 	void CPU2(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) 		WRITE_ONCE(z, 2);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 		smp_mb();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 		r1 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) Counter-intuitive though it might be, it is quite possible to have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) the final value of r0 be 1, the final value of z be 2, and the final
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) value of r1 be 0.  The reason for this surprising outcome is that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) CPU2() never acquired the lock, and thus did not benefit from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) lock's ordering properties.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) Ordering can be extended to CPUs not holding the lock by careful use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) of smp_mb__after_spinlock():
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) 	/* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 	void CPU0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 		spin_lock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) 		WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) 		WRITE_ONCE(y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 		spin_unlock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 	void CPU1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) 		spin_lock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) 		smp_mb__after_spinlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) 		r0 = READ_ONCE(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) 		WRITE_ONCE(z, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) 		spin_unlock(&mylock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) 	void CPU2(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) 		WRITE_ONCE(z, 2);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) 		smp_mb();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) 		r1 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) This addition of smp_mb__after_spinlock() strengthens the lock acquisition
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) sufficiently to rule out the counter-intuitive outcome.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) Taking off the training wheels
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) ==============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) This section looks at more complex examples, including message passing,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) load buffering, release-acquire chains, store buffering.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) Many classes of litmus tests have abbreviated names, which may be found
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) Message passing (MP)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) --------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) The MP pattern has one CPU execute a pair of stores to a pair of variables
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) and another CPU execute a pair of loads from this same pair of variables,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) but in the opposite order.  The goal is to avoid the counter-intuitive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) outcome in which the first load sees the value written by the second store
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) but the second load does not see the value written by the first store.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) In the absence of any ordering, this goal may not be met, as can be seen
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) in the MP+poonceonces.litmus litmus test.  This section therefore looks at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) a number of ways of meeting this goal.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) Release and acquire
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) ~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) Use of smp_store_release() and smp_load_acquire() is one way to force
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) the desired MP ordering.  The general approach is shown below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) 	/* See MP+pooncerelease+poacquireonce.litmus. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) 	void CPU0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) 		WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) 		smp_store_release(&y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) 	void CPU1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) 		r0 = smp_load_acquire(&y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) 		r1 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) The smp_store_release() macro orders any prior accesses against the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) store, while the smp_load_acquire macro orders the load against any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) subsequent accesses.  Therefore, if the final value of r0 is the value 1,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) the final value of r1 must also be the value 1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) The init_stack_slab() function in lib/stackdepot.c uses release-acquire
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) in this way to safely initialize of a slab of the stack.  Working out
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) the mutual-exclusion design is left as an exercise for the reader.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) Assign and dereference
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) ~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) use of smp_store_release() and smp_load_acquire(), except that both
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) rcu_assign_pointer() and rcu_dereference() operate on RCU-protected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) pointers.  The general approach is shown below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) 	/* See MP+onceassign+derefonce.litmus. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) 	int z;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) 	int *y = &z;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) 	int x;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) 	void CPU0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) 		WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) 		rcu_assign_pointer(y, &x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) 	void CPU1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) 		rcu_read_lock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) 		r0 = rcu_dereference(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) 		r1 = READ_ONCE(*r0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) 		rcu_read_unlock();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) In this example, if the final value of r0 is &x then the final value of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) r1 must be 1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) The rcu_assign_pointer() macro has the same ordering properties as does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) smp_store_release(), but the rcu_dereference() macro orders the load only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) against later accesses that depend on the value loaded.  A dependency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) is present if the value loaded determines the address of a later access
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) (address dependency, as shown above), the value written by a later store
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) (data dependency), or whether or not a later store is executed in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) first place (control dependency).  Note that the term "data dependency"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) is sometimes casually used to cover both address and data dependencies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) rcu_assign_pointer(), and the next_prime_number() function invokes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) rcu_dereference().  This combination mediates access to a bit vector
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) that is expanded as additional primes are needed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) Write and read memory barriers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) It is usually better to use smp_store_release() instead of smp_wmb()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) and to use smp_load_acquire() instead of smp_rmb().  However, the older
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) smp_wmb() and smp_rmb() APIs are still heavily used, so it is important
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) to understand their use cases.  The general approach is shown below:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) 	/* See MP+fencewmbonceonce+fencermbonceonce.litmus. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) 	void CPU0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) 		WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) 		smp_wmb();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) 		WRITE_ONCE(y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) 	void CPU1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) 		r0 = READ_ONCE(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) 		smp_rmb();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) 		r1 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) The smp_wmb() macro orders prior stores against later stores, and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) smp_rmb() macro orders prior loads against later loads.  Therefore, if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) the final value of r0 is 1, the final value of r1 must also be 1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) the following write-side code fragment:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) 	log->l_curr_block -= log->l_logBBsize;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) 	ASSERT(log->l_curr_block >= 0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) 	smp_wmb();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) 	log->l_curr_cycle++;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) the corresponding read-side code fragment:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) 	cur_cycle = READ_ONCE(log->l_curr_cycle);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) 	smp_rmb();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) 	cur_block = READ_ONCE(log->l_curr_block);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) Alternatively, consider the following comment in function
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) perf_output_put_handle() in kernel/events/ring_buffer.c:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) 	 *   kernel				user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) 	 *
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) 	 *   if (LOAD ->data_tail) {		LOAD ->data_head
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) 	 *			(A)		smp_rmb()	(C)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) 	 *	STORE $data			LOAD $data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) 	 *	smp_wmb()	(B)		smp_mb()	(D)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) 	 *	STORE ->data_head		STORE ->data_tail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) 	 *   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) The B/C pairing is an example of the MP pattern using smp_wmb() on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) write side and smp_rmb() on the read side.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) Of course, given that smp_mb() is strictly stronger than either smp_wmb()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) or smp_rmb(), any code fragment that would work with smp_rmb() and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) smp_wmb() would also work with smp_mb() replacing either or both of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) weaker barriers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) Load buffering (LB)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) -------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) The LB pattern has one CPU load from one variable and then store to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) second, while another CPU loads from the second variable and then stores
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) to the first.  The goal is to avoid the counter-intuitive situation where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) each load reads the value written by the other CPU's store.  In the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) absence of any ordering it is quite possible that this may happen, as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) can be seen in the LB+poonceonces.litmus litmus test.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) One way of avoiding the counter-intuitive outcome is through the use of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) control dependency paired with a full memory barrier:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) 	/* See LB+fencembonceonce+ctrlonceonce.litmus. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) 	void CPU0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) 		r0 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) 		if (r0)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) 			WRITE_ONCE(y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) 	void CPU1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) 		r1 = READ_ONCE(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) 		smp_mb();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) 		WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) This pairing of a control dependency in CPU0() with a full memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) barrier in CPU1() prevents r0 and r1 from both ending up equal to 1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) The A/D pairing from the ring-buffer use case shown earlier also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) illustrates LB.  Here is a repeat of the comment in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) perf_output_put_handle() in kernel/events/ring_buffer.c, showing a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) control dependency on the kernel side and a full memory barrier on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) the user side:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) 	 *   kernel				user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) 	 *
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) 	 *   if (LOAD ->data_tail) {		LOAD ->data_head
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) 	 *			(A)		smp_rmb()	(C)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) 	 *	STORE $data			LOAD $data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) 	 *	smp_wmb()	(B)		smp_mb()	(D)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) 	 *	STORE ->data_head		STORE ->data_tail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) 	 *   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) 	 *
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) 	 * Where A pairs with D, and B pairs with C.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) The kernel's control dependency between the load from ->data_tail
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) and the store to data combined with the user's full memory barrier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) between the load from data and the store to ->data_tail prevents
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) the counter-intuitive outcome where the kernel overwrites the data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) before the user gets done loading it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) Release-acquire chains
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) ----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) Release-acquire chains are a low-overhead, flexible, and easy-to-use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) method of maintaining order.  However, they do have some limitations that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) need to be fully understood.  Here is an example that maintains order:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) 	/* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) 	void CPU0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) 		WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) 		smp_store_release(&y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) 	void CPU1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) 		r0 = smp_load_acquire(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) 		smp_store_release(&z, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) 	void CPU2(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) 		r1 = smp_load_acquire(z);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) 		r2 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) In this case, if r0 and r1 both have final values of 1, then r2 must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) also have a final value of 1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) The ordering in this example is stronger than it needs to be.  For
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) example, ordering would still be preserved if CPU1()'s smp_load_acquire()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) invocation was replaced with READ_ONCE().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) It is tempting to assume that CPU0()'s store to x is globally ordered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) before CPU1()'s store to z, but this is not the case:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) 	/* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) 	void CPU0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) 		WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) 		smp_store_release(&y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) 	void CPU1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) 		r0 = smp_load_acquire(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) 		smp_store_release(&z, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) 	void CPU2(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456) 		WRITE_ONCE(z, 2);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) 		smp_mb();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) 		r1 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) One might hope that if the final value of r0 is 1 and the final value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) of z is 2, then the final value of r1 must also be 1, but it really is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) possible for r1 to have the final value of 0.  The reason, of course,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) is that in this version, CPU2() is not part of the release-acquire chain.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) This situation is accounted for in the rules of thumb below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) Despite this limitation, release-acquire chains are low-overhead as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) well as simple and powerful, at least as memory-ordering mechanisms go.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) Store buffering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) Store buffering can be thought of as upside-down load buffering, so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) that one CPU first stores to one variable and then loads from a second,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) while another CPU stores to the second variable and then loads from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) first.  Preserving order requires nothing less than full barriers:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) 	/* See SB+fencembonceonces.litmus. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) 	void CPU0(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) 		WRITE_ONCE(x, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) 		smp_mb();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) 		r0 = READ_ONCE(y);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) 	void CPU1(void)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) 		WRITE_ONCE(y, 1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) 		smp_mb();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) 		r1 = READ_ONCE(x);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) Omitting either smp_mb() will allow both r0 and r1 to have final
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) values of 0, but providing both full barriers as shown above prevents
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) this counter-intuitive outcome.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) This pattern most famously appears as part of Dekker's locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) algorithm, but it has a much more practical use within the Linux kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) of ordering wakeups.  The following comment taken from waitqueue_active()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) in include/linux/wait.h shows the canonical pattern:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503)  *      CPU0 - waker                    CPU1 - waiter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504)  *
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505)  *                                      for (;;) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506)  *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507)  *      smp_mb();                         // smp_mb() from set_current_state()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508)  *      if (waitqueue_active(wq_head))         if (@cond)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509)  *        wake_up(wq_head);                      break;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510)  *                                        schedule();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511)  *                                      }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512)  *                                      finish_wait(&wq_head, &wait);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) On CPU0, the store is to @cond and the load is in waitqueue_active().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) On CPU1, prepare_to_wait() contains both a store to wq_head and a call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) to set_current_state(), which contains an smp_mb() barrier; the load is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) "if (@cond)".  The full barriers prevent the undesirable outcome where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) CPU1 puts the waiting task to sleep and CPU0 fails to wake it up.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) Note that use of locking can greatly simplify this pattern.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) Rules of thumb
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) There might seem to be no pattern governing what ordering primitives are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527) needed in which situations, but this is not the case.  There is a pattern
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) based on the relation between the accesses linking successive CPUs in a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) given litmus test.  There are three types of linkage:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) 1.	Write-to-read, where the next CPU reads the value that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) 	previous CPU wrote.  The LB litmus-test patterns contain only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) 	this type of relation.	In formal memory-modeling texts, this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534) 	relation is called "reads-from" and is usually abbreviated "rf".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) 2.	Read-to-write, where the next CPU overwrites the value that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) 	previous CPU read.  The SB litmus test contains only this type
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538) 	of relation.  In formal memory-modeling texts, this relation is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539) 	often called "from-reads" and is sometimes abbreviated "fr".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) 3.	Write-to-write, where the next CPU overwrites the value written
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) 	by the previous CPU.  The Z6.0 litmus test pattern contains a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543) 	write-to-write relation between the last access of CPU1() and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) 	the first access of CPU2().  In formal memory-modeling texts,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545) 	this relation is often called "coherence order" and is sometimes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) 	abbreviated "co".  In the C++ standard, it is instead called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547) 	"modification order" and often abbreviated "mo".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) The strength of memory ordering required for a given litmus test to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) avoid a counter-intuitive outcome depends on the types of relations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551) linking the memory accesses for the outcome in question:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) o	If all links are write-to-read links, then the weakest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) 	possible ordering within each CPU suffices.  For example, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555) 	the LB litmus test, a control dependency was enough to do the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556) 	job.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) o	If all but one of the links are write-to-read links, then a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559) 	release-acquire chain suffices.  Both the MP and the ISA2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560) 	litmus tests illustrate this case.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562) o	If more than one of the links are something other than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563) 	write-to-read links, then a full memory barrier is required
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564) 	between each successive pair of non-write-to-read links.  This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565) 	case is illustrated by the Z6.0 litmus tests, both in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566) 	locking and in the release-acquire sections.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568) However, if you find yourself having to stretch these rules of thumb
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569) to fit your situation, you should consider creating a litmus test and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570) running it on the model.