^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) ===========================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) Proper Locking Under a Preemptible Kernel: Keeping Kernel Code Preempt-Safe
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ===========================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) :Author: Robert Love <rml@tech9.net>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) A preemptible kernel creates new locking issues. The issues are the same as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) those under SMP: concurrency and reentrancy. Thankfully, the Linux preemptible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) kernel model leverages existing SMP locking mechanisms. Thus, the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) requires explicit additional locking for very few additional situations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) This document is for all kernel hackers. Developing code in the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) requires protecting these situations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) RULE #1: Per-CPU data structures need explicit protection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) Two similar problems arise. An example code snippet::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) struct this_needs_locking tux[NR_CPUS];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) tux[smp_processor_id()] = some_value;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) /* task is preempted here... */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) something = tux[smp_processor_id()];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) First, since the data is per-CPU, it may not have explicit SMP locking, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) require it otherwise. Second, when a preempted task is finally rescheduled,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) the previous value of smp_processor_id may not equal the current. You must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) protect these situations by disabling preemption around them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) You can also use put_cpu() and get_cpu(), which will disable preemption.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) RULE #2: CPU state must be protected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) Under preemption, the state of the CPU must be protected. This is arch-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) dependent, but includes CPU structures and state not preserved over a context
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) switch. For example, on x86, entering and exiting FPU mode is now a critical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) section that must occur while preemption is disabled. Think what would happen
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) if the kernel is executing a floating-point instruction and is then preempted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) Remember, the kernel does not save FPU state except for user tasks. Therefore,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) upon preemption, the FPU registers will be sold to the lowest bidder. Thus,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) preemption must be disabled around such regions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) Note, some FPU functions are already explicitly preempt safe. For example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) kernel_fpu_begin and kernel_fpu_end will disable and enable preemption.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) RULE #3: Lock acquire and release must be performed by same task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) A lock acquired in one task must be released by the same task. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) means you can't do oddball things like acquire a lock and go off to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) play while another task releases it. If you want to do something
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) like this, acquire and release the task in the same code path and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) have the caller wait on an event by the other task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) Solution
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) Data protection under preemption is achieved by disabling preemption for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) duration of the critical region.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) preempt_enable() decrement the preempt counter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) preempt_disable() increment the preempt counter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) preempt_enable_no_resched() decrement, but do not immediately preempt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) preempt_check_resched() if needed, reschedule
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) preempt_count() return the preempt counter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) The functions are nestable. In other words, you can call preempt_disable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) n-times in a code path, and preemption will not be reenabled until the n-th
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) call to preempt_enable. The preempt statements define to nothing if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) preemption is not enabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) Note that you do not need to explicitly prevent preemption if you are holding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) any locks or interrupts are disabled, since preemption is implicitly disabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) in those cases.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) But keep in mind that 'irqs disabled' is a fundamentally unsafe way of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) disabling preemption - any cond_resched() or cond_resched_lock() might trigger
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) a reschedule if the preempt count is 0. A simple printk() might trigger a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) reschedule. So use this implicit preemption-disabling property only if you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) know that the affected codepath does not do any of this. Best policy is to use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) this only for small, atomic code that you wrote and which calls no complex
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) functions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) Example::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) cpucache_t *cc; /* this is per-CPU */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) preempt_disable();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) cc = cc_data(searchp);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) if (cc && cc->avail) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) __free_block(searchp, cc_entry(cc), cc->avail);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) cc->avail = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) preempt_enable();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) Notice how the preemption statements must encompass every reference of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) critical variables. Another example::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) int buf[NR_CPUS];
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) set_cpu_val(buf);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) if (buf[smp_processor_id()] == -1) printf(KERN_INFO "wee!\n");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) spin_lock(&buf_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) /* ... */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) This code is not preempt-safe, but see how easily we can fix it by simply
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) moving the spin_lock up two lines.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) Preventing preemption using interrupt disabling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) ===============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) It is possible to prevent a preemption event using local_irq_disable and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) local_irq_save. Note, when doing so, you must be very careful to not cause
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) an event that would set need_resched and result in a preemption check. When
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) in doubt, rely on locking or explicit preemption disabling.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) Note in 2.5 interrupt disabling is now only per-CPU (e.g. local).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) An additional concern is proper usage of local_irq_disable and local_irq_save.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) These may be used to protect from preemption, however, on exit, if preemption
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) may be enabled, a test to see if preemption is required should be done. If
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) these are called from the spin_lock and read/write lock macros, the right thing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) is done. They may also be called within a spin-lock protected region, however,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) if they are ever called outside of this context, a test for preemption should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) be made. Do note that calls from interrupt context or bottom half/ tasklets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) are also protected by preemption locks and so may use the versions which do
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) not check preemption.