Documentation/locking/locktypes.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 .. _kernel_hacking_locktypes:
   4
   5 ==========================
   6 Lock types and their rules
   7 ==========================
   8
   9 Introduction
  10 ============
  11
  12 The kernel provides a variety of locking primitives which can be divided
  13 into two categories:
  14
  15  - Sleeping locks
  16  - Spinning locks
  17
  18 This document conceptually describes these lock types and provides rules
  19 for their nesting, including the rules for use under PREEMPT_RT.
  20
  21
  22 Lock categories
  23 ===============
  24
  25 Sleeping locks
  26 --------------
  27
  28 Sleeping locks can only be acquired in preemptible task context.
  29
  30 Although implementations allow try_lock() from other contexts, it is
  31 necessary to carefully evaluate the safety of unlock() as well as of
  32 try_lock().  Furthermore, it is also necessary to evaluate the debugging
  33 versions of these primitives.  In short, don't acquire sleeping locks from
  34 other contexts unless there is no other option.
  35
  36 Sleeping lock types:
  37
  38  - mutex
  39  - rt_mutex
  40  - semaphore
  41  - rw_semaphore
  42  - ww_mutex
  43  - percpu_rw_semaphore
  44
  45 On PREEMPT_RT kernels, these lock types are converted to sleeping locks:
  46
  47  - spinlock_t
  48  - rwlock_t
  49
  50 Spinning locks
  51 --------------
  52
  53  - raw_spinlock_t
  54  - bit spinlocks
  55
  56 On non-PREEMPT_RT kernels, these lock types are also spinning locks:
  57
  58  - spinlock_t
  59  - rwlock_t
  60
  61 Spinning locks implicitly disable preemption and the lock / unlock functions
  62 can have suffixes which apply further protections:
  63
  64  ===================  ====================================================
  65  _bh()                Disable / enable bottom halves (soft interrupts)
  66  _irq()               Disable / enable interrupts
  67  _irqsave/restore()   Save and disable / restore interrupt disabled state
  68  ===================  ====================================================
  69
  70 Owner semantics
  71 ===============
  72
  73 The aforementioned lock types except semaphores have strict owner
  74 semantics:
  75
  76   The context (task) that acquired the lock must release it.
  77
  78 rw_semaphores have a special interface which allows non-owner release for
  79 readers.
  80
  81
  82 rtmutex
  83 =======
  84
  85 RT-mutexes are mutexes with support for priority inheritance (PI).
  86
  87 PI has limitations on non-PREEMPT_RT kernels due to preemption and
  88 interrupt disabled sections.
  89
  90 PI clearly cannot preempt preemption-disabled or interrupt-disabled
  91 regions of code, even on PREEMPT_RT kernels.  Instead, PREEMPT_RT kernels
  92 execute most such regions of code in preemptible task context, especially
  93 interrupt handlers and soft interrupts.  This conversion allows spinlock_t
  94 and rwlock_t to be implemented via RT-mutexes.
  95
  96
  97 semaphore
  98 =========
  99
 100 semaphore is a counting semaphore implementation.
 101
 102 Semaphores are often used for both serialization and waiting, but new use
 103 cases should instead use separate serialization and wait mechanisms, such
 104 as mutexes and completions.
 105
 106 semaphores and PREEMPT_RT
 107 ----------------------------
 108
 109 PREEMPT_RT does not change the semaphore implementation because counting
 110 semaphores have no concept of owners, thus preventing PREEMPT_RT from
 111 providing priority inheritance for semaphores.  After all, an unknown
 112 owner cannot be boosted. As a consequence, blocking on semaphores can
 113 result in priority inversion.
 114
 115
 116 rw_semaphore
 117 ============
 118
 119 rw_semaphore is a multiple readers and single writer lock mechanism.
 120
 121 On non-PREEMPT_RT kernels the implementation is fair, thus preventing
 122 writer starvation.
 123
 124 rw_semaphore complies by default with the strict owner semantics, but there
 125 exist special-purpose interfaces that allow non-owner release for readers.
 126 These interfaces work independent of the kernel configuration.
 127
 128 rw_semaphore and PREEMPT_RT
 129 ---------------------------
 130
 131 PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based
 132 implementation, thus changing the fairness:
 133
 134  Because an rw_semaphore writer cannot grant its priority to multiple
 135  readers, a preempted low-priority reader will continue holding its lock,
 136  thus starving even high-priority writers.  In contrast, because readers
 137  can grant their priority to a writer, a preempted low-priority writer will
 138  have its priority boosted until it releases the lock, thus preventing that
 139  writer from starving readers.
 140
 141
 142 raw_spinlock_t and spinlock_t
 143 =============================
 144
 145 raw_spinlock_t
 146 --------------
 147
 148 raw_spinlock_t is a strict spinning lock implementation regardless of the
 149 kernel configuration including PREEMPT_RT enabled kernels.
 150
 151 raw_spinlock_t is a strict spinning lock implementation in all kernels,
 152 including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
 153 core code, low-level interrupt handling and places where disabling
 154 preemption or interrupts is required, for example, to safely access
 155 hardware state.  raw_spinlock_t can sometimes also be used when the
 156 critical section is tiny, thus avoiding RT-mutex overhead.
 157
 158 spinlock_t
 159 ----------
 160
 161 The semantics of spinlock_t change with the state of PREEMPT_RT.
 162
 163 On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has
 164 exactly the same semantics.
 165
 166 spinlock_t and PREEMPT_RT
 167 -------------------------
 168
 169 On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation
 170 based on rt_mutex which changes the semantics:
 171
 172  - Preemption is not disabled.
 173
 174  - The hard interrupt related suffixes for spin_lock / spin_unlock
 175    operations (_irq, _irqsave / _irqrestore) do not affect the CPU's
 176    interrupt disabled state.
 177
 178  - The soft interrupt related suffix (_bh()) still disables softirq
 179    handlers.
 180
 181    Non-PREEMPT_RT kernels disable preemption to get this effect.
 182
 183    PREEMPT_RT kernels use a per-CPU lock for serialization which keeps
 184    preemption disabled. The lock disables softirq handlers and also
 185    prevents reentrancy due to task preemption.
 186
 187 PREEMPT_RT kernels preserve all other spinlock_t semantics:
 188
 189  - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
 190    avoid migration by disabling preemption.  PREEMPT_RT kernels instead
 191    disable migration, which ensures that pointers to per-CPU variables
 192    remain valid even if the task is preempted.
 193
 194  - Task state is preserved across spinlock acquisition, ensuring that the
 195    task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
 196    kernels leave task state untouched.  However, PREEMPT_RT must change
 197    task state if the task blocks during acquisition.  Therefore, it saves
 198    the current task state before blocking and the corresponding lock wakeup
 199    restores it, as shown below::
 200
 201     task->state = TASK_INTERRUPTIBLE
 202      lock()
 203        block()
 204          task->saved_state = task->state
 205          task->state = TASK_UNINTERRUPTIBLE
 206          schedule()
 207                                         lock wakeup
 208                                           task->state = task->saved_state
 209
 210    Other types of wakeups would normally unconditionally set the task state
 211    to RUNNING, but that does not work here because the task must remain
 212    blocked until the lock becomes available.  Therefore, when a non-lock
 213    wakeup attempts to awaken a task blocked waiting for a spinlock, it
 214    instead sets the saved state to RUNNING.  Then, when the lock
 215    acquisition completes, the lock wakeup sets the task state to the saved
 216    state, in this case setting it to RUNNING::
 217
 218     task->state = TASK_INTERRUPTIBLE
 219      lock()
 220        block()
 221          task->saved_state = task->state
 222          task->state = TASK_UNINTERRUPTIBLE
 223          schedule()
 224                                         non lock wakeup
 225                                           task->saved_state = TASK_RUNNING
 226
 227                                         lock wakeup
 228                                           task->state = task->saved_state
 229
 230    This ensures that the real wakeup cannot be lost.
 231
 232
 233 rwlock_t
 234 ========
 235
 236 rwlock_t is a multiple readers and single writer lock mechanism.
 237
 238 Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the
 239 suffix rules of spinlock_t apply accordingly. The implementation is fair,
 240 thus preventing writer starvation.
 241
 242 rwlock_t and PREEMPT_RT
 243 -----------------------
 244
 245 PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based
 246 implementation, thus changing semantics:
 247
 248  - All the spinlock_t changes also apply to rwlock_t.
 249
 250  - Because an rwlock_t writer cannot grant its priority to multiple
 251    readers, a preempted low-priority reader will continue holding its lock,
 252    thus starving even high-priority writers.  In contrast, because readers
 253    can grant their priority to a writer, a preempted low-priority writer
 254    will have its priority boosted until it releases the lock, thus
 255    preventing that writer from starving readers.
 256
 257
 258 PREEMPT_RT caveats
 259 ==================
 260
 261 spinlock_t and rwlock_t
 262 -----------------------
 263
 264 These changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
 265 have a few implications.  For example, on a non-PREEMPT_RT kernel the
 266 following code sequence works as expected::
 267
 268    local_irq_disable();
 269    spin_lock(&lock);
 270
 271 and is fully equivalent to::
 272
 273    spin_lock_irq(&lock);
 274
 275 Same applies to rwlock_t and the _irqsave() suffix variants.
 276
 277 On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a
 278 fully preemptible context.  Instead, use spin_lock_irq() or
 279 spin_lock_irqsave() and their unlock counterparts.  In cases where the
 280 interrupt disabling and locking must remain separate, PREEMPT_RT offers a
 281 local_lock mechanism.  Acquiring the local_lock pins the task to a CPU,
 282 allowing things like per-CPU interrupt disabled locks to be acquired.
 283 However, this approach should be used only where absolutely necessary.
 284
 285
 286 raw_spinlock_t
 287 --------------
 288
 289 Acquiring a raw_spinlock_t disables preemption and possibly also
 290 interrupts, so the critical section must avoid acquiring a regular
 291 spinlock_t or rwlock_t, for example, the critical section must avoid
 292 allocating memory.  Thus, on a non-PREEMPT_RT kernel the following code
 293 works perfectly::
 294
 295   raw_spin_lock(&lock);
 296   p = kmalloc(sizeof(*p), GFP_ATOMIC);
 297
 298 But this code fails on PREEMPT_RT kernels because the memory allocator is
 299 fully preemptible and therefore cannot be invoked from truly atomic
 300 contexts.  However, it is perfectly fine to invoke the memory allocator
 301 while holding normal non-raw spinlocks because they do not disable
 302 preemption on PREEMPT_RT kernels::
 303
 304   spin_lock(&lock);
 305   p = kmalloc(sizeof(*p), GFP_ATOMIC);
 306
 307
 308 bit spinlocks
 309 -------------
 310
 311 PREEMPT_RT cannot substitute bit spinlocks because a single bit is too
 312 small to accommodate an RT-mutex.  Therefore, the semantics of bit
 313 spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t
 314 caveats also apply to bit spinlocks.
 315
 316 Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT
 317 using conditional (#ifdef'ed) code changes at the usage site.  In contrast,
 318 usage-site changes are not needed for the spinlock_t substitution.
 319 Instead, conditionals in header files and the core locking implemementation
 320 enable the compiler to do the substitution transparently.
 321
 322
 323 Lock type nesting rules
 324 =======================
 325
 326 The most basic rules are:
 327
 328   - Lock types of the same lock category (sleeping, spinning) can nest
 329     arbitrarily as long as they respect the general lock ordering rules to
 330     prevent deadlocks.
 331
 332   - Sleeping lock types cannot nest inside spinning lock types.
 333
 334   - Spinning lock types can nest inside sleeping lock types.
 335
 336 These constraints apply both in PREEMPT_RT and otherwise.
 337
 338 The fact that PREEMPT_RT changes the lock category of spinlock_t and
 339 rwlock_t from spinning to sleeping means that they cannot be acquired while
 340 holding a raw spinlock.  This results in the following nesting ordering:
 341
 342   1) Sleeping locks
 343   2) spinlock_t and rwlock_t
 344   3) raw_spinlock_t and bit spinlocks
 345
 346 Lockdep will complain if these constraints are violated, both in
 347 PREEMPT_RT and otherwise.