Documentation/locking/locktypes.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 .. _kernel_hacking_locktypes:
   4
   5 ==========================
   6 Lock types and their rules
   7 ==========================
   8
   9 Introduction
  10 ============
  11
  12 The kernel provides a variety of locking primitives which can be divided
  13 into three categories:
  14
  15  - Sleeping locks
  16  - CPU local locks
  17  - Spinning locks
  18
  19 This document conceptually describes these lock types and provides rules
  20 for their nesting, including the rules for use under PREEMPT_RT.
  21
  22
  23 Lock categories
  24 ===============
  25
  26 Sleeping locks
  27 --------------
  28
  29 Sleeping locks can only be acquired in preemptible task context.
  30
  31 Although implementations allow try_lock() from other contexts, it is
  32 necessary to carefully evaluate the safety of unlock() as well as of
  33 try_lock().  Furthermore, it is also necessary to evaluate the debugging
  34 versions of these primitives.  In short, don't acquire sleeping locks from
  35 other contexts unless there is no other option.
  36
  37 Sleeping lock types:
  38
  39  - mutex
  40  - rt_mutex
  41  - semaphore
  42  - rw_semaphore
  43  - ww_mutex
  44  - percpu_rw_semaphore
  45
  46 On PREEMPT_RT kernels, these lock types are converted to sleeping locks:
  47
  48  - local_lock
  49  - spinlock_t
  50  - rwlock_t
  51
  52
  53 CPU local locks
  54 ---------------
  55
  56  - local_lock
  57
  58 On non-PREEMPT_RT kernels, local_lock functions are wrappers around
  59 preemption and interrupt disabling primitives. Contrary to other locking
  60 mechanisms, disabling preemption or interrupts are pure CPU local
  61 concurrency control mechanisms and not suited for inter-CPU concurrency
  62 control.
  63
  64
  65 Spinning locks
  66 --------------
  67
  68  - raw_spinlock_t
  69  - bit spinlocks
  70
  71 On non-PREEMPT_RT kernels, these lock types are also spinning locks:
  72
  73  - spinlock_t
  74  - rwlock_t
  75
  76 Spinning locks implicitly disable preemption and the lock / unlock functions
  77 can have suffixes which apply further protections:
  78
  79  ===================  ====================================================
  80  _bh()                Disable / enable bottom halves (soft interrupts)
  81  _irq()               Disable / enable interrupts
  82  _irqsave/restore()   Save and disable / restore interrupt disabled state
  83  ===================  ====================================================
  84
  85
  86 Owner semantics
  87 ===============
  88
  89 The aforementioned lock types except semaphores have strict owner
  90 semantics:
  91
  92   The context (task) that acquired the lock must release it.
  93
  94 rw_semaphores have a special interface which allows non-owner release for
  95 readers.
  96
  97
  98 rtmutex
  99 =======
 100
 101 RT-mutexes are mutexes with support for priority inheritance (PI).
 102
 103 PI has limitations on non-PREEMPT_RT kernels due to preemption and
 104 interrupt disabled sections.
 105
 106 PI clearly cannot preempt preemption-disabled or interrupt-disabled
 107 regions of code, even on PREEMPT_RT kernels.  Instead, PREEMPT_RT kernels
 108 execute most such regions of code in preemptible task context, especially
 109 interrupt handlers and soft interrupts.  This conversion allows spinlock_t
 110 and rwlock_t to be implemented via RT-mutexes.
 111
 112
 113 semaphore
 114 =========
 115
 116 semaphore is a counting semaphore implementation.
 117
 118 Semaphores are often used for both serialization and waiting, but new use
 119 cases should instead use separate serialization and wait mechanisms, such
 120 as mutexes and completions.
 121
 122 semaphores and PREEMPT_RT
 123 ----------------------------
 124
 125 PREEMPT_RT does not change the semaphore implementation because counting
 126 semaphores have no concept of owners, thus preventing PREEMPT_RT from
 127 providing priority inheritance for semaphores.  After all, an unknown
 128 owner cannot be boosted. As a consequence, blocking on semaphores can
 129 result in priority inversion.
 130
 131
 132 rw_semaphore
 133 ============
 134
 135 rw_semaphore is a multiple readers and single writer lock mechanism.
 136
 137 On non-PREEMPT_RT kernels the implementation is fair, thus preventing
 138 writer starvation.
 139
 140 rw_semaphore complies by default with the strict owner semantics, but there
 141 exist special-purpose interfaces that allow non-owner release for readers.
 142 These interfaces work independent of the kernel configuration.
 143
 144 rw_semaphore and PREEMPT_RT
 145 ---------------------------
 146
 147 PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based
 148 implementation, thus changing the fairness:
 149
 150  Because an rw_semaphore writer cannot grant its priority to multiple
 151  readers, a preempted low-priority reader will continue holding its lock,
 152  thus starving even high-priority writers.  In contrast, because readers
 153  can grant their priority to a writer, a preempted low-priority writer will
 154  have its priority boosted until it releases the lock, thus preventing that
 155  writer from starving readers.
 156
 157
 158 local_lock
 159 ==========
 160
 161 local_lock provides a named scope to critical sections which are protected
 162 by disabling preemption or interrupts.
 163
 164 On non-PREEMPT_RT kernels local_lock operations map to the preemption and
 165 interrupt disabling and enabling primitives:
 166
 167  ===============================  ======================
 168  local_lock(&llock)               preempt_disable()
 169  local_unlock(&llock)             preempt_enable()
 170  local_lock_irq(&llock)           local_irq_disable()
 171  local_unlock_irq(&llock)         local_irq_enable()
 172  local_lock_irqsave(&llock)       local_irq_save()
 173  local_unlock_irqrestore(&llock)  local_irq_restore()
 174  ===============================  ======================
 175
 176 The named scope of local_lock has two advantages over the regular
 177 primitives:
 178
 179   - The lock name allows static analysis and is also a clear documentation
 180     of the protection scope while the regular primitives are scopeless and
 181     opaque.
 182
 183   - If lockdep is enabled the local_lock gains a lockmap which allows to
 184     validate the correctness of the protection. This can detect cases where
 185     e.g. a function using preempt_disable() as protection mechanism is
 186     invoked from interrupt or soft-interrupt context. Aside of that
 187     lockdep_assert_held(&llock) works as with any other locking primitive.
 188
 189 local_lock and PREEMPT_RT
 190 -------------------------
 191
 192 PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing
 193 semantics:
 194
 195   - All spinlock_t changes also apply to local_lock.
 196
 197 local_lock usage
 198 ----------------
 199
 200 local_lock should be used in situations where disabling preemption or
 201 interrupts is the appropriate form of concurrency control to protect
 202 per-CPU data structures on a non PREEMPT_RT kernel.
 203
 204 local_lock is not suitable to protect against preemption or interrupts on a
 205 PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics.
 206
 207
 208 raw_spinlock_t and spinlock_t
 209 =============================
 210
 211 raw_spinlock_t
 212 --------------
 213
 214 raw_spinlock_t is a strict spinning lock implementation regardless of the
 215 kernel configuration including PREEMPT_RT enabled kernels.
 216
 217 raw_spinlock_t is a strict spinning lock implementation in all kernels,
 218 including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
 219 core code, low-level interrupt handling and places where disabling
 220 preemption or interrupts is required, for example, to safely access
 221 hardware state.  raw_spinlock_t can sometimes also be used when the
 222 critical section is tiny, thus avoiding RT-mutex overhead.
 223
 224 spinlock_t
 225 ----------
 226
 227 The semantics of spinlock_t change with the state of PREEMPT_RT.
 228
 229 On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has
 230 exactly the same semantics.
 231
 232 spinlock_t and PREEMPT_RT
 233 -------------------------
 234
 235 On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation
 236 based on rt_mutex which changes the semantics:
 237
 238  - Preemption is not disabled.
 239
 240  - The hard interrupt related suffixes for spin_lock / spin_unlock
 241    operations (_irq, _irqsave / _irqrestore) do not affect the CPU's
 242    interrupt disabled state.
 243
 244  - The soft interrupt related suffix (_bh()) still disables softirq
 245    handlers.
 246
 247    Non-PREEMPT_RT kernels disable preemption to get this effect.
 248
 249    PREEMPT_RT kernels use a per-CPU lock for serialization which keeps
 250    preemption disabled. The lock disables softirq handlers and also
 251    prevents reentrancy due to task preemption.
 252
 253 PREEMPT_RT kernels preserve all other spinlock_t semantics:
 254
 255  - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
 256    avoid migration by disabling preemption.  PREEMPT_RT kernels instead
 257    disable migration, which ensures that pointers to per-CPU variables
 258    remain valid even if the task is preempted.
 259
 260  - Task state is preserved across spinlock acquisition, ensuring that the
 261    task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
 262    kernels leave task state untouched.  However, PREEMPT_RT must change
 263    task state if the task blocks during acquisition.  Therefore, it saves
 264    the current task state before blocking and the corresponding lock wakeup
 265    restores it, as shown below::
 266
 267     task->state = TASK_INTERRUPTIBLE
 268      lock()
 269        block()
 270          task->saved_state = task->state
 271          task->state = TASK_UNINTERRUPTIBLE
 272          schedule()
 273                                         lock wakeup
 274                                           task->state = task->saved_state
 275
 276    Other types of wakeups would normally unconditionally set the task state
 277    to RUNNING, but that does not work here because the task must remain
 278    blocked until the lock becomes available.  Therefore, when a non-lock
 279    wakeup attempts to awaken a task blocked waiting for a spinlock, it
 280    instead sets the saved state to RUNNING.  Then, when the lock
 281    acquisition completes, the lock wakeup sets the task state to the saved
 282    state, in this case setting it to RUNNING::
 283
 284     task->state = TASK_INTERRUPTIBLE
 285      lock()
 286        block()
 287          task->saved_state = task->state
 288          task->state = TASK_UNINTERRUPTIBLE
 289          schedule()
 290                                         non lock wakeup
 291                                           task->saved_state = TASK_RUNNING
 292
 293                                         lock wakeup
 294                                           task->state = task->saved_state
 295
 296    This ensures that the real wakeup cannot be lost.
 297
 298
 299 rwlock_t
 300 ========
 301
 302 rwlock_t is a multiple readers and single writer lock mechanism.
 303
 304 Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the
 305 suffix rules of spinlock_t apply accordingly. The implementation is fair,
 306 thus preventing writer starvation.
 307
 308 rwlock_t and PREEMPT_RT
 309 -----------------------
 310
 311 PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based
 312 implementation, thus changing semantics:
 313
 314  - All the spinlock_t changes also apply to rwlock_t.
 315
 316  - Because an rwlock_t writer cannot grant its priority to multiple
 317    readers, a preempted low-priority reader will continue holding its lock,
 318    thus starving even high-priority writers.  In contrast, because readers
 319    can grant their priority to a writer, a preempted low-priority writer
 320    will have its priority boosted until it releases the lock, thus
 321    preventing that writer from starving readers.
 322
 323
 324 PREEMPT_RT caveats
 325 ==================
 326
 327 local_lock on RT
 328 ----------------
 329
 330 The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few
 331 implications. For example, on a non-PREEMPT_RT kernel the following code
 332 sequence works as expected::
 333
 334   local_lock_irq(&local_lock);
 335   raw_spin_lock(&lock);
 336
 337 and is fully equivalent to::
 338
 339    raw_spin_lock_irq(&lock);
 340
 341 On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq()
 342 is mapped to a per-CPU spinlock_t which neither disables interrupts nor
 343 preemption. The following code sequence works perfectly correct on both
 344 PREEMPT_RT and non-PREEMPT_RT kernels::
 345
 346   local_lock_irq(&local_lock);
 347   spin_lock(&lock);
 348
 349 Another caveat with local locks is that each local_lock has a specific
 350 protection scope. So the following substitution is wrong::
 351
 352   func1()
 353   {
 354     local_irq_save(flags);    -> local_lock_irqsave(&local_lock_1, flags);
 355     func3();
 356     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags);
 357   }
 358
 359   func2()
 360   {
 361     local_irq_save(flags);    -> local_lock_irqsave(&local_lock_2, flags);
 362     func3();
 363     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags);
 364   }
 365
 366   func3()
 367   {
 368     lockdep_assert_irqs_disabled();
 369     access_protected_data();
 370   }
 371
 372 On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel
 373 local_lock_1 and local_lock_2 are distinct and cannot serialize the callers
 374 of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel
 375 because local_lock_irqsave() does not disable interrupts due to the
 376 PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is::
 377
 378   func1()
 379   {
 380     local_irq_save(flags);    -> local_lock_irqsave(&local_lock, flags);
 381     func3();
 382     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
 383   }
 384
 385   func2()
 386   {
 387     local_irq_save(flags);    -> local_lock_irqsave(&local_lock, flags);
 388     func3();
 389     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
 390   }
 391
 392   func3()
 393   {
 394     lockdep_assert_held(&local_lock);
 395     access_protected_data();
 396   }
 397
 398
 399 spinlock_t and rwlock_t
 400 -----------------------
 401
 402 The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
 403 have a few implications.  For example, on a non-PREEMPT_RT kernel the
 404 following code sequence works as expected::
 405
 406    local_irq_disable();
 407    spin_lock(&lock);
 408
 409 and is fully equivalent to::
 410
 411    spin_lock_irq(&lock);
 412
 413 Same applies to rwlock_t and the _irqsave() suffix variants.
 414
 415 On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a
 416 fully preemptible context.  Instead, use spin_lock_irq() or
 417 spin_lock_irqsave() and their unlock counterparts.  In cases where the
 418 interrupt disabling and locking must remain separate, PREEMPT_RT offers a
 419 local_lock mechanism.  Acquiring the local_lock pins the task to a CPU,
 420 allowing things like per-CPU interrupt disabled locks to be acquired.
 421 However, this approach should be used only where absolutely necessary.
 422
 423 A typical scenario is protection of per-CPU variables in thread context::
 424
 425   struct foo *p = get_cpu_ptr(&var1);
 426
 427   spin_lock(&p->lock);
 428   p->count += this_cpu_read(var2);
 429
 430 This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel
 431 this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does
 432 not allow to acquire p->lock because get_cpu_ptr() implicitly disables
 433 preemption. The following substitution works on both kernels::
 434
 435   struct foo *p;
 436
 437   migrate_disable();
 438   p = this_cpu_ptr(&var1);
 439   spin_lock(&p->lock);
 440   p->count += this_cpu_read(var2);
 441
 442 On a non-PREEMPT_RT kernel migrate_disable() maps to preempt_disable()
 443 which makes the above code fully equivalent. On a PREEMPT_RT kernel
 444 migrate_disable() ensures that the task is pinned on the current CPU which
 445 in turn guarantees that the per-CPU access to var1 and var2 are staying on
 446 the same CPU.
 447
 448 The migrate_disable() substitution is not valid for the following
 449 scenario::
 450
 451   func()
 452   {
 453     struct foo *p;
 454
 455     migrate_disable();
 456     p = this_cpu_ptr(&var1);
 457     p->val = func2();
 458
 459 While correct on a non-PREEMPT_RT kernel, this breaks on PREEMPT_RT because
 460 here migrate_disable() does not protect against reentrancy from a
 461 preempting task. A correct substitution for this case is::
 462
 463   func()
 464   {
 465     struct foo *p;
 466
 467     local_lock(&foo_lock);
 468     p = this_cpu_ptr(&var1);
 469     p->val = func2();
 470
 471 On a non-PREEMPT_RT kernel this protects against reentrancy by disabling
 472 preemption. On a PREEMPT_RT kernel this is achieved by acquiring the
 473 underlying per-CPU spinlock.
 474
 475
 476 raw_spinlock_t on RT
 477 --------------------
 478
 479 Acquiring a raw_spinlock_t disables preemption and possibly also
 480 interrupts, so the critical section must avoid acquiring a regular
 481 spinlock_t or rwlock_t, for example, the critical section must avoid
 482 allocating memory.  Thus, on a non-PREEMPT_RT kernel the following code
 483 works perfectly::
 484
 485   raw_spin_lock(&lock);
 486   p = kmalloc(sizeof(*p), GFP_ATOMIC);
 487
 488 But this code fails on PREEMPT_RT kernels because the memory allocator is
 489 fully preemptible and therefore cannot be invoked from truly atomic
 490 contexts.  However, it is perfectly fine to invoke the memory allocator
 491 while holding normal non-raw spinlocks because they do not disable
 492 preemption on PREEMPT_RT kernels::
 493
 494   spin_lock(&lock);
 495   p = kmalloc(sizeof(*p), GFP_ATOMIC);
 496
 497
 498 bit spinlocks
 499 -------------
 500
 501 PREEMPT_RT cannot substitute bit spinlocks because a single bit is too
 502 small to accommodate an RT-mutex.  Therefore, the semantics of bit
 503 spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t
 504 caveats also apply to bit spinlocks.
 505
 506 Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT
 507 using conditional (#ifdef'ed) code changes at the usage site.  In contrast,
 508 usage-site changes are not needed for the spinlock_t substitution.
 509 Instead, conditionals in header files and the core locking implemementation
 510 enable the compiler to do the substitution transparently.
 511
 512
 513 Lock type nesting rules
 514 =======================
 515
 516 The most basic rules are:
 517
 518   - Lock types of the same lock category (sleeping, CPU local, spinning)
 519     can nest arbitrarily as long as they respect the general lock ordering
 520     rules to prevent deadlocks.
 521
 522   - Sleeping lock types cannot nest inside CPU local and spinning lock types.
 523
 524   - CPU local and spinning lock types can nest inside sleeping lock types.
 525
 526   - Spinning lock types can nest inside all lock types
 527
 528 These constraints apply both in PREEMPT_RT and otherwise.
 529
 530 The fact that PREEMPT_RT changes the lock category of spinlock_t and
 531 rwlock_t from spinning to sleeping and substitutes local_lock with a
 532 per-CPU spinlock_t means that they cannot be acquired while holding a raw
 533 spinlock.  This results in the following nesting ordering:
 534
 535   1) Sleeping locks
 536   2) spinlock_t, rwlock_t, local_lock
 537   3) raw_spinlock_t and bit spinlocks
 538
 539 Lockdep will complain if these constraints are violated, both in
 540 PREEMPT_RT and otherwise.