tools/memory-model/Documentation/simple.txt

   1 This document provides options for those wishing to keep their
   2 memory-ordering lives simple, as is necessary for those whose domain
   3 is complex.  After all, there are bugs other than memory-ordering bugs,
   4 and the time spent gaining memory-ordering knowledge is not available
   5 for gaining domain knowledge.  Furthermore Linux-kernel memory model
   6 (LKMM) is quite complex, with subtle differences in code often having
   7 dramatic effects on correctness.
   8
   9 The options near the beginning of this list are quite simple.  The idea
  10 is not that kernel hackers don't already know about them, but rather
  11 that they might need the occasional reminder.
  12
  13 Please note that this is a generic guide, and that specific subsystems
  14 will often have special requirements or idioms.  For example, developers
  15 of MMIO-based device drivers will often need to use mb(), rmb(), and
  16 wmb(), and therefore might find smp_mb(), smp_rmb(), and smp_wmb()
  17 to be more natural than smp_load_acquire() and smp_store_release().
  18 On the other hand, those coming in from other environments will likely
  19 be more familiar with these last two.
  20
  21
  22 Single-threaded code
  23 ====================
  24
  25 In single-threaded code, there is no reordering, at least assuming
  26 that your toolchain and hardware are working correctly.  In addition,
  27 it is generally a mistake to assume your code will only run in a single
  28 threaded context as the kernel can enter the same code path on multiple
  29 CPUs at the same time.  One important exception is a function that makes
  30 no external data references.
  31
  32 In the general case, you will need to take explicit steps to ensure that
  33 your code really is executed within a single thread that does not access
  34 shared variables.  A simple way to achieve this is to define a global lock
  35 that you acquire at the beginning of your code and release at the end,
  36 taking care to ensure that all references to your code's shared data are
  37 also carried out under that same lock.  Because only one thread can hold
  38 this lock at a given time, your code will be executed single-threaded.
  39 This approach is called "code locking".
  40
  41 Code locking can severely limit both performance and scalability, so it
  42 should be used with caution, and only on code paths that execute rarely.
  43 After all, a huge amount of effort was required to remove the Linux
  44 kernel's old "Big Kernel Lock", so let's please be very careful about
  45 adding new "little kernel locks".
  46
  47 One of the advantages of locking is that, in happy contrast with the
  48 year 1981, almost all kernel developers are very familiar with locking.
  49 The Linux kernel's lockdep (CONFIG_PROVE_LOCKING=y) is very helpful with
  50 the formerly feared deadlock scenarios.
  51
  52 Please use the standard locking primitives provided by the kernel rather
  53 than rolling your own.  For one thing, the standard primitives interact
  54 properly with lockdep.  For another thing, these primitives have been
  55 tuned to deal better with high contention.  And for one final thing, it is
  56 surprisingly hard to correctly code production-quality lock acquisition
  57 and release functions.  After all, even simple non-production-quality
  58 locking functions must carefully prevent both the CPU and the compiler
  59 from moving code in either direction across the locking function.
  60
  61 Despite the scalability limitations of single-threaded code, RCU
  62 takes this approach for much of its grace-period processing and also
  63 for early-boot operation.  The reason RCU is able to scale despite
  64 single-threaded grace-period processing is use of batching, where all
  65 updates that accumulated during one grace period are handled by the
  66 next one.  In other words, slowing down grace-period processing makes
  67 it more efficient.  Nor is RCU unique:  Similar batching optimizations
  68 are used in many I/O operations.
  69
  70
  71 Packaged code
  72 =============
  73
  74 Even if performance and scalability concerns prevent your code from
  75 being completely single-threaded, it is often possible to use library
  76 functions that handle the concurrency nearly or entirely on their own.
  77 This approach delegates any LKMM worries to the library maintainer.
  78
  79 In the kernel, what is the "library"?  Quite a bit.  It includes the
  80 contents of the lib/ directory, much of the include/linux/ directory along
  81 with a lot of other heavily used APIs.  But heavily used examples include
  82 the list macros (for example, include/linux/{,rcu}list.h), workqueues,
  83 smp_call_function(), and the various hash tables and search trees.
  84
  85
  86 Data locking
  87 ============
  88
  89 With code locking, we use single-threaded code execution to guarantee
  90 serialized access to the data that the code is accessing.  However,
  91 we can also achieve this by instead associating the lock with specific
  92 instances of the data structures.  This creates a "critical section"
  93 in the code execution that will execute as though it is single threaded.
  94 By placing all the accesses and modifications to a shared data structure
  95 inside a critical section, we ensure that the execution context that
  96 holds the lock has exclusive access to the shared data.
  97
  98 The poster boy for this approach is the hash table, where placing a lock
  99 in each hash bucket allows operations on different buckets to proceed
 100 concurrently.  This works because the buckets do not overlap with each
 101 other, so that an operation on one bucket does not interfere with any
 102 other bucket.
 103
 104 As the number of buckets increases, data locking scales naturally.
 105 In particular, if the amount of data increases with the number of CPUs,
 106 increasing the number of buckets as the number of CPUs increase results
 107 in a naturally scalable data structure.
 108
 109
 110 Per-CPU processing
 111 ==================
 112
 113 Partitioning processing and data over CPUs allows each CPU to take
 114 a single-threaded approach while providing excellent performance and
 115 scalability.  Of course, there is no free lunch:  The dark side of this
 116 excellence is substantially increased memory footprint.
 117
 118 In addition, it is sometimes necessary to occasionally update some global
 119 view of this processing and data, in which case something like locking
 120 must be used to protect this global view.  This is the approach taken
 121 by the percpu_counter infrastructure. In many cases, there are already
 122 generic/library variants of commonly used per-cpu constructs available.
 123 Please use them rather than rolling your own.
 124
 125 RCU uses DEFINE_PER_CPU*() declaration to create a number of per-CPU
 126 data sets.  For example, each CPU does private quiescent-state processing
 127 within its instance of the per-CPU rcu_data structure, and then uses data
 128 locking to report quiescent states up the grace-period combining tree.
 129
 130
 131 Packaged primitives: Sequence locking
 132 =====================================
 133
 134 Lockless programming is considered by many to be more difficult than
 135 lock-based programming, but there are a few lockless design patterns that
 136 have been built out into an API.  One of these APIs is sequence locking.
 137 Although this APIs can be used in extremely complex ways, there are simple
 138 and effective ways of using it that avoid the need to pay attention to
 139 memory ordering.
 140
 141 The basic keep-things-simple rule for sequence locking is "do not write
 142 in read-side code".  Yes, you can do writes from within sequence-locking
 143 readers, but it won't be so simple.  For example, such writes will be
 144 lockless and should be idempotent.
 145
 146 For more sophisticated use cases, LKMM can guide you, including use
 147 cases involving combining sequence locking with other synchronization
 148 primitives.  (LKMM does not yet know about sequence locking, so it is
 149 currently necessary to open-code it in your litmus tests.)
 150
 151 Additional information may be found in include/linux/seqlock.h.
 152
 153 Packaged primitives: RCU
 154 ========================
 155
 156 Another lockless design pattern that has been baked into an API
 157 is RCU.  The Linux kernel makes sophisticated use of RCU, but the
 158 keep-things-simple rules for RCU are "do not write in read-side code"
 159 and "do not update anything that is visible to and accessed by readers",
 160 and "protect updates with locking".
 161
 162 These rules are illustrated by the functions foo_update_a() and
 163 foo_get_a() shown in Documentation/RCU/whatisRCU.rst.  Additional
 164 RCU usage patterns maybe found in Documentation/RCU and in the
 165 source code.
 166
 167
 168 Packaged primitives: Atomic operations
 169 ======================================
 170
 171 Back in the day, the Linux kernel had three types of atomic operations:
 172
 173 1.      Initialization and read-out, such as atomic_set() and atomic_read().
 174
 175 2.      Operations that did not return a value and provided no ordering,
 176         such as atomic_inc() and atomic_dec().
 177
 178 3.      Operations that returned a value and provided full ordering, such as
 179         atomic_add_return() and atomic_dec_and_test().  Note that some
 180         value-returning operations provide full ordering only conditionally.
 181         For example, cmpxchg() provides ordering only upon success.
 182
 183 More recent kernels have operations that return a value but do not
 184 provide full ordering.  These are flagged with either a _relaxed()
 185 suffix (providing no ordering), or an _acquire() or _release() suffix
 186 (providing limited ordering).
 187
 188 Additional information may be found in these files:
 189
 190 Documentation/atomic_t.txt
 191 Documentation/atomic_bitops.txt
 192 Documentation/core-api/refcount-vs-atomic.rst
 193
 194 Reading code using these primitives is often also quite helpful.
 195
 196
 197 Lockless, fully ordered
 198 =======================
 199
 200 When using locking, there often comes a time when it is necessary
 201 to access some variable or another without holding the data lock
 202 that serializes access to that variable.
 203
 204 If you want to keep things simple, use the initialization and read-out
 205 operations from the previous section only when there are no racing
 206 accesses.  Otherwise, use only fully ordered operations when accessing
 207 or modifying the variable.  This approach guarantees that code prior
 208 to a given access to that variable will be seen by all CPUs has having
 209 happened before any code following any later access to that same variable.
 210
 211 Please note that per-CPU functions are not atomic operations and
 212 hence they do not provide any ordering guarantees at all.
 213
 214 If the lockless accesses are frequently executed reads that are used
 215 only for heuristics, or if they are frequently executed writes that
 216 are used only for statistics, please see the next section.
 217
 218
 219 Lockless statistics and heuristics
 220 ==================================
 221
 222 Unordered primitives such as atomic_read(), atomic_set(), READ_ONCE(), and
 223 WRITE_ONCE() can safely be used in some cases.  These primitives provide
 224 no ordering, but they do prevent the compiler from carrying out a number
 225 of destructive optimizations (for which please see the next section).
 226 One example use for these primitives is statistics, such as per-CPU
 227 counters exemplified by the rt_cache_stat structure's routing-cache
 228 statistics counters.  Another example use case is heuristics, such as
 229 the jiffies_till_first_fqs and jiffies_till_next_fqs kernel parameters
 230 controlling how often RCU scans for idle CPUs.
 231
 232 But be careful.  "Unordered" really does mean "unordered".  It is all
 233 too easy to assume ordering, and this assumption must be avoided when
 234 using these primitives.
 235
 236
 237 Don't let the compiler trip you up
 238 ==================================
 239
 240 It can be quite tempting to use plain C-language accesses for lockless
 241 loads from and stores to shared variables.  Although this is both
 242 possible and quite common in the Linux kernel, it does require a
 243 surprising amount of analysis, care, and knowledge about the compiler.
 244 Yes, some decades ago it was not unfair to consider a C compiler to be
 245 an assembler with added syntax and better portability, but the advent of
 246 sophisticated optimizing compilers mean that those days are long gone.
 247 Today's optimizing compilers can profoundly rewrite your code during the
 248 translation process, and have long been ready, willing, and able to do so.
 249
 250 Therefore, if you really need to use C-language assignments instead of
 251 READ_ONCE(), WRITE_ONCE(), and so on, you will need to have a very good
 252 understanding of both the C standard and your compiler.  Here are some
 253 introductory references and some tooling to start you on this noble quest:
 254
 255 Who's afraid of a big bad optimizing compiler?
 256         https://lwn.net/Articles/793253/
 257 Calibrating your fear of big bad optimizing compilers
 258         https://lwn.net/Articles/799218/
 259 Concurrency bugs should fear the big bad data-race detector (part 1)
 260         https://lwn.net/Articles/816850/
 261 Concurrency bugs should fear the big bad data-race detector (part 2)
 262         https://lwn.net/Articles/816854/
 263
 264
 265 More complex use cases
 266 ======================
 267
 268 If the alternatives above do not do what you need, please look at the
 269 recipes-pairs.txt file to peel off the next layer of the memory-ordering
 270 onion.