Documentation/livepatch/reliable-stacktrace.rst

   1 ===================
   2 Reliable Stacktrace
   3 ===================
   4
   5 This document outlines basic information about reliable stacktracing.
   6
   7 .. Table of Contents:
   8
   9 .. contents:: :local:
  10
  11 1. Introduction
  12 ===============
  13
  14 The kernel livepatch consistency model relies on accurately identifying which
  15 functions may have live state and therefore may not be safe to patch. One way
  16 to identify which functions are live is to use a stacktrace.
  17
  18 Existing stacktrace code may not always give an accurate picture of all
  19 functions with live state, and best-effort approaches which can be helpful for
  20 debugging are unsound for livepatching. Livepatching depends on architectures
  21 to provide a *reliable* stacktrace which ensures it never omits any live
  22 functions from a trace.
  23
  24
  25 2. Requirements
  26 ===============
  27
  28 Architectures must implement one of the reliable stacktrace functions.
  29 Architectures using CONFIG_ARCH_STACKWALK must implement
  30 'arch_stack_walk_reliable', and other architectures must implement
  31 'save_stack_trace_tsk_reliable'.
  32
  33 Principally, the reliable stacktrace function must ensure that either:
  34
  35 * The trace includes all functions that the task may be returned to, and the
  36   return code is zero to indicate that the trace is reliable.
  37
  38 * The return code is non-zero to indicate that the trace is not reliable.
  39
  40 .. note::
  41    In some cases it is legitimate to omit specific functions from the trace,
  42    but all other functions must be reported. These cases are described in
  43    futher detail below.
  44
  45 Secondly, the reliable stacktrace function must be robust to cases where
  46 the stack or other unwind state is corrupt or otherwise unreliable. The
  47 function should attempt to detect such cases and return a non-zero error
  48 code, and should not get stuck in an infinite loop or access memory in
  49 an unsafe way.  Specific cases are described in further detail below.
  50
  51
  52 3. Compile-time analysis
  53 ========================
  54
  55 To ensure that kernel code can be correctly unwound in all cases,
  56 architectures may need to verify that code has been compiled in a manner
  57 expected by the unwinder. For example, an unwinder may expect that
  58 functions manipulate the stack pointer in a limited way, or that all
  59 functions use specific prologue and epilogue sequences. Architectures
  60 with such requirements should verify the kernel compilation using
  61 objtool.
  62
  63 In some cases, an unwinder may require metadata to correctly unwind.
  64 Where necessary, this metadata should be generated at build time using
  65 objtool.
  66
  67
  68 4. Considerations
  69 =================
  70
  71 The unwinding process varies across architectures, their respective procedure
  72 call standards, and kernel configurations. This section describes common
  73 details that architectures should consider.
  74
  75 4.1 Identifying successful termination
  76 --------------------------------------
  77
  78 Unwinding may terminate early for a number of reasons, including:
  79
  80 * Stack or frame pointer corruption.
  81
  82 * Missing unwind support for an uncommon scenario, or a bug in the unwinder.
  83
  84 * Dynamically generated code (e.g. eBPF) or foreign code (e.g. EFI runtime
  85   services) not following the conventions expected by the unwinder.
  86
  87 To ensure that this does not result in functions being omitted from the trace,
  88 even if not caught by other checks, it is strongly recommended that
  89 architectures verify that a stacktrace ends at an expected location, e.g.
  90
  91 * Within a specific function that is an entry point to the kernel.
  92
  93 * At a specific location on a stack expected for a kernel entry point.
  94
  95 * On a specific stack expected for a kernel entry point (e.g. if the
  96   architecture has separate task and IRQ stacks).
  97
  98 4.2 Identifying unwindable code
  99 -------------------------------
 100
 101 Unwinding typically relies on code following specific conventions (e.g.
 102 manipulating a frame pointer), but there can be code which may not follow these
 103 conventions and may require special handling in the unwinder, e.g.
 104
 105 * Exception vectors and entry assembly.
 106
 107 * Procedure Linkage Table (PLT) entries and veneer functions.
 108
 109 * Trampoline assembly (e.g. ftrace, kprobes).
 110
 111 * Dynamically generated code (e.g. eBPF, optprobe trampolines).
 112
 113 * Foreign code (e.g. EFI runtime services).
 114
 115 To ensure that such cases do not result in functions being omitted from a
 116 trace, it is strongly recommended that architectures positively identify code
 117 which is known to be reliable to unwind from, and reject unwinding from all
 118 other code.
 119
 120 Kernel code including modules and eBPF can be distinguished from foreign code
 121 using '__kernel_text_address()'. Checking for this also helps to detect stack
 122 corruption.
 123
 124 There are several ways an architecture may identify kernel code which is deemed
 125 unreliable to unwind from, e.g.
 126
 127 * Placing such code into special linker sections, and rejecting unwinding from
 128   any code in these sections.
 129
 130 * Identifying specific portions of code using bounds information.
 131
 132 4.3 Unwinding across interrupts and exceptions
 133 ----------------------------------------------
 134
 135 At function call boundaries the stack and other unwind state is expected to be
 136 in a consistent state suitable for reliable unwinding, but this may not be the
 137 case part-way through a function. For example, during a function prologue or
 138 epilogue a frame pointer may be transiently invalid, or during the function
 139 body the return address may be held in an arbitrary general purpose register.
 140 For some architectures this may change at runtime as a result of dynamic
 141 instrumentation.
 142
 143 If an interrupt or other exception is taken while the stack or other unwind
 144 state is in an inconsistent state, it may not be possible to reliably unwind,
 145 and it may not be possible to identify whether such unwinding will be reliable.
 146 See below for examples.
 147
 148 Architectures which cannot identify when it is reliable to unwind such cases
 149 (or where it is never reliable) must reject unwinding across exception
 150 boundaries. Note that it may be reliable to unwind across certain
 151 exceptions (e.g. IRQ) but unreliable to unwind across other exceptions
 152 (e.g. NMI).
 153
 154 Architectures which can identify when it is reliable to unwind such cases (or
 155 have no such cases) should attempt to unwind across exception boundaries, as
 156 doing so can prevent unnecessarily stalling livepatch consistency checks and
 157 permits livepatch transitions to complete more quickly.
 158
 159 4.4 Rewriting of return addresses
 160 ---------------------------------
 161
 162 Some trampolines temporarily modify the return address of a function in order
 163 to intercept when that function returns with a return trampoline, e.g.
 164
 165 * An ftrace trampoline may modify the return address so that function graph
 166   tracing can intercept returns.
 167
 168 * A kprobes (or optprobes) trampoline may modify the return address so that
 169   kretprobes can intercept returns.
 170
 171 When this happens, the original return address will not be in its usual
 172 location. For trampolines which are not subject to live patching, where an
 173 unwinder can reliably determine the original return address and no unwind state
 174 is altered by the trampoline, the unwinder may report the original return
 175 address in place of the trampoline and report this as reliable. Otherwise, an
 176 unwinder must report these cases as unreliable.
 177
 178 Special care is required when identifying the original return address, as this
 179 information is not in a consistent location for the duration of the entry
 180 trampoline or return trampoline. For example, considering the x86_64
 181 'return_to_handler' return trampoline:
 182
 183 .. code-block:: none
 184
 185    SYM_CODE_START(return_to_handler)
 186            UNWIND_HINT_EMPTY
 187            subq  $24, %rsp
 188
 189            /* Save the return values */
 190            movq %rax, (%rsp)
 191            movq %rdx, 8(%rsp)
 192            movq %rbp, %rdi
 193
 194            call ftrace_return_to_handler
 195
 196            movq %rax, %rdi
 197            movq 8(%rsp), %rdx
 198            movq (%rsp), %rax
 199            addq $24, %rsp
 200            JMP_NOSPEC rdi
 201    SYM_CODE_END(return_to_handler)
 202
 203 While the traced function runs its return address on the stack points to
 204 the start of return_to_handler, and the original return address is stored in
 205 the task's cur_ret_stack. During this time the unwinder can find the return
 206 address using ftrace_graph_ret_addr().
 207
 208 When the traced function returns to return_to_handler, there is no longer a
 209 return address on the stack, though the original return address is still stored
 210 in the task's cur_ret_stack. Within ftrace_return_to_handler(), the original
 211 return address is removed from cur_ret_stack and is transiently moved
 212 arbitrarily by the compiler before being returned in rax. The return_to_handler
 213 trampoline moves this into rdi before jumping to it.
 214
 215 Architectures might not always be able to unwind such sequences, such as when
 216 ftrace_return_to_handler() has removed the address from cur_ret_stack, and the
 217 location of the return address cannot be reliably determined.
 218
 219 It is recommended that architectures unwind cases where return_to_handler has
 220 not yet been returned to, but architectures are not required to unwind from the
 221 middle of return_to_handler and can report this as unreliable. Architectures
 222 are not required to unwind from other trampolines which modify the return
 223 address.
 224
 225 4.5 Obscuring of return addresses
 226 ---------------------------------
 227
 228 Some trampolines do not rewrite the return address in order to intercept
 229 returns, but do transiently clobber the return address or other unwind state.
 230
 231 For example, the x86_64 implementation of optprobes patches the probed function
 232 with a JMP instruction which targets the associated optprobe trampoline. When
 233 the probe is hit, the CPU will branch to the optprobe trampoline, and the
 234 address of the probed function is not held in any register or on the stack.
 235
 236 Similarly, the arm64 implementation of DYNAMIC_FTRACE_WITH_REGS patches traced
 237 functions with the following:
 238
 239 .. code-block:: none
 240
 241    MOV X9, X30
 242    BL <trampoline>
 243
 244 The MOV saves the link register (X30) into X9 to preserve the return address
 245 before the BL clobbers the link register and branches to the trampoline. At the
 246 start of the trampoline, the address of the traced function is in X9 rather
 247 than the link register as would usually be the case.
 248
 249 Architectures must either ensure that unwinders either reliably unwind
 250 such cases, or report the unwinding as unreliable.
 251
 252 4.6 Link register unreliability
 253 -------------------------------
 254
 255 On some other architectures, 'call' instructions place the return address into a
 256 link register, and 'return' instructions consume the return address from the
 257 link register without modifying the register. On these architectures software
 258 must save the return address to the stack prior to making a function call. Over
 259 the duration of a function call, the return address may be held in the link
 260 register alone, on the stack alone, or in both locations.
 261
 262 Unwinders typically assume the link register is always live, but this
 263 assumption can lead to unreliable stack traces. For example, consider the
 264 following arm64 assembly for a simple function:
 265
 266 .. code-block:: none
 267
 268    function:
 269            STP X29, X30, [SP, -16]!
 270            MOV X29, SP
 271            BL <other_function>
 272            LDP X29, X30, [SP], #16
 273            RET
 274
 275 At entry to the function, the link register (x30) points to the caller, and the
 276 frame pointer (X29) points to the caller's frame including the caller's return
 277 address. The first two instructions create a new stackframe and update the
 278 frame pointer, and at this point the link register and the frame pointer both
 279 describe this function's return address. A trace at this point may describe
 280 this function twice, and if the function return is being traced, the unwinder
 281 may consume two entries from the fgraph return stack rather than one entry.
 282
 283 The BL invokes 'other_function' with the link register pointing to this
 284 function's LDR and the frame pointer pointing to this function's stackframe.
 285 When 'other_function' returns, the link register is left pointing at the BL,
 286 and so a trace at this point could result in 'function' appearing twice in the
 287 backtrace.
 288
 289 Similarly, a function may deliberately clobber the LR, e.g.
 290
 291 .. code-block:: none
 292
 293    caller:
 294            STP X29, X30, [SP, -16]!
 295            MOV X29, SP
 296            ADR LR, <callee>
 297            BLR LR
 298            LDP X29, X30, [SP], #16
 299            RET
 300
 301 The ADR places the address of 'callee' into the LR, before the BLR branches to
 302 this address. If a trace is made immediately after the ADR, 'callee' will
 303 appear to be the parent of 'caller', rather than the child.
 304
 305 Due to cases such as the above, it may only be possible to reliably consume a
 306 link register value at a function call boundary. Architectures where this is
 307 the case must reject unwinding across exception boundaries unless they can
 308 reliably identify when the LR or stack value should be used (e.g. using
 309 metadata generated by objtool).