Documentation/bpf/classic_vs_extended.rst

   1
   2 ===================
   3 Classic BPF vs eBPF
   4 ===================
   5
   6 eBPF is designed to be JITed with one to one mapping, which can also open up
   7 the possibility for GCC/LLVM compilers to generate optimized eBPF code through
   8 an eBPF backend that performs almost as fast as natively compiled code.
   9
  10 Some core changes of the eBPF format from classic BPF:
  11
  12 - Number of registers increase from 2 to 10:
  13
  14   The old format had two registers A and X, and a hidden frame pointer. The
  15   new layout extends this to be 10 internal registers and a read-only frame
  16   pointer. Since 64-bit CPUs are passing arguments to functions via registers
  17   the number of args from eBPF program to in-kernel function is restricted
  18   to 5 and one register is used to accept return value from an in-kernel
  19   function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
  20   sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
  21   registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
  22
  23   Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
  24   etc, and eBPF calling convention maps directly to ABIs used by the kernel on
  25   64-bit architectures.
  26
  27   On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
  28   and may let more complex programs to be interpreted.
  29
  30   R0 - R5 are scratch registers and eBPF program needs spill/fill them if
  31   necessary across calls. Note that there is only one eBPF program (== one
  32   eBPF main routine) and it cannot call other eBPF functions, it can only
  33   call predefined in-kernel functions, though.
  34
  35 - Register width increases from 32-bit to 64-bit:
  36
  37   Still, the semantics of the original 32-bit ALU operations are preserved
  38   via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
  39   subregisters that zero-extend into 64-bit if they are being written to.
  40   That behavior maps directly to x86_64 and arm64 subregister definition, but
  41   makes other JITs more difficult.
  42
  43   32-bit architectures run 64-bit eBPF programs via interpreter.
  44   Their JITs may convert BPF programs that only use 32-bit subregisters into
  45   native instruction set and let the rest being interpreted.
  46
  47   Operation is 64-bit, because on 64-bit architectures, pointers are also
  48   64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
  49   so 32-bit eBPF registers would otherwise require to define register-pair
  50   ABI, thus, there won't be able to use a direct eBPF register to HW register
  51   mapping and JIT would need to do combine/split/move operations for every
  52   register in and out of the function, which is complex, bug prone and slow.
  53   Another reason is the use of atomic 64-bit counters.
  54
  55 - Conditional jt/jf targets replaced with jt/fall-through:
  56
  57   While the original design has constructs such as ``if (cond) jump_true;
  58   else jump_false;``, they are being replaced into alternative constructs like
  59   ``if (cond) jump_true; /* else fall-through */``.
  60
  61 - Introduces bpf_call insn and register passing convention for zero overhead
  62   calls from/to other kernel functions:
  63
  64   Before an in-kernel function call, the eBPF program needs to
  65   place function arguments into R1 to R5 registers to satisfy calling
  66   convention, then the interpreter will take them from registers and pass
  67   to in-kernel function. If R1 - R5 registers are mapped to CPU registers
  68   that are used for argument passing on given architecture, the JIT compiler
  69   doesn't need to emit extra moves. Function arguments will be in the correct
  70   registers and BPF_CALL instruction will be JITed as single 'call' HW
  71   instruction. This calling convention was picked to cover common call
  72   situations without performance penalty.
  73
  74   After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
  75   a return value of the function. Since R6 - R9 are callee saved, their state
  76   is preserved across the call.
  77
  78   For example, consider three C functions::
  79
  80     u64 f1() { return (*_f2)(1); }
  81     u64 f2(u64 a) { return f3(a + 1, a); }
  82     u64 f3(u64 a, u64 b) { return a - b; }
  83
  84   GCC can compile f1, f3 into x86_64::
  85
  86     f1:
  87         movl $1, %edi
  88         movq _f2(%rip), %rax
  89         jmp  *%rax
  90     f3:
  91         movq %rdi, %rax
  92         subq %rsi, %rax
  93         ret
  94
  95   Function f2 in eBPF may look like::
  96
  97     f2:
  98         bpf_mov R2, R1
  99         bpf_add R1, 1
 100         bpf_call f3
 101         bpf_exit
 102
 103   If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and
 104   returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
 105   be used to call into f2.
 106
 107   For practical reasons all eBPF programs have only one argument 'ctx' which is
 108   already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs
 109   can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
 110   are currently not supported, but these restrictions can be lifted if necessary
 111   in the future.
 112
 113   On 64-bit architectures all register map to HW registers one to one. For
 114   example, x86_64 JIT compiler can map them as ...
 115
 116   ::
 117
 118     R0 - rax
 119     R1 - rdi
 120     R2 - rsi
 121     R3 - rdx
 122     R4 - rcx
 123     R5 - r8
 124     R6 - rbx
 125     R7 - r13
 126     R8 - r14
 127     R9 - r15
 128     R10 - rbp
 129
 130   ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
 131   and rbx, r12 - r15 are callee saved.
 132
 133   Then the following eBPF pseudo-program::
 134
 135     bpf_mov R6, R1 /* save ctx */
 136     bpf_mov R2, 2
 137     bpf_mov R3, 3
 138     bpf_mov R4, 4
 139     bpf_mov R5, 5
 140     bpf_call foo
 141     bpf_mov R7, R0 /* save foo() return value */
 142     bpf_mov R1, R6 /* restore ctx for next call */
 143     bpf_mov R2, 6
 144     bpf_mov R3, 7
 145     bpf_mov R4, 8
 146     bpf_mov R5, 9
 147     bpf_call bar
 148     bpf_add R0, R7
 149     bpf_exit
 150
 151   After JIT to x86_64 may look like::
 152
 153     push %rbp
 154     mov %rsp,%rbp
 155     sub $0x228,%rsp
 156     mov %rbx,-0x228(%rbp)
 157     mov %r13,-0x220(%rbp)
 158     mov %rdi,%rbx
 159     mov $0x2,%esi
 160     mov $0x3,%edx
 161     mov $0x4,%ecx
 162     mov $0x5,%r8d
 163     callq foo
 164     mov %rax,%r13
 165     mov %rbx,%rdi
 166     mov $0x6,%esi
 167     mov $0x7,%edx
 168     mov $0x8,%ecx
 169     mov $0x9,%r8d
 170     callq bar
 171     add %r13,%rax
 172     mov -0x228(%rbp),%rbx
 173     mov -0x220(%rbp),%r13
 174     leaveq
 175     retq
 176
 177   Which is in this example equivalent in C to::
 178
 179     u64 bpf_filter(u64 ctx)
 180     {
 181         return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
 182     }
 183
 184   In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
 185   arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
 186   registers and place their return value into ``%rax`` which is R0 in eBPF.
 187   Prologue and epilogue are emitted by JIT and are implicit in the
 188   interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
 189   them across the calls as defined by calling convention.
 190
 191   For example the following program is invalid::
 192
 193     bpf_mov R1, 1
 194     bpf_call foo
 195     bpf_mov R0, R1
 196     bpf_exit
 197
 198   After the call the registers R1-R5 contain junk values and cannot be read.
 199   An in-kernel verifier.rst is used to validate eBPF programs.
 200
 201 Also in the new design, eBPF is limited to 4096 insns, which means that any
 202 program will terminate quickly and will only call a fixed number of kernel
 203 functions. Original BPF and eBPF are two operand instructions,
 204 which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
 205
 206 The input context pointer for invoking the interpreter function is generic,
 207 its content is defined by a specific use case. For seccomp register R1 points
 208 to seccomp_data, for converted BPF filters R1 points to a skb.
 209
 210 A program, that is translated internally consists of the following elements::
 211
 212   op:16, jt:8, jf:8, k:32    ==>    op:8, dst_reg:4, src_reg:4, off:16, imm:32
 213
 214 So far 87 eBPF instructions were implemented. 8-bit 'op' opcode field
 215 has room for new instructions. Some of them may use 16/24/32 byte encoding. New
 216 instructions must be multiple of 8 bytes to preserve backward compatibility.
 217
 218 eBPF is a general purpose RISC instruction set. Not every register and
 219 every instruction are used during translation from original BPF to eBPF.
 220 For example, socket filters are not using ``exclusive add`` instruction, but
 221 tracing filters may do to maintain counters of events, for example. Register R9
 222 is not used by socket filters either, but more complex filters may be running
 223 out of registers and would have to resort to spill/fill to stack.
 224
 225 eBPF can be used as a generic assembler for last step performance
 226 optimizations, socket filters and seccomp are using it as assembler. Tracing
 227 filters may use it as assembler to generate code from kernel. In kernel usage
 228 may not be bounded by security considerations, since generated eBPF code
 229 may be optimizing internal code path and not being exposed to the user space.
 230 Safety of eBPF can come from the verifier.rst. In such use cases as
 231 described, it may be used as safe instruction set.
 232
 233 Just like the original BPF, eBPF runs within a controlled environment,
 234 is deterministic and the kernel can easily prove that. The safety of the program
 235 can be determined in two steps: first step does depth-first-search to disallow
 236 loops and other CFG validation; second step starts from the first insn and
 237 descends all possible paths. It simulates execution of every insn and observes
 238 the state change of registers and stack.
 239
 240 opcode encoding
 241 ===============
 242
 243 eBPF is reusing most of the opcode encoding from classic to simplify conversion
 244 of classic BPF to eBPF.
 245
 246 For arithmetic and jump instructions the 8-bit 'code' field is divided into three
 247 parts::
 248
 249   +----------------+--------+--------------------+
 250   |   4 bits       |  1 bit |   3 bits           |
 251   | operation code | source | instruction class  |
 252   +----------------+--------+--------------------+
 253   (MSB)                                      (LSB)
 254
 255 Three LSB bits store instruction class which is one of:
 256
 257   ===================     ===============
 258   Classic BPF classes     eBPF classes
 259   ===================     ===============
 260   BPF_LD    0x00          BPF_LD    0x00
 261   BPF_LDX   0x01          BPF_LDX   0x01
 262   BPF_ST    0x02          BPF_ST    0x02
 263   BPF_STX   0x03          BPF_STX   0x03
 264   BPF_ALU   0x04          BPF_ALU   0x04
 265   BPF_JMP   0x05          BPF_JMP   0x05
 266   BPF_RET   0x06          BPF_JMP32 0x06
 267   BPF_MISC  0x07          BPF_ALU64 0x07
 268   ===================     ===============
 269
 270 The 4th bit encodes the source operand ...
 271
 272     ::
 273
 274         BPF_K     0x00
 275         BPF_X     0x08
 276
 277  * in classic BPF, this means::
 278
 279         BPF_SRC(code) == BPF_X - use register X as source operand
 280         BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
 281
 282  * in eBPF, this means::
 283
 284         BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
 285         BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
 286
 287 ... and four MSB bits store operation code.
 288
 289 If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of::
 290
 291   BPF_ADD   0x00
 292   BPF_SUB   0x10
 293   BPF_MUL   0x20
 294   BPF_DIV   0x30
 295   BPF_OR    0x40
 296   BPF_AND   0x50
 297   BPF_LSH   0x60
 298   BPF_RSH   0x70
 299   BPF_NEG   0x80
 300   BPF_MOD   0x90
 301   BPF_XOR   0xa0
 302   BPF_MOV   0xb0  /* eBPF only: mov reg to reg */
 303   BPF_ARSH  0xc0  /* eBPF only: sign extending shift right */
 304   BPF_END   0xd0  /* eBPF only: endianness conversion */
 305
 306 If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of::
 307
 308   BPF_JA    0x00  /* BPF_JMP only */
 309   BPF_JEQ   0x10
 310   BPF_JGT   0x20
 311   BPF_JGE   0x30
 312   BPF_JSET  0x40
 313   BPF_JNE   0x50  /* eBPF only: jump != */
 314   BPF_JSGT  0x60  /* eBPF only: signed '>' */
 315   BPF_JSGE  0x70  /* eBPF only: signed '>=' */
 316   BPF_CALL  0x80  /* eBPF BPF_JMP only: function call */
 317   BPF_EXIT  0x90  /* eBPF BPF_JMP only: function return */
 318   BPF_JLT   0xa0  /* eBPF only: unsigned '<' */
 319   BPF_JLE   0xb0  /* eBPF only: unsigned '<=' */
 320   BPF_JSLT  0xc0  /* eBPF only: signed '<' */
 321   BPF_JSLE  0xd0  /* eBPF only: signed '<=' */
 322
 323 So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
 324 and eBPF. There are only two registers in classic BPF, so it means A += X.
 325 In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
 326 BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous
 327 src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.
 328
 329 Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
 330 eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no
 331 BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
 332 exactly the same operations as BPF_ALU, but with 64-bit wide operands
 333 instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
 334 dst_reg = dst_reg + src_reg
 335
 336 Classic BPF wastes the whole BPF_RET class to represent a single ``ret``
 337 operation. Classic BPF_RET | BPF_K means copy imm32 into return register
 338 and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
 339 in eBPF means function exit only. The eBPF program needs to store return
 340 value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as
 341 BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide
 342 operands for the comparisons instead.
 343
 344 For load and store instructions the 8-bit 'code' field is divided as::
 345
 346   +--------+--------+-------------------+
 347   | 3 bits | 2 bits |   3 bits          |
 348   |  mode  |  size  | instruction class |
 349   +--------+--------+-------------------+
 350   (MSB)                             (LSB)
 351
 352 Size modifier is one of ...
 353
 354 ::
 355
 356   BPF_W   0x00    /* word */
 357   BPF_H   0x08    /* half word */
 358   BPF_B   0x10    /* byte */
 359   BPF_DW  0x18    /* eBPF only, double word */
 360
 361 ... which encodes size of load/store operation::
 362
 363  B  - 1 byte
 364  H  - 2 byte
 365  W  - 4 byte
 366  DW - 8 byte (eBPF only)
 367
 368 Mode modifier is one of::
 369
 370   BPF_IMM     0x00  /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
 371   BPF_ABS     0x20
 372   BPF_IND     0x40
 373   BPF_MEM     0x60
 374   BPF_LEN     0x80  /* classic BPF only, reserved in eBPF */
 375   BPF_MSH     0xa0  /* classic BPF only, reserved in eBPF */
 376   BPF_ATOMIC  0xc0  /* eBPF only, atomic operations */