Andrii Nakryiko [Thu, 16 Feb 2023 04:59:53 +0000 (20:59 -0800)]
selftests/bpf: Convert test_global_funcs test to test_loader framework
Convert 17 test_global_funcs subtests into test_loader framework for
easier maintenance and more declarative way to define expected
failures/successes.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230216045954.3002473-3-andrii@kernel.org
Andrii Nakryiko [Thu, 16 Feb 2023 04:59:52 +0000 (20:59 -0800)]
bpf: Fix global subprog context argument resolution logic
KPROBE program's user-facing context type is defined as typedef
bpf_user_pt_regs_t. This leads to a problem when trying to passing
kprobe/uprobe/usdt context argument into global subprog, as kernel
always strip away mods and typedefs of user-supplied type, but takes
expected type from bpf_ctx_convert as is, which causes mismatch.
Current way to work around this is to define a fake struct with the same
name as expected typedef:
struct bpf_user_pt_regs_t {};
__noinline my_global_subprog(struct bpf_user_pt_regs_t *ctx) { ... }
This patch fixes the issue by resolving expected type, if it's not
a struct. It still leaves the above work-around working for backwards
compatibility.
Fixes:
91cc1a99740e ("bpf: Annotate context types")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230216045954.3002473-2-andrii@kernel.org
Hengqi Chen [Tue, 14 Feb 2023 15:26:33 +0000 (15:26 +0000)]
LoongArch, bpf: Use 4 instructions for function address in JIT
This patch fixes the following issue of function calls in JIT, like:
[ 29.346981] multi-func JIT bug 105 != 103
The issus can be reproduced by running the "inline simple bpf_loop call"
verifier test.
This is because we are emiting 2-4 instructions for 64-bit immediate moves.
During the first pass of JIT, the placeholder address is zero, emiting two
instructions for it. In the extra pass, the function address is in XKVRANGE,
emiting four instructions for it. This change the instruction index in
JIT context. Let's always use 4 instructions for function address in JIT.
So that the instruction sequences don't change between the first pass and
the extra pass for function calls.
Fixes:
5dc615520c4d ("LoongArch: Add BPF JIT support")
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Link: https://lore.kernel.org/bpf/20230214152633.2265699-1-hengqi.chen@gmail.com
Martin KaFai Lau [Fri, 17 Feb 2023 00:41:48 +0000 (16:41 -0800)]
bpf: bpf_fib_lookup should not return neigh in NUD_FAILED state
The bpf_fib_lookup() helper does not only look up the fib (ie. route)
but it also looks up the neigh. Before returning the neigh, the helper
does not check for NUD_VALID. When a neigh state (neigh->nud_state)
is in NUD_FAILED, its dmac (neigh->ha) could be all zeros. The helper
still returns SUCCESS instead of NO_NEIGH in this case. Because of the
SUCCESS return value, the bpf prog directly uses the returned dmac
and ends up filling all zero in the eth header.
This patch checks for NUD_VALID and returns NO_NEIGH if the neigh is
not valid.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230217004150.2980689-3-martin.lau@linux.dev
Martin KaFai Lau [Fri, 17 Feb 2023 00:41:47 +0000 (16:41 -0800)]
bpf: Disable bh in bpf_test_run for xdp and tc prog
Some of the bpf helpers require bh disabled. eg. The bpf_fib_lookup
helper that will be used in a latter selftest. In particular, it
calls ___neigh_lookup_noref that expects the bh disabled.
This patch disables bh before calling bpf_prog_run[_xdp], so
the testing prog can also use those helpers.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230217004150.2980689-2-martin.lau@linux.dev
Maciej Fijalkowski [Wed, 15 Feb 2023 14:33:09 +0000 (15:33 +0100)]
xsk: check IFF_UP earlier in Tx path
Xsk Tx can be triggered via either sendmsg() or poll() syscalls. These
two paths share a call to common function xsk_xmit() which has two
sanity checks within. A pseudo code example to show the two paths:
__xsk_sendmsg() : xsk_poll():
if (unlikely(!xsk_is_bound(xs))) if (unlikely(!xsk_is_bound(xs)))
return -ENXIO; return mask;
if (unlikely(need_wait)) (...)
return -EOPNOTSUPP; xsk_xmit()
mark napi id
(...)
xsk_xmit()
xsk_xmit():
if (unlikely(!(xs->dev->flags & IFF_UP)))
return -ENETDOWN;
if (unlikely(!xs->tx))
return -ENOBUFS;
As it can be observed above, in sendmsg() napi id can be marked on
interface that was not brought up and this causes a NULL ptr
dereference:
[31757.505631] BUG: kernel NULL pointer dereference, address:
0000000000000018
[31757.512710] #PF: supervisor read access in kernel mode
[31757.517936] #PF: error_code(0x0000) - not-present page
[31757.523149] PGD 0 P4D 0
[31757.525726] Oops: 0000 [#1] PREEMPT SMP NOPTI
[31757.530154] CPU: 26 PID: 95641 Comm: xdpsock Not tainted 6.2.0-rc5+ #40
[31757.536871] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.
031920191559 03/19/2019
[31757.547457] RIP: 0010:xsk_sendmsg+0xde/0x180
[31757.551799] Code: 00 75 a2 48 8b 00 a8 04 75 9b 84 d2 74 69 8b 85 14 01 00 00 85 c0 75 1b 48 8b 85 28 03 00 00 48 8b 80 98 00 00 00 48 8b 40 20 <8b> 40 18 89 85 14 01 00 00 8b bd 14 01 00 00 81 ff 00 01 00 00 0f
[31757.570840] RSP: 0018:
ffffc90034f27dc0 EFLAGS:
00010246
[31757.576143] RAX:
0000000000000000 RBX:
ffffc90034f27e18 RCX:
0000000000000000
[31757.583389] RDX:
0000000000000001 RSI:
ffffc90034f27e18 RDI:
ffff88984cf3c100
[31757.590631] RBP:
ffff88984714a800 R08:
ffff88984714a800 R09:
0000000000000000
[31757.597877] R10:
0000000000000001 R11:
0000000000000000 R12:
00000000fffffffa
[31757.605123] R13:
0000000000000000 R14:
0000000000000003 R15:
0000000000000000
[31757.612364] FS:
00007fb4c5931180(0000) GS:
ffff88afdfa00000(0000) knlGS:
0000000000000000
[31757.620571] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[31757.626406] CR2:
0000000000000018 CR3:
000000184b41c003 CR4:
00000000007706e0
[31757.633648] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[31757.640894] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[31757.648139] PKRU:
55555554
[31757.650894] Call Trace:
[31757.653385] <TASK>
[31757.655524] sock_sendmsg+0x8f/0xa0
[31757.659077] ? sockfd_lookup_light+0x12/0x70
[31757.663416] __sys_sendto+0xfc/0x170
[31757.667051] ? do_sched_setscheduler+0xdb/0x1b0
[31757.671658] __x64_sys_sendto+0x20/0x30
[31757.675557] do_syscall_64+0x38/0x90
[31757.679197] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[31757.687969] Code: 8e f6 ff 44 8b 4c 24 2c 4c 8b 44 24 20 41 89 c4 44 8b 54 24 28 48 8b 54 24 18 b8 2c 00 00 00 48 8b 74 24 10 8b 7c 24 08 0f 05 <48> 3d 00 f0 ff ff 77 3a 44 89 e7 48 89 44 24 08 e8 b5 8e f6 ff 48
[31757.707007] RSP: 002b:
00007ffd49c73c70 EFLAGS:
00000293 ORIG_RAX:
000000000000002c
[31757.714694] RAX:
ffffffffffffffda RBX:
000055a996565380 RCX:
00007fb4c5727c16
[31757.721939] RDX:
0000000000000000 RSI:
0000000000000000 RDI:
0000000000000003
[31757.729184] RBP:
0000000000000040 R08:
0000000000000000 R09:
0000000000000000
[31757.736429] R10:
0000000000000040 R11:
0000000000000293 R12:
0000000000000000
[31757.743673] R13:
0000000000000000 R14:
0000000000000000 R15:
0000000000000000
[31757.754940] </TASK>
To fix this, let's make xsk_xmit a function that will be responsible for
generic Tx, where RCU is handled accordingly and pull out sanity checks
and xs->zc handling. Populate sanity checks to __xsk_sendmsg() and
xsk_poll().
Fixes:
ca2e1a627035 ("xsk: Mark napi_id on sendmsg()")
Fixes:
18b1ab7aa76b ("xsk: Fix race at socket teardown")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230215143309.13145-1-maciej.fijalkowski@intel.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Taichi Nishimura [Thu, 16 Feb 2023 08:55:37 +0000 (17:55 +0900)]
Fix typos in selftest/bpf files
Run spell checker on files in selftest/bpf and fixed typos.
Signed-off-by: Taichi Nishimura <awkrail01@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/bpf/20230216085537.519062-1-awkrail01@gmail.com
Ilya Leoshkevich [Tue, 14 Feb 2023 23:12:18 +0000 (00:12 +0100)]
selftests/bpf: Use bpf_{btf,link,map,prog}_get_info_by_fd()
Use the new type-safe wrappers around bpf_obj_get_info_by_fd().
Fix a prog/map mixup in prog_holds_map().
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230214231221.249277-6-iii@linux.ibm.com
Ilya Leoshkevich [Tue, 14 Feb 2023 23:12:17 +0000 (00:12 +0100)]
samples/bpf: Use bpf_{btf,link,map,prog}_get_info_by_fd()
Use the new type-safe wrappers around bpf_obj_get_info_by_fd().
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230214231221.249277-5-iii@linux.ibm.com
Ilya Leoshkevich [Tue, 14 Feb 2023 23:12:16 +0000 (00:12 +0100)]
bpftool: Use bpf_{btf,link,map,prog}_get_info_by_fd()
Use the new type-safe wrappers around bpf_obj_get_info_by_fd().
Split the bpf_obj_get_info_by_fd() call in build_btf_type_table() in
two, since knowing the type helps with the Memory Sanitizer.
Improve map_parse_fd_and_info() type safety by using
struct bpf_map_info * instead of void * for info.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20230214231221.249277-4-iii@linux.ibm.com
Ilya Leoshkevich [Tue, 14 Feb 2023 23:12:15 +0000 (00:12 +0100)]
libbpf: Use bpf_{btf,link,map,prog}_get_info_by_fd()
Use the new type-safe wrappers around bpf_obj_get_info_by_fd().
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230214231221.249277-3-iii@linux.ibm.com
Ilya Leoshkevich [Tue, 14 Feb 2023 23:12:14 +0000 (00:12 +0100)]
libbpf: Introduce bpf_{btf,link,map,prog}_get_info_by_fd()
These are type-safe wrappers around bpf_obj_get_info_by_fd(). They
found one problem in selftests, and are also useful for adding
Memory Sanitizer annotations.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230214231221.249277-2-iii@linux.ibm.com
Alexander Lobakin [Wed, 15 Feb 2023 18:54:40 +0000 (19:54 +0100)]
bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES
&xdp_buff and &xdp_frame are bound in a way that
xdp_buff->data_hard_start == xdp_frame
It's always the case and e.g. xdp_convert_buff_to_frame() relies on
this.
IOW, the following:
for (u32 i = 0; i < 0xdead; i++) {
xdpf = xdp_convert_buff_to_frame(&xdp);
xdp_convert_frame_to_buff(xdpf, &xdp);
}
shouldn't ever modify @xdpf's contents or the pointer itself.
However, "live packet" code wrongly treats &xdp_frame as part of its
context placed *before* the data_hard_start. With such flow,
data_hard_start is sizeof(*xdpf) off to the right and no longer points
to the XDP frame.
Instead of replacing `sizeof(ctx)` with `offsetof(ctx, xdpf)` in several
places and praying that there are no more miscalcs left somewhere in the
code, unionize ::frm with ::data in a flex array, so that both starts
pointing to the actual data_hard_start and the XDP frame actually starts
being a part of it, i.e. a part of the headroom, not the context.
A nice side effect is that the maximum frame size for this mode gets
increased by 40 bytes, as xdp_buff::frame_sz includes everything from
data_hard_start (-> includes xdpf already) to the end of XDP/skb shared
info.
Also update %MAX_PKT_SIZE accordingly in the selftests code. Leave it
hardcoded for 64 bit && 4k pages, it can be made more flexible later on.
Minor: align `&head->data` with how `head->frm` is assigned for
consistency.
Minor #2: rename 'frm' to 'frame' in &xdp_page_head while at it for
clarity.
(was found while testing XDP traffic generator on ice, which calls
xdp_convert_frame_to_buff() for each XDP frame)
Fixes:
b530e9e1063e ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN")
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230215185440.4126672-1-aleksander.lobakin@intel.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Andrii Nakryiko [Thu, 16 Feb 2023 00:29:32 +0000 (16:29 -0800)]
Merge branch 'New benchmark for hashmap lookups'
Anton Protopopov says:
====================
Add a new benchmark for hashmap lookups and fix several typos.
In commit 3 I've patched the bench utility so that now command line options
can be reused by different benchmarks.
The benchmark itself is added in the last commit 7. I was using this benchmark
to test map lookup productivity when using a different hash function [1]. When
run with --quiet, the results can be easily plotted [2]. The results provided
by the benchmark look reasonable and match the results of my different
benchmarks (requiring to patch kernel to get actual statistics on map lookups).
Links:
[1] https://fosdem.org/2023/schedule/event/bpf_hashing/
[2] https://github.com/aspsk/bpf-bench/tree/master/hashmap-bench
Changes,
v1->v2:
- percpu_times_index[] is of wrong size (Martin)
- use base 0 for strtol (Andrii)
- just use -q without argument (Andrii)
- use less hacks when parsing arguments (Andrii)
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Anton Protopopov [Mon, 13 Feb 2023 09:15:19 +0000 (09:15 +0000)]
selftest/bpf/benchs: Add benchmark for hashmap lookups
Add a new benchmark which measures hashmap lookup operations speed. A user can
control the following parameters of the benchmark:
* key_size (max 1024): the key size to use
* max_entries: the hashmap max entries
* nr_entries: the number of entries to insert/lookup
* nr_loops: the number of loops for the benchmark
* map_flags The hashmap flags passed to BPF_MAP_CREATE
The BPF program performing the benchmarks calls two nested bpf_loop:
bpf_loop(nr_loops/nr_entries)
bpf_loop(nr_entries)
bpf_map_lookup()
So the nr_loops determines the number of actual map lookups. All lookups are
successful.
Example (the output is generated on a AMD Ryzen 9 3950X machine):
for nr_entries in `seq 4096 4096 65536`; do echo -n "$((nr_entries*100/65536))% full: "; sudo ./bench -d2 -a bpf-hashmap-lookup --key_size=4 --nr_entries=$nr_entries --max_entries=65536 --nr_loops=
1000000 --map_flags=0x40 | grep cpu; done
6% full: cpu01: lookup 50.739M ± 0.018M events/sec (approximated from 32 samples of ~19ms)
12% full: cpu01: lookup 47.751M ± 0.015M events/sec (approximated from 32 samples of ~20ms)
18% full: cpu01: lookup 45.153M ± 0.013M events/sec (approximated from 32 samples of ~22ms)
25% full: cpu01: lookup 43.826M ± 0.014M events/sec (approximated from 32 samples of ~22ms)
31% full: cpu01: lookup 41.971M ± 0.012M events/sec (approximated from 32 samples of ~23ms)
37% full: cpu01: lookup 41.034M ± 0.015M events/sec (approximated from 32 samples of ~24ms)
43% full: cpu01: lookup 39.946M ± 0.012M events/sec (approximated from 32 samples of ~25ms)
50% full: cpu01: lookup 38.256M ± 0.014M events/sec (approximated from 32 samples of ~26ms)
56% full: cpu01: lookup 36.580M ± 0.018M events/sec (approximated from 32 samples of ~27ms)
62% full: cpu01: lookup 36.252M ± 0.012M events/sec (approximated from 32 samples of ~27ms)
68% full: cpu01: lookup 35.200M ± 0.012M events/sec (approximated from 32 samples of ~28ms)
75% full: cpu01: lookup 34.061M ± 0.009M events/sec (approximated from 32 samples of ~29ms)
81% full: cpu01: lookup 34.374M ± 0.010M events/sec (approximated from 32 samples of ~29ms)
87% full: cpu01: lookup 33.244M ± 0.011M events/sec (approximated from 32 samples of ~30ms)
93% full: cpu01: lookup 32.182M ± 0.013M events/sec (approximated from 32 samples of ~31ms)
100% full: cpu01: lookup 31.497M ± 0.016M events/sec (approximated from 32 samples of ~31ms)
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230213091519.1202813-8-aspsk@isovalent.com
Anton Protopopov [Mon, 13 Feb 2023 09:15:18 +0000 (09:15 +0000)]
selftest/bpf/benchs: Print less if the quiet option is set
The bench utility will print
Setting up benchmark '<bench-name>'...
Benchmark '<bench-name>' started.
on startup to stdout. Suppress this output if --quiet option if given. This
makes it simpler to parse benchmark output by a script.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230213091519.1202813-7-aspsk@isovalent.com
Anton Protopopov [Mon, 13 Feb 2023 09:15:17 +0000 (09:15 +0000)]
selftest/bpf/benchs: Make quiet option common
The "local-storage-tasks-trace" benchmark has a `--quiet` option. Move it to
the list of common options, so that the main code and other benchmarks can use
(new) env.quiet variable. Patch the run_bench_local_storage_rcu_tasks_trace.sh
helper script accordingly.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230213091519.1202813-6-aspsk@isovalent.com
Anton Protopopov [Mon, 13 Feb 2023 09:15:16 +0000 (09:15 +0000)]
selftest/bpf/benchs: Remove an unused header
The benchs/bench_bpf_hashmap_full_update.c doesn't set a custom argp,
so it shouldn't include the <argp.h> header.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230213091519.1202813-5-aspsk@isovalent.com
Anton Protopopov [Mon, 13 Feb 2023 09:15:15 +0000 (09:15 +0000)]
selftest/bpf/benchs: Enhance argp parsing
To parse command line the bench utility uses the argp_parse() function. This
function takes as an argument a parent 'struct argp' structure which defines
common command line options and an array of children 'struct argp' structures
which defines additional command line options for particular benchmarks. This
implementation doesn't allow benchmarks to share option names, e.g., if two
benchmarks want to use, say, the --option option, then only one of them will
succeed (the first one encountered in the array). This will be convenient if
same option names could be used in different benchmarks (with the same
semantics, e.g., --nr_loops=N).
Fix this by calling the argp_parse() function twice. The first call is the same
as it was before, with all children argps, and helps to find the benchmark name
and to print a combined help message if anything is wrong. Given the name, we
can call the argp_parse the second time, but now the children array points only
to a correct benchmark thus always calling the correct parsers. (If there's no
a specific list of arguments, then only one call to argp_parse will be done.)
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230213091519.1202813-4-aspsk@isovalent.com
Anton Protopopov [Mon, 13 Feb 2023 09:15:14 +0000 (09:15 +0000)]
selftest/bpf/benchs: Make a function static in bpf_hashmap_full_update
The hashmap_report_final callback function defined in the
benchs/bench_bpf_hashmap_full_update.c file should be static.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230213091519.1202813-3-aspsk@isovalent.com
Anton Protopopov [Mon, 13 Feb 2023 09:15:13 +0000 (09:15 +0000)]
selftest/bpf/benchs: Fix a typo in bpf_hashmap_full_update
To call the bpf_hashmap_full_update benchmark, one should say:
bench bpf-hashmap-ful-update
The patch adds a missing 'l' to the benchmark name.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230213091519.1202813-2-aspsk@isovalent.com
Alexei Starovoitov [Wed, 15 Feb 2023 23:40:06 +0000 (15:40 -0800)]
Merge branch 'Use __GFP_ZERO in bpf memory allocator'
Hou Tao says:
====================
From: Hou Tao <houtao1@huawei.com>
Hi,
The patchset tries to fix the hard-up problem found when checking how htab
handles element reuse in bpf memory allocator. The immediate reuse of
freed elements will reinitialize special fields (e.g., bpf_spin_lock) in
htab map value and it may corrupt lookup procedure with BFP_F_LOCK flag
which acquires bpf-spin-lock during value copying, and lead to hard-lock
as shown in patch #2. Patch #1 fixes it by using __GFP_ZERO when allocating
the object from slab and the behavior is similar with the preallocated
hash-table case. Please see individual patches for more details. And comments
are always welcome.
Regards,
Change Log:
v1:
* Use __GFP_ZERO instead of ctor to avoid retpoline overhead (from Alexei)
* Add comments for check_and_init_map_value() (from Alexei)
* split __GFP_ZERO patches out of the original patchset to unblock
the development work of others.
RFC: https://lore.kernel.org/bpf/
20221230041151.
1231169-1-houtao@huaweicloud.com
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Hou Tao [Wed, 15 Feb 2023 08:21:32 +0000 (16:21 +0800)]
selftests/bpf: Add test case for element reuse in htab map
The reinitialization of spin-lock in map value after immediate reuse may
corrupt lookup with BPF_F_LOCK flag and result in hard lock-up, so add
one test case to demonstrate the problem.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230215082132.3856544-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Hou Tao [Wed, 15 Feb 2023 08:21:31 +0000 (16:21 +0800)]
bpf: Zeroing allocated object from slab in bpf memory allocator
Currently the freed element in bpf memory allocator may be immediately
reused, for htab map the reuse will reinitialize special fields in map
value (e.g., bpf_spin_lock), but lookup procedure may still access
these special fields, and it may lead to hard-lockup as shown below:
NMI backtrace for cpu 16
CPU: 16 PID: 2574 Comm: htab.bin Tainted: G L 6.1.0+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
RIP: 0010:queued_spin_lock_slowpath+0x283/0x2c0
......
Call Trace:
<TASK>
copy_map_value_locked+0xb7/0x170
bpf_map_copy_value+0x113/0x3c0
__sys_bpf+0x1c67/0x2780
__x64_sys_bpf+0x1c/0x20
do_syscall_64+0x30/0x60
entry_SYSCALL_64_after_hwframe+0x46/0xb0
......
</TASK>
For htab map, just like the preallocated case, these is no need to
initialize these special fields in map value again once these fields
have been initialized. For preallocated htab map, these fields are
initialized through __GFP_ZERO in bpf_map_area_alloc(), so do the
similar thing for non-preallocated htab in bpf memory allocator. And
there is no need to use __GFP_ZERO for per-cpu bpf memory allocator,
because __alloc_percpu_gfp() does it implicitly.
Fixes:
0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230215082132.3856544-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Wed, 15 Feb 2023 19:48:48 +0000 (11:48 -0800)]
Merge branch 'Improvements for BPF_ST tracking by verifier '
Eduard Zingerman says:
====================
This patch-set is a part of preparation work for -mcpu=v4 option for
BPF C compiler (discussed in [1]). Among other things -mcpu=v4 should
enable generation of BPF_ST instruction by the compiler.
- Patches #1,2 adjust verifier to track values of constants written to
stack using BPF_ST. Currently these are tracked imprecisely, unlike
the writes using BPF_STX, e.g.:
fp[-8] = 42; currently verifier assumes that fp[-8]=mmmmmmmm
after such instruction, where m stands for "misc",
just a note that something is written at fp[-8].
r1 = 42; verifier tracks r1=42 after this instruction.
fp[-8] = r1; verifier tracks fp[-8]=42 after this instruction.
This patch makes both cases equivalent.
- Patches #3,4 adjust verifier.c:check_stack_write_fixed_off() to
preserve STACK_ZERO marks when BPF_ST writes zero. Currently these
are replaced by STACK_MISC, unlike zero writes using BPF_STX, e.g.:
... stack range [X,Y] is marked as STACK_ZERO ...
r0 = ... variable offset pointer to stack with range [X,Y] ...
fp[r0] = 0; currently verifier marks range [X,Y] as
STACK_MISC for such instructions.
r1 = 0;
fp[r0] = r1; verifier keeps STACK_ZERO marks for range [X,Y].
This patch makes both cases equivalent.
Motivating example for patch #1 could be found at [3].
Previous version of the patch-set is here [2], the changes are:
- Explicit initialization of fake register parent link is removed from
verifier.c:check_stack_write_fixed_off() as parent links are now
correctly handled by verifier.c:save_register_state().
- Original patch #1 is split in patches #1 & #3.
- Missing test case added for patch #3
verifier.c:check_stack_write_fixed_off() adjustment.
- Test cases are updated to use .prog_type = BPF_PROG_TYPE_SK_LOOKUP,
which requires return value to be in the range [0,1] (original test
cases assumed that such range is always required, which is not true).
- Original patch #3 with changes allowing BPF_ST writes to context is
withheld for now, w/o compiler support for BPF_ST it requires some
creative testing.
- Original patch #5 is removed from the patch-set. This patch
contained adjustments to expected verifier error messages in some
tests, necessary when C compiler generates BPF_ST instruction
instead of BPF_STX (changes to expected instruction indices). These
changes are not necessary yet.
[1] https://lore.kernel.org/bpf/
01515302-c37d-2ee5-c950-
2f556a4caad0@meta.com/
[2] https://lore.kernel.org/bpf/
20221231163122.
1360813-1-eddyz87@gmail.com/
[3] https://lore.kernel.org/bpf/
f1e4282bf00aa21a72fc5906f8c3be1ae6c94a5e.camel@gmail.com/
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman [Tue, 14 Feb 2023 23:20:30 +0000 (01:20 +0200)]
selftests/bpf: check if BPF_ST with variable offset preserves STACK_ZERO
A test case to verify that variable offset BPF_ST instruction
preserves STACK_ZERO marks when writes zeros, e.g. in the following
situation:
*(u64*)(r10 - 8) = 0 ; STACK_ZERO marks for fp[-8]
r0 = random(-7, -1) ; some random number in range of [-7, -1]
r0 += r10 ; r0 is now variable offset pointer to stack
*(u8*)(r0) = 0 ; BPF_ST writing zero, STACK_ZERO mark for
; fp[-8] should be preserved.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230214232030.1502829-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman [Tue, 14 Feb 2023 23:20:29 +0000 (01:20 +0200)]
bpf: BPF_ST with variable offset should preserve STACK_ZERO marks
BPF_STX instruction preserves STACK_ZERO marks for variable offset
writes in situations like below:
*(u64*)(r10 - 8) = 0 ; STACK_ZERO marks for fp[-8]
r0 = random(-7, -1) ; some random number in range of [-7, -1]
r0 += r10 ; r0 is now a variable offset pointer to stack
r1 = 0
*(u8*)(r0) = r1 ; BPF_STX writing zero, STACK_ZERO mark for
; fp[-8] is preserved
This commit updates verifier.c:check_stack_write_var_off() to process
BPF_ST in a similar manner, e.g. the following example:
*(u64*)(r10 - 8) = 0 ; STACK_ZERO marks for fp[-8]
r0 = random(-7, -1) ; some random number in range of [-7, -1]
r0 += r10 ; r0 is now variable offset pointer to stack
*(u8*)(r0) = 0 ; BPF_ST writing zero, STACK_ZERO mark for
; fp[-8] is preserved
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230214232030.1502829-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman [Tue, 14 Feb 2023 23:20:28 +0000 (01:20 +0200)]
selftests/bpf: check if verifier tracks constants spilled by BPF_ST_MEM
Check that verifier tracks the value of 'imm' spilled to stack by
BPF_ST_MEM instruction. Cover the following cases:
- write of non-zero constant to stack;
- write of a zero constant to stack.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230214232030.1502829-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman [Tue, 14 Feb 2023 23:20:27 +0000 (01:20 +0200)]
bpf: track immediate values written to stack by BPF_ST instruction
For aligned stack writes using BPF_ST instruction track stored values
in a same way BPF_STX is handled, e.g. make sure that the following
commands produce similar verifier knowledge:
fp[-8] = 42; r1 = 42;
fp[-8] = r1;
This covers two cases:
- non-null values written to stack are stored as spill of fake
registers;
- null values written to stack are stored as STACK_ZERO marks.
Previously both cases above used STACK_MISC marks instead.
Some verifier test cases relied on the old logic to obtain STACK_MISC
marks for some stack values. These test cases are updated in the same
commit to avoid failures during bisect.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230214232030.1502829-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Tue, 14 Feb 2023 23:50:51 +0000 (15:50 -0800)]
selftests/bpf: Fix map_kptr test.
The compiler is optimizing out majority of unref_ptr read/writes, so the test
wasn't testing much. For example, one could delete '__kptr' tag from
'struct prog_test_ref_kfunc __kptr *unref_ptr;' and the test would still "pass".
Convert it to volatile stores. Confirmed by comparing bpf asm before/after.
Fixes:
2cbc469a6fc3 ("selftests/bpf: Add C tests for kptr")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230214235051.22938-1-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Björn Töpel [Tue, 14 Feb 2023 16:12:53 +0000 (17:12 +0100)]
selftests/bpf: Cross-compile bpftool
When the BPF selftests are cross-compiled, only the a host version of
bpftool is built. This version of bpftool is used on the host-side to
generate various intermediates, e.g., skeletons.
The test runners are also using bpftool, so the Makefile will symlink
bpftool from the selftest/bpf root, where the test runners will look
the tool:
| $(Q)ln -sf $(if $2,..,.)/tools/build/bpftool/bootstrap/bpftool \
| $(OUTPUT)/$(if $2,$2/)bpftool
There are two problems for cross-compilation builds:
1. There is no native (cross-compilation target) of bpftool
2. The bootstrap/bpftool is never cross-compiled (by design)
Make sure that a native/cross-compiled version of bpftool is built,
and if CROSS_COMPILE is set, symlink the native/non-bootstrap version.
Acked-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Link: https://lore.kernel.org/r/20230214161253.183458-1-bjorn@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
David Vernet [Tue, 14 Feb 2023 22:35:53 +0000 (16:35 -0600)]
bpf, docs: Add myself to BPF docs MAINTAINERS entry
In commit
7e2a9ebe8126 ("docs, bpf: Ensure IETF's BPF mailing list gets
copied for ISA doc changes"), a new MAINTAINERS entry was added for any
BPF IETF documentation updates for the ongoing standardization process.
I've been making it a point to try and review as many BPF documentation
patches as possible, and have made a committment to Alexei to
consistently review BPF standardization patches going forward. This
patch adds my name as a reviewer to the MAINTAINERS entry for the
standardization effort.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230214223553.78353-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tiezhu Yang [Wed, 15 Feb 2023 11:01:07 +0000 (19:01 +0800)]
selftests/bpf: Fix build error for LoongArch
There exists build error when make -C tools/testing/selftests/bpf/
on LoongArch:
BINARY test_verifier
In file included from test_verifier.c:27:
tools/include/uapi/linux/bpf_perf_event.h:14:28: error: field 'regs' has incomplete type
14 | bpf_user_pt_regs_t regs;
| ^~~~
make: *** [Makefile:577: tools/testing/selftests/bpf/test_verifier] Error 1
make: Leaving directory 'tools/testing/selftests/bpf'
Add missing uapi header for LoongArch to use the following definition:
typedef struct user_pt_regs bpf_user_pt_regs_t;
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Link: https://lore.kernel.org/r/1676458867-22052-1-git-send-email-yangtiezhu@loongson.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Bagas Sanjaya [Wed, 15 Feb 2023 12:32:52 +0000 (19:32 +0700)]
Documentation: bpf: Add missing line break separator in node_data struct code block
Stephen Rothwell reported htmldocs warning when merging bpf-next tree,
which was the same warning as reported by kernel test robot:
Documentation/bpf/graph_ds_impl.rst:62: ERROR: Error in "code-block" directive:
maximum 1 argument(s) allowed, 12 supplied.
The error is due to Sphinx confuses node_data struct declaration with
code-block directive option.
Fix the warning by separating the code-block marker with node_data struct
declaration.
Link: https://lore.kernel.org/linux-next/20230215144505.4751d823@canb.auug.org.au/
Link: https://lore.kernel.org/linux-doc/202302151123.wUE5FYFx-lkp@intel.com/
Fixes:
c31315c3aa0929 ("bpf, documentation: Add graph documentation for non-owning refs")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20230215123253.41552-3-bagasdotme@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Tue, 14 Feb 2023 06:01:14 +0000 (22:01 -0800)]
Revert "bpf: Add --skip_encoding_btf_inconsistent_proto, --btf_gen_optimized to pahole flags for v1.25"
This reverts commit
0243d3dfe274832aa0a16214499c208122345173.
pahole 1.25 is too aggressive removing functions.
With clang compiled kernel the following is seen:
WARN: resolve_btfids: unresolved symbol tcp_reno_cong_avoid
WARN: resolve_btfids: unresolved symbol dctcp_update_alpha
WARN: resolve_btfids: unresolved symbol cubictcp_cong_avoid
WARN: resolve_btfids: unresolved symbol bpf_xdp_metadata_rx_timestamp
WARN: resolve_btfids: unresolved symbol bpf_xdp_metadata_rx_hash
WARN: resolve_btfids: unresolved symbol bpf_task_kptr_get
WARN: resolve_btfids: unresolved symbol bpf_task_acquire_not_zero
WARN: resolve_btfids: unresolved symbol bpf_rdonly_cast
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_static_unused_arg
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_ref
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_pass_ctx
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_pass2
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_pass1
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_mem_len_pass1
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_mem_len_fail2
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_mem_len_fail1
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_kptr_get
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_fail3
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_fail2
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_acquire
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test2
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test1
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_memb_release
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_memb1_release
WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_int_mem_release
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Joanne Koong [Tue, 14 Feb 2023 05:13:32 +0000 (21:13 -0800)]
selftests/bpf: Clean up dynptr prog_tests
Clean up prog_tests/dynptr.c by removing the unneeded "expected_err_msg"
in the dynptr_tests struct, which is a remnant from converting the fail
tests cases to use the generic verification tester.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20230214051332.4007131-2-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Joanne Koong [Tue, 14 Feb 2023 05:13:31 +0000 (21:13 -0800)]
selftests/bpf: Clean up user_ringbuf, cgrp_kfunc, kfunc_dynptr_param tests
Clean up user_ringbuf, cgrp_kfunc, and kfunc_dynptr_param tests to use
the generic verification tester for checking verifier rejections.
The generic verification tester uses btf_decl_tag-based annotations
for verifying that the tests fail with the expected log messages.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com>
Link: https://lore.kernel.org/r/20230214051332.4007131-1-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Tue, 14 Feb 2023 03:31:14 +0000 (19:31 -0800)]
Merge branch 'BPF rbtree next-gen datastructure'
Dave Marchevsky says:
====================
This series adds a rbtree datastructure following the "next-gen
datastructure" precedent set by recently-added linked-list [0]. This is
a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
instead of adding a new map type. This series adds a smaller set of API
functions than that RFC - just the minimum needed to support current
cgfifo example scheduler in ongoing sched_ext effort [2], namely:
bpf_rbtree_add
bpf_rbtree_remove
bpf_rbtree_first
The meat of this series is bugfixes and verifier infra work to support
these API functions. Adding more rbtree kfuncs in future patches should
be straightforward as a result.
First, the series refactors and extends linked_list's release_on_unlock
logic. The concept of "reference to node that was added to data
structure" is formalized as "non-owning reference". From linked_list's
perspective this non-owning reference after
linked_list_push_{front,back} has same semantics as release_on_unlock,
with the addition of writes to such references being valid in the
critical section. Such references are no longer marked PTR_UNTRUSTED.
Patches 2 and 13 go into more detail.
The series then adds rbtree API kfuncs and necessary verifier support
for them - namely support for callback args to kfuncs and some
non-owning reference interactions that linked_list didn't need.
BPF rbtree uses struct rb_root_cached + existing rbtree lib under the
hood. From the BPF program writer's perspective, a BPF rbtree is very
similar to existing linked list. Consider the following example:
struct node_data {
long key;
long data;
struct bpf_rb_node node;
}
static bool less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
{
struct node_data *node_a;
struct node_data *node_b;
node_a = container_of(a, struct node_data, node);
node_b = container_of(b, struct node_data, node);
return node_a->key < node_b->key;
}
private(A) struct bpf_spin_lock glock;
private(A) struct bpf_rb_root groot __contains(node_data, node);
/* ... in BPF program */
struct node_data *n, *m;
struct bpf_rb_node *res;
n = bpf_obj_new(typeof(*n));
if (!n)
/* skip */
n->key = 5;
n->data = 10;
bpf_spin_lock(&glock);
bpf_rbtree_add(&groot, &n->node, less);
bpf_spin_unlock(&glock);
bpf_spin_lock(&glock);
res = bpf_rbtree_first(&groot);
if (!res)
/* skip */
res = bpf_rbtree_remove(&groot, res);
if (!res)
/* skip */
bpf_spin_unlock(&glock);
m = container_of(res, struct node_data, node);
bpf_obj_drop(m);
Some obvious similarities:
* Special bpf_rb_root and bpf_rb_node types have same semantics
as bpf_list_head and bpf_list_node, respectively
* __contains is used to associated node type with root
* The spin_lock associated with a rbtree must be held when using
rbtree API kfuncs
* Nodes are allocated via bpf_obj_new and dropped via bpf_obj_drop
* Rbtree takes ownership of node lifetime when a node is added.
Removing a node gives ownership back to the program, requiring a
bpf_obj_drop before program exit
Some new additions as well:
* Support for callbacks in kfunc args is added to enable 'less'
callback use above
* bpf_rbtree_first is the first graph API function to return a
non-owning reference instead of convering an arg from own->non-own
* Because all references to nodes already added to the rbtree are
non-owning, bpf_rbtree_remove must accept such a reference in order
to remove it from the tree
Summary of patches:
Patches 1 - 5 implement the meat of rbtree-specific support in this
series, gradually building up to implemented kfuncs that verify as
expected.
Patch 6 adds the bpf_rbtree_{add,first,remove} to bpf_experimental.h.
Patch 7 adds tests, Patch 9 adds documentation.
[0]: lore.kernel.org/bpf/
20221118015614.
2013203-1-memxor@gmail.com
[1]: lore.kernel.org/bpf/
20220830172759.
4069786-1-davemarchevsky@fb.com
[2]: lore.kernel.org/bpf/
20221130082313.
3241517-1-tj@kernel.org
Changelog:
v5 -> v6: lore.kernel.org/bpf/
20230212092715.
1422619-1-davemarchevsky@fb.com/
Patch #'s below refer to the patch's number in v5 unless otherwise stated.
* General / Patch 1
* Rebase onto latest bpf-next: "bpf: Migrate release_on_unlock logic to non-owning ref semantics"
* This was Patch 1 of v4, was applied, not included in v6
* Patch 3 - "bpf: Add bpf_rbtree_{add,remove,first} kfuncs"
* Use bpf_callback_t instead of plain-C fn ptr for bpf_rbtree_add. This
necessitated having bpf_rbtree_add duplicate rbtree_add's functionality.
Wrapper function was used w/ internal __bpf_rbtree_add helper so that
bpf_experimental.h proto could continue to use plain-C fn ptr so BPF progs
could benefit from typechecking (Alexei)
v4 -> v5: lore.kernel.org/bpf/
20230209174144.
3280955-1-davemarchevsky@fb.com/
Patch #'s below refer to the patch's number in v4 unless otherwise stated.
* General
* Rebase onto latest bpf-next: "Merge branch 'bpf, mm: introduce cgroup.memory=nobpf'"
* Patches 1-3 are squashed into "bpf: Migrate release_on_unlock logic to non-owning ref semantics".
* Added type_is_non_owning_ref helper (Alexei)
* Use a NON_OWN_REF type flag instead of separate bool (Alexei)
* Patch 8 - "bpf: Special verifier handling for bpf_rbtree_{remove, first}"
* When doing btf_parse_fields, reject structs with both bpf_list_node and
bpf_rb_node fields. This is a temporary measure that can be removed after
"collection identity" followup. See comment added in btf_parse_fields for
more detail (Kumar, Alexei)
* Add linked_list BTF test exercising check added to btf_parse_fields
* Minor changes and moving around of some reg type checks due to NON_OWN_REF type flag
introduction
* Patch 10 - "selftests/bpf: Add rbtree selftests"
* Migrate failure tests to RUN_TESTS, __failure, __msg() framework (Alexei)
v3 -> v4: lore.kernel.org/bpf/
20230131180016.
3368305-1-davemarchevsky@fb.com/
Patch #'s below refer to the patch's number in v3 unless otherwise stated.
* General
* Don't base this series on "bpf: Refactor release_regno searching logic",
which was submitted separately as a refactor.
* Rebase onto latest bpf-next: "samples/bpf: Add openat2() enter/exit tracepoint to syscall_tp sample"
* Patch 2 - "bpf: Improve bpf_reg_state space usage for non-owning ref lock"
* print_verifier_state change was adding redundant comma after "non_own_ref",
fix it to put comma in correct place
* invalidate_non_owning_refs no longer needs to take bpf_active_lock param,
since any non-owning ref reg in env's cur_state is assumed to use that
state's active_lock (Alexei)
* invalidate_non_owning_refs' reg loop should check that the reg being
inspected is a PTR_TO_BTF_ID before checking reg->non_owning_ref_lock,
since that field is part of a union and may be filled w/ meaningless bytes
if reg != PTR_TO_BTF_ID (Alexei)
* Patch 3 - "selftests/bpf: Update linked_list tests for non-owning ref semantics"
* Change the string searched for by the following tests:
* linked_list/incorrect_node_off1
* linked_list/double_push_front
* linked_list/double_push_back
necessary due to rebase / dropping of "release_regno searching logic" patch
(see "General" changes)
* Patch 8 - "bpf: Special verifier handling for bpf_rbtree_{remove, first}"
* Just call invalidate_non_owning_refs w/ env instead of env, lock. (see
Patch 2 changes)
* Patch 11 - "bpf, documentation: Add graph documentation for non-owning refs"
* Fix documentation formatting and improve content (David)
* v3's version of patch 11 was missing some changes, v4's patch 11 is still
addressing David's feedback from v2
v2 -> v3: lore.kernel.org/bpf/
20221217082506.
1570898-1-davemarchevsky@fb.com/
Patch #'s below refer to the patch's number in v2 unless otherwise stated.
* Patch 1 - "bpf: Support multiple arg regs w/ ref_obj_id for kfuncs"
* No longer needed as v3 doesn't have multiple ref_obj_id arg regs
* The refactoring pieces were submitted separately
(https://lore.kernel.org/bpf/
20230121002417.
1684602-1-davemarchevsky@fb.com/)
* Patch 2 - "bpf: Migrate release_on_unlock logic to non-owning ref semantics"
* Remove KF_RELEASE_NON_OWN flag from list API push methods, just match
against specific kfuncs for now (Alexei, David)
* Separate "release non owning reference" logic from KF_RELEASE logic
(Alexei, David)
* reg_find_field_offset now correctly tests 'rec' instead of 'reg' after
calling reg_btf_record (Dan Carpenter)
* New patch added after Patch 2 - "bpf: Improve bpf_reg_state space usage for non-owning ref lock"
* Eliminates extra bpf_reg_state memory usage by using a bool instead of
copying lock identity
* Patch 4 - "bpf: rename list_head -> graph_root in field info types"
* v2's version was applied to bpf-next, not including in respins
* Patch 6 - "bpf: Add bpf_rbtree_{add,remove,first} kfuncs"
* Remove KF_RELEASE_NON_OWN flag from rbtree_add, just add it to specific
kfunc matching (Alexei, David)
* Patch 9 - "bpf: Special verifier handling for bpf_rbtree_{remove, first}"
* Remove KF_INVALIDATE_NON_OWN kfunc flag, just match against specific kfunc
for now (Alexei, David)
* Patch 11 - "libbpf: Make BTF mandatory if program BTF has spin_lock or alloc_obj type"
* Drop for now, will submit separately
* Patch 12 - "selftests/bpf: Add rbtree selftests"
* Some expected-failure tests have different error messages due to "release
non-owning reference logic" being separated from KF_RELEASE logic in Patch
2 changes
* Patch 13 - "bpf, documentation: Add graph documentation for non-owning refs"
* Fix documentation formatting and improve content (David)
v1 -> v2: lore.kernel.org/bpf/
20221206231000.
3180914-1-davemarchevsky@fb.com/
Series-wide changes:
* Rename datastructure_{head,node,api} -> graph_{root,node,api} (Alexei)
* "graph datastructure" in patch summaries to refer to linked_list + rbtree
instead of "next-gen datastructure" (Alexei)
* Move from hacky marking of non-owning references as PTR_UNTRUSTED to
cleaner implementation (Alexei)
* Add invalidation of non-owning refs to rbtree_remove (Kumar, Alexei)
Patch #'s below refer to the patch's number in v1 unless otherwise stated.
Note that in v1 most of the meaty verifier changes were in the latter half
of the series. Here, about half of that complexity has been moved to
"bpf: Migrate release_on_unlock logic to non-owning ref semantics" - was Patch
3 in v1.
* Patch 1 - "bpf: Loosen alloc obj test in verifier's reg_btf_record"
* Was applied, dropped from further iterations
* Patch 2 - "bpf: map_check_btf should fail if btf_parse_fields fails"
* Dropped in favor of verifier check-on-use: when some normal verifier
checking expects the map to have btf_fields correctly parsed, it won't
find any and verification will fail
* New patch added before Patch 3 - "bpf: Support multiple arg regs w/ ref_obj_id for kfuncs"
* Addition of KF_RELEASE_NON_OWN flag, which requires KF_RELEASE, and tagging
of bpf_list_push_{front,back} KF_RELEASE | KF_RELEASE_NON_OWN, means that
list-in-list push_{front,back} will trigger "only one ref_obj_id arg reg"
logic. This is because "head" arg to those functions can be a list-in-list,
which itself can be an owning reference with ref_obj_id. So need to
support multiple ref_obj_id for release kfuncs.
* Patch 3 - "bpf: Minor refactor of ref_set_release_on_unlock"
* Now a major refactor w/ a rename to reflect this
* "bpf: Migrate release_on_unlock logic to non-owning ref semantics"
* Replaces release_on_unlock with active_lock logic as discussed in v1
* New patch added after Patch 3 - "selftests/bpf: Update linked_list tests for non_owning_ref logic"
* Removes "write after push" linked_list failure tests - no longer failure
scenarios.
* Patch 4 - "bpf: rename list_head -> datastructure_head in field info types"
* rename to graph_root instead. Similar renamings across the series - see
series-wide changes.
* Patch 5 - "bpf: Add basic bpf_rb_{root,node} support"
* OWNER_FIELD_MASK -> GRAPH_ROOT_MASK, OWNEE_FIELD_MASK -> GRAPH_NODE_MASK,
and change of "owner"/"ownee" in big btf_check_and_fixup_fields comment to
"root"/"node" (Alexei)
* Patch 6 - "bpf: Add bpf_rbtree_{add,remove,first} kfuncs"
* bpf_rbtree_remove can no longer return NULL. v2 continues v1's "use type
system to prevent remove of node that isn't in a datastructure" approach,
so rbtree_remove should never have been able to return NULL
* Patch 7 - "bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args"
* is_bpf_datastructure_api_kfunc -> is_bpf_graph_api_kfunc (Alexei)
* Patch 8 - "bpf: Add callback validation to kfunc verifier logic"
* Explicitly disallow rbtree_remove in rbtree callback
* Explicitly disallow bpf_spin_{lock,unlock} call in rbtree callback,
preventing possibility of "unbalanced" unlock (Alexei)
* Patch 10 - "bpf, x86: BPF_PROBE_MEM handling for insn->off < 0"
* Now that non-owning refs aren't marked PTR_UNTRUSTED it's not necessary to
include this patch as part of the series
* After conversation w/ Alexei, did another pass and submitted as an
independent series (lore.kernel.org/bpf/
20221213182726.325137-1-davemarchevsky@fb.com/)
* Patch 13 - "selftests/bpf: Add rbtree selftests"
* Since bpf_rbtree_remove can no longer return null, remove null checks
* Remove test confirming that rbtree_first isn't allowed in callback. We want
this to be possible
* Add failure test confirming that rbtree_remove's new non-owning reference
invalidation behavior behaves as expected
* Add SEC("license") to rbtree_btf_fail__* progs. They were previously
failing due to lack of this section. Now they're failing for correct
reasons.
* rbtree_btf_fail__add_wrong_type.c - add locking around rbtree_add, rename
the bpf prog to something reasonable
* New patch added after patch 13 - "bpf, documentation: Add graph documentation for non-owning refs"
* Summarizes details of owning and non-owning refs which we hashed out in
v1
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:17 +0000 (16:40 -0800)]
bpf, documentation: Add graph documentation for non-owning refs
It is difficult to intuit the semantics of owning and non-owning
references from verifier code. In order to keep the high-level details
from being lost in the mailing list, this patch adds documentation
explaining semantics and details.
The target audience of doc added in this patch is folks working on BPF
internals, as there's focus on "what should the verifier do here". Via
reorganization or copy-and-paste, much of the content can probably be
repurposed for BPF program writer audience as well.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-9-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:16 +0000 (16:40 -0800)]
selftests/bpf: Add rbtree selftests
This patch adds selftests exercising the logic changed/added in the
previous patches in the series. A variety of successful and unsuccessful
rbtree usages are validated:
Success:
* Add some nodes, let map_value bpf_rbtree_root destructor clean them
up
* Add some nodes, remove one using the non-owning ref leftover by
successful rbtree_add() call
* Add some nodes, remove one using the non-owning ref returned by
rbtree_first() call
Failure:
* BTF where bpf_rb_root owns bpf_list_node should fail to load
* BTF where node of type X is added to tree containing nodes of type Y
should fail to load
* No calling rbtree api functions in 'less' callback for rbtree_add
* No releasing lock in 'less' callback for rbtree_add
* No removing a node which hasn't been added to any tree
* No adding a node which has already been added to a tree
* No escaping of non-owning references past their lock's
critical section
* No escaping of non-owning references past other invalidation points
(rbtree_remove)
These tests mostly focus on rbtree-specific additions, but some of the
failure cases revalidate scenarios common to both linked_list and rbtree
which are covered in the former's tests. Better to be a bit redundant in
case linked_list and rbtree semantics deviate over time.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-8-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:15 +0000 (16:40 -0800)]
bpf: Add bpf_rbtree_{add,remove,first} decls to bpf_experimental.h
These kfuncs will be used by selftests in following patches
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-7-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:14 +0000 (16:40 -0800)]
bpf: Special verifier handling for bpf_rbtree_{remove, first}
Newly-added bpf_rbtree_{remove,first} kfuncs have some special properties
that require handling in the verifier:
* both bpf_rbtree_remove and bpf_rbtree_first return the type containing
the bpf_rb_node field, with the offset set to that field's offset,
instead of a struct bpf_rb_node *
* mark_reg_graph_node helper added in previous patch generalizes
this logic, use it
* bpf_rbtree_remove's node input is a node that's been inserted
in the tree - a non-owning reference.
* bpf_rbtree_remove must invalidate non-owning references in order to
avoid aliasing issue. Use previously-added
invalidate_non_owning_refs helper to mark this function as a
non-owning ref invalidation point.
* Unlike other functions, which convert one of their input arg regs to
non-owning reference, bpf_rbtree_first takes no arguments and just
returns a non-owning reference (possibly null)
* For now verifier logic for this is special-cased instead of
adding new kfunc flag.
This patch, along with the previous one, complete special verifier
handling for all rbtree API functions added in this series.
With functional verifier handling of rbtree_remove, under current
non-owning reference scheme, a node type with both bpf_{list,rb}_node
fields could cause the verifier to accept programs which remove such
nodes from collections they haven't been added to.
In order to prevent this, this patch adds a check to btf_parse_fields
which rejects structs with both bpf_{list,rb}_node fields. This is a
temporary measure that can be removed after "collection identity"
followup. See comment added in btf_parse_fields. A linked_list BTF test
exercising the new check is added in this patch as well.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-6-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:13 +0000 (16:40 -0800)]
bpf: Add callback validation to kfunc verifier logic
Some BPF helpers take a callback function which the helper calls. For
each helper that takes such a callback, there's a special call to
__check_func_call with a callback-state-setting callback that sets up
verifier bpf_func_state for the callback's frame.
kfuncs don't have any of this infrastructure yet, so let's add it in
this patch, following existing helper pattern as much as possible. To
validate functionality of this added plumbing, this patch adds
callback handling for the bpf_rbtree_add kfunc and hopes to lay
groundwork for future graph datastructure callbacks.
In the "general plumbing" category we have:
* check_kfunc_call doing callback verification right before clearing
CALLER_SAVED_REGS, exactly like check_helper_call
* recognition of func_ptr BTF types in kfunc args as
KF_ARG_PTR_TO_CALLBACK + propagation of subprogno for this arg type
In the "rbtree_add / graph datastructure-specific plumbing" category:
* Since bpf_rbtree_add must be called while the spin_lock associated
with the tree is held, don't complain when callback's func_state
doesn't unlock it by frame exit
* Mark rbtree_add callback's args with ref_set_non_owning
to prevent rbtree api functions from being called in the callback.
Semantically this makes sense, as less() takes no ownership of its
args when determining which comes first.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-5-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:12 +0000 (16:40 -0800)]
bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args
Now that we find bpf_rb_root and bpf_rb_node in structs, let's give args
that contain those types special classification and properly handle
these types when checking kfunc args.
"Properly handling" these types largely requires generalizing similar
handling for bpf_list_{head,node}, with little new logic added in this
patch.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:11 +0000 (16:40 -0800)]
bpf: Add bpf_rbtree_{add,remove,first} kfuncs
This patch adds implementations of bpf_rbtree_{add,remove,first}
and teaches verifier about their BTF_IDs as well as those of
bpf_rb_{root,node}.
All three kfuncs have some nonstandard component to their verification
that needs to be addressed in future patches before programs can
properly use them:
* bpf_rbtree_add: Takes 'less' callback, need to verify it
* bpf_rbtree_first: Returns ptr_to_node_type(off=rb_node_off) instead
of ptr_to_rb_node(off=0). Return value ref is
non-owning.
* bpf_rbtree_remove: Returns ptr_to_node_type(off=rb_node_off) instead
of ptr_to_rb_node(off=0). 2nd arg (node) is a
non-owning reference.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Tue, 14 Feb 2023 00:40:10 +0000 (16:40 -0800)]
bpf: Add basic bpf_rb_{root,node} support
This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to
BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new
types, and adds bpf_rb_root_free function for freeing bpf_rb_root in
map_values.
structs bpf_rb_root and bpf_rb_node are opaque types meant to
obscure structs rb_root_cached rb_node, respectively.
btf_struct_access will prevent BPF programs from touching these special
fields automatically now that they're recognized.
btf_check_and_fixup_fields now groups list_head and rb_root together as
"graph root" fields and {list,rb}_node as "graph node", and does same
ownership cycle checking as before. Note that this function does _not_
prevent ownership type mixups (e.g. rb_root owning list_node) - that's
handled by btf_parse_graph_root.
After this patch, a bpf program can have a struct bpf_rb_root in a
map_value, but not add anything to nor do anything useful with it.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Sun, 12 Feb 2023 09:27:07 +0000 (01:27 -0800)]
bpf: Migrate release_on_unlock logic to non-owning ref semantics
This patch introduces non-owning reference semantics to the verifier,
specifically linked_list API kfunc handling. release_on_unlock logic for
refs is refactored - with small functional changes - to implement these
semantics, and bpf_list_push_{front,back} are migrated to use them.
When a list node is pushed to a list, the program still has a pointer to
the node:
n = bpf_obj_new(typeof(*n));
bpf_spin_lock(&l);
bpf_list_push_back(&l, n);
/* n still points to the just-added node */
bpf_spin_unlock(&l);
What the verifier considers n to be after the push, and thus what can be
done with n, are changed by this patch.
Common properties both before/after this patch:
* After push, n is only a valid reference to the node until end of
critical section
* After push, n cannot be pushed to any list
* After push, the program can read the node's fields using n
Before:
* After push, n retains the ref_obj_id which it received on
bpf_obj_new, but the associated bpf_reference_state's
release_on_unlock field is set to true
* release_on_unlock field and associated logic is used to implement
"n is only a valid ref until end of critical section"
* After push, n cannot be written to, the node must be removed from
the list before writing to its fields
* After push, n is marked PTR_UNTRUSTED
After:
* After push, n's ref is released and ref_obj_id set to 0. NON_OWN_REF
type flag is added to reg's type, indicating that it's a non-owning
reference.
* NON_OWN_REF flag and logic is used to implement "n is only a
valid ref until end of critical section"
* n can be written to (except for special fields e.g. bpf_list_node,
timer, ...)
Summary of specific implementation changes to achieve the above:
* release_on_unlock field, ref_set_release_on_unlock helper, and logic
to "release on unlock" based on that field are removed
* The anonymous active_lock struct used by bpf_verifier_state is
pulled out into a named struct bpf_active_lock.
* NON_OWN_REF type flag is introduced along with verifier logic
changes to handle non-owning refs
* Helpers are added to use NON_OWN_REF flag to implement non-owning
ref semantics as described above
* invalidate_non_owning_refs - helper to clobber all non-owning refs
matching a particular bpf_active_lock identity. Replaces
release_on_unlock logic in process_spin_lock.
* ref_set_non_owning - set NON_OWN_REF type flag after doing some
sanity checking
* ref_convert_owning_non_owning - convert owning reference w/
specified ref_obj_id to non-owning references. Set NON_OWN_REF
flag for each reg with that ref_obj_id and 0-out its ref_obj_id
* Update linked_list selftests to account for minor semantic
differences introduced by this patch
* Writes to a release_on_unlock node ref are not allowed, while
writes to non-owning reference pointees are. As a result the
linked_list "write after push" failure tests are no longer scenarios
that should fail.
* The test##missing_lock##op and test##incorrect_lock##op
macro-generated failure tests need to have a valid node argument in
order to have the same error output as before. Otherwise
verification will fail early and the expected error output won't be seen.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230212092715.1422619-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Mon, 13 Feb 2023 18:13:13 +0000 (19:13 +0100)]
Merge branch 'xdp-ice-mbuf'
Alexander Lobakin says:
====================
The set grew from the poor performance of %BPF_F_TEST_XDP_LIVE_FRAMES
when the ice-backed device is a sender. Initially there were around
3.3 Mpps / thread, while I have 5.5 on skb-based pktgen ...
After fixing 0005 (0004 is a prereq for it) first (strange thing nobody
noticed that earlier), I started catching random OOMs. This is how 0002
(and partially 0001) appeared.
0003 is a suggestion from Maciej to not waste time on refactoring dead
lines. 0006 is a "cherry on top" to get away with the final 6.7 Mpps.
4.5 of 6 are fixes, but only the first three are tagged, since it then
starts being tricky. I may backport them manually later on.
TL;DR for the series is that shortcuts are good, but only as long as
they don't make the driver miss important things. %XDP_TX is purely
driver-local, however .ndo_xdp_xmit() is not, and sometimes assumptions
can be unsafe there.
With that series and also one core code patch[0], "live frames" and
xdp-trafficgen are now safe'n'fast on ice (probably more to come).
[0] https://lore.kernel.org/all/
20230209172827.874728-1-alexandr.lobakin@intel.com
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Alexander Lobakin [Fri, 10 Feb 2023 17:06:18 +0000 (18:06 +0100)]
ice: Micro-optimize .ndo_xdp_xmit() path
After the recent mbuf changes, ice_xmit_xdp_ring() became a 3-liner.
It makes no sense to keep it global in a different file than its caller.
Move it just next to the sole call site and mark static. Also, it
doesn't need a full xdp_convert_frame_to_buff(). Save several cycles
and fill only the fields used by __ice_xmit_xdp_ring() later on.
Finally, since it doesn't modify @xdpf anyhow, mark the argument const
to save some more (whole -11 bytes of .text! :D).
Thanks to 1 jump less and less calcs as well, this yields as many as
6.7 Mpps per queue. `xdp.data_hard_start = xdpf` is fully intentional
again (see xdp_convert_buff_to_frame()) and just works when there are
no source device's driver issues.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-7-alexandr.lobakin@intel.com
Alexander Lobakin [Fri, 10 Feb 2023 17:06:17 +0000 (18:06 +0100)]
ice: Fix freeing XDP frames backed by Page Pool
As already mentioned, freeing any &xdp_frame via page_frag_free() is
wrong, as it assumes the frame is backed by either an order-0 page or
a page with no "patrons" behind them, while in fact frames backed by
Page Pool can be redirected to a device, which's driver doesn't use it.
Keep storing a pointer to the raw buffer and then freeing it
unconditionally via page_frag_free() for %XDP_TX frames, but introduce
a separate type in the enum for frames coming through .ndo_xdp_xmit(),
and free them via xdp_return_frame_bulk(). Note that saving xdpf as
xdp_buff->data_hard_start is intentional and is always true when
everything is configured properly.
After this change, %XDP_REDIRECT from a Page Pool based driver to ice
becomes zero-alloc as it should be and horrendous 3.3 Mpps / queue
turn into 6.6, hehe.
Let it go with no "Fixes:" tag as it spans across good 5+ commits and
can't be trivially backported.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-6-alexandr.lobakin@intel.com
Alexander Lobakin [Fri, 10 Feb 2023 17:06:16 +0000 (18:06 +0100)]
ice: Robustify cleaning/completing XDP Tx buffers
When queueing frames from a Page Pool for redirecting to a device backed
by the ice driver, `perf top` shows heavy load on page_alloc() and
page_frag_free(), despite that on a properly working system it must be
fully or at least almost zero-alloc. The problem is in fact a bit deeper
and raises from how ice cleans up completed Tx buffers.
The story so far: when cleaning/freeing the resources related to
a particular completed Tx frame (skbs, DMA mappings etc.), ice uses some
heuristics only without setting any type explicitly (except for dummy
Flow Director packets, which are marked via ice_tx_buf::tx_flags).
This kinda works, but only up to some point. For example, currently ice
assumes that each frame coming to __ice_xmit_xdp_ring(), is backed by
either plain order-0 page or plain page frag, while it may also be
backed by Page Pool or any other possible memory models introduced in
future. This means any &xdp_frame must be freed properly via
xdp_return_frame() family with no assumptions.
In order to do that, the whole heuristics must be replaced with setting
the Tx buffer/frame type explicitly, just how it's always been done via
an enum. Let us reuse 16 bits from ::tx_flags -- 1 bit-and instr won't
hurt much -- especially given that sometimes there was a check for
%ICE_TX_FLAGS_DUMMY_PKT, which is now turned from a flag to an enum
member. The rest of the changes is straightforward and most of it is
just a conversion to rely now on the type set in &ice_tx_buf rather than
to some secondary properties.
For now, no functional changes intended, the change only prepares the
ground for starting freeing XDP frames properly next step. And it must
be done atomically/synchronously to not break stuff.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-5-alexandr.lobakin@intel.com
Alexander Lobakin [Fri, 10 Feb 2023 17:06:15 +0000 (18:06 +0100)]
ice: Remove two impossible branches on XDP Tx cleaning
The tagged commit started sending %XDP_TX frames from XSk Rx ring
directly without converting it to an &xdp_frame. However, when XSk is
enabled on a queue pair, it has its separate Tx cleaning functions, so
neither ice_clean_xdp_irq() nor ice_unmap_and_free_tx_buf() ever happens
there.
Remove impossible branches in order to reduce the diffstat of the
upcoming change.
Fixes:
a24b4c6e9aab ("ice: xsk: Do not convert to buff to frame for XDP_TX")
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-4-alexandr.lobakin@intel.com
Alexander Lobakin [Fri, 10 Feb 2023 17:06:14 +0000 (18:06 +0100)]
ice: Fix XDP Tx ring overrun
Sometimes, under heavy XDP Tx traffic, e.g. when using XDP traffic
generator (%BPF_F_TEST_XDP_LIVE_FRAMES), the machine can catch OOM due
to the driver not freeing all of the pages passed to it by
.ndo_xdp_xmit().
Turned out that during the development of the tagged commit, the check,
which ensures that we have a free descriptor to queue a frame, moved
into the branch happening only when a buffer has frags. Otherwise, we
only run a cleaning cycle, but don't check anything.
ATST, there can be situations when the driver gets new frames to send,
but there are no buffers that can be cleaned/completed and the ring has
no free slots. It's very rare, but still possible (> 6.5 Mpps per ring).
The driver then fills the next buffer/descriptor, effectively
overwriting the data, which still needs to be freed.
Restore the check after the cleaning routine to make sure there is a
slot to queue a new frame. When there are frags, there still will be a
separate check that we can place all of them, but if the ring is full,
there's no point in wasting any more time.
(minor: make `!ready_frames` unlikely since it happens ~1-2 times per
billion of frames)
Fixes:
3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-3-alexandr.lobakin@intel.com
Alexander Lobakin [Fri, 10 Feb 2023 17:06:13 +0000 (18:06 +0100)]
ice: fix ice_tx_ring:: Xdp_tx_active underflow
xdp_tx_active is used to indicate whether an XDP ring has any %XDP_TX
frames queued to shortcut processing Tx cleaning for XSk-enabled queues.
When !XSk, it simply indicates whether the ring has any queued frames in
general.
It gets increased on each frame placed onto the ring and counts the
whole frame, not each frag. However, currently it gets decremented in
ice_clean_xdp_tx_buf(), which is called per each buffer, i.e. per each
frag. Thus, on completing multi-frag frames, an underflow happens.
Move the decrement to the outer function and do it once per frame, not
buf. Also, do that on the stack and update the ring counter after the
loop is done to save several cycles.
XSk rings are fine since there are no frags at the moment.
Fixes:
3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-2-alexandr.lobakin@intel.com
Ilya Leoshkevich [Wed, 8 Feb 2023 23:12:11 +0000 (00:12 +0100)]
selftests/bpf: Fix out-of-srctree build
Building BPF selftests out of srctree fails with:
make: *** No rule to make target '/linux-build//ima_setup.sh', needed by 'ima_setup.sh'. Stop.
The culprit is the rule that defines convenient shorthands like
"make test_progs", which builds $(OUTPUT)/test_progs. These shorthands
make sense only for binaries that are built though; scripts that live
in the source tree do not end up in $(OUTPUT).
Therefore drop $(TEST_PROGS) and $(TEST_PROGS_EXTENDED) from the rule.
The issue exists for a while, but it became a problem only after commit
d68ae4982cb7 ("selftests/bpf: Install all required files to run selftests"),
which added dependencies on these scripts.
Fixes:
03dcb78460c2 ("selftests/bpf: Add simple per-test targets to Makefile")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230208231211.283606-1-iii@linux.ibm.com
Alan Maguire [Thu, 9 Feb 2023 13:28:51 +0000 (13:28 +0000)]
bpf: Add --skip_encoding_btf_inconsistent_proto, --btf_gen_optimized to pahole flags for v1.25
v1.25 of pahole supports filtering out functions with multiple inconsistent
function prototypes or optimized-out parameters from the BTF representation.
These present problems because there is no additional info in BTF saying which
inconsistent prototype matches which function instance to help guide attachment,
and functions with optimized-out parameters can lead to incorrect assumptions
about register contents.
So for now, filter out such functions while adding BTF representations for
functions that have "."-suffixes (foo.isra.0) but not optimized-out parameters.
This patch assumes that below linked changes land in pahole for v1.25.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/1675790102-23037-1-git-send-email-alan.maguire@oracle.com
Link: https://lore.kernel.org/bpf/1675949331-27935-1-git-send-email-alan.maguire@oracle.com
Alexei Starovoitov [Sat, 11 Feb 2023 02:59:57 +0000 (18:59 -0800)]
Merge branch 'bpf, mm: introduce cgroup.memory=nobpf'
Yafang Shao says:
====================
The bpf memory accouting has some known problems in contianer
environment,
- The container memory usage is not consistent if there's pinned bpf
program
After the container restart, the leftover bpf programs won't account
to the new generation, so the memory usage of the container is not
consistent. This issue can be resolved by introducing selectable
memcg, but we don't have an agreement on the solution yet. See also
the discussions at https://lwn.net/Articles/905150/ .
- The leftover non-preallocated bpf map can't be limited
The leftover bpf map will be reparented, and thus it will be limited by
the parent, rather than the container itself. Furthermore, if the
parent is destroyed, it be will limited by its parent's parent, and so
on. It can also be resolved by introducing selectable memcg.
- The memory dynamically allocated in bpf prog is charged into root memcg
only
Nowdays the bpf prog can dynamically allocate memory, for example via
bpf_obj_new(), but it only allocate from the global bpf_mem_alloc
pool, so it will charge into root memcg only. That needs to be
addressed by a new proposal.
So let's give the container user an option to disable bpf memory accouting.
The idea of "cgroup.memory=nobpf" is originally by Tejun[1].
[1]. https://lwn.net/ml/linux-mm/YxjOawzlgE458ezL@slm.duckdns.org/
Changes,
v1->v2:
- squash patches (Roman)
- commit log improvement in patch #2. (Johannes)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yafang Shao [Fri, 10 Feb 2023 15:47:34 +0000 (15:47 +0000)]
bpf: allow to disable bpf prog memory accounting
We can simply disable the bpf prog memory accouting by not setting the
GFP_ACCOUNT.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20230210154734.4416-5-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yafang Shao [Fri, 10 Feb 2023 15:47:33 +0000 (15:47 +0000)]
bpf: allow to disable bpf map memory accounting
We can simply set root memcg as the map's memcg to disable bpf memory
accounting. bpf_map_area_alloc is a little special as it gets the memcg
from current rather than from the map, so we need to disable GFP_ACCOUNT
specifically for it.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20230210154734.4416-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yafang Shao [Fri, 10 Feb 2023 15:47:32 +0000 (15:47 +0000)]
bpf: use bpf_map_kvcalloc in bpf_local_storage
Introduce new helper bpf_map_kvcalloc() for the memory allocation in
bpf_local_storage(). Then the allocation will charge the memory from the
map instead of from current, though currently they are the same thing as
it is only used in map creation path now. By charging map's memory into
the memcg from the map, it will be more clear.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20230210154734.4416-3-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yafang Shao [Fri, 10 Feb 2023 15:47:31 +0000 (15:47 +0000)]
mm: memcontrol: add new kernel parameter cgroup.memory=nobpf
Add new kernel parameter cgroup.memory=nobpf to allow user disable bpf
memory accounting. This is a preparation for the followup patch.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20230210154734.4416-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Fri, 10 Feb 2023 22:40:31 +0000 (23:40 +0100)]
docs, bpf: Ensure IETF's BPF mailing list gets copied for ISA doc changes
Given BPF is increasingly being used beyond just the Linux kernel, with
implementations in NICs and other hardware, Windows, etc, there is an
ongoing effort to document and standardize parts of the existing BPF
infrastructure such as its ISA. As "source of truth" we decided some
time ago to rely on the in-tree documentation, in particular, starting
out with the Documentation/bpf/instruction-set.rst as a base for later
RFC drafts on the ISA. Therefore, we want to ensure that changes to that
document have bpf@ietf.org in Cc, so add a MAINTAINERS file entry with
a section on documents related to standardization efforts. For now, this
only relates to instruction-set.rst, and later additional files will be
added.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dave Thaler <dthaler@microsoft.com>
Cc: bpf@ietf.org
Link: https://datatracker.ietf.org/doc/bofreq-thaler-bpf-ebpf/
Link: https://lore.kernel.org/r/57619c0dd8e354d82bf38745f99405e3babdc970.1676068387.git.daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jakub Kicinski [Sat, 11 Feb 2023 01:51:27 +0000 (17:51 -0800)]
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-02-11
We've added 96 non-merge commits during the last 14 day(s) which contain
a total of 152 files changed, 4884 insertions(+), 962 deletions(-).
There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c
between commit
5b246e533d01 ("ice: split probe into smaller functions")
from the net-next tree and commit
66c0e13ad236 ("drivers: net: turn on
XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev()
is otherwise there a 2nd time, and add XDP features to the existing
ice_cfg_netdev() one:
[...]
ice_set_netdev_features(netdev);
netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
NETDEV_XDP_ACT_XSK_ZEROCOPY;
ice_set_ops(netdev);
[...]
Stephen's merge conflict mail:
https://lore.kernel.org/bpf/
20230207101951.
21a114fa@canb.auug.org.au/
The main changes are:
1) Add support for BPF trampoline on s390x which finally allows to remove many
test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich.
2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski.
3) Add capability to export the XDP features supported by the NIC.
Along with that, add a XDP compliance test tool,
from Lorenzo Bianconi & Marek Majtyka.
4) Add __bpf_kfunc tag for marking kernel functions as kfuncs,
from David Vernet.
5) Add a deep dive documentation about the verifier's register
liveness tracking algorithm, from Eduard Zingerman.
6) Fix and follow-up cleanups for resolve_btfids to be compiled
as a host program to avoid cross compile issues,
from Jiri Olsa & Ian Rogers.
7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted
when testing on different NICs, from Jesper Dangaard Brouer.
8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang.
9) Extend libbpf to add an option for when the perf buffer should
wake up, from Jon Doron.
10) Follow-up fix on xdp_metadata selftest to just consume on TX
completion, from Stanislav Fomichev.
11) Extend the kfuncs.rst document with description on kfunc
lifecycle & stability expectations, from David Vernet.
12) Fix bpftool prog profile to skip attaching to offline CPUs,
from Tonghao Zhang.
====================
Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Randy Dunlap [Thu, 9 Feb 2023 07:13:44 +0000 (23:13 -0800)]
Documentation: isdn: correct spelling
Correct spelling problems for Documentation/isdn/ as reported
by codespell.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20230209071400.31476-9-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 11 Feb 2023 00:23:05 +0000 (16:23 -0800)]
Merge branch 'net-move-more-duplicate-code-of-ovs-and-tc-conntrack-into-nf_conntrack_ovs'
Xin Long says:
====================
net: move more duplicate code of ovs and tc conntrack into nf_conntrack_ovs
We've moved some duplicate code into nf_nat_ovs in:
"net: eliminate the duplicate code in the ct nat functions of ovs and tc"
This patchset addresses more code duplication in the conntrack of ovs
and tc then creates nf_conntrack_ovs for them, and four functions will
be extracted and moved into it:
nf_ct_handle_fragments()
nf_ct_skb_network_trim()
nf_ct_helper()
nf_ct_add_helper()
====================
Link: https://lore.kernel.org/r/cover.1675810210.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Tue, 7 Feb 2023 22:52:10 +0000 (17:52 -0500)]
net: extract nf_ct_handle_fragments to nf_conntrack_ovs
Now handle_fragments() in OVS and TC have the similar code, and
this patch removes the duplicate code by moving the function
to nf_conntrack_ovs.
Note that skb_clear_hash(skb) or skb->ignore_df = 1 should be
done only when defrag returns 0, as it does in other places
in kernel.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Tue, 7 Feb 2023 22:52:09 +0000 (17:52 -0500)]
net: sched: move frag check and tc_skb_cb update out of handle_fragments
This patch has no functional changes and just moves frag check and
tc_skb_cb update out of handle_fragments, to make it easier to move
the duplicate code from handle_fragments() into nf_conntrack_ovs later.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Tue, 7 Feb 2023 22:52:08 +0000 (17:52 -0500)]
openvswitch: move key and ovs_cb update out of handle_fragments
This patch has no functional changes and just moves key and ovs_cb update
out of handle_fragments, and skb_clear_hash() and skb->ignore_df change
into handle_fragments(), to make it easier to move the duplicate code
from handle_fragments() into nf_conntrack_ovs later.
Note that it changes to pass info->family to handle_fragments() instead
of key for the packet type check, as info->family is set according to
key->eth.type in ovs_ct_copy_action() when creating the action.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Tue, 7 Feb 2023 22:52:07 +0000 (17:52 -0500)]
net: extract nf_ct_skb_network_trim function to nf_conntrack_ovs
There are almost the same code in ovs_skb_network_trim() and
tcf_ct_skb_network_trim(), this patch extracts them into a function
nf_ct_skb_network_trim() and moves the function to nf_conntrack_ovs.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Tue, 7 Feb 2023 22:52:06 +0000 (17:52 -0500)]
net: create nf_conntrack_ovs for ovs and tc use
Similar to nf_nat_ovs created by Commit
ebddb1404900 ("net: move the
nat function to nf_nat_ovs for ovs and tc"), this patch is to create
nf_conntrack_ovs to get these functions shared by OVS and TC only.
There are nf_ct_helper() and nf_ct_add_helper() from nf_conntrak_helper
in this patch, and will be more in the following patches.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ilya Leoshkevich [Fri, 10 Feb 2023 00:12:01 +0000 (01:12 +0100)]
libbpf: Fix alen calculation in libbpf_nla_dump_errormsg()
The code assumes that everything that comes after nlmsgerr are nlattrs.
When calculating their size, it does not account for the initial
nlmsghdr. This may lead to accessing uninitialized memory.
Fixes:
bbf48c18ee0c ("libbpf: add error reporting in XDP")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-8-iii@linux.ibm.com
Ilya Leoshkevich [Fri, 10 Feb 2023 00:12:00 +0000 (01:12 +0100)]
selftests/bpf: Attach to fopen()/fclose() in attach_probe
malloc() and free() may be completely replaced by sanitizers, use
fopen() and fclose() instead.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-7-iii@linux.ibm.com
Ilya Leoshkevich [Fri, 10 Feb 2023 00:11:59 +0000 (01:11 +0100)]
selftests/bpf: Attach to fopen()/fclose() in uprobe_autoattach
malloc() and free() may be completely replaced by sanitizers, use
fopen() and fclose() instead.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-6-iii@linux.ibm.com
Ilya Leoshkevich [Fri, 10 Feb 2023 00:11:58 +0000 (01:11 +0100)]
selftests/bpf: Forward SAN_CFLAGS and SAN_LDFLAGS to runqslower and libbpf
To get useful results from the Memory Sanitizer, all code running in a
process needs to be instrumented. When building tests with other
sanitizers, it's not strictly necessary, but is also helpful.
So make sure runqslower and libbpf are compiled with SAN_CFLAGS and
linked with SAN_LDFLAGS.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-5-iii@linux.ibm.com
Ilya Leoshkevich [Fri, 10 Feb 2023 00:11:57 +0000 (01:11 +0100)]
selftests/bpf: Split SAN_CFLAGS and SAN_LDFLAGS
Memory Sanitizer requires passing different options to CFLAGS and
LDFLAGS: besides the mandatory -fsanitize=memory, one needs to pass
header and library paths, and passing -L to a compilation step
triggers -Wunused-command-line-argument. So introduce a separate
variable for linker flags. Use $(SAN_CFLAGS) as a default in order to
avoid complicating the ASan usage.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-4-iii@linux.ibm.com
Ilya Leoshkevich [Fri, 10 Feb 2023 00:11:56 +0000 (01:11 +0100)]
tools: runqslower: Add EXTRA_CFLAGS and EXTRA_LDFLAGS support
This makes it possible to add sanitizer flags.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-3-iii@linux.ibm.com
Ilya Leoshkevich [Fri, 10 Feb 2023 00:11:55 +0000 (01:11 +0100)]
selftests/bpf: Quote host tools
Using HOSTCC="ccache clang" breaks building the tests, since, when it's
forwarded to e.g. bpftool, the child make sees HOSTCC=ccache and
"clang" is considered a target. Fix by quoting it, and also HOSTLD and
HOSTAR for consistency.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-2-iii@linux.ibm.com
Jakub Kicinski [Thu, 9 Feb 2023 06:06:42 +0000 (22:06 -0800)]
net: skbuff: drop the word head from skb cache
skbuff_head_cache is misnamed (perhaps for historical reasons?)
because it does not hold heads. Head is the buffer which skb->data
points to, and also where shinfo lives. struct sk_buff is a metadata
structure, not the head.
Eric recently added skb_small_head_cache (which allocates actual
head buffers), let that serve as an excuse to finally clean this up :)
Leave the user-space visible name intact, it could possibly be uAPI.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 10 Feb 2023 08:06:33 +0000 (08:06 +0000)]
Merge branch 'net-ipa-GSI'
Alex Elder says:
====================
net: ipa: prepare for GSI register updtaes
An upcoming series (or two) will convert the definitions of GSI
registers used by IPA so they use the "IPA reg" mechanism to specify
register offsets and their fields. This will simplify implementing
the fairly large number of changes required in GSI registers to
support more than 32 GSI channels (introduced in IPA v5.0).
A few minor problems and inconsistencies were found, and they're
fixed here. The last three patches in this series change the
"ipa_reg" code to separate the IPA-specific part (the base virtual
address, basically) from the generic register part, and the now-
generic code is renamed to use just "reg_" or "REG_" as a prefix
rather than "ipa_reg" or "IPA_REG_".
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:53 +0000 (14:56 -0600)]
net: ipa: generalize register field functions
Rename functions related to register fields so they don't appear to
be IPA-specific, and move their definitions into "reg.h":
ipa_reg_fmask() -> reg_fmask()
ipa_reg_bit() -> reg_bit()
ipa_reg_field_max() -> reg_field_max()
ipa_reg_encode() -> reg_encode()
ipa_reg_decode() -> reg_decode()
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:52 +0000 (14:56 -0600)]
net: ipa: generalize register offset functions
Rename ipa_reg_offset() to be reg_offset() and move its definition
to "reg.h". Rename ipa_reg_n_offset() to be reg_n_offset() also.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:51 +0000 (14:56 -0600)]
net: ipa: start generalizing "ipa_reg"
IPA register definitions have evolved with each new version. The
changes required to support more than 32 endpoints in IPA v5.0 made
it best to define a unified mechanism for defining registers and
their fields.
GSI register definitions, meanwhile, have remained fairly stable.
And even as the total number of IPA endpoints goes beyond 32, the
number of GSI channels on a given EE that underly endpoints still
remains 32 or less.
Despite that, GSI v3.0 (which is used with IPA v5.0) extends the
number of channels (and events) it supports to be about 256, and as
a result, many GSI register definitions must change significantly.
To address this, we'll use the same "ipa_reg" mechanism to define
the GSI registers.
As a first step in generalizing the "ipa_reg" to also support GSI
registers, isolate the definitions of the "ipa_reg" and "ipa_regs"
structure types (and some supporting macros) into a new header file,
and remove the "ipa_" and "IPA_" from symbol names.
Separate the IPA register ID validity checking from the generic
check that a register ID is in range. Aside from that, this is
intended to have no functional effect on the code.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:50 +0000 (14:56 -0600)]
net: ipa: GSI register cleanup
Move some static inline function definitions out of "gsi_reg.h" and
into "gsi.c", which is the only place they're used. Rename them so
their names identify the register they're associated with.
Move the gsi_channel_type enumerated type definition below the
offset and field definitions for the CH_C_CNTXT_0 register where
it's used.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:49 +0000 (14:56 -0600)]
net: ipa: use bitmasks for GSI IRQ values
There are seven GSI interrupt types that can be signaled by a single
GSI IRQ. These are represented in a bitmask, and the gsi_irq_type_id
enumerated type defines what each bit position represents.
Similarly, the global and general GSI interrupt types each has a set
of conditions it signals, and both types have an enumerated type
that defines which bit that represents each condition.
When used, these enumerated values are passed as an argument to BIT()
in *all* cases. So clean up the code a little bit by defining the
enumerated type values as one-bit masks rather than bit positions.
Rename gsi_general_id to be gsi_general_irq_id for consistency.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:48 +0000 (14:56 -0600)]
net: ipa: tighten up IPA register validity checking
When checking the validity of an IPA register ID, compare it against
all possible ipa_reg_id values.
Rename the function ipa_reg_id_valid() to be specific about what's
being checked.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:47 +0000 (14:56 -0600)]
net: ipa: add some new IPA versions
Soon IPA v5.0+ will be supported, and when that happens we will be
able to enable support for the SDX65 (IPA v5.0), SM8450 (IPA v5.1),
and SM8550 (IPA v5.5).
Fix the comment about the GSI version used for IPA v3.1.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:46 +0000 (14:56 -0600)]
net: ipa: get rid of ipa->reg_addr
The reg_addr field in the IPA structure is set but never used.
Get rid of it.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 8 Feb 2023 20:56:45 +0000 (14:56 -0600)]
net: ipa: generic command param fix
Starting at IPA v4.11, the GSI_GENERIC_COMMAND GSI register got a
new PARAMS field. The code that encodes a value into that field
sets it unconditionally, which is wrong.
We currently only provide 0 as the field's value, so this error has
no real effect. Still, it's a bug, so let's fix it.
Fix an (unrelated) incorrect comment as well. Fields in the
ERROR_LOG GSI register actually *are* defined for IPA versions
prior to v3.5.1.
Fixes:
fe68c43ce388 ("net: ipa: support enhanced channel flow control")
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Horatiu Vultur [Wed, 8 Feb 2023 13:08:39 +0000 (14:08 +0100)]
net: microchip: vcap: Add tc flower keys for lan966x
Add the following TC flower filter keys to lan966x for IS2:
- ipv4_addr (sip and dip)
- ipv6_addr (sip and dip)
- control (IPv4 fragments)
- portnum (tcp and udp port numbers)
- basic (L3 and L4 protocol)
- vlan (outer vlan tag info)
- tcp (tcp flags)
- ip (tos field)
As the parsing of these keys is similar between lan966x and sparx5, move
the code in a separate file to be shared by these 2 chips. And put the
specific parsing outside of the common functions.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 10 Feb 2023 08:00:05 +0000 (08:00 +0000)]
Merge tag 'rxrpc-next-
20230208' of git://git./linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc development
Here are some miscellaneous changes for rxrpc:
(1) Use consume_skb() rather than kfree_skb_reason().
(2) Fix unnecessary waking when poking and already-poked call.
(3) Add ack.rwind to the rxrpc_tx_ack tracepoint as this indicates how
many incoming DATA packets we're telling the peer that we are
currently willing to accept on this call.
(4) Reduce duplicate ACK transmission. We send ACKs to let the peer know
that we're increasing the receive window (ack.rwind) as we consume
packets locally. Normal ACK transmission is triggered in three places
and that leads to duplicates being sent.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Clément Léger [Wed, 8 Feb 2023 16:12:49 +0000 (17:12 +0100)]
net: pcs: rzn1-miic: remove unused struct members and use miic variable
Remove unused bulk clocks struct from the miic state and use an already
existing miic variable in miic_config().
Signed-off-by: Clément Léger <clement.leger@bootlin.com>
Link: https://lore.kernel.org/r/20230208161249.329631-1-clement.leger@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andy Shevchenko [Wed, 8 Feb 2023 13:31:53 +0000 (15:31 +0200)]
openvswitch: Use string_is_terminated() helper
Use string_is_terminated() helper instead of cpecific memchr() call.
This shows better the intention of the call.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230208133153.22528-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andy Shevchenko [Wed, 8 Feb 2023 13:31:52 +0000 (15:31 +0200)]
genetlink: Use string_is_terminated() helper
Use string_is_terminated() helper instead of cpecific memchr() call.
This shows better the intention of the call.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230208133153.22528-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andy Shevchenko [Wed, 8 Feb 2023 13:31:51 +0000 (15:31 +0200)]
string_helpers: Move string_is_valid() to the header
Move string_is_valid() to the header for wider use.
While at it, rename to string_is_terminated() to be
precise about its semantics.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230208133153.22528-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Horatiu Vultur [Wed, 8 Feb 2023 11:44:06 +0000 (12:44 +0100)]
net: micrel: Cable Diagnostics feature for lan8841 PHY
Add support for cable diagnostics in lan8841 PHY. It has the same
registers layout as lan8814 PHY, therefore reuse the functionality.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230208114406.1666671-1-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Huanhuan Wang [Wed, 8 Feb 2023 09:10:00 +0000 (10:10 +0100)]
nfp: support IPsec offloading for NFP3800
Add IPsec offloading support for NFP3800. Include data
plane and control plane.
Data plane: add IPsec packet process flow in NFP3800
datapath (NFDk).
Control plane: add an algorithm support distinction flow
in xfrm hook function xdo_dev_state_add(), as NFP3800 has
a different set of IPsec algorithm support.
This matches existing support for the NFP6000/NFP4000 and
their NFD3 datapath.
In addition, fixup the md_bytes calculation for NFD3 datapath
to make sure the two datapahts are keept in sync.
Signed-off-by: Huanhuan Wang <huanhuan.wang@corigine.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230208091000.4139974-1-simon.horman@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 10 Feb 2023 01:45:57 +0000 (17:45 -0800)]
Merge branch 'net-introduce-rps_default_mask'
Paolo Abeni says:
====================
net: introduce rps_default_mask
Real-time setups try hard to ensure proper isolation between time
critical applications and e.g. network processing performed by the
network stack in softirq and RPS is used to move the softirq
activity away from the isolated core.
If the network configuration is dynamic, with netns and devices
routinely created at run-time, enforcing the correct RPS setting
on each newly created device allowing to transient bad configuration
became complex.
Additionally, when multi-queue devices are involved, configuring rps
in user-space on each queue easily becomes very expensive, e.g.
some setups use veths with per cpu queues.
These series try to address the above, introducing a new
sysctl knob: rps_default_mask. The new sysctl entry allows
configuring a netns-wide RPS mask, to be enforced since receive
queue creation time without any fourther per device configuration
required.
Additionally, a simple self-test is introduced to check the
rps_default_mask behavior.
====================
Link: https://lore.kernel.org/r/cover.1675789134.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 7 Feb 2023 18:44:58 +0000 (19:44 +0100)]
self-tests: introduce self-tests for RPS default mask
Ensure that RPS default mask changes take place on
all newly created netns/devices and don't affect
existing ones.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 7 Feb 2023 18:44:57 +0000 (19:44 +0100)]
net: introduce default_rps_mask netns attribute
If RPS is enabled, this allows configuring a default rps
mask, which is effective since receive queue creation time.
A default RPS mask allows the system admin to ensure proper
isolation, avoiding races at network namespace or device
creation time.
The default RPS mask is initially empty, and can be
modified via a newly added sysctl entry.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 7 Feb 2023 18:44:56 +0000 (19:44 +0100)]
net-sysctl: factor-out rpm mask manipulation helpers
Will simplify the following patch. No functional change
intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>