Daan De Meyer [Wed, 11 Oct 2023 18:51:09 +0000 (20:51 +0200)]
documentation/bpf: Document cgroup unix socket address hooks
Update the documentation to mention the new cgroup unix sockaddr
hooks.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-8-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Daan De Meyer [Wed, 11 Oct 2023 18:51:08 +0000 (20:51 +0200)]
bpftool: Add support for cgroup unix socket address hooks
Add the necessary plumbing to hook up the new cgroup unix sockaddr
hooks into bpftool.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/r/20231011185113.140426-7-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Daan De Meyer [Wed, 11 Oct 2023 18:51:07 +0000 (20:51 +0200)]
libbpf: Add support for cgroup unix socket address hooks
Add the necessary plumbing to hook up the new cgroup unix sockaddr
hooks into libbpf.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-6-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Daan De Meyer [Wed, 11 Oct 2023 18:51:06 +0000 (20:51 +0200)]
bpf: Implement cgroup sockaddr hooks for unix sockets
These hooks allows intercepting connect(), getsockname(),
getpeername(), sendmsg() and recvmsg() for unix sockets. The unix
socket hooks get write access to the address length because the
address length is not fixed when dealing with unix sockets and
needs to be modified when a unix socket address is modified by
the hook. Because abstract socket unix addresses start with a
NUL byte, we cannot recalculate the socket address in kernelspace
after running the hook by calculating the length of the unix socket
path using strlen().
These hooks can be used when users want to multiplex syscall to a
single unix socket to multiple different processes behind the scenes
by redirecting the connect() and other syscalls to process specific
sockets.
We do not implement support for intercepting bind() because when
using bind() with unix sockets with a pathname address, this creates
an inode in the filesystem which must be cleaned up. If we rewrite
the address, the user might try to clean up the wrong file, leaking
the socket in the filesystem where it is never cleaned up. Until we
figure out a solution for this (and a use case for intercepting bind()),
we opt to not allow rewriting the sockaddr in bind() calls.
We also implement recvmsg() support for connected streams so that
after a connect() that is modified by a sockaddr hook, any corresponding
recmvsg() on the connected socket can also be modified to make the
connected program think it is connected to the "intended" remote.
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-5-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Daan De Meyer [Wed, 11 Oct 2023 18:51:05 +0000 (20:51 +0200)]
bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf
As prep for adding unix socket support to the cgroup sockaddr hooks,
let's add a kfunc bpf_sock_addr_set_sun_path() that allows modifying a unix
sockaddr from bpf. While this is already possible for AF_INET and AF_INET6,
we'll need this kfunc when we add unix socket support since modifying the
address for those requires modifying both the address and the sockaddr
length.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-4-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Daan De Meyer [Wed, 11 Oct 2023 18:51:04 +0000 (20:51 +0200)]
bpf: Propagate modified uaddrlen from cgroup sockaddr programs
As prep for adding unix socket support to the cgroup sockaddr hooks,
let's propagate the sockaddr length back to the caller after running
a bpf cgroup sockaddr hook program. While not important for AF_INET or
AF_INET6, the sockaddr length is important when working with AF_UNIX
sockaddrs as the size of the sockaddr cannot be determined just from the
address family or the sockaddr's contents.
__cgroup_bpf_run_filter_sock_addr() is modified to take the uaddrlen as
an input/output argument. After running the program, the modified sockaddr
length is stored in the uaddrlen pointer.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-3-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Daan De Meyer [Wed, 11 Oct 2023 18:51:03 +0000 (20:51 +0200)]
selftests/bpf: Add missing section name tests for getpeername/getsockname
These were missed when these hooks were first added so add them now
instead to make sure every sockaddr hook has a matching section name
test.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-2-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Martin KaFai Lau [Mon, 9 Oct 2023 22:35:00 +0000 (15:35 -0700)]
Merge branch 'bpf: Fix src IP addr related limitation in bpf_*_fib_lookup()'
Martynas Pumputis says:
====================
The patchset fixes the limitation of bpf_*_fib_lookup() helper, which
prevents it from being used in BPF dataplanes with network interfaces
which have more than one IP addr. See the first patch for more details.
Thanks!
* v2->v3: Address Martin KaFai Lau's feedback
* v1->v2: Use IPv6 stubs to fix compilation when CONFIG_IPV6=m.
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Martynas Pumputis [Sat, 7 Oct 2023 08:14:15 +0000 (10:14 +0200)]
selftests/bpf: Add BPF_FIB_LOOKUP_SRC tests
This patch extends the existing fib_lookup test suite by adding two test
cases (for each IP family):
* Test source IP selection from the egressing netdev.
* Test source IP selection when an IP route has a preferred src IP addr.
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Link: https://lore.kernel.org/r/20231007081415.33502-3-m@lambda.lt
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Martynas Pumputis [Sat, 7 Oct 2023 08:14:14 +0000 (10:14 +0200)]
bpf: Derive source IP addr via bpf_*_fib_lookup()
Extend the bpf_fib_lookup() helper by making it to return the source
IPv4/IPv6 address if the BPF_FIB_LOOKUP_SRC flag is set.
For example, the following snippet can be used to derive the desired
source IP address:
struct bpf_fib_lookup p = { .ipv4_dst = ip4->daddr };
ret = bpf_skb_fib_lookup(skb, p, sizeof(p),
BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_SKIP_NEIGH);
if (ret != BPF_FIB_LKUP_RET_SUCCESS)
return TC_ACT_SHOT;
/* the p.ipv4_src now contains the source address */
The inability to derive the proper source address may cause malfunctions
in BPF-based dataplanes for hosts containing netdevs with more than one
routable IP address or for multi-homed hosts.
For example, Cilium implements packet masquerading in BPF. If an
egressing netdev to which the Cilium's BPF prog is attached has
multiple IP addresses, then only one [hardcoded] IP address can be used for
masquerading. This breaks connectivity if any other IP address should have
been selected instead, for example, when a public and private addresses
are attached to the same egress interface.
The change was tested with Cilium [1].
Nikolay Aleksandrov helped to figure out the IPv6 addr selection.
[1]: https://github.com/cilium/cilium/pull/28283
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Link: https://lore.kernel.org/r/20231007081415.33502-2-m@lambda.lt
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Ian Rogers [Sat, 7 Oct 2023 04:44:39 +0000 (21:44 -0700)]
bpftool: Align bpf_load_and_run_opts insns and data
A C string lacks alignment so use aligned arrays to avoid potential
alignment problems. Switch to using sizeof (less 1 for the \0
terminator) rather than a hardcode size constant.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20231007044439.25171-2-irogers@google.com
Ian Rogers [Sat, 7 Oct 2023 04:44:38 +0000 (21:44 -0700)]
bpftool: Align output skeleton ELF code
libbpf accesses the ELF data requiring at least 8 byte alignment,
however, the data is generated into a C string that doesn't guarantee
alignment. Fix this by assigning to an aligned char array. Use sizeof
on the array, less one for the \0 terminator, rather than generating a
constant.
Fixes:
a6cc6b34b93e ("bpftool: Provide a helper method for accessing skeleton's embedded ELF data")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20231007044439.25171-1-irogers@google.com
David Vernet [Wed, 4 Oct 2023 16:23:39 +0000 (11:23 -0500)]
selftests/bpf: Test pinning bpf timer to a core
Now that we support pinning a BPF timer to the current core, we should
test it with some selftests. This patch adds two new testcases to the
timer suite, which verifies that a BPF timer both with and without
BPF_F_TIMER_ABS, can be pinned to the calling core with BPF_F_TIMER_CPU_PIN.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20231004162339.200702-3-void@manifault.com
David Vernet [Wed, 4 Oct 2023 16:23:38 +0000 (11:23 -0500)]
bpf: Add ability to pin bpf timer to calling CPU
BPF supports creating high resolution timers using bpf_timer_* helper
functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which
specifies that the timeout should be interpreted as absolute time. It
would also be useful to be able to pin that timer to a core. For
example, if you wanted to make a subset of cores run without timer
interrupts, and only have the timer be invoked on a single core.
This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag.
When specified, the HRTIMER_MODE_PINNED flag is passed to
hrtimer_start(). A subsequent patch will update selftests to validate.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20231004162339.200702-2-void@manifault.com
Kees Cook [Fri, 6 Oct 2023 20:17:00 +0000 (13:17 -0700)]
bpf: Annotate struct bpf_stack_map with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle [1], add __counted_by for struct bpf_stack_map.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Link: https://lore.kernel.org/bpf/20231006201657.work.531-kees@kernel.org
Geliang Tang [Fri, 6 Oct 2023 10:32:16 +0000 (18:32 +0800)]
selftests/bpf: Add pairs_redir_to_connected helper
Extract duplicate code from these four functions
unix_redir_to_connected()
udp_redir_to_connected()
inet_unix_redir_to_connected()
unix_inet_redir_to_connected()
to generate a new helper pairs_redir_to_connected(). Create the
different socketpairs in these four functions, then pass the
socketpairs info to the new common helper to do the connections.
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Link: https://lore.kernel.org/r/54bb28dcf764e7d4227ab160883931d2173f4f3d.1696588133.git.geliang.tang@suse.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Andrii Nakryiko [Fri, 6 Oct 2023 17:57:44 +0000 (10:57 -0700)]
selftests/bpf: Don't truncate #test/subtest field
We currently expect up to a three-digit number of tests and subtests, so:
#999/999: some_test/some_subtest: ...
Is the largest test/subtest we can see. If we happen to cross into
1000s, current logic will just truncate everything after 7th character.
This patch fixes this truncate and allows to go way higher (up to 31
characters in total). We still nicely align test numbers:
#60/66 core_reloc_btfgen/type_based___incompat:OK
#60/67 core_reloc_btfgen/type_based___fn_wrong_args:OK
#60/68 core_reloc_btfgen/type_id:OK
#60/69 core_reloc_btfgen/type_id___missing_targets:OK
#60/70 core_reloc_btfgen/enumval:OK
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20231006175744.3136675-3-andrii@kernel.org
Andrii Nakryiko [Fri, 6 Oct 2023 17:57:43 +0000 (10:57 -0700)]
selftests/bpf: Support building selftests in optimized -O2 mode
Add support for building selftests with -O2 level of optimization, which
allows more compiler warnings detection (like lots of potentially
uninitialized usage), but also is useful to have a faster-running test
for some CPU-intensive tests.
One can build optimized versions of libbpf and selftests by running:
$ make RELEASE=1
There is a measurable speed up of about 10 seconds for me locally,
though it's mostly capped by non-parallelized serial tests. User CPU
time goes down by total 40 seconds, from 1m10s to 0m28s.
Unoptimized build (-O0)
=======================
Summary: 430/3544 PASSED, 25 SKIPPED, 4 FAILED
real 1m59.937s
user 1m10.877s
sys 3m14.880s
Optimized build (-O2)
=====================
Summary: 425/3543 PASSED, 25 SKIPPED, 9 FAILED
real 1m50.540s
user 0m28.406s
sys 3m13.198s
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20231006175744.3136675-2-andrii@kernel.org
Andrii Nakryiko [Fri, 6 Oct 2023 17:57:42 +0000 (10:57 -0700)]
selftests/bpf: Fix compiler warnings reported in -O2 mode
Fix a bunch of potentially unitialized variable usage warnings that are
reported by GCC in -O2 mode. Also silence overzealous stringop-truncation
class of warnings.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20231006175744.3136675-1-andrii@kernel.org
Yafang Shao [Thu, 5 Oct 2023 08:41:23 +0000 (08:41 +0000)]
bpf: Inherit system settings for CPU security mitigations
Currently, there exists a system-wide setting related to CPU security
mitigations, denoted as 'mitigations='. When set to 'mitigations=off', it
deactivates all optional CPU mitigations. Therefore, if we implement a
system-wide 'mitigations=off' setting, it should inherently bypass Spectre
v1 and Spectre v4 in the BPF subsystem.
Please note that there is also a more specific 'nospectre_v1' setting on
x86 and ppc architectures, though it is not currently exported. For the
time being, let's disregard more fine-grained options.
This idea emerged during our discussion about potential Spectre v1 attacks
with Luis [0].
[0] https://lore.kernel.org/bpf/
b4fc15f7-b204-767e-ebb9-
fdb4233961fb@iogearbox.net
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Cc: Luis Gerhorst <gerhorst@cs.fau.de>
Link: https://lore.kernel.org/bpf/20231005084123.1338-1-laoar.shao@gmail.com
Akihiko Odaki [Thu, 5 Oct 2023 07:21:36 +0000 (16:21 +0900)]
bpf: Fix the comment for bpf_restore_data_end()
The comment used to say:
> Restore data saved by bpf_compute_data_pointers().
But bpf_compute_data_pointers() does not save the data;
bpf_compute_and_save_data_end() does.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20231005072137.29870-1-akihiko.odaki@daynix.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Geliang Tang [Thu, 5 Oct 2023 07:21:51 +0000 (15:21 +0800)]
selftests/bpf: Enable CONFIG_VSOCKETS in config
CONFIG_VSOCKETS is required by BPF selftests, otherwise we get errors
like this:
./test_progs:socket_loopback_reuseport:386: socket:
Address family not supported by protocol
socket_loopback_reuseport:FAIL:386
./test_progs:vsock_unix_redir_connectible:1496:
vsock_socketpair_connectible() failed
vsock_unix_redir_connectible:FAIL:1496
So this patch enables it in tools/testing/selftests/bpf/config.
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Link: https://lore.kernel.org/r/472e73d285db2ea59aca9bbb95eb5d4048327588.1696490003.git.geliang.tang@suse.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Andrii Nakryiko [Wed, 4 Oct 2023 20:35:07 +0000 (13:35 -0700)]
Merge branch 'selftest/bpf, riscv: Improved cross-building support'
Björn Töpel says:
====================
From: Björn Töpel <bjorn@rivosinc.com>
Yet another "more cross-building support for RISC-V" series.
An example how to invoke a gen_tar build:
| make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- CC=riscv64-linux-gnu-gcc \
| HOSTCC=gcc O=/workspace/kbuild FORMAT= \
| SKIP_TARGETS="arm64 ia64 powerpc sparc64 x86 sgx" -j $(($(nproc)-1)) \
| -C tools/testing/selftests gen_tar
Björn
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Björn Töpel [Wed, 4 Oct 2023 12:27:21 +0000 (14:27 +0200)]
selftests/bpf: Add uprobe_multi to gen_tar target
The uprobe_multi program was not picked up for the gen_tar target. Fix
by adding it to TEST_GEN_FILES.
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20231004122721.54525-4-bjorn@kernel.org
Björn Töpel [Wed, 4 Oct 2023 12:27:20 +0000 (14:27 +0200)]
selftests/bpf: Enable lld usage for RISC-V
RISC-V has proper lld support. Use that, similar to what x86 does, for
urandom_read et al.
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231004122721.54525-3-bjorn@kernel.org
Björn Töpel [Wed, 4 Oct 2023 12:27:19 +0000 (14:27 +0200)]
selftests/bpf: Add cross-build support for urandom_read et al
Some userland programs in the BPF test suite, e.g. urandom_read, is
missing cross-build support. Add cross-build support for these
programs
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231004122721.54525-2-bjorn@kernel.org
Andrii Nakryiko [Wed, 4 Oct 2023 20:18:49 +0000 (13:18 -0700)]
Merge branch 'libbpf/selftests syscall wrapper fixes for RISC-V'
Björn Töpel says:
====================
From: Björn Töpel <bjorn@rivosinc.com>
Commit
08d0ce30e0e4 ("riscv: Implement syscall wrappers") introduced
some regressions in libbpf, and the kselftests BPF suite, which are
fixed with these three patches.
Note that there's an outstanding fix [1] for ftrace syscall tracing
which is also a fallout from the commit above.
Björn
[1] https://lore.kernel.org/linux-riscv/
20231003182407.32198-1-alexghiti@rivosinc.com/
Alexandre Ghiti (1):
libbpf: Fix syscall access arguments on riscv
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Björn Töpel [Wed, 4 Oct 2023 11:09:05 +0000 (13:09 +0200)]
selftests/bpf: Define SYS_NANOSLEEP_KPROBE_NAME for riscv
Add missing sys_nanosleep name for RISC-V, which is used by some tests
(e.g. attach_probe).
Fixes:
08d0ce30e0e4 ("riscv: Implement syscall wrappers")
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/bpf/20231004110905.49024-4-bjorn@kernel.org
Björn Töpel [Wed, 4 Oct 2023 11:09:04 +0000 (13:09 +0200)]
selftests/bpf: Define SYS_PREFIX for riscv
SYS_PREFIX was missing for a RISC-V, which made a couple of kprobe
tests fail.
Add missing SYS_PREFIX for RISC-V.
Fixes:
08d0ce30e0e4 ("riscv: Implement syscall wrappers")
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/bpf/20231004110905.49024-3-bjorn@kernel.org
Alexandre Ghiti [Wed, 4 Oct 2023 11:09:03 +0000 (13:09 +0200)]
libbpf: Fix syscall access arguments on riscv
Since commit
08d0ce30e0e4 ("riscv: Implement syscall wrappers"), riscv
selects ARCH_HAS_SYSCALL_WRAPPER so let's use the generic implementation
of PT_REGS_SYSCALL_REGS().
Fixes:
08d0ce30e0e4 ("riscv: Implement syscall wrappers")
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/bpf/20231004110905.49024-2-bjorn@kernel.org
Daniel Borkmann [Wed, 4 Oct 2023 13:26:02 +0000 (15:26 +0200)]
Merge branch 'bpf-xsk-sh-umem'
Tushar Vyavahare says:
====================
Implement a test for the SHARED_UMEM feature in this patch set and make
necessary changes/improvements. Ensure that the framework now supports
different streams for different sockets.
v2->v3:
- Set the sock_num at the end of the while loop.
- Declare xsk at the top of the while loop.
v1->v2:
- Remove generate_mac_addresses() and generate mac addresses based on
the number of sockets in __test_spec_init() function. [Magnus]
- Update Makefile to include find_bit.c for compiling xskxceiver.
- Add bitmap_full() function to verify all bits are set to break the
while loop in the receive_pkts() and send_pkts() functions.
- Replace __test_and_set_bit() function with __set_bit() function.
- Add single return check for wait_for_tx_completion() function call.
Patch series summary:
1: Move the packet stream from the ifobject struct to the xsk_socket_info
struct to enable the use of different streams for different sockets
This will facilitate the sending and receiving of data from multiple
sockets simultaneously using the SHARED_XDP_UMEM feature.
It gives flexibility of send/recive individual traffic on particular
socket.
2: Rename the header file to a generic name so that it can be used by all
future XDP programs.
3: Move the src_mac and dst_mac fields from the ifobject structure to the
xsk_socket_info structure to achieve per-socket MAC address assignment.
Require this in order to steer traffic to various sockets in subsequent
patches.
4: Improve the receive_pkt() function to enable it to receive packets from
multiple sockets. Define a sock_num variable to iterate through all the
sockets in the Rx path. Add nb_valid_entries to check that all the
expected number of packets are received.
5: The pkt_set() function no longer needs the umem parameter. This commit
removes the umem parameter from the pkt_set() function.
6: Iterate over all the sockets in the send pkts function. Update
send_pkts() to handle multiple sockets for sending packets. Multiple TX
sockets are utilized alternately based on the batch size for improve
packet transmission.
7: Modify xsk_update_xskmap() to accept the index as an argument, enabling
the addition of multiple sockets to xskmap.
8: Add a new test for testing shared umem feature. This is accomplished by
adding a new XDP program and using the multiple sockets. The new XDP
program redirects the packets based on the destination MAC address.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tushar Vyavahare [Wed, 27 Sep 2023 13:52:41 +0000 (19:22 +0530)]
selftests/xsk: Add a test for shared umem feature
Add a new test for testing shared umem feature. This is accomplished by
adding a new XDP program and using the multiple sockets.
The new XDP program redirects the packets based on the destination MAC
address.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20230927135241.2287547-9-tushar.vyavahare@intel.com
Tushar Vyavahare [Wed, 27 Sep 2023 13:52:40 +0000 (19:22 +0530)]
selftests/xsk: Modify xsk_update_xskmap() to accept the index as an argument
Modify xsk_update_xskmap() to accept the index as an argument, enabling
the addition of multiple sockets to xskmap.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20230927135241.2287547-8-tushar.vyavahare@intel.com
Tushar Vyavahare [Wed, 27 Sep 2023 13:52:39 +0000 (19:22 +0530)]
selftests/xsk: Iterate over all the sockets in the send pkts function
Update send_pkts() to handle multiple sockets for sending packets.
Multiple TX sockets are utilized alternately based on the batch size for
improve packet transmission.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20230927135241.2287547-7-tushar.vyavahare@intel.com
Tushar Vyavahare [Wed, 27 Sep 2023 13:52:38 +0000 (19:22 +0530)]
selftests/xsk: Remove unnecessary parameter from pkt_set() function call
The pkt_set() function no longer needs the umem parameter. This commit
removes the umem parameter from the pkt_set() function.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20230927135241.2287547-6-tushar.vyavahare@intel.com
Tushar Vyavahare [Wed, 27 Sep 2023 13:52:37 +0000 (19:22 +0530)]
selftests/xsk: Iterate over all the sockets in the receive pkts function
Improve the receive_pkt() function to enable it to receive packets from
multiple sockets. Define a sock_num variable to iterate through all the
sockets in the Rx path. Add nb_valid_entries to check that all the
expected number of packets are received.
Revise the function __receive_pkts() to only inspect the receive ring
once, handle any received packets, and promptly return. Implement a bitmap
to store the value of number of sockets. Update Makefile to include
find_bit.c for compiling xskxceiver.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20230927135241.2287547-5-tushar.vyavahare@intel.com
Tushar Vyavahare [Wed, 27 Sep 2023 13:52:36 +0000 (19:22 +0530)]
selftests/xsk: Move src_mac and dst_mac to the xsk_socket_info
Move the src_mac and dst_mac fields from the ifobject structure to the
xsk_socket_info structure to achieve per-socket MAC address assignment.
Require this in order to steer traffic to various sockets in subsequent
patches.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20230927135241.2287547-4-tushar.vyavahare@intel.com
Tushar Vyavahare [Wed, 27 Sep 2023 13:52:35 +0000 (19:22 +0530)]
selftests/xsk: Rename xsk_xdp_metadata.h to xsk_xdp_common.h
Rename the header file to a generic name so that it can be used by all
future XDP programs. Ensure that the xsk_xdp_common.h header file includes
include guards.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20230927135241.2287547-3-tushar.vyavahare@intel.com
Tushar Vyavahare [Wed, 27 Sep 2023 13:52:34 +0000 (19:22 +0530)]
selftests/xsk: Move pkt_stream to the xsk_socket_info
Move the packet stream from the ifobject struct to the xsk_socket_info
struct to enable the use of different streams for different sockets. This
will facilitate the sending and receiving of data from multiple sockets
simultaneously using the SHARED_XDP_UMEM feature.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20230927135241.2287547-2-tushar.vyavahare@intel.com
Hengqi Chen [Fri, 29 Sep 2023 15:59:54 +0000 (15:59 +0000)]
libbpf: Allow Golang symbols in uprobe secdef
Golang symbols in ELF files are different from C/C++
which contains special characters like '*', '(' and ')'.
With generics, things get more complicated, there are
symbols like:
github.com/cilium/ebpf/internal.(*Deque[go.shape.interface { Format(fmt.State, int32); TypeName() string;github.com/cilium/ebpf/btf.copy() github.com/cilium/ebpf/btf.Type}]).Grow
Matching such symbols using `%m[^\n]` in sscanf, this
excludes newline which typically does not appear in ELF
symbols. This should work in most use-cases and also
work for unicode letters in identifiers. If newline do
show up in ELF symbols, users can still attach to such
symbol by specifying bpf_uprobe_opts::func_name.
A working example can be found at this repo ([0]).
[0]: https://github.com/chenhengqi/libbpf-go-symbols
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230929155954.92448-1-hengqi.chen@gmail.com
Ruowen Qin [Wed, 27 Sep 2023 04:50:30 +0000 (23:50 -0500)]
samples/bpf: Add -fsanitize=bounds to userspace programs
The sanitizer flag, which is supported by both clang and gcc, would make
it easier to debug array index out-of-bounds problems in these programs.
Make the Makfile smarter to detect ubsan support from the compiler and
add the '-fsanitize=bounds' accordingly.
Suggested-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Jinghao Jia <jinghao@linux.ibm.com>
Signed-off-by: Jinghao Jia <jinghao7@illinois.edu>
Signed-off-by: Ruowen Qin <ruowenq2@illinois.edu>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230927045030.224548-2-ruowenq2@illinois.edu
Andrii Nakryiko [Mon, 25 Sep 2023 23:37:45 +0000 (16:37 -0700)]
Merge branch 'bpf: Add missed stats for kprobes'
Jiri Olsa says:
====================
hi,
at the moment we can't retrieve the number of missed kprobe
executions and subsequent execution of BPF programs.
This patchset adds:
- counting of missed execution on attach layer for:
. kprobes attached through perf link (kprobe/ftrace)
. kprobes attached through kprobe.multi link (fprobe)
- counting of recursion_misses for BPF kprobe programs
It's still technically possible to create kprobe without perf link (using
SET_BPF perf ioctl) in which case we don't have a way to retrieve the kprobe's
'missed' count. However both libbpf and cilium/ebpf libraries use perf link
if it's available, and for old kernels without perf link support we can use
BPF program to retrieve the kprobe missed count.
v3 changes:
- added acks [Song]
- make test_missed not serial [Andrii]
Also available at:
https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
bpf/missed_stats
thanks,
jirka
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Jiri Olsa [Wed, 20 Sep 2023 21:31:45 +0000 (23:31 +0200)]
selftests/bpf: Add test for recursion counts of perf event link tracepoint
Adding selftest that puts kprobe on bpf_fentry_test1 that calls bpf_printk
and invokes bpf_trace_printk tracepoint. The bpf_trace_printk tracepoint
has test[234] programs attached to it.
Because kprobe execution goes through bpf_prog_active check, programs
attached to the tracepoint will fail the recursion check and increment the
recursion_misses stats.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Reviewed-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20230920213145.1941596-10-jolsa@kernel.org
Jiri Olsa [Wed, 20 Sep 2023 21:31:44 +0000 (23:31 +0200)]
selftests/bpf: Add test for recursion counts of perf event link kprobe
Adding selftest that puts kprobe.multi on bpf_fentry_test1 that
calls bpf_kfunc_common_test kfunc which has 3 perf event kprobes
and 1 kprobe.multi attached.
Because fprobe (kprobe.multi attach layear) does not have strict
recursion check the kprobe's bpf_prog_active check is hit for test2-5.
Disabling this test for arm64, because there's no fprobe support yet.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230920213145.1941596-9-jolsa@kernel.org
Jiri Olsa [Wed, 20 Sep 2023 21:31:43 +0000 (23:31 +0200)]
selftests/bpf: Add test for missed counts of perf event link kprobe
Adding test that puts kprobe on bpf_fentry_test1 that calls
bpf_kfunc_common_test kfunc, which has also kprobe on.
The latter won't get triggered due to kprobe recursion check
and kprobe missed counter is incremented.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230920213145.1941596-8-jolsa@kernel.org
Jiri Olsa [Wed, 20 Sep 2023 21:31:42 +0000 (23:31 +0200)]
bpftool: Display missed count for kprobe perf link
Adding 'missed' field to display missed counts for kprobes
attached by perf event link, like:
# bpftool link
5: perf_event prog 82
kprobe
ffffffff815203e0 ksys_write
6: perf_event prog 83
kprobe
ffffffff811d1e50 scheduler_tick missed 682217
# bpftool link -jp
[{
"id": 5,
"type": "perf_event",
"prog_id": 82,
"retprobe": false,
"addr":
18446744071584220128,
"func": "ksys_write",
"offset": 0,
"missed": 0
},{
"id": 6,
"type": "perf_event",
"prog_id": 83,
"retprobe": false,
"addr":
18446744071580753488,
"func": "scheduler_tick",
"offset": 0,
"missed": 693469
}
]
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20230920213145.1941596-7-jolsa@kernel.org
Jiri Olsa [Wed, 20 Sep 2023 21:31:41 +0000 (23:31 +0200)]
bpftool: Display missed count for kprobe_multi link
Adding 'missed' field to display missed counts for kprobes
attached by kprobe multi link, like:
# bpftool link
5: kprobe_multi prog 76
kprobe.multi func_cnt 1 missed 1
addr func [module]
ffffffffa039c030 fp3_test [fprobe_test]
# bpftool link -jp
[{
"id": 5,
"type": "kprobe_multi",
"prog_id": 76,
"retprobe": false,
"func_cnt": 1,
"missed": 1,
"funcs": [{
"addr":
18446744072102723632,
"func": "fp3_test",
"module": "fprobe_test"
}
]
}
]
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20230920213145.1941596-6-jolsa@kernel.org
Jiri Olsa [Wed, 20 Sep 2023 21:31:40 +0000 (23:31 +0200)]
bpf: Count missed stats in trace_call_bpf
Increase misses stats in case bpf array execution is skipped
because of recursion check in trace_call_bpf.
Adding bpf_prog_inc_misses_counters that increase misses
counts for all bpf programs in bpf_prog_array.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Reviewed-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20230920213145.1941596-5-jolsa@kernel.org
Jiri Olsa [Wed, 20 Sep 2023 21:31:39 +0000 (23:31 +0200)]
bpf: Add missed value to kprobe perf link info
Add missed value to kprobe attached through perf link info to
hold the stats of missed kprobe handler execution.
The kprobe's missed counter gets incremented when kprobe handler
is not executed due to another kprobe running on the same cpu.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230920213145.1941596-4-jolsa@kernel.org
Jiri Olsa [Wed, 20 Sep 2023 21:31:38 +0000 (23:31 +0200)]
bpf: Add missed value to kprobe_multi link info
Add missed value to kprobe_multi link info to hold the stats of missed
kprobe_multi probe.
The missed counter gets incremented when fprobe fails the recursion
check or there's no rethook available for return probe. In either
case the attached bpf program is not executed.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230920213145.1941596-3-jolsa@kernel.org
Jiri Olsa [Wed, 20 Sep 2023 21:31:37 +0000 (23:31 +0200)]
bpf: Count stats for kprobe_multi programs
Adding support to gather missed stats for kprobe_multi
programs due to bpf_prog_active protection.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Reviewed-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20230920213145.1941596-2-jolsa@kernel.org
Andrii Nakryiko [Mon, 25 Sep 2023 23:22:43 +0000 (16:22 -0700)]
Merge branch 'add libbpf getters for individual ringbuffers'
Martin Kelly says:
====================
This patch series adds a new ring__ API to libbpf exposing getters for
accessing the individual ringbuffers inside a struct ring_buffer. This is
useful for polling individually, getting available data, or similar use
cases. The API looks like this, and was roughly proposed by Andrii Nakryiko
in another thread:
Getting a ring struct:
struct ring *ring_buffer__ring(struct ring_buffer *rb, unsigned int idx);
Using the ring struct:
unsigned long ring__consumer_pos(const struct ring *r);
unsigned long ring__producer_pos(const struct ring *r);
size_t ring__avail_data_size(const struct ring *r);
size_t ring__size(const struct ring *r);
int ring__map_fd(const struct ring *r);
int ring__consume(struct ring *r);
Changes in v2:
- Addressed all feedback from Andrii Nakryiko
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Martin Kelly [Mon, 25 Sep 2023 21:50:45 +0000 (14:50 -0700)]
selftests/bpf: Add tests for ring__consume
Add tests for new API ring__consume.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-15-martin.kelly@crowdstrike.com
Martin Kelly [Mon, 25 Sep 2023 21:50:44 +0000 (14:50 -0700)]
libbpf: Add ring__consume
Add ring__consume to consume a single ringbuffer, analogous to
ring_buffer__consume.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-14-martin.kelly@crowdstrike.com
Martin Kelly [Mon, 25 Sep 2023 21:50:43 +0000 (14:50 -0700)]
selftests/bpf: Add tests for ring__map_fd
Add tests for the new API ring__map_fd.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-13-martin.kelly@crowdstrike.com
Martin Kelly [Mon, 25 Sep 2023 21:50:42 +0000 (14:50 -0700)]
libbpf: Add ring__map_fd
Add ring__map_fd to get the file descriptor underlying a given
ringbuffer.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-12-martin.kelly@crowdstrike.com
Martin Kelly [Mon, 25 Sep 2023 21:50:41 +0000 (14:50 -0700)]
selftests/bpf: Add tests for ring__size
Add tests for the new API ring__size.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-11-martin.kelly@crowdstrike.com
Martin Kelly [Mon, 25 Sep 2023 21:50:40 +0000 (14:50 -0700)]
libbpf: Add ring__size
Add ring__size to get the total size of a given ringbuffer.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-10-martin.kelly@crowdstrike.com
Martin Kelly [Mon, 25 Sep 2023 21:50:39 +0000 (14:50 -0700)]
selftests/bpf: Add tests for ring__avail_data_size
Add test for the new API ring__avail_data_size.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-9-martin.kelly@crowdstrike.com
Martin Kelly [Mon, 25 Sep 2023 21:50:38 +0000 (14:50 -0700)]
libbpf: Add ring__avail_data_size
Add ring__avail_data_size for querying the currently available data in
the ringbuffer, similar to the BPF_RB_AVAIL_DATA flag in
bpf_ringbuf_query. This is racy during ongoing operations but is still
useful for overall information on how a ringbuffer is behaving.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-8-martin.kelly@crowdstrike.com
Martin Kelly [Mon, 25 Sep 2023 21:50:37 +0000 (14:50 -0700)]
selftests/bpf: Add tests for ring__*_pos
Add tests for the new APIs ring__producer_pos and ring__consumer_pos.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-7-martin.kelly@crowdstrike.com
Martin Kelly [Mon, 25 Sep 2023 21:50:36 +0000 (14:50 -0700)]
libbpf: Add ring__producer_pos, ring__consumer_pos
Add APIs to get the producer and consumer position for a given
ringbuffer.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-6-martin.kelly@crowdstrike.com
Martin Kelly [Mon, 25 Sep 2023 21:50:35 +0000 (14:50 -0700)]
selftests/bpf: Add tests for ring_buffer__ring
Add tests for the new API ring_buffer__ring.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-5-martin.kelly@crowdstrike.com
Martin Kelly [Mon, 25 Sep 2023 21:50:34 +0000 (14:50 -0700)]
libbpf: Add ring_buffer__ring
Add a new function ring_buffer__ring, which exposes struct ring * to the
user, representing a single ringbuffer.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-4-martin.kelly@crowdstrike.com
Martin Kelly [Mon, 25 Sep 2023 21:50:33 +0000 (14:50 -0700)]
libbpf: Switch rings to array of pointers
Switch rb->rings to be an array of pointers instead of a contiguous
block. This allows for each ring pointer to be stable after
ring_buffer__add is called, which allows us to expose struct ring * to
the user without gotchas. Without this change, the realloc in
ring_buffer__add could invalidate a struct ring *, making it unsafe to
give to the user.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-3-martin.kelly@crowdstrike.com
Martin Kelly [Mon, 25 Sep 2023 21:50:32 +0000 (14:50 -0700)]
libbpf: Refactor cleanup in ring_buffer__add
Refactor the cleanup code in ring_buffer__add to use a unified err_out
label. This reduces code duplication, as well as plugging a potential
leak if mmap_sz != (__u64)(size_t)mmap_sz (currently this would miss
unmapping tmp because ringbuf_unmap_ring isn't called).
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-2-martin.kelly@crowdstrike.com
Andrii Nakryiko [Fri, 22 Sep 2023 21:18:56 +0000 (14:18 -0700)]
Merge branch 'libbpf: Support symbol versioning for uprobe'
Hengqi Chen says:
====================
Dynamic symbols in shared library may have the same name, for example:
$ nm -D /lib/x86_64-linux-gnu/libc.so.6 | grep rwlock_wrlock
000000000009b1a0 T __pthread_rwlock_wrlock@GLIBC_2.2.5
000000000009b1a0 T pthread_rwlock_wrlock@@GLIBC_2.34
000000000009b1a0 T pthread_rwlock_wrlock@GLIBC_2.2.5
$ readelf -W --dyn-syms /lib/x86_64-linux-gnu/libc.so.6 | grep rwlock_wrlock
706:
000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 __pthread_rwlock_wrlock@GLIBC_2.2.5
2568:
000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 pthread_rwlock_wrlock@@GLIBC_2.34
2571:
000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 pthread_rwlock_wrlock@GLIBC_2.2.5
There are two pthread_rwlock_wrlock symbols in libc.so .dynsym section.
The one with @@ is the default version, the other is hidden.
Note that the version info is stored in .gnu.version and .gnu.version_d
sections of libc and the two symbols are at the _same_ offset.
Currently, specify `pthread_rwlock_wrlock`, `pthread_rwlock_wrlock@@GLIBC_2.34`
or `pthread_rwlock_wrlock@GLIBC_2.2.5` in bpf_uprobe_opts::func_name won't work.
Because there are two `pthread_rwlock_wrlock` in .dynsym sections without the
version suffix and both are global bind.
We could solve this by introducing symbol versioning ([0]). So that users can
specify func, func@LIB_VERSION or func@@LIB_VERSION to attach a uprobe.
This patchset resolves symbol conflicts and add symbol versioning for uprobe.
- Patch 1 resolves symbol conflicts at the same offset
- Patch 2 adds symbol versioning for dynsym
- Patch 3 adds selftests for the above changes
Changes from v3:
- Address comments from Andrii
Changes from v2:
- Add uretprobe selfttest (Alan)
- Check symbol exact match (Alan)
- Fix typo (Jiri)
Changes from v1:
- Address comments from Alan and Jiri
- Add selftests (Someone reminds me that there is an attempt at [1]
and part of the selftest code from Andrii is taken from there)
[0]: https://refspecs.linuxfoundation.org/LSB_5.0.0/LSB-Core-generic/LSB-Core-generic/symversion.html
[1]: https://lore.kernel.org/lkml/CAEf4BzZTrjjyyOm3ak9JsssPSh6T_ZmGd677a2rt5e5rBLUrpQ@mail.gmail.com/
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Hengqi Chen [Mon, 18 Sep 2023 02:48:13 +0000 (02:48 +0000)]
selftests/bpf: Add tests for symbol versioning for uprobe
This exercises the newly added dynsym symbol versioning logics.
Now we accept symbols in form of func, func@LIB_VERSION or
func@@LIB_VERSION.
The test rely on liburandom_read.so. For liburandom_read.so, we have:
$ nm -D liburandom_read.so
w __cxa_finalize@GLIBC_2.17
w __gmon_start__
w _ITM_deregisterTMCloneTable
w _ITM_registerTMCloneTable
0000000000000000 A LIBURANDOM_READ_1.0.0
0000000000000000 A LIBURANDOM_READ_2.0.0
000000000000081c T urandlib_api@@LIBURANDOM_READ_2.0.0
0000000000000814 T urandlib_api@LIBURANDOM_READ_1.0.0
0000000000000824 T urandlib_api_sameoffset@LIBURANDOM_READ_1.0.0
0000000000000824 T urandlib_api_sameoffset@@LIBURANDOM_READ_2.0.0
000000000000082c T urandlib_read_without_sema@@LIBURANDOM_READ_1.0.0
00000000000007c4 T urandlib_read_with_sema@@LIBURANDOM_READ_1.0.0
0000000000011018 D urandlib_read_with_sema_semaphore@@LIBURANDOM_READ_1.0.0
For `urandlib_api`, specifying `urandlib_api` will cause a conflict because
there are two symbols named urandlib_api and both are global bind.
For `urandlib_api_sameoffset`, there are also two symbols in the .so, but
both are at the same offset and essentially they refer to the same function
so no conflict.
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230918024813.237475-4-hengqi.chen@gmail.com
Hengqi Chen [Mon, 18 Sep 2023 02:48:12 +0000 (02:48 +0000)]
libbpf: Support symbol versioning for uprobe
In current implementation, we assume that symbol found in .dynsym section
would have a version suffix and use it to compare with symbol user supplied.
According to the spec ([0]), this assumption is incorrect, the version info
of dynamic symbols are stored in .gnu.version and .gnu.version_d sections
of ELF objects. For example:
$ nm -D /lib/x86_64-linux-gnu/libc.so.6 | grep rwlock_wrlock
000000000009b1a0 T __pthread_rwlock_wrlock@GLIBC_2.2.5
000000000009b1a0 T pthread_rwlock_wrlock@@GLIBC_2.34
000000000009b1a0 T pthread_rwlock_wrlock@GLIBC_2.2.5
$ readelf -W --dyn-syms /lib/x86_64-linux-gnu/libc.so.6 | grep rwlock_wrlock
706:
000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 __pthread_rwlock_wrlock@GLIBC_2.2.5
2568:
000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 pthread_rwlock_wrlock@@GLIBC_2.34
2571:
000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 pthread_rwlock_wrlock@GLIBC_2.2.5
In this case, specify pthread_rwlock_wrlock@@GLIBC_2.34 or
pthread_rwlock_wrlock@GLIBC_2.2.5 in bpf_uprobe_opts::func_name won't work.
Because the qualified name does NOT match `pthread_rwlock_wrlock` (without
version suffix) in .dynsym sections.
This commit implements the symbol versioning for dynsym and allows user to
specify symbol in the following forms:
- func
- func@LIB_VERSION
- func@@LIB_VERSION
In case of symbol conflicts, error out and users should resolve it by
specifying a qualified name.
[0]: https://refspecs.linuxfoundation.org/LSB_5.0.0/LSB-Core-generic/LSB-Core-generic/symversion.html
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230918024813.237475-3-hengqi.chen@gmail.com
Hengqi Chen [Mon, 18 Sep 2023 02:48:11 +0000 (02:48 +0000)]
libbpf: Resolve symbol conflicts at the same offset for uprobe
Dynamic symbols in shared library may have the same name, for example:
$ nm -D /lib/x86_64-linux-gnu/libc.so.6 | grep rwlock_wrlock
000000000009b1a0 T __pthread_rwlock_wrlock@GLIBC_2.2.5
000000000009b1a0 T pthread_rwlock_wrlock@@GLIBC_2.34
000000000009b1a0 T pthread_rwlock_wrlock@GLIBC_2.2.5
$ readelf -W --dyn-syms /lib/x86_64-linux-gnu/libc.so.6 | grep rwlock_wrlock
706:
000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 __pthread_rwlock_wrlock@GLIBC_2.2.5
2568:
000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 pthread_rwlock_wrlock@@GLIBC_2.34
2571:
000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 pthread_rwlock_wrlock@GLIBC_2.2.5
Currently, users can't attach a uprobe to pthread_rwlock_wrlock because
there are two symbols named pthread_rwlock_wrlock and both are global
bind. And libbpf considers it as a conflict.
Since both of them are at the same offset we could accept one of them
harmlessly. Note that we already does this in elf_resolve_syms_offsets.
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230918024813.237475-2-hengqi.chen@gmail.com
Tiezhu Yang [Tue, 19 Sep 2023 08:25:37 +0000 (16:25 +0800)]
bpf, docs: Add loongarch64 as arch supporting BPF JIT
As BPF JIT support for loongarch64 was added about one year ago
with commit
5dc615520c4d ("LoongArch: Add BPF JIT support"), it
is appropriate to add loongarch64 as arch supporting BPF JIT in
bpf and sysctl docs as well.
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Link: https://lore.kernel.org/r/1695111937-19697-1-git-send-email-yangtiezhu@loongson.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Jinghao Jia [Sun, 17 Sep 2023 21:42:20 +0000 (16:42 -0500)]
samples/bpf: syscall_tp_user: Fix array out-of-bound access
Commit
06744f24696e ("samples/bpf: Add openat2() enter/exit tracepoint
to syscall_tp sample") added two more eBPF programs to support the
openat2() syscall. However, it did not increase the size of the array
that holds the corresponding bpf_links. This leads to an out-of-bound
access on that array in the bpf_object__for_each_program loop and could
corrupt other variables on the stack. On our testing QEMU, it corrupts
the map1_fds array and causes the sample to fail:
# ./syscall_tp
prog #0: map ids 4 5
verify map:4 val: 5
map_lookup failed: Bad file descriptor
Dynamically allocate the array based on the number of programs reported
by libbpf to prevent similar inconsistencies in the future
Fixes:
06744f24696e ("samples/bpf: Add openat2() enter/exit tracepoint to syscall_tp sample")
Signed-off-by: Jinghao Jia <jinghao@linux.ibm.com>
Signed-off-by: Ruowen Qin <ruowenq2@illinois.edu>
Signed-off-by: Jinghao Jia <jinghao7@illinois.edu>
Link: https://lore.kernel.org/r/20230917214220.637721-4-jinghao7@illinois.edu
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jinghao Jia [Sun, 17 Sep 2023 21:42:19 +0000 (16:42 -0500)]
samples/bpf: syscall_tp_user: Rename num_progs into nr_tests
The variable name num_progs causes confusion because that variable
really controls the number of rounds the test should be executed.
Rename num_progs into nr_tests for the sake of clarity.
Signed-off-by: Jinghao Jia <jinghao@linux.ibm.com>
Signed-off-by: Ruowen Qin <ruowenq2@illinois.edu>
Signed-off-by: Jinghao Jia <jinghao7@illinois.edu>
Link: https://lore.kernel.org/r/20230917214220.637721-3-jinghao7@illinois.edu
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Thu, 21 Sep 2023 21:22:01 +0000 (14:22 -0700)]
Merge branch 'implement-cpuv4-support-for-s390x'
Ilya Leoshkevich says:
====================
Implement cpuv4 support for s390x
v1: https://lore.kernel.org/bpf/
20230830011128.
1415752-1-iii@linux.ibm.com/
v1 -> v2:
- Redo Disable zero-extension for BPF_MEMSX as Puranjay and Alexei
suggested.
- Drop the bpf_ct_insert_entry() patch, it went in via the bpf tree.
- Rebase, don't apply A-bs because there were fixed conflicts.
Hi,
This series adds the cpuv4 support to the s390x eBPF JIT.
Patches 1-3 are preliminary bugfixes.
Patches 4-8 implement the new instructions.
Patches 9-10 enable the tests.
Best regards,
Ilya
====================
Link: https://lore.kernel.org/r/20230919101336.2223655-1-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ilya Leoshkevich [Tue, 19 Sep 2023 10:09:12 +0000 (12:09 +0200)]
selftests/bpf: Trim DENYLIST.s390x
Enable all selftests, except the 2 that have to do with the userspace
unwinding, and the new exceptions test, in the s390x CI.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230919101336.2223655-11-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ilya Leoshkevich [Tue, 19 Sep 2023 10:09:11 +0000 (12:09 +0200)]
selftests/bpf: Enable the cpuv4 tests for s390x
Now that all the cpuv4 support is in place, enable the tests.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230919101336.2223655-10-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ilya Leoshkevich [Tue, 19 Sep 2023 10:09:10 +0000 (12:09 +0200)]
s390/bpf: Implement signed division
Implement the cpuv4 signed division. It is encoded as unsigned
division, but with off field set to 1. s390x has the necessary
instructions: dsgfr, dsgf and dsgr.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230919101336.2223655-9-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ilya Leoshkevich [Tue, 19 Sep 2023 10:09:09 +0000 (12:09 +0200)]
s390/bpf: Implement unconditional jump with 32-bit offset
Implement the cpuv4 unconditional jump with 32-bit offset, which is
encoded as BPF_JMP32 | BPF_JA and stores the offset in the imm field.
Reuse the existing BPF_JMP | BPF_JA logic.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230919101336.2223655-8-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ilya Leoshkevich [Tue, 19 Sep 2023 10:09:08 +0000 (12:09 +0200)]
s390/bpf: Implement unconditional byte swap
Implement the cpuv4 unconditional byte swap, which is encoded as
BPF_ALU64 | BPF_END | BPF_FROM_LE. Since s390x is big-endian, it's
the same as the existing BPF_ALU | BPF_END | BPF_FROM_LE.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230919101336.2223655-7-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ilya Leoshkevich [Tue, 19 Sep 2023 10:09:07 +0000 (12:09 +0200)]
s390/bpf: Implement BPF_MEMSX
Implement the cpuv4 load with sign-extension, which is encoded as
BPF_MEMSX (and, for internal uses cases only, BPF_PROBE_MEMSX).
This is the same as BPF_MEM and BPF_PROBE_MEM, but with sign
extension instead of zero extension, and s390x has the necessary
instructions: lgb, lgh and lgf.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230919101336.2223655-6-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ilya Leoshkevich [Tue, 19 Sep 2023 10:09:06 +0000 (12:09 +0200)]
s390/bpf: Implement BPF_MOV | BPF_X with sign-extension
Implement the cpuv4 register-to-register move with sign extension. It
is distinguished from the normal moves by non-zero values in
insn->off, which determine the source size. s390x has instructions to
deal with all of them: lbr, lhr, lgbr, lghr and lgfr.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230919101336.2223655-5-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ilya Leoshkevich [Tue, 19 Sep 2023 10:09:05 +0000 (12:09 +0200)]
selftests/bpf: Add big-endian support to the ldsx test
Prepare the ldsx test to run on big-endian systems by adding the
necessary endianness checks around narrow memory accesses.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230919101336.2223655-4-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ilya Leoshkevich [Tue, 19 Sep 2023 10:09:04 +0000 (12:09 +0200)]
selftests/bpf: Unmount the cgroup2 work directory
test_progs -t bind_perm,bpf_obj_pinning/mounted-str-rel fails when
the selftests directory is mounted under /mnt, which is a reasonable
thing to do when sharing the selftests residing on the host with a
virtual machine, e.g., using 9p.
The reason is that cgroup2 is mounted at /mnt and not unmounted,
causing subsequent tests that need to access the selftests directory
to fail.
Fix by unmounting it. The kernel maintains a mount stack, so this
reveals what was mounted there before. Introduce cgroup_workdir_mounted
in order to maintain idempotency. Make it thread-local in order to
support test_progs -j.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230919101336.2223655-3-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ilya Leoshkevich [Tue, 19 Sep 2023 10:09:03 +0000 (12:09 +0200)]
bpf: Disable zero-extension for BPF_MEMSX
On the architectures that use bpf_jit_needs_zext(), e.g., s390x, the
verifier incorrectly inserts a zero-extension after BPF_MEMSX, leading
to miscompilations like the one below:
24: 89 1a ff fe 00 00 00 00 "r1 = *(s16 *)(r10 - 2);" # zext_dst set
0x3ff7fdb910e: lgh %r2,-2(%r13,%r0) # load halfword
0x3ff7fdb9114: llgfr %r2,%r2 # wrong!
25: 65 10 00 03 00 00 7f ff if r1 s> 32767 goto +3 <l0_1> # check_cond_jmp_op()
Disable such zero-extensions. The JITs need to insert sign-extension
themselves, if necessary.
Suggested-by: Puranjay Mohan <puranjay12@gmail.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20230919101336.2223655-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Paolo Abeni [Thu, 21 Sep 2023 19:49:45 +0000 (21:49 +0200)]
Merge git://git./linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.
No conflicts.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Linus Torvalds [Thu, 21 Sep 2023 18:28:16 +0000 (11:28 -0700)]
Merge tag 'net-6.6-rc3' of git://git./linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter and bpf.
Current release - regressions:
- bpf: adjust size_index according to the value of KMALLOC_MIN_SIZE
- netfilter: fix entries val in rule reset audit log
- eth: stmmac: fix incorrect rxq|txq_stats reference
Previous releases - regressions:
- ipv4: fix null-deref in ipv4_link_failure
- netfilter:
- fix several GC related issues
- fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP
- eth: team: fix null-ptr-deref when team device type is changed
- eth: i40e: fix VF VLAN offloading when port VLAN is configured
- eth: ionic: fix 16bit math issue when PAGE_SIZE >= 64KB
Previous releases - always broken:
- core: fix ETH_P_1588 flow dissector
- mptcp: fix several connection hang-up conditions
- bpf:
- avoid deadlock when using queue and stack maps from NMI
- add override check to kprobe multi link attach
- hsr: properly parse HSRv1 supervisor frames.
- eth: igc: fix infinite initialization loop with early XDP redirect
- eth: octeon_ep: fix tx dma unmap len values in SG
- eth: hns3: fix GRE checksum offload issue"
* tag 'net-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
sfc: handle error pointers returned by rhashtable_lookup_get_insert_fast()
igc: Expose tx-usecs coalesce setting to user
octeontx2-pf: Do xdp_do_flush() after redirects.
bnxt_en: Flush XDP for bnxt_poll_nitroa0()'s NAPI
net: ena: Flush XDP packets on error.
net/handshake: Fix memory leak in __sock_create() and sock_alloc_file()
net: hinic: Fix warning-hinic_set_vlan_fliter() warn: variable dereferenced before check 'hwdev'
netfilter: ipset: Fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP
netfilter: nf_tables: fix memleak when more than 255 elements expired
netfilter: nf_tables: disable toggling dormant table state more than once
vxlan: Add missing entries to vxlan_get_size()
net: rds: Fix possible NULL-pointer dereference
team: fix null-ptr-deref when team device type is changed
net: bridge: use DEV_STATS_INC()
net: hns3: add 5ms delay before clear firmware reset irq source
net: hns3: fix fail to delete tc flower rules during reset issue
net: hns3: only enable unicast promisc when mac table full
net: hns3: fix GRE checksum offload issue
net: hns3: add cmdq check for vf periodic service task
net: stmmac: fix incorrect rxq|txq_stats reference
...
Linus Torvalds [Thu, 21 Sep 2023 17:15:26 +0000 (10:15 -0700)]
Merge tag 'v6.6-rc3.vfs.ctime.revert' of git://git./linux/kernel/git/vfs/vfs
Pull finegrained timestamp reverts from Christian Brauner:
"Earlier this week we sent a few minor fixes for the multi-grained
timestamp work in [1]. While we were polishing those up after Linus
realized that there might be a nicer way to fix them we received a
regression report in [2] that fine grained timestamps break gnulib
tests and thus possibly other tools.
The kernel will elide fine-grain timestamp updates when no one is
actively querying for them to avoid performance impacts. So a sequence
like write(f1) stat(f2) write(f2) stat(f2) write(f1) stat(f1) may
result in timestamp f1 to be older than the final f2 timestamp even
though f1 was last written too but the second write didn't update the
timestamp.
Such plotholes can lead to subtle bugs when programs compare
timestamps. For example, the nap() function in [2] will estimate that
it needs to wait one ns on a fine-grain timestamp enabled filesytem
between subsequent calls to observe a timestamp change. But in general
we don't update timestamps with more than one jiffie if we think that
no one is actively querying for fine-grain timestamps to avoid
performance impacts.
While discussing various fixes the decision was to go back to the
drawing board and ultimately to explore a solution that involves only
exposing such fine-grained timestamps to nfs internally and never to
userspace.
As there are multiple solutions discussed the honest thing to do here
is not to fix this up or disable it but to cleanly revert. The general
infrastructure will probably come back but there is no reason to keep
this code in mainline.
The general changes to timestamp handling are valid and a good cleanup
that will stay. The revert is fully bisectable"
Link: https://lore.kernel.org/all/20230918-hirte-neuzugang-4c2324e7bae3@brauner
Link: https://lore.kernel.org/all/bf0524debb976627693e12ad23690094e4514303.camel@linuxfromscratch.org
* tag 'v6.6-rc3.vfs.ctime.revert' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Revert "fs: add infrastructure for multigrain timestamps"
Revert "btrfs: convert to multigrain timestamps"
Revert "ext4: switch to multigrain timestamps"
Revert "xfs: switch to multigrain timestamps"
Revert "tmpfs: add support for multigrain timestamps"
Linus Torvalds [Thu, 21 Sep 2023 15:39:24 +0000 (08:39 -0700)]
Merge tag 'powerpc-6.6-2' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- A fix for breakpoint handling which was using get_user() while atomic
- Fix the Power10 HASHCHK handler which was using get_user() while
atomic
- A few build fixes for issues caused by recent changes
Thanks to Benjamin Gray, Christophe Leroy, Kajol Jain, and Naveen N Rao.
* tag 'powerpc-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/dexcr: Move HASHCHK trap handler
powerpc/82xx: Select FSL_SOC
powerpc: Fix build issue with LD_DEAD_CODE_DATA_ELIMINATION and FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY
powerpc/watchpoints: Annotate atomic context in more places
powerpc/watchpoint: Disable pagefaults when getting user instruction
powerpc/watchpoints: Disable preemption in thread_change_pc()
powerpc/perf/hv-24x7: Update domain value check
Linus Torvalds [Thu, 21 Sep 2023 15:27:42 +0000 (08:27 -0700)]
Merge tag 'for-linus-6.6a-rc3-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- remove some unused functions in the Xen event channel handling
- fix a regression (introduced during the merge window) when booting as
Xen PV guest
- small cleanup removing another strncpy() instance
* tag 'for-linus-6.6a-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/efi: refactor deprecated strncpy
x86/xen: allow nesting of same lazy mode
x86/xen: move paravirt lazy code
arm/xen: remove lazy mode related definitions
xen: simplify evtchn_do_upcall() call maze
Linus Torvalds [Thu, 21 Sep 2023 15:21:23 +0000 (08:21 -0700)]
Merge tag 'fixes-2023-09-21' of git://git./linux/kernel/git/rppt/memblock
Pull memblock test fixes from Mike Rapoport:
"Fix several compilation errors and warnings in memblock tests"
* tag 'fixes-2023-09-21' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock tests: fix warning ‘struct seq_file’ declared inside parameter list
memblock tests: fix warning: "__ALIGN_KERNEL" redefined
memblock tests: Fix compilation errors.
Linus Torvalds [Thu, 21 Sep 2023 15:13:15 +0000 (08:13 -0700)]
Merge tag 'sound-6.6-rc3' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A large collection of fixes around this time.
All small and mostly trivial fixes.
- Lots of fixes for the new -Wformat-truncation warnings
- A fix in ALSA rawmidi core regression and UMP handling
- Series of Cirrus codec fixes
- ASoC Intel and Realtek codec fixes
- Usual HD- and USB-audio quirks and AMD ASoC quirks"
* tag 'sound-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (64 commits)
ALSA: hda/realtek - ALC287 Realtek I2S speaker platform support
ALSA: hda: cs35l56: Use the new RUNTIME_PM_OPS() macro
ALSA: usb-audio: scarlett_gen2: Fix another -Wformat-truncation warning
ALSA: rawmidi: Fix NULL dereference at proc read
ASoC: SOF: core: Only call sof_ops_free() on remove if the probe was successful
ASoC: SOF: Intel: MTL: Reduce the DSP init timeout
ASoC: cs42l43: Add shared IRQ flag for shutters
ASoC: imx-audmix: Fix return error with devm_clk_get()
ASoC: hdaudio.c: Add missing check for devm_kstrdup
ALSA: riptide: Fix -Wformat-truncation warning for longname string
ALSA: cs4231: Fix -Wformat-truncation warning for longname string
ALSA: ad1848: Fix -Wformat-truncation warning for longname string
ALSA: hda: generic: Check potential mixer name string truncation
ALSA: cmipci: Fix -Wformat-truncation warning
ALSA: firewire: Fix -Wformat-truncation warning for MIDI stream names
ALSA: firewire: Fix -Wformat-truncation warning for longname string
ALSA: xen: Fix -Wformat-truncation warning
ALSA: opti9x: Fix -Wformat-truncation warning
ALSA: es1688: Fix -Wformat-truncation warning
ALSA: cs4236: Fix -Wformat-truncation warning
...
Linus Torvalds [Thu, 21 Sep 2023 15:10:47 +0000 (08:10 -0700)]
Merge tag 'hwmon-for-v6.6-rc3' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fix from Guenter Roeck:
"One patch to drop a non-existent alarm attribute in the nct6775 driver"
* tag 'hwmon-for-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (nct6775) Fix non-existent ALARM warning
Colin Ian King [Tue, 19 Sep 2023 09:36:06 +0000 (10:36 +0100)]
net: dsa: sja1105: make read-only const arrays static
Don't populate read-only const arrays on the stack, instead make them
static.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230919093606.24446-1-colin.i.king@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Yang Li [Tue, 19 Sep 2023 01:03:05 +0000 (09:03 +0800)]
netdev: Remove unneeded semicolon
./drivers/dpll/dpll_netlink.c:847:3-4: Unneeded semicolon
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6605
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/
202309190540.RFwfIgO7-lkp@intel.com/
Link: https://lore.kernel.org/r/20230919010305.120991-1-yang.lee@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 21 Sep 2023 10:34:02 +0000 (12:34 +0200)]
Merge branch 'vsock-virtio-vhost-msg_zerocopy-preparations'
Arseniy Krasnov says:
====================
vsock/virtio/vhost: MSG_ZEROCOPY preparations
this patchset is first of three parts of another big patchset for
MSG_ZEROCOPY flag support:
https://lore.kernel.org/netdev/
20230701063947.
3422088-1-AVKrasnov@sberdevices.ru/
During review of this series, Stefano Garzarella <sgarzare@redhat.com>
suggested to split it for three parts to simplify review and merging:
1) virtio and vhost updates (for fragged skbs) <--- this patchset
2) AF_VSOCK updates (allows to enable MSG_ZEROCOPY mode and read
tx completions) and update for Documentation/.
3) Updates for tests and utils.
This series enables handling of fragged skbs in virtio and vhost parts.
Newly logic won't be triggered, because SO_ZEROCOPY options is still
impossible to enable at this moment (next bunch of patches from big
set above will enable it).
====================
Link: https://lore.kernel.org/r/20230916130918.4105122-1-avkrasnov@salutedevices.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Arseniy Krasnov [Sat, 16 Sep 2023 13:09:18 +0000 (16:09 +0300)]
vsock/virtio: MSG_ZEROCOPY flag support
This adds handling of MSG_ZEROCOPY flag on transmission path:
1) If this flag is set and zerocopy transmission is possible (enabled
in socket options and transport allows zerocopy), then non-linear
skb will be created and filled with the pages of user's buffer.
Pages of user's buffer are locked in memory by 'get_user_pages()'.
2) Replaces way of skb owning: instead of 'skb_set_owner_sk_safe()' it
calls 'skb_set_owner_w()'. Reason of this change is that
'__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc' of socket, so
to decrease this field correctly, proper skb destructor is needed:
'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
3) Adds new callback to 'struct virtio_transport': 'can_msgzerocopy'.
If this callback is set, then transport needs extra check to be able
to send provided number of buffers in zerocopy mode. Currently, the
only transport that needs this callback set is virtio, because this
transport adds new buffers to the virtio queue and we need to check,
that number of these buffers is less than size of the queue (it is
required by virtio spec). vhost and loopback transports don't need
this check.
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Arseniy Krasnov [Sat, 16 Sep 2023 13:09:17 +0000 (16:09 +0300)]
vsock/virtio: non-linear skb handling for tap
For tap device new skb is created and data from the current skb is
copied to it. This adds copying data from non-linear skb to new
the skb.
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Arseniy Krasnov [Sat, 16 Sep 2023 13:09:16 +0000 (16:09 +0300)]
vsock/virtio: support to send non-linear skb
For non-linear skb use its pages from fragment array as buffers in
virtio tx queue. These pages are already pinned by 'get_user_pages()'
during such skb creation.
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Arseniy Krasnov [Sat, 16 Sep 2023 13:09:15 +0000 (16:09 +0300)]
vsock/virtio/vhost: read data from non-linear skb
This is preparation patch for MSG_ZEROCOPY support. It adds handling of
non-linear skbs by replacing direct calls of 'memcpy_to_msg()' with
'skb_copy_datagram_iter()'. Main advantage of the second one is that it
can handle paged part of the skb by using 'kmap()' on each page, but if
there are no pages in the skb, it behaves like simple copying to iov
iterator. This patch also adds new field to the control block of skb -
this value shows current offset in the skb to read next portion of data
(it doesn't matter linear it or not). Idea behind this field is that
'skb_copy_datagram_iter()' handles both types of skb internally - it
just needs an offset from which to copy data from the given skb. This
offset is incremented on each read from skb. This approach allows to
simplify handling of both linear and non-linear skbs, because for
linear skb we need to call 'skb_pull()' after reading data from it,
while in non-linear case we need to update 'data_len'.
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 21 Sep 2023 09:09:44 +0000 (11:09 +0200)]
Merge tag 'nf-23-09-20' of https://git./linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
netfilter updates for net
The following three patches fix regressions in the netfilter subsystem:
1. Reject attempts to repeatedly toggle the 'dormant' flag in a single
transaction. Doing so makes nf_tables lose track of the real state
vs. the desired state. This ends with an attempt to unregister hooks
that were never registered in the first place, which yields a splat.
2. Fix element counting in the new nftables garbage collection infra
that came with 6.5: More than 255 expired elements wraps a counter
which results in memory leak.
3. Since 6.4 ipset can BUG when a set is renamed while a CREATE command
is in progress, fix from Jozsef Kadlecsik.
* tag 'nf-23-09-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: ipset: Fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP
netfilter: nf_tables: fix memleak when more than 255 elements expired
netfilter: nf_tables: disable toggling dormant table state more than once
====================
Link: https://lore.kernel.org/r/20230920084156.4192-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>