Sidong Yang [Sat, 16 Nov 2024 08:10:52 +0000 (17:10 +0900)]
libbpf: Change hash_combine parameters from long to unsigned long
The hash_combine() could be trapped when compiled with sanitizer like "zig cc"
or clang with signed-integer-overflow option. This patch parameters and return
type to unsigned long to remove the potential overflow.
Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241116081054.65195-1-sidong.yang@furiosa.ai
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Sat, 16 Nov 2024 18:56:17 +0000 (10:56 -0800)]
selftests/bpf: Fix build error with llvm 19
llvm 19 fails to compile arena self test:
CLNG-BPF [test_progs] verifier_arena_large.bpf.o
progs/verifier_arena_large.c:90:24: error: unsupported signed division, please convert to unsigned div/mod.
90 | pg_idx = (pg - base) / PAGE_SIZE;
Though llvm <= 18 and llvm >= 20 don't have this issue,
fix the test to avoid the build error.
Reported-by: Jiri Olsa <olsajiri@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jiri Olsa [Fri, 15 Nov 2024 11:58:43 +0000 (12:58 +0100)]
libbpf: Fix memory leak in bpf_program__attach_uprobe_multi
Andrii reported memory leak detected by Coverity on error path
in bpf_program__attach_uprobe_multi. Fixing that by moving
the check earlier before the offsets allocations.
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241115115843.694337-1-jolsa@kernel.org
Andrii Nakryiko [Fri, 15 Nov 2024 00:13:03 +0000 (16:13 -0800)]
bpf: use common instruction history across all states
Instead of allocating and copying instruction history each time we
enqueue child verifier state, switch to a model where we use one common
dynamically sized array of instruction history entries across all states.
The key observation for proving this is correct is that instruction
history is only relevant while state is active, which means it either is
a current state (and thus we are actively modifying instruction history
and no other state can interfere with us) or we are checkpointed state
with some children still active (either enqueued or being current).
In the latter case our portion of instruction history is finalized and
won't change or grow, so as long as we keep it immutable until the state
is finalized, we are good.
Now, when state is finalized and is put into state hash for potentially
future pruning lookups, instruction history is not used anymore. This is
because instruction history is only used by precision marking logic, and
we never modify precision markings for finalized states.
So, instead of each state having its own small instruction history, we
keep a global dynamically-sized instruction history, where each state in
current DFS path from root to active state remembers its portion of
instruction history. Current state can append to this history, but
cannot modify any of its parent histories.
Async callback state enqueueing, while logically detached from parent
state, still is part of verification backtracking tree, so has to follow
the same schema as normal state checkpoints.
Because the insn_hist array can be grown through realloc, states don't
keep pointers, they instead maintain two indices, [start, end), into
global instruction history array. End is exclusive index, so
`start == end` means there is no relevant instruction history.
This eliminates a lot of allocations and minimizes overall memory usage.
For instance, running a worst-case test from [0] (but without the
heuristics-based fix [1]), it took 12.5 minutes until we get -ENOMEM.
With the changes in this patch the whole test succeeds in 10 minutes
(very slow, so heuristics from [1] is important, of course).
To further validate correctness, veristat-based comparison was performed for
Meta production BPF objects and BPF selftests objects. In both cases there
were no differences *at all* in terms of verdict or instruction and state
counts, providing a good confidence in the change.
Having this low-memory-overhead solution of keeping dynamic
per-instruction history cheaply opens up some new possibilities, like
keeping extra information for literally every single validated
instruction. This will be used for simplifying precision backpropagation
logic in follow up patches.
[0] https://lore.kernel.org/bpf/
20241029172641.
1042523-2-eddyz87@gmail.com/
[1] https://lore.kernel.org/bpf/
20241029172641.
1042523-1-eddyz87@gmail.com/
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241115001303.277272-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Fri, 15 Nov 2024 06:03:54 +0000 (22:03 -0800)]
bpf: Add necessary migrate_disable to range_tree.
When running bpf selftest (./test_progs -j), the following warnings
showed up:
$ ./test_progs -t arena_atomics
...
BUG: using smp_processor_id() in preemptible [
00000000] code: kworker/u19:0/12501
caller is bpf_mem_free+0x128/0x330
...
Call Trace:
<TASK>
dump_stack_lvl
check_preemption_disabled
bpf_mem_free
range_tree_destroy
arena_map_free
bpf_map_free_deferred
process_scheduled_works
...
For selftests arena_htab and arena_list, similar smp_process_id() BUGs are
dumped, and the following are two stack trace:
<TASK>
dump_stack_lvl
check_preemption_disabled
bpf_mem_alloc
range_tree_set
arena_map_alloc
map_create
...
<TASK>
dump_stack_lvl
check_preemption_disabled
bpf_mem_alloc
range_tree_clear
arena_vm_fault
do_pte_missing
handle_mm_fault
do_user_addr_fault
...
Add migrate_{disable,enable}() around related bpf_mem_{alloc,free}()
calls to fix the issue.
Fixes:
b795379757eb ("bpf: Introduce range_tree data structure and use it in bpf arena")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241115060354.2832495-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Viktor Malik [Fri, 15 Nov 2024 08:25:48 +0000 (09:25 +0100)]
bpf: Do not alloc arena on unsupported arches
Do not allocate BPF arena on arches that do not support it, instead
return EOPNOTSUPP. This is useful to prevent bugs such as soft lockups
while trying to free the arena which we have witnessed on ppc64le [1].
[1] https://lore.kernel.org/bpf/
4afdcb50-13f2-4772-8db1-
3fd02bd985b3@redhat.com/
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Link: https://lore.kernel.org/r/20241115082548.74972-1-vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ihor Solodrai [Fri, 15 Nov 2024 00:38:55 +0000 (00:38 +0000)]
selftests/bpf: Set test path for token/obj_priv_implicit_token_envvar
token/obj_priv_implicit_token_envvar test may fail in an environment
where the process executing tests can not write to the root path.
Example:
https://github.com/libbpf/libbpf/actions/runs/
11844507007/job/
33007897936
Change default path used by the test to /tmp/bpf-token-fs, and make it
runtime configurable via an environment variable.
Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241115003853.864397-1-ihor.solodrai@pm.me
Andrii Nakryiko [Wed, 13 Nov 2024 21:51:10 +0000 (13:51 -0800)]
Merge branch 'bpf-range_tree-for-bpf-arena'
Alexei Starovoitov says:
====================
bpf: range_tree for bpf arena
From: Alexei Starovoitov <ast@kernel.org>
Introduce range_tree (interval tree plus rbtree) to track
unallocated ranges in bpf arena and replace maple_tree with it.
This is a step towards making bpf_arena|free_alloc_pages non-sleepable.
The previous approach to reuse drm_mm to replace maple_tree reached
dead end, since sizeof(struct drm_mm_node) = 168 and
sizeof(struct maple_node) = 256 while
sizeof(struct range_node) = 64 introduced in this patch.
Not only it's smaller, but the algorithm splits and merges
adjacent ranges. Ultimate performance doesn't matter.
The main objective of range_tree is to work in context
where kmalloc/kfree are not safe. It achieves that via bpf_mem_alloc.
====================
Link: https://patch.msgid.link/20241108025616.17625-1-alexei.starovoitov@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Alexei Starovoitov [Fri, 8 Nov 2024 02:56:16 +0000 (18:56 -0800)]
selftests/bpf: Add a test for arena range tree algorithm
Add a test that verifies specific behavior of arena range tree
algorithm and adjust existing big_alloc1 test due to use
of global data in arena.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20241108025616.17625-3-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 8 Nov 2024 02:56:15 +0000 (18:56 -0800)]
bpf: Introduce range_tree data structure and use it in bpf arena
Introduce range_tree data structure and use it in bpf arena to track
ranges of allocated pages. range_tree is a large bitmap that is
implemented as interval tree plus rbtree. The contiguous sequence of
bits represents unallocated pages.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20241108025616.17625-2-alexei.starovoitov@gmail.com
Alexei Starovoitov [Wed, 13 Nov 2024 20:51:15 +0000 (12:51 -0800)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR.
In particular to bring the fix in
commit
aa30eb3260b2 ("bpf: Force checkpoint when jmp history is too long").
The follow up verifier work depends on it.
And the fix in
commit
6801cf7890f2 ("selftests/bpf: Use -4095 as the bad address for bits iterator").
It's fixing instability of BPF CI on s390 arch.
No conflicts.
Adjacent changes in:
Auto-merging arch/Kconfig
Auto-merging kernel/bpf/helpers.c
Auto-merging kernel/bpf/memalloc.c
Auto-merging kernel/bpf/verifier.c
Auto-merging mm/slab_common.c
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Zhu Jun [Mon, 11 Nov 2024 06:15:14 +0000 (22:15 -0800)]
samples/bpf: Remove unused variable in xdp2skb_meta_kern.c
The variable is never referenced in the code, just remove it.
Signed-off-by: Zhu Jun <zhujun2@cmss.chinamobile.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241111061514.3257-1-zhujun2@cmss.chinamobile.com
Zhu Jun [Mon, 11 Nov 2024 06:23:12 +0000 (22:23 -0800)]
samples/bpf: Remove unused variables in tc_l2_redirect_kern.c
These variables are never referenced in the code, just remove them.
Signed-off-by: Zhu Jun <zhujun2@cmss.chinamobile.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241111062312.3541-1-zhujun2@cmss.chinamobile.com
Luo Yifan [Tue, 12 Nov 2024 07:37:01 +0000 (15:37 +0800)]
bpftool: Cast variable `var` to long long
When the SIGNED condition is met, the variable `var` should be cast to
`long long` instead of `unsigned long long`.
Signed-off-by: Luo Yifan <luoyifan@cmss.chinamobile.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Quentin Monnet <qmo@kernel.org>
Link: https://lore.kernel.org/bpf/20241112073701.283362-1-luoyifan@cmss.chinamobile.com
Linus Torvalds [Wed, 13 Nov 2024 17:14:19 +0000 (09:14 -0800)]
Merge tag 'bpf-fixes' of git://git./linux/kernel/git/bpf/bpf
Pull bpf fixes from Daniel Borkmann:
- Fix a mismatching RCU unlock flavor in bpf_out_neigh_v6 (Jiawei Ye)
- Fix BPF sockmap with kTLS to reject vsock and unix sockets upon kTLS
context retrieval (Zijian Zhang)
- Fix BPF bits iterator selftest for s390x (Hou Tao)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix mismatched RCU unlock flavour in bpf_out_neigh_v6
bpf: Add sk_is_inet and IS_ICSK check in tls_sw_has_ctx_tx/rx
selftests/bpf: Use -4095 as the bad address for bits iterator
Linus Torvalds [Wed, 13 Nov 2024 17:09:00 +0000 (09:09 -0800)]
Merge tag 'loongarch-fixes-6.12-2' of git://git./linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
- fix possible CPUs setup logical-physical CPU mapping, in order to
avoid CPU hotplug issue
- fix some KASAN bugs
- fix AP booting issue in VM mode
- some trivial cleanups
* tag 'loongarch-fixes-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: Fix AP booting issue in VM mode
LoongArch: Add WriteCombine shadow mapping in KASAN
LoongArch: Disable KASAN if PGDIR_SIZE is too large for cpu_vabits
LoongArch: Make KASAN work with 5-level page-tables
LoongArch: Define a default value for VM_DATA_DEFAULT_FLAGS
LoongArch: Fix early_numa_add_cpu() usage for FDT systems
LoongArch: For all possible CPUs setup logical-physical CPU mapping
Linus Torvalds [Wed, 13 Nov 2024 16:58:11 +0000 (08:58 -0800)]
Merge tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git./linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"10 hotfixes, 7 of which are cc:stable. 7 are MM, 3 are not. All
singletons"
* tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: swapfile: fix cluster reclaim work crash on rotational devices
selftests: hugetlb_dio: fixup check for initial conditions to skip in the start
mm/thp: fix deferred split queue not partially_mapped: fix
mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases
nommu: pass NULL argument to vma_iter_prealloc()
ocfs2: fix UBSAN warning in ocfs2_verify_volume()
nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint
nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint
mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
mm: count zeromap read and set for swapout and swapin
Leon Hwang [Thu, 7 Nov 2024 13:45:28 +0000 (21:45 +0800)]
bpf, x86: Propagate tailcall info only for subprogs
In x64 JIT, propagate tailcall info only for subprogs, not for helpers
or kfuncs.
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20241107134529.8602-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Wed, 13 Nov 2024 01:13:47 +0000 (17:13 -0800)]
Merge branch 'add-kernel-symbol-for-struct_ops-trampoline'
Xu Kuohai says:
====================
Add kernel symbol for struct_ops trampoline.
Without kernel symbol for struct_ops trampoline, the unwinder may
produce unexpected stacktraces. For example, the x86 ORC and FP
unwinder stops stacktrace on a struct_ops trampoline address since
there is no kernel symbol for the address.
v4:
- Add a separate cleanup patch to remove unused member rcu from
bpf_struct_ops_map (patch 1)
- Use funcs_cnt instead of btf_type_vlen(vt) for links memory
calculation in .map_mem_usage (patch 2)
- Include ksyms[] memory in map_mem_usage (patch 3)
- Various fixes in patch 3 (Thanks to Martin)
v3: https://lore.kernel.org/bpf/
20241111121641.
2679885-1-xukuohai@huaweicloud.com/
- Add a separate cleanup patch to replace links_cnt with funcs_cnt
- Allocate ksyms on-demand in update_elem() to stay with the links
allocation way
- Set ksym name to prog__<struct_ops_name>_<member_name>
v2: https://lore.kernel.org/bpf/
20241101111948.
1570547-1-xukuohai@huaweicloud.com/
- Refine the commit message for clarity and fix a test bot warning
v1: https://lore.kernel.org/bpf/
20241030111533.907289-1-xukuohai@huaweicloud.com/
====================
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241112145849.3436772-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Xu Kuohai [Tue, 12 Nov 2024 14:58:49 +0000 (22:58 +0800)]
bpf: Add kernel symbol for struct_ops trampoline
Without kernel symbols for struct_ops trampoline, the unwinder may
produce unexpected stacktraces.
For example, the x86 ORC and FP unwinders check if an IP is in kernel
text by verifying the presence of the IP's kernel symbol. When a
struct_ops trampoline address is encountered, the unwinder stops due
to the absence of symbol, resulting in an incomplete stacktrace that
consists only of direct and indirect child functions called from the
trampoline.
The arm64 unwinder is another example. While the arm64 unwinder can
proceed across a struct_ops trampoline address, the corresponding
symbol name is displayed as "unknown", which is confusing.
Thus, add kernel symbol for struct_ops trampoline. The name is
bpf__<struct_ops_name>_<member_name>, where <struct_ops_name> is the
type name of the struct_ops, and <member_name> is the name of
the member that the trampoline is linked to.
Below is a comparison of stacktraces captured on x86 by perf record,
before and after this patch.
Before:
ffffffff8116545d __lock_acquire+0xad ([kernel.kallsyms])
ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms])
ffffffff813088f4 __bpf_prog_enter+0x34 ([kernel.kallsyms])
After:
ffffffff811656bd __lock_acquire+0x30d ([kernel.kallsyms])
ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms])
ffffffff81309024 __bpf_prog_enter+0x34 ([kernel.kallsyms])
ffffffffc000d7e9 bpf__tcp_congestion_ops_cong_avoid+0x3e ([kernel.kallsyms])
ffffffff81f250a5 tcp_ack+0x10d5 ([kernel.kallsyms])
ffffffff81f27c66 tcp_rcv_established+0x3b6 ([kernel.kallsyms])
ffffffff81f3ad03 tcp_v4_do_rcv+0x193 ([kernel.kallsyms])
ffffffff81d65a18 __release_sock+0xd8 ([kernel.kallsyms])
ffffffff81d65af4 release_sock+0x34 ([kernel.kallsyms])
ffffffff81f15c4b tcp_sendmsg+0x3b ([kernel.kallsyms])
ffffffff81f663d7 inet_sendmsg+0x47 ([kernel.kallsyms])
ffffffff81d5ab40 sock_write_iter+0x160 ([kernel.kallsyms])
ffffffff8149c67b vfs_write+0x3fb ([kernel.kallsyms])
ffffffff8149caf6 ksys_write+0xc6 ([kernel.kallsyms])
ffffffff8149cb5d __x64_sys_write+0x1d ([kernel.kallsyms])
ffffffff81009200 x64_sys_call+0x1d30 ([kernel.kallsyms])
ffffffff82232d28 do_syscall_64+0x68 ([kernel.kallsyms])
ffffffff8240012f entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms])
Fixes:
85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241112145849.3436772-4-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Xu Kuohai [Tue, 12 Nov 2024 14:58:48 +0000 (22:58 +0800)]
bpf: Use function pointers count as struct_ops links count
Only function pointers in a struct_ops structure can be linked to bpf
progs, so set the links count to the function pointers count, instead
of the total members count in the structure.
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20241112145849.3436772-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Xu Kuohai [Tue, 12 Nov 2024 14:58:47 +0000 (22:58 +0800)]
bpf: Remove unused member rcu from bpf_struct_ops_map
The rcu member in bpf_struct_ops_map is not used after commit
b671c2067a04 ("bpf: Retire the struct_ops map kvalue->refcnt.")
Remove it.
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20241112145849.3436772-2-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Linus Torvalds [Wed, 13 Nov 2024 00:39:34 +0000 (16:39 -0800)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost
Pull virtio fix from Michael Tsirkin:
"A last minute mlx5 bugfix"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vdpa/mlx5: Fix PA offset with unaligned starting iotlb map
Alexei Starovoitov [Wed, 13 Nov 2024 00:26:25 +0000 (16:26 -0800)]
Merge branch 'bpf-support-private-stack-for-bpf-progs'
Yonghong Song says:
====================
bpf: Support private stack for bpf progs
The main motivation for private stack comes from nested scheduler in
sched-ext from Tejun. The basic idea is that
- each cgroup will its own associated bpf program,
- bpf program with parent cgroup will call bpf programs
in immediate child cgroups.
Let us say we have the following cgroup hierarchy:
root_cg (prog0):
cg1 (prog1):
cg11 (prog11):
cg111 (prog111)
cg112 (prog112)
cg12 (prog12):
cg121 (prog121)
cg122 (prog122)
cg2 (prog2):
cg21 (prog21)
cg22 (prog22)
cg23 (prog23)
In the above example, prog0 will call a kfunc which will call prog1 and
prog2 to get sched info for cg1 and cg2 and then the information is
summarized and sent back to prog0. Similarly, prog11 and prog12 will be
invoked in the kfunc and the result will be summarized and sent back to
prog1, etc. The following illustrates a possible call sequence:
... -> bpf prog A -> kfunc -> ops.<callback_fn> (bpf prog B) ...
Currently, for each thread, the x86 kernel allocate 16KB stack. Each
bpf program (including its subprograms) has maximum 512B stack size to
avoid potential stack overflow. Nested bpf programs further increase the
risk of stack overflow. To avoid potential stack overflow caused by bpf
programs, this patch set supported private stack and bpf program stack
space is allocated during jit time. Using private stack for bpf progs
can reduce or avoid potential kernel stack overflow.
Currently private stack is applied to tracing programs like kprobe/uprobe,
perf_event, tracepoint, raw tracepoint and struct_ops progs.
Tracing progs enable private stack if any subprog stack size is more
than a threshold (i.e. 64B). Struct-ops progs enable private stack
based on particular struct op implementation which can enable private
stack before verification at per-insn level. Struct-ops progs have
the same treatment as tracing progs w.r.t when to enable private stack.
For all these progs, the kernel will do recursion check (no nesting for
per prog per cpu) to ensure that private stack won't be overwritten.
The bpf_prog_aux struct has a callback func recursion_detected() which
can be implemented by kernel subsystem to synchronously detect recursion,
report error, etc.
Only x86_64 arch supports private stack now. It can be extended to other
archs later. Please see each individual patch for details.
Change logs:
v11 -> v12:
link: https://lore.kernel.org/bpf/20241109025312.148539-1-yonghong.song@linux.dev/
- Fix a bug where allocated percpu space is less than actual private stack.
- Add guard memory (before and after actual prog stack) to detect potential
underflow/overflow.
v10 -> v11:
link: https://lore.kernel.org/bpf/20241107024138.3355687-1-yonghong.song@linux.dev/
- Use two bool variables, priv_stack_requested (used by struct-ops only) and
jits_use_priv_stack, in order to make code cleaner.
- Set env->prog->aux->jits_use_priv_stack to true if any subprog uses private stack.
This is for struct-ops use case to kick in recursion protection.
v9 -> v10:
link: https://lore.kernel.org/bpf/20241104193455.3241859-1-yonghong.song@linux.dev/
- Simplify handling async cbs by making those async cb related progs using normal
kernel stack.
- Do percpu allocation in jit instead of verifier.
v8 -> v9:
link: https://lore.kernel.org/bpf/20241101030950.2677215-1-yonghong.song@linux.dev/
- Use enum to express priv stack mode.
- Use bits in bpf_subprog_info struct to do subprog recursion check between
main/async and async subprogs.
- Fix potential memory leak.
- Rename recursion detection func from recursion_skipped() to recursion_detected().
v7 -> v8:
link: https://lore.kernel.org/bpf/20241029221637.264348-1-yonghong.song@linux.dev/
- Add recursion_skipped() callback func to bpf_prog->aux structure such that if
a recursion miss happened and bpf_prog->aux->recursion_skipped is not NULL, the
callback fn will be called so the subsystem can do proper action based on their
respective design.
v6 -> v7:
link: https://lore.kernel.org/bpf/20241020191341.2104841-1-yonghong.song@linux.dev/
- Going back to do private stack allocation per prog instead per subtree. This can
simplify implementation and avoid verifier complexity.
- Handle potential nested subprog run if async callback exists.
- Use struct_ops->check_member() callback to set whether a particular struct-ops
prog wants private stack or not.
v5 -> v6:
link: https://lore.kernel.org/bpf/20241017223138.3175885-1-yonghong.song@linux.dev/
- Instead of using (or not using) private stack at struct_ops level,
each prog in struct_ops can decide whether to use private stack or not.
v4 -> v5:
link: https://lore.kernel.org/bpf/20241010175552.1895980-1-yonghong.song@linux.dev/
- Remove bpf_prog_call() related implementation.
- Allow (opt-in) private stack for sched-ext progs.
v3 -> v4:
link: https://lore.kernel.org/bpf/20240926234506.1769256-1-yonghong.song@linux.dev/
There is a long discussion in the above v3 link trying to allow private
stack to be used by kernel functions in order to simplify implementation.
But unfortunately we didn't find a workable solution yet, so we return
to the approach where private stack is only used by bpf programs.
- Add bpf_prog_call() kfunc.
v2 -> v3:
- Instead of per-subprog private stack allocation, allocate private
stacks at main prog or callback entry prog. Subprogs not main or callback
progs will increment the inherited stack pointer to be their
frame pointer.
- Private stack allows each prog max stack size to be 512 bytes, intead
of the whole prog hierarchy to be 512 bytes.
- Add some tests.
====================
Link: https://lore.kernel.org/r/20241112163902.2223011-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Tue, 12 Nov 2024 16:39:38 +0000 (08:39 -0800)]
selftests/bpf: Add struct_ops prog private stack tests
Add three tests for struct_ops using private stack.
./test_progs -t struct_ops_private_stack
#336/1 struct_ops_private_stack/private_stack:OK
#336/2 struct_ops_private_stack/private_stack_fail:OK
#336/3 struct_ops_private_stack/private_stack_recur:OK
#336 struct_ops_private_stack:OK
The following is a snippet of a struct_ops check_member() implementation:
u32 moff = __btf_member_bit_offset(t, member) / 8;
switch (moff) {
case offsetof(struct bpf_testmod_ops3, test_1):
prog->aux->priv_stack_requested = true;
prog->aux->recursion_detected = test_1_recursion_detected;
fallthrough;
default:
break;
}
return 0;
The first test is with nested two different callback functions where the
first prog has more than 512 byte stack size (including subprogs) with
private stack enabled.
The second test is a negative test where the second prog has more than 512
byte stack size without private stack enabled.
The third test is the same callback function recursing itself. At run time,
the jit trampoline recursion check kicks in to prevent the recursion. The
recursion_detected() callback function is implemented by the bpf_testmod,
the following message in dmesg
bpf_testmod: oh no, recursing into test_1, recursion_misses 1
demonstrates the callback function is indeed triggered when recursion miss
happens.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241112163938.2225528-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Tue, 12 Nov 2024 16:39:33 +0000 (08:39 -0800)]
bpf: Support private stack for struct_ops progs
For struct_ops progs, whether a particular prog uses private stack
depends on prog->aux->priv_stack_requested setting before actual
insn-level verification for that prog. One particular implementation
is to piggyback on struct_ops->check_member(). The next patch has
an example for this. The struct_ops->check_member() sets
prog->aux->priv_stack_requested to be true which enables private stack
usage.
The struct_ops prog follows the same rule as kprobe/tracing progs after
function bpf_enable_priv_stack(). For example, even a struct_ops prog
requests private stack, it could still use normal kernel stack if
the stack size is small (< 64 bytes).
Similar to tracing progs, nested same cpu same prog run will be skipped.
A field (recursion_detected()) is added to bpf_prog_aux structure.
If bpf_prog->aux->recursion_detected is implemented by the struct_ops
subsystem and nested same cpu/prog happens, the function will be
triggered to report an error, collect related info, etc.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241112163933.2224962-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Tue, 12 Nov 2024 16:39:27 +0000 (08:39 -0800)]
selftests/bpf: Add tracing prog private stack tests
Some private stack tests are added including:
- main prog only with stack size greater than BPF_PSTACK_MIN_SIZE.
- main prog only with stack size smaller than BPF_PSTACK_MIN_SIZE.
- prog with one subprog having MAX_BPF_STACK stack size and another
subprog having non-zero small stack size.
- prog with callback function.
- prog with exception in main prog or subprog.
- prog with async callback without nesting
- prog with async callback with possible nesting
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241112163927.2224750-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Tue, 12 Nov 2024 16:39:22 +0000 (08:39 -0800)]
bpf, x86: Support private stack in jit
Private stack is allocated in function bpf_int_jit_compile() with
alignment 8. Private stack allocation size includes the stack size
determined by verifier and additional space to protect stack overflow
and underflow. See below an illustration:
---> memory address increasing
[8 bytes to protect overflow] [normal stack] [8 bytes to protect underflow]
If overflow/underflow is detected, kernel messages will be
emited in dmesg like
BPF private stack overflow/underflow detected for prog Fx
BPF Private stack overflow/underflow detected for prog bpf_prog_a41699c234a1567a_subprog1x
Those messages are generated when I made some changes to jitted code
to intentially cause overflow for some progs.
For the jited prog, The x86 register 9 (X86_REG_R9) is used to replace
bpf frame register (BPF_REG_10). The private stack is used per
subprog per cpu. The X86_REG_R9 is saved and restored around every
func call (not including tailcall) to maintain correctness of
X86_REG_R9.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241112163922.2224385-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Tue, 12 Nov 2024 16:39:17 +0000 (08:39 -0800)]
bpf, x86: Avoid repeated usage of bpf_prog->aux->stack_depth
Refactor the code to avoid repeated usage of bpf_prog->aux->stack_depth
in do_jit() func. If the private stack is used, the stack_depth will be
0 for that prog. Refactoring make it easy to adjust stack_depth.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241112163917.2224189-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Tue, 12 Nov 2024 16:39:12 +0000 (08:39 -0800)]
bpf: Enable private stack for eligible subprogs
If private stack is used by any subprog, set that subprog
prog->aux->jits_use_priv_stack to be true so later jit can allocate
private stack for that subprog properly.
Also set env->prog->aux->jits_use_priv_stack to be true if
any subprog uses private stack. This is a use case for a
single main prog (no subprogs) to use private stack, and
also a use case for later struct-ops progs where
env->prog->aux->jits_use_priv_stack will enable recursion
check if any subprog uses private stack.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241112163912.2224007-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Tue, 12 Nov 2024 16:39:07 +0000 (08:39 -0800)]
bpf: Find eligible subprogs for private stack support
Private stack will be allocated with percpu allocator in jit time.
To avoid complexity at runtime, only one copy of private stack is
available per cpu per prog. So runtime recursion check is necessary
to avoid stack corruption.
Current private stack only supports kprobe/perf_event/tp/raw_tp
which has recursion check in the kernel, and prog types that use
bpf trampoline recursion check. For trampoline related prog types,
currently only tracing progs have recursion checking.
To avoid complexity, all async_cb subprogs use normal kernel stack
including those subprogs used by both main prog subtree and async_cb
subtree. Any prog having tail call also uses kernel stack.
To avoid jit penalty with private stack support, a subprog stack
size threshold is set such that only if the stack size is no less
than the threshold, private stack is supported. The current threshold
is 64 bytes. This avoids jit penality if the stack usage is small.
A useless 'continue' is also removed from a loop in func
check_max_stack_depth().
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241112163907.2223839-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Johannes Weiner [Thu, 7 Nov 2024 14:08:36 +0000 (09:08 -0500)]
mm: swapfile: fix cluster reclaim work crash on rotational devices
syzbot and Daan report a NULL pointer crash in the new full swap cluster
reclaim work:
> Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN PTI
> KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
> CPU: 1 UID: 0 PID: 51 Comm: kworker/1:1 Not tainted 6.12.0-rc6-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
> Workqueue: events swap_reclaim_work
> RIP: 0010:__list_del_entry_valid_or_report+0x20/0x1c0 lib/list_debug.c:49
> Code: 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 fe 48 83 c7 08 48 83 ec 18 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 19 01 00 00 48 89 f2 48 8b 4e 08 48 b8 00 00 00
> RSP: 0018:
ffffc90000bb7c30 EFLAGS:
00010202
> RAX:
dffffc0000000000 RBX:
0000000000000000 RCX:
ffff88807b9ae078
> RDX:
0000000000000001 RSI:
0000000000000000 RDI:
0000000000000008
> RBP:
0000000000000001 R08:
0000000000000001 R09:
0000000000000000
> R10:
0000000000000001 R11:
000000000000004f R12:
dffffc0000000000
> R13:
ffffffffffffffb8 R14:
ffff88807b9ae000 R15:
ffffc90003af1000
> FS:
0000000000000000(0000) GS:
ffff8880b8700000(0000) knlGS:
0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
> CR2:
00007fffaca68fb8 CR3:
00000000791c8000 CR4:
00000000003526f0
> DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
> DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
> Call Trace:
> <TASK>
> __list_del_entry_valid include/linux/list.h:124 [inline]
> __list_del_entry include/linux/list.h:215 [inline]
> list_move_tail include/linux/list.h:310 [inline]
> swap_reclaim_full_clusters+0x109/0x460 mm/swapfile.c:748
> swap_reclaim_work+0x2e/0x40 mm/swapfile.c:779
The syzbot console output indicates a virtual environment where swapfile
is on a rotational device. In this case, clusters aren't actually used,
and si->full_clusters is not initialized. Daan's report is from qemu, so
likely rotational too.
Make sure to only schedule the cluster reclaim work when clusters are
actually in use.
Link: https://lkml.kernel.org/r/20241107142335.GB1172372@cmpxchg.org
Link: https://lore.kernel.org/lkml/672ac50b.050a0220.2edce.1517.GAE@google.com/
Link: https://github.com/systemd/systemd/issues/35044
Fixes:
5168a68eb78f ("mm, swap: avoid over reclaim of full clusters")
Reported-by: syzbot+078be8bfa863cb9e0c6b@syzkaller.appspotmail.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Si-Wei Liu [Mon, 21 Oct 2024 13:40:39 +0000 (16:40 +0300)]
vdpa/mlx5: Fix PA offset with unaligned starting iotlb map
When calculating the physical address range based on the iotlb and mr
[start,end) ranges, the offset of mr->start relative to map->start
is not taken into account. This leads to some incorrect and duplicate
mappings.
For the case when mr->start < map->start the code is already correct:
the range in [mr->start, map->start) was handled by a different
iteration.
Fixes:
94abbccdf291 ("vdpa/mlx5: Add shared memory registration code")
Cc: stable@vger.kernel.org
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <
20241021134040.975221-2-dtatulea@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Alexei Starovoitov [Tue, 12 Nov 2024 21:53:27 +0000 (13:53 -0800)]
Merge branch 'selftests-bpf-fix-for-bpf_signal-stalls-watchdog-for-test_progs'
Eduard Zingerman says:
====================
selftests/bpf: fix for bpf_signal stalls, watchdog for test_progs
Test case 'bpf_signal' had been recently reported to stall, both on
the mailing list [1] and CI [2]. The stall is caused by CPU cycles
perf event not being delivered within expected time frame, before test
process enters system call and waits indefinitely.
This patch-set addresses the issue in several ways:
- A watchdog timer is added to test_progs.c runner:
- it prints current sub-test name to stderr if sub-test takes longer
than 10 seconds to finish;
- it terminates process executing sub-test if sub-test takes longer
than 120 seconds to finish.
- The test case is updated to await perf event notification with a
timeout and a few retries, this serves two purposes:
- busy loops longer to increase the time frame for CPU cycles event
generation/delivery;
- makes a timeout, not stall, a worst case scenario.
- The test case is updated to lower frequency of perf events, as high
frequency of such events caused events generation throttling,
which in turn delayed events delivery by amount of time sufficient
to cause test case failure.
Note:
librt pthread-based timer API is used to implement watchdog timer.
I chose this API over SIGALRM because signal handler execution
within test process context was sufficient to trigger perf event
delivery for send_signal/send_signal_nmi_thread_remote test case,
w/o any additional changes. Thus I concluded that SIGALRM based
implementation interferes with tests execution.
[1] https://lore.kernel.org/bpf/CAP01T75OUeE8E-Lw9df84dm8ag2YmHW619f1DmPSVZ5_O89+Bg@mail.gmail.com/
[2] https://github.com/kernel-patches/bpf/actions/runs/
11791485271/job/
32843996871
====================
Link: https://lore.kernel.org/r/20241112110906.3045278-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman [Tue, 12 Nov 2024 11:09:06 +0000 (03:09 -0800)]
selftests/bpf: update send_signal to lower perf evemts frequency
Similar to commit [1] sample perf events less often in
test_send_signal_nmi(). This should reduce perf events throttling.
[1]
7015843afcaf ("selftests/bpf: Fix send_signal test with nested CONFIG_PARAVIRT")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241112110906.3045278-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman [Tue, 12 Nov 2024 11:09:05 +0000 (03:09 -0800)]
selftests/bpf: allow send_signal test to timeout
The following invocation:
$ t1=send_signal/send_signal_perf_thread_remote \
t2=send_signal/send_signal_nmi_thread_remote \
./test_progs -t $t1,$t2
Leads to send_signal_nmi_thread_remote to be stuck
on a line 180:
/* wait for result */
err = read(pipe_c2p[0], buf, 1);
In this test case:
- perf event PERF_COUNT_HW_CPU_CYCLES is created for parent process;
- BPF program is attached to perf event, and sends a signal to child
process when event occurs;
- parent program burns some CPU in busy loop and calls read() to get
notification from child that it received a signal.
The perf event is declared with .sample_period = 1.
This forces perf to throttle events, and under some unclear conditions
the event does not always occur while parent is in busy loop.
After parent enters read() system call CPU cycles event won't be
generated for parent anymore. Thus, if perf event had not occurred
already the test is stuck.
This commit updates the parent to wait for notification with a timeout,
doing several iterations of busy loop + read_with_timeout().
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241112110906.3045278-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman [Tue, 12 Nov 2024 11:09:04 +0000 (03:09 -0800)]
selftests/bpf: add read_with_timeout() utility function
int read_with_timeout(int fd, char *buf, size_t count, long usec)
As a regular read(2), but allows to specify a timeout in
micro-seconds. Returns -EAGAIN on timeout.
Implemented using select().
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241112110906.3045278-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman [Tue, 12 Nov 2024 11:09:03 +0000 (03:09 -0800)]
selftests/bpf: watchdog timer for test_progs
This commit provides a watchdog timer that sets a limit of how long a
single sub-test could run:
- if sub-test runs for 10 seconds, the name of the test is printed
(currently the name of the test is printed only after it finishes);
- if sub-test runs for 120 seconds, the running thread is terminated
with SIGSEGV (to trigger crash_handler() and get a stack trace).
Specifically:
- the timer is armed on each call to run_one_test();
- re-armed at each call to test__start_subtest();
- is stopped when exiting run_one_test().
Default timeout could be overridden using '-w' or '--watchdog-timeout'
options. Value 0 can be used to turn the timer off.
Here is an example execution:
$ ./ssh-exec.sh ./test_progs -w 5 -t \
send_signal/send_signal_perf_thread_remote,send_signal/send_signal_nmi_thread_remote
WATCHDOG: test case send_signal/send_signal_nmi_thread_remote executes for 5 seconds, terminating with SIGSEGV
Caught signal #11!
Stack trace:
./test_progs(crash_handler+0x1f)[0x9049ef]
/lib64/libc.so.6(+0x40d00)[0x7f1f1184fd00]
/lib64/libc.so.6(read+0x4a)[0x7f1f1191cc4a]
./test_progs[0x720dd3]
./test_progs[0x71ef7a]
./test_progs(test_send_signal+0x1db)[0x71edeb]
./test_progs[0x9066c5]
./test_progs(main+0x5ed)[0x9054ad]
/lib64/libc.so.6(+0x2a088)[0x7f1f11839088]
/lib64/libc.so.6(__libc_start_main+0x8b)[0x7f1f1183914b]
./test_progs(_start+0x25)[0x527385]
#292 send_signal:FAIL
test_send_signal_common:PASS:reading pipe 0 nsec
test_send_signal_common:PASS:reading pipe error: size 0 0 nsec
test_send_signal_common:PASS:incorrect result 0 nsec
test_send_signal_common:PASS:pipe_write 0 nsec
test_send_signal_common:PASS:setpriority 0 nsec
Timer is implemented using timer_{create,start} librt API.
Internally librt uses pthreads for SIGEV_THREAD timers,
so this change adds a background timer thread to the test process.
Because of this a few checks in tests 'bpf_iter' and 'iters'
need an update to account for an extra thread.
For parallelized scenario the watchdog is also created for each worker
fork. If one of the workers gets stuck, it would be terminated by a
watchdog. In theory, this might lead to a scenario when all worker
threads are exhausted, however this should not be a problem for
server_main(), as it would exit with some of the tests not run.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241112110906.3045278-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Linus Torvalds [Tue, 12 Nov 2024 21:35:13 +0000 (13:35 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"x86 and selftests fixes.
x86:
- When emulating a guest TLB flush for a nested guest, flush vpid01,
not vpid02, if L2 is active but VPID is disabled in vmcs12, i.e. if
L2 and L1 are sharing VPID '0' (from L1's perspective).
- Fix a bug in the SNP initialization flow where KVM would return '0'
to userspace instead of -errno on failure.
- Move the Intel PT virtualization (i.e. outputting host trace to
host buffer and guest trace to guest buffer) behind CONFIG_BROKEN.
- Fix memory leak on failure of KVM_SEV_SNP_LAUNCH_START
- Fix a bug where KVM fails to inject an interrupt from the IRR after
KVM_SET_LAPIC.
Selftests:
- Increase the timeout for the memslot performance selftest to avoid
false failures on arm64 and nested x86 platforms.
- Fix a goof in the guest_memfd selftest where a for-loop initialized
a bit mask to zero instead of BIT(0).
- Disable strict aliasing when building KVM selftests to prevent the
compiler from treating things like "u64 *" to "uint64_t *" cases as
undefined behavior, which can lead to nasty, hard to debug
failures.
- Force -march=x86-64-v2 for KVM x86 selftests if and only if the
uarch is supported by the compiler.
- Fix broken compilation of kvm selftests after a header sync in
tools/"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKEN
KVM: x86: Unconditionally set irr_pending when updating APICv state
kvm: svm: Fix gctx page leak on invalid inputs
KVM: selftests: use X86_MEMTYPE_WB instead of VMX_BASIC_MEM_TYPE_WB
KVM: SVM: Propagate error from snp_guest_req_init() to userspace
KVM: nVMX: Treat vpid01 as current if L2 is active, but with VPID disabled
KVM: selftests: Don't force -march=x86-64-v2 if it's unsupported
KVM: selftests: Disable strict aliasing
KVM: selftests: fix unintentional noop test in guest_memfd_test.c
KVM: selftests: memslot_perf_test: increase guest sync timeout
Linus Torvalds [Tue, 12 Nov 2024 21:21:07 +0000 (13:21 -0800)]
Merge tag 'for-6.12/dm-fixes-3' of git://git./linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mikulas Patocka:
- fix warnings about duplicate slab cache names
* tag 'for-6.12/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm-cache: fix warnings about duplicate slab caches
dm-bufio: fix warnings about duplicate slab caches
Linus Torvalds [Tue, 12 Nov 2024 21:06:31 +0000 (13:06 -0800)]
Merge tag 'integrity-v6.12' of git://git./linux/kernel/git/zohar/linux-integrity
Pull integrity fixes from Mimi Zohar:
"One bug fix, one performance improvement, and the use of
static_assert:
- The bug fix addresses "only a cosmetic change" commit, which didn't
take into account the original 'ima' template definition.
- The performance improvement limits the atomic_read()"
* tag 'integrity-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
integrity: Use static_assert() to check struct sizes
evm: stop avoidably reading i_writecount in evm_file_release
ima: fix buffer overrun in ima_eventdigest_init_common
Linus Torvalds [Tue, 12 Nov 2024 21:01:09 +0000 (13:01 -0800)]
Merge tag 'landlock-6.12-rc7' of git://git./linux/kernel/git/mic/linux
Pull landlock fixes from Mickaël Salaün:
"This fixes issues in the Landlock's sandboxer sample and
documentation, slightly refactors helpers (required for ongoing patch
series), and improve/fix a feature merged in v6.12 (signal and
abstract UNIX socket scoping)"
* tag 'landlock-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
landlock: Optimize scope enforcement
landlock: Refactor network access mask management
landlock: Refactor filesystem access mask management
samples/landlock: Clarify option parsing behaviour
samples/landlock: Refactor help message
samples/landlock: Fix port parsing in sandboxer
landlock: Fix grammar issues in documentation
landlock: Improve documentation of previous limitations
Donet Tom [Sun, 10 Nov 2024 06:49:03 +0000 (00:49 -0600)]
selftests: hugetlb_dio: fixup check for initial conditions to skip in the start
This test verifies that a hugepage, used as a user buffer for DIO
operations, is correctly freed upon unmapping. To test this, we read the
count of free hugepages before and after the mmap, DIO, and munmap
operations, then check if the free hugepage count is the same.
Reading free hugepages before the test was removed by commit
0268d4579901
('selftests: hugetlb_dio: check for initial conditions to skip at the
start'), causing the test to always fail.
This patch adds back reading the free hugepages before starting the test.
With this patch, the tests are now passing.
Test results without this patch:
./tools/testing/selftests/mm/hugetlb_dio
TAP version 13
1..4
# No. Free pages before allocation : 0
# No. Free pages after munmap : 100
not ok 1 : Huge pages not freed!
# No. Free pages before allocation : 0
# No. Free pages after munmap : 100
not ok 2 : Huge pages not freed!
# No. Free pages before allocation : 0
# No. Free pages after munmap : 100
not ok 3 : Huge pages not freed!
# No. Free pages before allocation : 0
# No. Free pages after munmap : 100
not ok 4 : Huge pages not freed!
# Totals: pass:0 fail:4 xfail:0 xpass:0 skip:0 error:0
Test results with this patch:
/tools/testing/selftests/mm/hugetlb_dio
TAP version 13
1..4
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 1 : Huge pages freed successfully !
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 2 : Huge pages freed successfully !
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 3 : Huge pages freed successfully !
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 4 : Huge pages freed successfully !
# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
Link: https://lkml.kernel.org/r/20241110064903.23626-1-donettom@linux.ibm.com
Fixes:
0268d4579901 ("selftests: hugetlb_dio: check for initial conditions to skip in the start")
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Sun, 10 Nov 2024 21:11:21 +0000 (13:11 -0800)]
mm/thp: fix deferred split queue not partially_mapped: fix
Though even more elusive than before, list_del corruption has still been
seen on THP's deferred split queue.
The idea in commit
e66f3185fa04 was right, but its implementation wrong.
The context omitted an important comment just before the critical test:
"split_folio() removes folio from list on success." In ignoring that
comment, when a THP split succeeded, the code went on to release the
preceding safe folio, preserving instead an irrelevant (formerly head)
folio: which gives no safety because it's not on the list. Fix the logic.
Link: https://lkml.kernel.org/r/3c995a30-31ce-0998-1b9f-3a2cb9354c91@google.com
Fixes:
e66f3185fa04 ("mm/thp: fix deferred split queue not partially_mapped")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
John Hubbard [Tue, 5 Nov 2024 03:29:44 +0000 (19:29 -0800)]
mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases
commit
53ba78de064b ("mm/gup: introduce
check_and_migrate_movable_folios()") created a new constraint on the
pin_user_pages*() API family: a potentially large internal allocation must
now occur, for FOLL_LONGTERM cases.
A user-visible consequence has now appeared: user space can no longer pin
more than 2GB of memory anymore on x86_64. That's because, on a 4KB
PAGE_SIZE system, when user space tries to (indirectly, via a device
driver that calls pin_user_pages()) pin 2GB, this requires an allocation
of a folio pointers array of MAX_PAGE_ORDER size, which is the limit for
kmalloc().
In addition to the directly visible effect described above, there is also
the problem of adding an unnecessary allocation. The **pages array
argument has already been allocated, and there is no need for a redundant
**folios array allocation in this case.
Fix this by avoiding the new allocation entirely. This is done by
referring to either the original page[i] within **pages, or to the
associated folio. Thanks to David Hildenbrand for suggesting this
approach and for providing the initial implementation (which I've tested
and adjusted slightly) as well.
[jhubbard@nvidia.com: whitespace tweak, per David]
Link: https://lkml.kernel.org/r/131cf9c8-ebc0-4cbb-b722-22fa8527bf3c@nvidia.com
[jhubbard@nvidia.com: bypass pofs_get_folio(), per Oscar]
Link: https://lkml.kernel.org/r/c1587c7f-9155-45be-bd62-1e36c0dd6923@nvidia.com
Link: https://lkml.kernel.org/r/20241105032944.141488-2-jhubbard@nvidia.com
Fixes:
53ba78de064b ("mm/gup: introduce check_and_migrate_movable_folios()")
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bibo Mao [Tue, 12 Nov 2024 08:35:39 +0000 (16:35 +0800)]
LoongArch: Fix AP booting issue in VM mode
Native IPI is used for AP booting, because it is the booting interface
between OS and BIOS firmware. The paravirt IPI is only used inside OS,
and native IPI is necessary to boot AP.
When booting AP, we write the kernel entry address in the HW mailbox of
AP and send IPI interrupt to it. AP executes idle instruction and waits
for interrupts or SW events, then clears IPI interrupt and jumps to the
kernel entry from HW mailbox.
Between writing HW mailbox and sending IPI, AP can be woken up by SW
events and jumps to the kernel entry, so ACTION_BOOT_CPU IPI interrupt
will keep pending during AP booting. And native IPI interrupt handler
needs be registered so that it can clear pending native IPI, else there
will be endless interrupts during AP booting stage.
Here native IPI interrupt is initialized even if paravirt IPI is used.
Cc: stable@vger.kernel.org
Fixes:
74c16b2e2b0c ("LoongArch: KVM: Add PV IPI support on guest side")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Kanglong Wang [Tue, 12 Nov 2024 08:35:39 +0000 (16:35 +0800)]
LoongArch: Add WriteCombine shadow mapping in KASAN
Currently, the kernel couldn't boot when ARCH_IOREMAP, ARCH_WRITECOMBINE
and KASAN are enabled together. Because DMW2 is used by kernel now which
is configured as 0xa000000000000000 for WriteCombine, but KASAN has no
segment mapping for it. This patch fix this issue.
Solution: Add the relevant definitions for WriteCombine (DMW2) in KASAN.
Cc: stable@vger.kernel.org
Fixes:
8e02c3b782ec ("LoongArch: Add writecombine support for DMW-based ioremap()")
Signed-off-by: Kanglong Wang <wangkanglong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Huacai Chen [Tue, 12 Nov 2024 08:35:39 +0000 (16:35 +0800)]
LoongArch: Disable KASAN if PGDIR_SIZE is too large for cpu_vabits
If PGDIR_SIZE is too large for cpu_vabits, KASAN_SHADOW_END will
overflow UINTPTR_MAX because KASAN_SHADOW_START/KASAN_SHADOW_END are
aligned up by PGDIR_SIZE. And then the overflowed KASAN_SHADOW_END looks
like a user space address.
For example, PGDIR_SIZE of CONFIG_4KB_4LEVEL is 2^39, which is too large
for Loongson-2K series whose cpu_vabits = 39.
Since CONFIG_4KB_4LEVEL is completely legal for CPUs with cpu_vabits <=
39, we just disable KASAN via early return in kasan_init(). Otherwise we
get a boot failure.
Moreover, we change KASAN_SHADOW_END from the first address after KASAN
shadow area to the last address in KASAN shadow area, in order to avoid
the end address exactly overflow to 0 (which is a legal case). We don't
need to worry about alignment because pgd_addr_end() can handle it.
Cc: stable@vger.kernel.org
Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Huacai Chen [Tue, 12 Nov 2024 08:35:39 +0000 (16:35 +0800)]
LoongArch: Make KASAN work with 5-level page-tables
Make KASAN work with 5-level page-tables, including:
1. Implement and use __pgd_none() and kasan_p4d_offset().
2. As done in kasan_pmd_populate() and kasan_pte_populate(), restrict
the loop conditions of kasan_p4d_populate() and kasan_pud_populate()
to avoid unnecessary population.
Cc: stable@vger.kernel.org
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Yuli Wang [Tue, 12 Nov 2024 08:35:39 +0000 (16:35 +0800)]
LoongArch: Define a default value for VM_DATA_DEFAULT_FLAGS
This is a trivial cleanup, commit
c62da0c35d58518d ("mm/vma: define a
default value for VM_DATA_DEFAULT_FLAGS") has unified default values of
VM_DATA_DEFAULT_FLAGS across different platforms.
Apply the same consistency to LoongArch.
Suggested-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: Yuli Wang <wangyuli@uniontech.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Huacai Chen [Tue, 12 Nov 2024 08:35:36 +0000 (16:35 +0800)]
LoongArch: Fix early_numa_add_cpu() usage for FDT systems
early_numa_add_cpu() applies on physical CPU id rather than logical CPU
id, so use cpuid instead of cpu.
Cc: stable@vger.kernel.org
Fixes:
3de9c42d02a79a5 ("LoongArch: Add all CPUs enabled by fdt to NUMA node 0")
Reported-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Huacai Chen [Tue, 12 Nov 2024 08:35:36 +0000 (16:35 +0800)]
LoongArch: For all possible CPUs setup logical-physical CPU mapping
In order to support ACPI-based physical CPU hotplug, we suppose for all
"possible" CPUs cpu_logical_map() can work. Because some drivers want to
use cpu_logical_map() for all "possible" CPUs, while currently we only
setup logical-physical mapping for "present" CPUs. This lack of mapping
also causes cpu_to_node() cannot work for hot-added CPUs.
All "possible" CPUs are listed in MADT, and the "present" subset is
marked as ACPI_MADT_ENABLED. To setup logical-physical CPU mapping for
all possible CPUs and keep present CPUs continuous in cpu_present_mask,
we parse MADT twice. The first pass handles CPUs with ACPI_MADT_ENABLED
and the second pass handles CPUs without ACPI_MADT_ENABLED.
The global flag (cpu_enumerated) is removed because acpi_map_cpu() calls
cpu_number_map() rather than set_processor_mask() now.
Reported-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Andrii Nakryiko [Tue, 12 Nov 2024 04:14:57 +0000 (20:14 -0800)]
Merge branch 'libbpf-stringify-error-codes-in-log-messages'
Mykyta Yatsenko says:
====================
libbpf: stringify error codes in log messages
From: Mykyta Yatsenko <yatsenko@meta.com>
Libbpf may report error in 2 ways:
1. Numeric errno
2. Errno's text representation, returned by strerror
Both ways may be confusing for users: numeric code requires people to
know how to find its meaning and strerror may be too generic and
unclear.
These patches modify libbpf error reporting by swapping numeric codes
and strerror with the standard short error name, for example:
"failed to attach: -22" becomes "failed to attach: -EINVAL".
====================
Link: https://patch.msgid.link/20241111212919.368971-1-mykyta.yatsenko5@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Mykyta Yatsenko [Mon, 11 Nov 2024 21:29:19 +0000 (21:29 +0000)]
libbpf: Stringify errno in log messages in the remaining code
Convert numeric error codes into the string representations in log
messages in the rest of libbpf source files.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241111212919.368971-5-mykyta.yatsenko5@gmail.com
Mykyta Yatsenko [Mon, 11 Nov 2024 21:29:18 +0000 (21:29 +0000)]
libbpf: Stringify errno in log messages in btf*.c
Convert numeric error codes into the string representations in log
messages in btf.c and btf_dump.c.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241111212919.368971-4-mykyta.yatsenko5@gmail.com
Mykyta Yatsenko [Mon, 11 Nov 2024 21:29:17 +0000 (21:29 +0000)]
libbpf: Stringify errno in log messages in libbpf.c
Convert numeric error codes into the string representations in log
messages in libbpf.c.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241111212919.368971-3-mykyta.yatsenko5@gmail.com
Mykyta Yatsenko [Mon, 11 Nov 2024 21:29:16 +0000 (21:29 +0000)]
libbpf: Introduce errstr() for stringifying errno
Add function errstr(int err) that allows converting numeric error codes
into string representations.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241111212919.368971-2-mykyta.yatsenko5@gmail.com
Menglong Dong [Mon, 11 Nov 2024 12:49:11 +0000 (20:49 +0800)]
bpf: Replace the document for PTR_TO_BTF_ID_OR_NULL
Commit
c25b2ae13603 ("bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX |
PTR_MAYBE_NULL") moved the fields around and misplaced the
documentation for "PTR_TO_BTF_ID_OR_NULL". So, let's replace it in the
proper place.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241111124911.1436911-1-dongml2@chinatelecom.cn
Luo Yifan [Mon, 11 Nov 2024 02:10:04 +0000 (10:10 +0800)]
tools/bpf: Fix the wrong format specifier in bpf_jit_disasm
There is a static checker warning that the %d in format string is
mismatched with the corresponding argument type, which could result in
incorrect printed data. This patch fixes it.
Signed-off-by: Luo Yifan <luoyifan@cmss.chinamobile.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241111021004.272293-1-luoyifan@cmss.chinamobile.com
Florian Schmaus [Sat, 2 Nov 2024 10:04:51 +0000 (11:04 +0100)]
kbuild,bpf: Pass make jobs' value to pahole
Pass the value of make's -j/--jobs argument to pahole, to avoid out of
memory errors and make pahole respect the "jobs" value of make.
On systems with little memory but many cores, invoking pahole using -j
without argument potentially creates too many pahole instances,
causing an out-of-memory situation. Instead, we should pass make's
"jobs" value as an argument to pahole's -j, which is likely configured
to be (much) lower than the actual core count on such systems.
If make was invoked without -j, either via cmdline or MAKEFLAGS, then
JOBS will be simply empty, resulting in the existing behavior, as
expected.
Signed-off-by: Florian Schmaus <flo@geekplace.eu>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Link: https://lore.kernel.org/bpf/20241102100452.793970-1-flo@geekplace.eu
Hajime Tazaki [Fri, 8 Nov 2024 22:28:34 +0000 (07:28 +0900)]
nommu: pass NULL argument to vma_iter_prealloc()
When deleting a vma entry from a maple tree, it has to pass NULL to
vma_iter_prealloc() in order to calculate internal state of the tree, but
it passed a wrong argument. As a result, nommu kernels crashed upon
accessing a vma iterator, such as acct_collect() reading the size of vma
entries after do_munmap().
This commit fixes this issue by passing a right argument to the
preallocation call.
Link: https://lkml.kernel.org/r/20241108222834.3625217-1-thehajime@gmail.com
Fixes:
b5df09226450 ("mm: set up vma iterator for vma_iter_prealloc() calls")
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dmitry Antipov [Wed, 6 Nov 2024 09:21:00 +0000 (12:21 +0300)]
ocfs2: fix UBSAN warning in ocfs2_verify_volume()
Syzbot has reported the following splat triggered by UBSAN:
UBSAN: shift-out-of-bounds in fs/ocfs2/super.c:2336:10
shift exponent 32768 is too large for 32-bit type 'int'
CPU: 2 UID: 0 PID: 5255 Comm: repro Not tainted
6.12.0-rc4-syzkaller-00047-gc2ee9f594da8 #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x241/0x360
? __pfx_dump_stack_lvl+0x10/0x10
? __pfx__printk+0x10/0x10
? __asan_memset+0x23/0x50
? lockdep_init_map_type+0xa1/0x910
__ubsan_handle_shift_out_of_bounds+0x3c8/0x420
ocfs2_fill_super+0xf9c/0x5750
? __pfx_ocfs2_fill_super+0x10/0x10
? __pfx_validate_chain+0x10/0x10
? __pfx_validate_chain+0x10/0x10
? validate_chain+0x11e/0x5920
? __lock_acquire+0x1384/0x2050
? __pfx_validate_chain+0x10/0x10
? string+0x26a/0x2b0
? widen_string+0x3a/0x310
? string+0x26a/0x2b0
? bdev_name+0x2b1/0x3c0
? pointer+0x703/0x1210
? __pfx_pointer+0x10/0x10
? __pfx_format_decode+0x10/0x10
? __lock_acquire+0x1384/0x2050
? vsnprintf+0x1ccd/0x1da0
? snprintf+0xda/0x120
? __pfx_lock_release+0x10/0x10
? do_raw_spin_lock+0x14f/0x370
? __pfx_snprintf+0x10/0x10
? set_blocksize+0x1f9/0x360
? sb_set_blocksize+0x98/0xf0
? setup_bdev_super+0x4e6/0x5d0
mount_bdev+0x20c/0x2d0
? __pfx_ocfs2_fill_super+0x10/0x10
? __pfx_mount_bdev+0x10/0x10
? vfs_parse_fs_string+0x190/0x230
? __pfx_vfs_parse_fs_string+0x10/0x10
legacy_get_tree+0xf0/0x190
? __pfx_ocfs2_mount+0x10/0x10
vfs_get_tree+0x92/0x2b0
do_new_mount+0x2be/0xb40
? __pfx_do_new_mount+0x10/0x10
__se_sys_mount+0x2d6/0x3c0
? __pfx___se_sys_mount+0x10/0x10
? do_syscall_64+0x100/0x230
? __x64_sys_mount+0x20/0xc0
do_syscall_64+0xf3/0x230
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f37cae96fda
Code: 48 8b 0d 51 ce 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1e ce 0c 00 f7 d8 64 89 01 48
RSP: 002b:
00007fff6c1aa228 EFLAGS:
00000206 ORIG_RAX:
00000000000000a5
RAX:
ffffffffffffffda RBX:
00007fff6c1aa240 RCX:
00007f37cae96fda
RDX:
00000000200002c0 RSI:
0000000020000040 RDI:
00007fff6c1aa240
RBP:
0000000000000004 R08:
00007fff6c1aa280 R09:
0000000000000000
R10:
00000000000008c0 R11:
0000000000000206 R12:
00000000000008c0
R13:
00007fff6c1aa280 R14:
0000000000000003 R15:
0000000001000000
</TASK>
For a really damaged superblock, the value of 'i_super.s_blocksize_bits'
may exceed the maximum possible shift for an underlying 'int'. So add an
extra check whether the aforementioned field represents the valid block
size, which is 512 bytes, 1K, 2K, or 4K.
Link: https://lkml.kernel.org/r/20241106092100.2661330-1-dmantipov@yandex.ru
Fixes:
ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reported-by: syzbot+56f7cd1abe4b8e475180@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=
56f7cd1abe4b8e475180
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryusuke Konishi [Wed, 6 Nov 2024 16:07:33 +0000 (01:07 +0900)]
nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint
When using the "block:block_dirty_buffer" tracepoint, mark_buffer_dirty()
may cause a NULL pointer dereference, or a general protection fault when
KASAN is enabled.
This happens because, since the tracepoint was added in
mark_buffer_dirty(), it references the dev_t member bh->b_bdev->bd_dev
regardless of whether the buffer head has a pointer to a block_device
structure.
In the current implementation, nilfs_grab_buffer(), which grabs a buffer
to read (or create) a block of metadata, including b-tree node blocks,
does not set the block device, but instead does so only if the buffer is
not in the "uptodate" state for each of its caller block reading
functions. However, if the uptodate flag is set on a folio/page, and the
buffer heads are detached from it by try_to_free_buffers(), and new buffer
heads are then attached by create_empty_buffers(), the uptodate flag may
be restored to each buffer without the block device being set to
bh->b_bdev, and mark_buffer_dirty() may be called later in that state,
resulting in the bug mentioned above.
Fix this issue by making nilfs_grab_buffer() always set the block device
of the super block structure to the buffer head, regardless of the state
of the buffer's uptodate flag.
Link: https://lkml.kernel.org/r/20241106160811.3316-3-konishi.ryusuke@gmail.com
Fixes:
5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ubisectech Sirius <bugreport@valiantsec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryusuke Konishi [Wed, 6 Nov 2024 16:07:32 +0000 (01:07 +0900)]
nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint
Patch series "nilfs2: fix null-ptr-deref bugs on block tracepoints".
This series fixes null pointer dereference bugs that occur when using
nilfs2 and two block-related tracepoints.
This patch (of 2):
It has been reported that when using "block:block_touch_buffer"
tracepoint, touch_buffer() called from __nilfs_get_folio_block() causes a
NULL pointer dereference, or a general protection fault when KASAN is
enabled.
This happens because since the tracepoint was added in touch_buffer(), it
references the dev_t member bh->b_bdev->bd_dev regardless of whether the
buffer head has a pointer to a block_device structure. In the current
implementation, the block_device structure is set after the function
returns to the caller.
Here, touch_buffer() is used to mark the folio/page that owns the buffer
head as accessed, but the common search helper for folio/page used by the
caller function was optimized to mark the folio/page as accessed when it
was reimplemented a long time ago, eliminating the need to call
touch_buffer() here in the first place.
So this solves the issue by eliminating the touch_buffer() call itself.
Link: https://lkml.kernel.org/r/20241106160811.3316-1-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/20241106160811.3316-2-konishi.ryusuke@gmail.com
Fixes:
5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: Ubisectech Sirius <bugreport@valiantsec.com>
Closes: https://lkml.kernel.org/r/
86bd3013-887e-4e38-960f-
ca45c657f032.bugreport@valiantsec.com
Reported-by: syzbot+9982fb8d18eba905abe2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=
9982fb8d18eba905abe2
Tested-by: syzbot+9982fb8d18eba905abe2@syzkaller.appspotmail.com
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Roman Gushchin [Wed, 6 Nov 2024 19:53:54 +0000 (19:53 +0000)]
mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
Syzbot reported a bad page state problem caused by a page being freed
using free_page() still having a mlocked flag at free_pages_prepare()
stage:
BUG: Bad page state in process syz.5.504 pfn:61f45
page: refcount:0 mapcount:0 mapping:
0000000000000000 index:0x0 pfn:0x61f45
flags: 0xfff00000080204(referenced|workingset|mlocked|node=0|zone=1|lastcpupid=0x7ff)
raw:
00fff00000080204 0000000000000000 dead000000000122 0000000000000000
raw:
0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 8443, tgid 8442 (syz.5.504), ts
201884660643, free_ts
201499827394
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537
prep_new_page mm/page_alloc.c:1545 [inline]
get_page_from_freelist+0x303f/0x3190 mm/page_alloc.c:3457
__alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4733
alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
kvm_coalesced_mmio_init+0x1f/0xf0 virt/kvm/coalesced_mmio.c:99
kvm_create_vm virt/kvm/kvm_main.c:1235 [inline]
kvm_dev_ioctl_create_vm virt/kvm/kvm_main.c:5488 [inline]
kvm_dev_ioctl+0x12dc/0x2240 virt/kvm/kvm_main.c:5530
__do_compat_sys_ioctl fs/ioctl.c:1007 [inline]
__se_compat_sys_ioctl+0x510/0xc90 fs/ioctl.c:950
do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
__do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386
do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411
entry_SYSENTER_compat_after_hwframe+0x84/0x8e
page last free pid 8399 tgid 8399 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
free_pages_prepare mm/page_alloc.c:1108 [inline]
free_unref_folios+0xf12/0x18d0 mm/page_alloc.c:2686
folios_put_refs+0x76c/0x860 mm/swap.c:1007
free_pages_and_swap_cache+0x5c8/0x690 mm/swap_state.c:335
__tlb_batch_free_encoded_pages mm/mmu_gather.c:136 [inline]
tlb_batch_pages_flush mm/mmu_gather.c:149 [inline]
tlb_flush_mmu_free mm/mmu_gather.c:366 [inline]
tlb_flush_mmu+0x3a3/0x680 mm/mmu_gather.c:373
tlb_finish_mmu+0xd4/0x200 mm/mmu_gather.c:465
exit_mmap+0x496/0xc40 mm/mmap.c:1926
__mmput+0x115/0x390 kernel/fork.c:1348
exit_mm+0x220/0x310 kernel/exit.c:571
do_exit+0x9b2/0x28e0 kernel/exit.c:926
do_group_exit+0x207/0x2c0 kernel/exit.c:1088
__do_sys_exit_group kernel/exit.c:1099 [inline]
__se_sys_exit_group kernel/exit.c:1097 [inline]
__x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097
x64_sys_call+0x2634/0x2640 arch/x86/include/generated/asm/syscalls_64.h:232
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Modules linked in:
CPU: 0 UID: 0 PID: 8442 Comm: syz.5.504 Not tainted 6.12.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
bad_page+0x176/0x1d0 mm/page_alloc.c:501
free_page_is_bad mm/page_alloc.c:918 [inline]
free_pages_prepare mm/page_alloc.c:1100 [inline]
free_unref_page+0xed0/0xf20 mm/page_alloc.c:2638
kvm_destroy_vm virt/kvm/kvm_main.c:1327 [inline]
kvm_put_kvm+0xc75/0x1350 virt/kvm/kvm_main.c:1386
kvm_vcpu_release+0x54/0x60 virt/kvm/kvm_main.c:4143
__fput+0x23f/0x880 fs/file_table.c:431
task_work_run+0x24f/0x310 kernel/task_work.c:239
exit_task_work include/linux/task_work.h:43 [inline]
do_exit+0xa2f/0x28e0 kernel/exit.c:939
do_group_exit+0x207/0x2c0 kernel/exit.c:1088
__do_sys_exit_group kernel/exit.c:1099 [inline]
__se_sys_exit_group kernel/exit.c:1097 [inline]
__ia32_sys_exit_group+0x3f/0x40 kernel/exit.c:1097
ia32_sys_call+0x2624/0x2630 arch/x86/include/generated/asm/syscalls_32.h:253
do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
__do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386
do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411
entry_SYSENTER_compat_after_hwframe+0x84/0x8e
RIP: 0023:0xf745d579
Code: Unable to access opcode bytes at 0xf745d54f.
RSP: 002b:
00000000f75afd6c EFLAGS:
00000206 ORIG_RAX:
00000000000000fc
RAX:
ffffffffffffffda RBX:
0000000000000000 RCX:
0000000000000000
RDX:
0000000000000000 RSI:
00000000ffffff9c RDI:
00000000f744cff4
RBP:
00000000f717ae61 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000206 R12:
0000000000000000
R13:
0000000000000000 R14:
0000000000000000 R15:
0000000000000000
</TASK>
The problem was originally introduced by commit
b109b87050df ("mm/munlock:
replace clear_page_mlock() by final clearance"): it was focused on
handling pagecache and anonymous memory and wasn't suitable for lower
level get_page()/free_page() API's used for example by KVM, as with this
reproducer.
Fix it by moving the mlocked flag clearance down to free_page_prepare().
The bug itself if fairly old and harmless (aside from generating these
warnings), aside from a small memory leak - "bad" pages are stopped from
being allocated again.
Link: https://lkml.kernel.org/r/20241106195354.270757-1-roman.gushchin@linux.dev
Fixes:
b109b87050df ("mm/munlock: replace clear_page_mlock() by final clearance")
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Reported-by: syzbot+e985d3026c4fd041578e@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/
6729f475.
050a0220.701a.0019.GAE@google.com
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Mon, 11 Nov 2024 22:09:57 +0000 (14:09 -0800)]
Merge tag 'sched_ext-for-6.12-rc7-fixes' of git://git./linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- The fair sched class currently has a bug where its balance() returns
true telling the sched core that it has tasks to run but then NULL
from pick_task(). This makes sched core call sched_ext's pick_task()
without preceding balance() which can lead to stalls in partial mode.
For now, work around by detecting the condition and forcing the CPU
to go through another scheduling cycle.
- Add a missing newline to an error message and fix drgn introspection
tool which went out of sync.
* tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type
sched_ext: Add a missing newline at the end of an error message
Linus Torvalds [Mon, 11 Nov 2024 17:06:17 +0000 (09:06 -0800)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"Several small bugfixes all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vdpa/mlx5: Fix error path during device add
vp_vdpa: fix id_table array not null terminated error
virtio_pci: Fix admin vq cleanup by using correct info pointer
vDPA/ifcvf: Fix pci_read_config_byte() return code handling
Fix typo in vringh_test.c
vdpa: solidrun: Fix UB bug with devres
vsock/virtio: Initialization of the dangling pointer occurring in vsk->trans
Alexei Starovoitov [Sat, 9 Nov 2024 23:39:03 +0000 (15:39 -0800)]
Merge branch 'refactor-lock-management'
Kumar Kartikeya Dwivedi says:
====================
Refactor lock management
This set refactors lock management in the verifier in preparation for
spin locks that can be acquired multiple times. In addition to this,
unnecessary code special case reference leak logic for callbacks is also
dropped, that is no longer necessary. See patches for details.
Changelog:
----------
v5 -> v6
v5: https://lore.kernel.org/bpf/
20241109225243.
2306756-1-memxor@gmail.com
* Move active_locks mutation to {acquire,release}_lock_state (Alexei)
v4 -> v5
v4: https://lore.kernel.org/bpf/
20241109074347.
1434011-1-memxor@gmail.com
* Make active_locks part of bpf_func_state (Alexei)
* Remove unneeded in_callback_fn logic for references
v3 -> v4
v3: https://lore.kernel.org/bpf/
20241104151716.
2079893-1-memxor@gmail.com
* Address comments from Alexei
* Drop struct bpf_active_lock definition
* Name enum type, expand definition to multiple lines
* s/REF_TYPE_BPF_LOCK/REF_TYPE_LOCK/g
* Change active_lock type to int
* Fix type of 'type' in acquire_lock_state
* Filter by taking type explicitly in find_lock_state
* WARN for default case in refsafe switch statement
v2 -> v3
v2: https://lore.kernel.org/bpf/
20241103212252.547071-1-memxor@gmail.com
* Rebase on bpf-next to resolve merge conflict
v1 -> v2
v1: https://lore.kernel.org/bpf/
20241103205856.345580-1-memxor@gmail.com
* Fix refsafe state comparison to check callback_ref and ptr separately.
====================
Link: https://lore.kernel.org/r/20241109231430.2475236-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Kumar Kartikeya Dwivedi [Sat, 9 Nov 2024 23:14:30 +0000 (15:14 -0800)]
bpf: Drop special callback reference handling
Logic to prevent callbacks from acquiring new references for the program
(i.e. leaving acquired references), and releasing caller references
(i.e. those acquired in parent frames) was introduced in commit
9d9d00ac29d0 ("bpf: Fix reference state management for synchronous callbacks").
This was necessary because back then, the verifier simulated each
callback once (that could potentially be executed N times, where N can
be zero). This meant that callbacks that left lingering resources or
cleared caller resources could do it more than once, operating on
undefined state or leaking memory.
With the fixes to callback verification in commit
ab5cfac139ab ("bpf: verify callbacks as if they are called unknown number of times"),
all of this extra logic is no longer necessary. Hence, drop it as part
of this commit.
Cc: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241109231430.2475236-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Kumar Kartikeya Dwivedi [Sat, 9 Nov 2024 23:14:29 +0000 (15:14 -0800)]
bpf: Refactor active lock management
When bpf_spin_lock was introduced originally, there was deliberation on
whether to use an array of lock IDs, but since bpf_spin_lock is limited
to holding a single lock at any given time, we've been using a single ID
to identify the held lock.
In preparation for introducing spin locks that can be taken multiple
times, introduce support for acquiring multiple lock IDs. For this
purpose, reuse the acquired_refs array and store both lock and pointer
references. We tag the entry with REF_TYPE_PTR or REF_TYPE_LOCK to
disambiguate and find the relevant entry. The ptr field is used to track
the map_ptr or btf (for bpf_obj_new allocations) to ensure locks can be
matched with protected fields within the same "allocation", i.e.
bpf_obj_new object or map value.
The struct active_lock is changed to an int as the state is part of the
acquired_refs array, and we only need active_lock as a cheap way of
detecting lock presence.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241109231430.2475236-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Viktor Malik [Thu, 7 Nov 2024 11:52:31 +0000 (12:52 +0100)]
selftests/bpf: skip the timer_lockup test for single-CPU nodes
The timer_lockup test needs 2 CPUs to work, on single-CPU nodes it fails
to set thread affinity to CPU 1 since it doesn't exist:
# ./test_progs -t timer_lockup
test_timer_lockup:PASS:timer_lockup__open_and_load 0 nsec
test_timer_lockup:PASS:pthread_create thread1 0 nsec
test_timer_lockup:PASS:pthread_create thread2 0 nsec
timer_lockup_thread:PASS:cpu affinity 0 nsec
timer_lockup_thread:FAIL:cpu affinity unexpected error: 22 (errno 0)
test_timer_lockup:PASS: 0 nsec
#406 timer_lockup:FAIL
Skip the test if only 1 CPU is available.
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Fixes:
50bd5a0c658d1 ("selftests/bpf: Add timer lockup selftest")
Tested-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241107115231.75200-1-vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Alexei Starovoitov [Fri, 8 Nov 2024 19:40:16 +0000 (11:40 -0800)]
Merge branch 'fix-lockdep-warning-for-htab-of-map'
Hou Tao says:
====================
The patch set fixes a lockdep warning for htab of map. The
warning is found when running test_maps. The warning occurs when
htab_put_fd_value() attempts to acquire map_idr_lock to free the map id
of the inner map while already holding the bucket lock (raw_spinlock_t).
The fix moves the invocation of free_htab_elem() after
htab_unlock_bucket() and adds a test case to verify the solution.
====================
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Hou Tao [Wed, 6 Nov 2024 06:35:42 +0000 (14:35 +0800)]
selftests/bpf: Test the update operations for htab of maps
Add test cases to verify the following four update operations on htab of
maps don't trigger lockdep warning:
(1) add then delete
(2) add, overwrite, then delete
(3) add, then lookup_and_delete
(4) add two elements, then lookup_and_delete_batch
Test cases are added for pre-allocated and non-preallocated htab of maps
respectively.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241106063542.357743-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Hou Tao [Wed, 6 Nov 2024 06:35:41 +0000 (14:35 +0800)]
selftests/bpf: Move ENOTSUPP from bpf_util.h
Moving the definition of ENOTSUPP into bpf_util.h to remove the
duplicated definitions in multiple files.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241106063542.357743-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Hou Tao [Wed, 6 Nov 2024 06:35:40 +0000 (14:35 +0800)]
bpf: Call free_htab_elem() after htab_unlock_bucket()
For htab of maps, when the map is removed from the htab, it may hold the
last reference of the map. bpf_map_fd_put_ptr() will invoke
bpf_map_free_id() to free the id of the removed map element. However,
bpf_map_fd_put_ptr() is invoked while holding a bucket lock
(raw_spin_lock_t), and bpf_map_free_id() attempts to acquire map_idr_lock
(spinlock_t), triggering the following lockdep warning:
=============================
[ BUG: Invalid wait context ]
6.11.0-rc4+ #49 Not tainted
-----------------------------
test_maps/4881 is trying to lock:
ffffffff84884578 (map_idr_lock){+...}-{3:3}, at: bpf_map_free_id.part.0+0x21/0x70
other info that might help us debug this:
context-{5:5}
2 locks held by test_maps/4881:
#0:
ffffffff846caf60 (rcu_read_lock){....}-{1:3}, at: bpf_fd_htab_map_update_elem+0xf9/0x270
#1:
ffff888149ced148 (&htab->lockdep_key#2){....}-{2:2}, at: htab_map_update_elem+0x178/0xa80
stack backtrace:
CPU: 0 UID: 0 PID: 4881 Comm: test_maps Not tainted 6.11.0-rc4+ #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
Call Trace:
<TASK>
dump_stack_lvl+0x6e/0xb0
dump_stack+0x10/0x20
__lock_acquire+0x73e/0x36c0
lock_acquire+0x182/0x450
_raw_spin_lock_irqsave+0x43/0x70
bpf_map_free_id.part.0+0x21/0x70
bpf_map_put+0xcf/0x110
bpf_map_fd_put_ptr+0x9a/0xb0
free_htab_elem+0x69/0xe0
htab_map_update_elem+0x50f/0xa80
bpf_fd_htab_map_update_elem+0x131/0x270
htab_map_update_elem+0x50f/0xa80
bpf_fd_htab_map_update_elem+0x131/0x270
bpf_map_update_value+0x266/0x380
__sys_bpf+0x21bb/0x36b0
__x64_sys_bpf+0x45/0x60
x64_sys_call+0x1b2a/0x20d0
do_syscall_64+0x5d/0x100
entry_SYSCALL_64_after_hwframe+0x76/0x7e
One way to fix the lockdep warning is using raw_spinlock_t for
map_idr_lock as well. However, bpf_map_alloc_id() invokes
idr_alloc_cyclic() after acquiring map_idr_lock, it will trigger a
similar lockdep warning because the slab's lock (s->cpu_slab->lock) is
still a spinlock.
Instead of changing map_idr_lock's type, fix the issue by invoking
htab_put_fd_value() after htab_unlock_bucket(). However, only deferring
the invocation of htab_put_fd_value() is not enough, because the old map
pointers in htab of maps can not be saved during batched deletion.
Therefore, also defer the invocation of free_htab_elem(), so these
to-be-freed elements could be linked together similar to lru map.
There are four callers for ->map_fd_put_ptr:
(1) alloc_htab_elem() (through htab_put_fd_value())
It invokes ->map_fd_put_ptr() under a raw_spinlock_t. The invocation of
htab_put_fd_value() can not simply move after htab_unlock_bucket(),
because the old element has already been stashed in htab->extra_elems.
It may be reused immediately after htab_unlock_bucket() and the
invocation of htab_put_fd_value() after htab_unlock_bucket() may release
the newly-added element incorrectly. Therefore, saving the map pointer
of the old element for htab of maps before unlocking the bucket and
releasing the map_ptr after unlock. Beside the map pointer in the old
element, should do the same thing for the special fields in the old
element as well.
(2) free_htab_elem() (through htab_put_fd_value())
Its caller includes __htab_map_lookup_and_delete_elem(),
htab_map_delete_elem() and __htab_map_lookup_and_delete_batch().
For htab_map_delete_elem(), simply invoke free_htab_elem() after
htab_unlock_bucket(). For __htab_map_lookup_and_delete_batch(), just
like lru map, linking the to-be-freed element into node_to_free list
and invoking free_htab_elem() for these element after unlock. It is safe
to reuse batch_flink as the link for node_to_free, because these
elements have been removed from the hash llist.
Because htab of maps doesn't support lookup_and_delete operation,
__htab_map_lookup_and_delete_elem() doesn't have the problem, so kept
it as is.
(3) fd_htab_map_free()
It invokes ->map_fd_put_ptr without raw_spinlock_t.
(4) bpf_fd_htab_map_update_elem()
It invokes ->map_fd_put_ptr without raw_spinlock_t.
After moving free_htab_elem() outside htab bucket lock scope, using
pcpu_freelist_push() instead of __pcpu_freelist_push() to disable
the irq before freeing elements, and protecting the invocations of
bpf_mem_cache_free() with migrate_{disable|enable} pair.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241106063542.357743-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Andrii Nakryiko [Fri, 8 Nov 2024 18:20:46 +0000 (10:20 -0800)]
Merge branch 'bpf-add-uprobe-session-support'
Jiri Olsa says:
====================
bpf: Add uprobe session support
hi,
this patchset is adding support for session uprobe attachment and
using it through bpf link for bpf programs.
The session means that the uprobe consumer is executed on entry
and return of probed function with additional control:
- entry callback can control execution of the return callback
- entry and return callbacks can share data/cookie
Uprobe changes (on top of perf/core [1] are posted in here [2].
This patchset is based on bpf-next/master and will be merged once
we pull [2] in bpf-next/master.
v9 changes:
- rebased on bpf-next/master with perf/core tag merged (thanks Peter!)
thanks,
jirka
[1] git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core
[2] https://lore.kernel.org/bpf/
20241018202252.693462-1-jolsa@kernel.org/T/#ma43c549c4bf684ca1b17fa638aa5e7cbb46893e9
---
====================
Link: https://lore.kernel.org/r/20241108134544.480660-1-jolsa@kernel.org
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Jiri Olsa [Fri, 8 Nov 2024 13:45:44 +0000 (14:45 +0100)]
selftests/bpf: Add threads to consumer test
With recent uprobe fix [1] the sync time after unregistering uprobe is
much longer and prolongs the consumer test which creates and destroys
hundreds of uprobes.
This change adds 16 threads (which fits the test logic) and speeds up
the test.
Before the change:
# perf stat --null ./test_progs -t uprobe_multi_test/consumers
#421/9 uprobe_multi_test/consumers:OK
#421 uprobe_multi_test:OK
Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
Performance counter stats for './test_progs -t uprobe_multi_test/consumers':
28.
818778973 seconds time elapsed
0.
745518000 seconds user
0.
919186000 seconds sys
After the change:
# perf stat --null ./test_progs -t uprobe_multi_test/consumers 2>&1
#421/9 uprobe_multi_test/consumers:OK
#421 uprobe_multi_test:OK
Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
Performance counter stats for './test_progs -t uprobe_multi_test/consumers':
3.
504790814 seconds time elapsed
0.
012141000 seconds user
0.
751760000 seconds sys
[1] commit
87195a1ee332 ("uprobes: switch to RCU Tasks Trace flavor for better performance")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-14-jolsa@kernel.org
Jiri Olsa [Fri, 8 Nov 2024 13:45:43 +0000 (14:45 +0100)]
selftests/bpf: Add uprobe sessions to consumer test
Adding uprobe session consumers to the consumer test,
so we get the session into the test mix.
In addition scaling down the test to have just 1 uprobe
and 1 uretprobe, otherwise the test time grows and is
unsuitable for CI even with threads.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-13-jolsa@kernel.org
Jiri Olsa [Fri, 8 Nov 2024 13:45:42 +0000 (14:45 +0100)]
selftests/bpf: Add uprobe session single consumer test
Testing that the session ret_handler bypass works on single
uprobe with multiple consumers, each with different session
ignore return value.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-12-jolsa@kernel.org
Jiri Olsa [Fri, 8 Nov 2024 13:45:41 +0000 (14:45 +0100)]
selftests/bpf: Add kprobe session verifier test for return value
Making sure kprobe.session program can return only [0,1] values.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-11-jolsa@kernel.org
Jiri Olsa [Fri, 8 Nov 2024 13:45:40 +0000 (14:45 +0100)]
selftests/bpf: Add uprobe session verifier test for return value
Making sure uprobe.session program can return only [0,1] values.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-10-jolsa@kernel.org
Jiri Olsa [Fri, 8 Nov 2024 13:45:39 +0000 (14:45 +0100)]
selftests/bpf: Add uprobe session recursive test
Adding uprobe session test that verifies the cookie value is stored
properly when single uprobe-ed function is executed recursively.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-9-jolsa@kernel.org
Jiri Olsa [Fri, 8 Nov 2024 13:45:38 +0000 (14:45 +0100)]
selftests/bpf: Add uprobe session cookie test
Adding uprobe session test that verifies the cookie value
get properly propagated from entry to return program.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-8-jolsa@kernel.org
Jiri Olsa [Fri, 8 Nov 2024 13:45:37 +0000 (14:45 +0100)]
selftests/bpf: Add uprobe session test
Adding uprobe session test and testing that the entry program
return value controls execution of the return probe program.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-7-jolsa@kernel.org
Jiri Olsa [Fri, 8 Nov 2024 13:45:36 +0000 (14:45 +0100)]
libbpf: Add support for uprobe multi session attach
Adding support to attach program in uprobe session mode
with bpf_program__attach_uprobe_multi function.
Adding session bool to bpf_uprobe_multi_opts struct that allows
to load and attach the bpf program via uprobe session.
the attachment to create uprobe multi session.
Also adding new program loader section that allows:
SEC("uprobe.session/bpf_fentry_test*")
and loads/attaches uprobe program as uprobe session.
Adding sleepable hook (uprobe.session.s) as well.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-6-jolsa@kernel.org
Jiri Olsa [Fri, 8 Nov 2024 13:45:35 +0000 (14:45 +0100)]
bpf: Add support for uprobe multi session context
Placing bpf_session_run_ctx layer in between bpf_run_ctx and
bpf_uprobe_multi_run_ctx, so the session data can be retrieved
from uprobe_multi link.
Plus granting session kfuncs access to uprobe session programs.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-5-jolsa@kernel.org
Jiri Olsa [Fri, 8 Nov 2024 13:45:34 +0000 (14:45 +0100)]
bpf: Add support for uprobe multi session attach
Adding support to attach BPF program for entry and return probe
of the same function. This is common use case which at the moment
requires to create two uprobe multi links.
Adding new BPF_TRACE_UPROBE_SESSION attach type that instructs
kernel to attach single link program to both entry and exit probe.
It's possible to control execution of the BPF program on return
probe simply by returning zero or non zero from the entry BPF
program execution to execute or not the BPF program on return
probe respectively.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-4-jolsa@kernel.org
Jiri Olsa [Fri, 8 Nov 2024 13:45:33 +0000 (14:45 +0100)]
bpf: Force uprobe bpf program to always return 0
As suggested by Andrii make uprobe multi bpf programs to always return 0,
so they can't force uprobe removal.
Keeping the int return type for uprobe_prog_run, because it will be used
in following session changes.
Fixes:
89ae89f53d20 ("bpf: Add multi uprobe link")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-3-jolsa@kernel.org
Jiri Olsa [Fri, 8 Nov 2024 13:45:32 +0000 (14:45 +0100)]
bpf: Allow return values 0 and 1 for kprobe session
The kprobe session program can return only 0 or 1,
instruct verifier to check for that.
Fixes:
535a3692ba72 ("bpf: Add support for kprobe session attach")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-2-jolsa@kernel.org
Jiri Olsa [Thu, 7 Nov 2024 09:43:37 +0000 (10:43 +0100)]
selftests/bpf: Fix uprobe consumer test (again)
The new uprobe changes bring some new behaviour that we need to reflect
in the consumer test. Now pending uprobe instance in the kernel can
survive longer and thus might call uretprobe consumer callbacks in
some situations in which, previously, such callback would be omitted.
We now need to take that into account in uprobe-multi consumer tests.
The idea being that uretprobe under test either stayed from before to
after (uret_stays + test_bit) or uretprobe instance survived and we
have uretprobe active in after (uret_survives + test_bit).
uret_survives just states that uretprobe survives if there are *any*
uretprobes both before and after (overlapping or not, doesn't matter)
and uprobe was attached before.
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241107094337.3848210-1-jolsa@kernel.org
Abhinav Saxena [Thu, 7 Nov 2024 06:37:08 +0000 (23:37 -0700)]
bpf: Remove trailing whitespace in verifier.rst
Remove trailing whitespace in Documentation/bpf/verifier.rst.
Signed-off-by: Abhinav Saxena <xandfury@gmail.com>
Link: https://lore.kernel.org/r/20241107063708.106340-2-xandfury@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Viktor Malik [Fri, 1 Nov 2024 08:27:11 +0000 (09:27 +0100)]
selftests/bpf: Allow building with extra flags
In order to specify extra compilation or linking flags to BPF selftests,
it is possible to set EXTRA_CFLAGS and EXTRA_LDFLAGS from the command
line. The problem is that they are not propagated to sub-make calls
(runqslower, bpftool, libbpf) and in the better case are not applied, in
the worse case cause the entire build fail.
Propagate EXTRA_CFLAGS and EXTRA_LDFLAGS to the sub-makes.
This, for instance, allows to build selftests as PIE with
$ make EXTRA_CFLAGS='-fPIE' EXTRA_LDFLAGS='-pie'
Without this change, the command would fail because libbpf.a would not
be built with -fPIE and other PIE binaries would not link against it.
The only problem is that we have to explicitly provide empty
EXTRA_CFLAGS='' and EXTRA_LDFLAGS='' to the builds of kernel modules as
we don't want to build modules with flags used for userspace (the above
example would fail as kernel doesn't support PIE).
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Mikulas Patocka [Mon, 11 Nov 2024 15:51:02 +0000 (16:51 +0100)]
dm-cache: fix warnings about duplicate slab caches
The commit
4c39529663b9 adds a warning about duplicate cache names if
CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-cache
code.
The dm-cache code allocates a slab cache for each device. This commit
changes it to allocate just one slab cache in the module init function.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes:
4c39529663b9 ("slab: Warn on duplicate cache names when DEBUG_VM=y")
Mikulas Patocka [Mon, 11 Nov 2024 15:48:18 +0000 (16:48 +0100)]
dm-bufio: fix warnings about duplicate slab caches
The commit
4c39529663b9 adds a warning about duplicate cache names if
CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-bufio
code. The dm-bufio code allocates a slab cache with each client. It is
not possible to preallocate the caches in the module init function
because the size of auxiliary per-buffer data is not known at this point.
So, this commit changes dm-bufio so that it appends a unique atomic value
to the cache name, to avoid the warnings.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes:
4c39529663b9 ("slab: Warn on duplicate cache names when DEBUG_VM=y")
Barry Song [Thu, 7 Nov 2024 01:12:46 +0000 (14:12 +1300)]
mm: count zeromap read and set for swapout and swapin
When the proportion of folios from the zeromap is small, missing their
accounting may not significantly impact profiling. However, it's easy to
construct a scenario where this becomes an issue—for example, allocating
1 GB of memory, writing zeros from userspace, followed by MADV_PAGEOUT,
and then swapping it back in. In this case, the swap-out and swap-in
counts seem to vanish into a black hole, potentially causing semantic
ambiguity.
On the other hand, Usama reported that zero-filled pages can exceed 10% in
workloads utilizing zswap, while Hailong noted that some app in Android
have more than 6% zero-filled pages. Before commit
0ca0c24e3211 ("mm:
store zero pages to be swapped out in a bitmap"), both zswap and zRAM
implemented similar optimizations, leading to these optimized-out pages
being counted in either zswap or zRAM counters (with pswpin/pswpout also
increasing for zRAM). With zeromap functioning prior to both zswap and
zRAM, userspace will no longer detect these swap-out and swap-in actions.
We have three ways to address this:
1. Introduce a dedicated counter specifically for the zeromap.
2. Use pswpin/pswpout accounting, treating the zero map as a standard
backend. This approach aligns with zRAM's current handling of
same-page fills at the device level. However, it would mean losing the
optimized-out page counters previously available in zRAM and would not
align with systems using zswap. Additionally, as noted by Nhat Pham,
pswpin/pswpout counters apply only to I/O done directly to the backend
device.
3. Count zeromap pages under zswap, aligning with system behavior when
zswap is enabled. However, this would not be consistent with zRAM, nor
would it align with systems lacking both zswap and zRAM.
Given the complications with options 2 and 3, this patch selects
option 1.
We can find these counters from /proc/vmstat (counters for the whole
system) and memcg's memory.stat (counters for the interested memcg).
For example:
$ grep -E 'swpin_zero|swpout_zero' /proc/vmstat
swpin_zero 1648
swpout_zero 33536
$ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat
swpin_zero 3905
swpout_zero 3985
This patch does not address any specific zeromap bug, but the missing
swpout and swpin counts for zero-filled pages can be highly confusing and
may mislead user-space agents that rely on changes in these counters as
indicators. Therefore, we add a Fixes tag to encourage the inclusion of
this counter in any kernel versions with zeromap.
Many thanks to Kanchana for the contribution of changing
count_objcg_event() to count_objcg_events() to support large folios[1],
which has now been incorporated into this patch.
[1] https://lkml.kernel.org/r/
20241001053222.6944-5-kanchana.p.sridhar@intel.com
Link: https://lkml.kernel.org/r/20241107011246.59137-1-21cnbao@gmail.com
Fixes:
0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
Co-developed-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Hailong Liu <hailong.liu@oppo.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Sun, 10 Nov 2024 22:19:35 +0000 (14:19 -0800)]
Linux 6.12-rc7
Linus Torvalds [Sun, 10 Nov 2024 22:16:28 +0000 (14:16 -0800)]
Merge tag 'clk-fixes-for-linus' of git://git./linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"A handful of Qualcomm clk driver fixes:
- Correct flags for X Elite USB MP GDSC and pcie pipediv2 clocks
- Fix alpha PLL post_div mask for the cases where width is not
specified
- Avoid hangs in the SM8350 video driver (venus) by setting HW_CTRL
trigger feature on the video clocks"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: qcom: gcc-x1e80100: Fix USB MP SS1 PHY GDSC pwrsts flags
clk: qcom: gcc-x1e80100: Fix halt_check for pipediv2 clocks
clk: qcom: clk-alpha-pll: Fix pll post div mask when width is not set
clk: qcom: videocc-sm8350: use HW_CTRL_TRIGGER for vcodec GDSCs
Linus Torvalds [Sun, 10 Nov 2024 22:13:05 +0000 (14:13 -0800)]
Merge tag 'i2c-for-6.12-rc7' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"i2c-host fixes for v6.12-rc7 (from Andi):
- Fix designware incorrect behavior when concluding a transmission
- Fix Mule multiplexer error value evaluation"
* tag 'i2c-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: designware: do not hold SCL low when I2C_DYNAMIC_TAR_UPDATE is not set
i2c: muxes: Fix return value check in mule_i2c_mux_probe()
Trond Myklebust [Fri, 13 Sep 2024 17:57:04 +0000 (13:57 -0400)]
filemap: Fix bounds checking in filemap_read()
If the caller supplies an iocb->ki_pos value that is close to the
filesystem upper limit, and an iterator with a count that causes us to
overflow that limit, then filemap_read() enters an infinite loop.
This behaviour was discovered when testing xfstests generic/525 with the
"localio" optimisation for loopback NFS mounts.
Reported-by: Mike Snitzer <snitzer@kernel.org>
Fixes:
c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
Tested-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 10 Nov 2024 17:37:47 +0000 (09:37 -0800)]
Merge tag 'irq_urgent_for_v6.12_rc7' of git://git./linux/kernel/git/tip/tip
Pull irq fix from Borislav Petkov:
- Make sure GICv3 controller interrupt activation doesn't race with a
concurrent deactivation due to propagation delays of the register
write
* tag 'irq_urgent_for_v6.12_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3: Force propagation of the active state with a read-back