linux-2.6-microblaze.git
6 months agobpf: Add support for kprobe session cookie
Jiri Olsa [Tue, 30 Apr 2024 11:28:26 +0000 (13:28 +0200)]
bpf: Add support for kprobe session cookie

Adding support for cookie within the session of kprobe multi
entry and return program.

The session cookie is u64 value and can be retrieved be new
kfunc bpf_session_cookie, which returns pointer to the cookie
value. The bpf program can use the pointer to store (on entry)
and load (on return) the value.

The cookie value is implemented via fprobe feature that allows
to share values between entry and return ftrace fprobe callbacks.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240430112830.1184228-4-jolsa@kernel.org
6 months agobpf: Add support for kprobe session context
Jiri Olsa [Tue, 30 Apr 2024 11:28:25 +0000 (13:28 +0200)]
bpf: Add support for kprobe session context

Adding struct bpf_session_run_ctx object to hold session related
data, which is atm is_return bool and data pointer coming in
following changes.

Placing bpf_session_run_ctx layer in between bpf_run_ctx and
bpf_kprobe_multi_run_ctx so the session data can be retrieved
regardless of if it's kprobe_multi or uprobe_multi link, which
support is coming in future. This way both kprobe_multi and
uprobe_multi can use same kfuncs to access the session data.

Adding bpf_session_is_return kfunc that returns true if the
bpf program is executed from the exit probe of the kprobe multi
link attached in wrapper mode. It returns false otherwise.

Adding new kprobe hook for kprobe program type.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240430112830.1184228-3-jolsa@kernel.org
6 months agobpf: Add support for kprobe session attach
Jiri Olsa [Tue, 30 Apr 2024 11:28:24 +0000 (13:28 +0200)]
bpf: Add support for kprobe session attach

Adding support to attach bpf program for entry and return probe
of the same function. This is common use case which at the moment
requires to create two kprobe multi links.

Adding new BPF_TRACE_KPROBE_SESSION attach type that instructs
kernel to attach single link program to both entry and exit probe.

It's possible to control execution of the bpf program on return
probe simply by returning zero or non zero from the entry bpf
program execution to execute or not the bpf program on return
probe respectively.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240430112830.1184228-2-jolsa@kernel.org
6 months agoselftests/bpf: Drop an unused local variable
Benjamin Tissoires [Tue, 30 Apr 2024 10:43:26 +0000 (12:43 +0200)]
selftests/bpf: Drop an unused local variable

Some copy/paste leftover, this is never used.

Fixes: e3d9eac99afd ("selftests/bpf: wq: add bpf_wq_init() checks")
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20240430-bpf-next-v3-3-27afe7f3b17c@kernel.org
6 months agobpf: Do not walk twice the hash map on free
Benjamin Tissoires [Tue, 30 Apr 2024 10:43:25 +0000 (12:43 +0200)]
bpf: Do not walk twice the hash map on free

If someone stores both a timer and a workqueue in a hash map, on free, we
would walk it twice.

Add a check in htab_free_malloced_timers_or_wq and free the timers and
workqueues if they are present.

Fixes: 246331e3f1ea ("bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps")
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20240430-bpf-next-v3-2-27afe7f3b17c@kernel.org
6 months agobpf: Do not walk twice the map on free
Benjamin Tissoires [Tue, 30 Apr 2024 10:43:24 +0000 (12:43 +0200)]
bpf: Do not walk twice the map on free

If someone stores both a timer and a workqueue in a map, on free
we would walk it twice.

Add a check in array_map_free_timers_wq and free the timers and
workqueues if they are present.

Fixes: 246331e3f1ea ("bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps")
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20240430-bpf-next-v3-1-27afe7f3b17c@kernel.org
6 months agoselftests/bpf: validate nulled-out struct_ops program is handled properly
Andrii Nakryiko [Sun, 28 Apr 2024 03:09:54 +0000 (20:09 -0700)]
selftests/bpf: validate nulled-out struct_ops program is handled properly

Add a selftests validating that it's possible to have some struct_ops
callback set declaratively, then disable it (by setting to NULL)
programmatically. Libbpf should detect that such program should
not be loaded. Otherwise, it will unnecessarily fail the loading
when the host kernel does not have the type information.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240428030954.3918764-2-andrii@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
6 months agolibbpf: handle nulled-out program in struct_ops correctly
Andrii Nakryiko [Sun, 28 Apr 2024 03:09:53 +0000 (20:09 -0700)]
libbpf: handle nulled-out program in struct_ops correctly

If struct_ops has one of program callbacks set declaratively and host
kernel is old and doesn't support this callback, libbpf will allow to
load such struct_ops as long as that callback was explicitly nulled-out
(presumably through skeleton). This is all working correctly, except we
won't reset corresponding program slot to NULL before bailing out, which
will lead to libbpf not detecting that BPF program has to be not
auto-loaded. Fix this by unconditionally resetting corresponding program
slot to NULL.

Fixes: c911fc61a7ce ("libbpf: Skip zeroed or null fields if not found in the kernel type.")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240428030954.3918764-1-andrii@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
6 months agobpf: Include linux/types.h for u32
Dmitrii Bundin [Sat, 20 Apr 2024 04:24:57 +0000 (07:24 +0300)]
bpf: Include linux/types.h for u32

Inclusion of the header linux/btf_ids.h relies on indirect inclusion of
the header linux/types.h. Including it directly on the top level helps
to avoid potential problems if linux/types.h hasn't been included
before.

The main motivation to introduce this it is to avoid similar problems that
have shown up in the bpftool where GNU libc indirectly pulls
linux/types.h causing compile error of the form:

   error: unknown type name 'u32'
                             u32 cnt;
                             ^~~

The bpftool compile error was fixed in
62248b22d01e ("tools/resolve_btfids: fix build with musl libc").

Signed-off-by: Dmitrii Bundin <dmitrii.bundin.a@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240420042457.3198883-1-dmitrii.bundin.a@gmail.com
6 months agoMerge branch 'free-strdup-memory-in-selftests'
Andrii Nakryiko [Mon, 29 Apr 2024 23:17:16 +0000 (16:17 -0700)]
Merge branch 'free-strdup-memory-in-selftests'

Geliang Tang says:

====================
Free strdup memory in selftests

From: Geliang Tang <tanggeliang@kylinos.cn>

Two fixes to free strdup memory in selftests to avoid memory leaks.
====================

Link: https://lore.kernel.org/r/cover.1714374022.git.tanggeliang@kylinos.cn
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
6 months agoselftests/bpf: Free strdup memory in veristat
Geliang Tang [Mon, 29 Apr 2024 07:07:34 +0000 (15:07 +0800)]
selftests/bpf: Free strdup memory in veristat

The strdup() function returns a pointer to a new string which is a
duplicate of the string "input". Memory for the new string is obtained
with malloc(), and need to be freed with free().

This patch adds these missing "free(input)" in parse_stats() to avoid
memory leak in veristat.c.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/ded44f8865cd7f337f52fc5fb0a5fbed7d6bd641.1714374022.git.tanggeliang@kylinos.cn
6 months agoselftests/bpf: Free strdup memory in test_sockmap
Geliang Tang [Mon, 29 Apr 2024 07:07:33 +0000 (15:07 +0800)]
selftests/bpf: Free strdup memory in test_sockmap

The strdup() function returns a pointer to a new string which is a
duplicate of the string "ptr". Memory for the new string is obtained
with malloc(), and need to be freed with free().

This patch adds these missing "free(ptr)" in check_whitelist() and
check_blacklist() to avoid memory leaks in test_sockmap.c.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/b76f2f4c550aebe4ab8ea73d23c4cbe4f06ea996.1714374022.git.tanggeliang@kylinos.cn
6 months agoselftests/bpf: Run cgroup1_hierarchy test in own mount namespace
Viktor Malik [Mon, 29 Apr 2024 11:23:11 +0000 (13:23 +0200)]
selftests/bpf: Run cgroup1_hierarchy test in own mount namespace

The cgroup1_hierarchy test uses setup_classid_environment to setup
cgroupv1 environment. The problem is that the environment is set in
/sys/fs/cgroup and therefore, if not run under an own mount namespace,
effectively deletes all system cgroups:

    $ ls /sys/fs/cgroup | wc -l
    27
    $ sudo ./test_progs -t cgroup1_hierarchy
    #41/1    cgroup1_hierarchy/test_cgroup1_hierarchy:OK
    #41/2    cgroup1_hierarchy/test_root_cgid:OK
    #41/3    cgroup1_hierarchy/test_invalid_level:OK
    #41/4    cgroup1_hierarchy/test_invalid_cgid:OK
    #41/5    cgroup1_hierarchy/test_invalid_hid:OK
    #41/6    cgroup1_hierarchy/test_invalid_cgrp_name:OK
    #41/7    cgroup1_hierarchy/test_invalid_cgrp_name2:OK
    #41/8    cgroup1_hierarchy/test_sleepable_prog:OK
    #41      cgroup1_hierarchy:OK
    Summary: 1/8 PASSED, 0 SKIPPED, 0 FAILED
    $ ls /sys/fs/cgroup | wc -l
    1

To avoid this, run setup_cgroup_environment first which will create an
own mount namespace. This only affects the cgroupv1_hierarchy test as
all other cgroup1 test progs already run setup_cgroup_environment prior
to running setup_classid_environment.

Also add a comment to the header of setup_classid_environment to warn
against this invalid usage in future.

Fixes: 360769233cc9 ("selftests/bpf: Add selftests for cgroup1 hierarchy")
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240429112311.402497-1-vmalik@redhat.com
6 months agobpf: Switch to krealloc_array()
Andy Shevchenko [Mon, 29 Apr 2024 12:00:05 +0000 (15:00 +0300)]
bpf: Switch to krealloc_array()

Let the krealloc_array() copy the original data and
check for a multiplication overflow.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240429120005.3539116-1-andriy.shevchenko@linux.intel.com
6 months agobpf: Use struct_size()
Andy Shevchenko [Mon, 29 Apr 2024 12:13:22 +0000 (15:13 +0300)]
bpf: Use struct_size()

Use struct_size() instead of hand writing it.
This is less verbose and more robust.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240429121323.3818497-1-andriy.shevchenko@linux.intel.com
6 months agosamples/bpf: Add valid info for VMLINUX_BTF
Tao Chen [Sun, 28 Apr 2024 16:10:32 +0000 (00:10 +0800)]
samples/bpf: Add valid info for VMLINUX_BTF

When I use the command 'make M=samples/bpf' to compile samples/bpf code
in ubuntu 22.04, the error info occured:
Cannot find a vmlinux for VMLINUX_BTF at any of "  /home/ubuntu/code/linux/vmlinux",
build the kernel or set VMLINUX_BTF or VMLINUX_H variable

Others often encounter this kind of issue, new kernel has the vmlinux, so we can
set the path in error info which seems more intuitive, like:
Cannot find a vmlinux for VMLINUX_BTF at any of "  /home/ubuntu/code/linux/vmlinux",
buiild the kernel or set VMLINUX_BTF like "VMLINUX_BTF=/sys/kernel/btf/vmlinux" or
VMLINUX_H variable

Signed-off-by: Tao Chen <chen.dylane@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240428161032.239043-1-chen.dylane@gmail.com
6 months agobpf: Fix verifier assumptions about socket->sk
Alexei Starovoitov [Sat, 27 Apr 2024 00:25:44 +0000 (17:25 -0700)]
bpf: Fix verifier assumptions about socket->sk

The verifier assumes that 'sk' field in 'struct socket' is valid
and non-NULL when 'socket' pointer itself is trusted and non-NULL.
That may not be the case when socket was just created and
passed to LSM socket_accept hook.
Fix this verifier assumption and adjust tests.

Reported-by: Liam Wisehart <liamwisehart@meta.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 6fcd486b3a0a ("bpf: Refactor RCU enforcement in the verifier.")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20240427002544.68803-1-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
6 months agoMerge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf...
Jakub Kicinski [Mon, 29 Apr 2024 18:59:20 +0000 (11:59 -0700)]
Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-04-29

We've added 147 non-merge commits during the last 32 day(s) which contain
a total of 158 files changed, 9400 insertions(+), 2213 deletions(-).

The main changes are:

1) Add an internal-only BPF per-CPU instruction for resolving per-CPU
   memory addresses and implement support in x86 BPF JIT. This allows
   inlining per-CPU array and hashmap lookups
   and the bpf_get_smp_processor_id() helper, from Andrii Nakryiko.

2) Add BPF link support for sk_msg and sk_skb programs, from Yonghong Song.

3) Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
   atomics in bpf_arena which can be JITed as a single x86 instruction,
   from Alexei Starovoitov.

4) Add support for passing mark with bpf_fib_lookup helper,
   from Anton Protopopov.

5) Add a new bpf_wq API for deferring events and refactor sleepable
   bpf_timer code to keep common code where possible,
   from Benjamin Tissoires.

6) Fix BPF_PROG_TEST_RUN infra with regards to bpf_dummy_struct_ops programs
   to check when NULL is passed for non-NULLable parameters,
   from Eduard Zingerman.

7) Harden the BPF verifier's and/or/xor value tracking,
   from Harishankar Vishwanathan.

8) Introduce crypto kfuncs to make BPF programs able to utilize the kernel
   crypto subsystem, from Vadim Fedorenko.

9) Various improvements to the BPF instruction set standardization doc,
   from Dave Thaler.

10) Extend libbpf APIs to partially consume items from the BPF ringbuffer,
    from Andrea Righi.

11) Bigger batch of BPF selftests refactoring to use common network helpers
    and to drop duplicate code, from Geliang Tang.

12) Support bpf_tail_call_static() helper for BPF programs with GCC 13,
    from Jose E. Marchesi.

13) Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
    program to have code sections where preemption is disabled,
    from Kumar Kartikeya Dwivedi.

14) Allow invoking BPF kfuncs from BPF_PROG_TYPE_SYSCALL programs,
    from David Vernet.

15) Extend the BPF verifier to allow different input maps for a given
    bpf_for_each_map_elem() helper call in a BPF program, from Philo Lu.

16) Add support for PROBE_MEM32 and bpf_addr_space_cast instructions
    for riscv64 and arm64 JITs to enable BPF Arena, from Puranjay Mohan.

17) Shut up a false-positive KMSAN splat in interpreter mode by unpoison
    the stack memory, from Martin KaFai Lau.

18) Improve xsk selftest coverage with new tests on maximum and minimum
    hardware ring size configurations, from Tushar Vyavahare.

19) Various ReST man pages fixes as well as documentation and bash completion
    improvements for bpftool, from Rameez Rehman & Quentin Monnet.

20) Fix libbpf with regards to dumping subsequent char arrays,
    from Quentin Deslandes.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (147 commits)
  bpf, docs: Clarify PC use in instruction-set.rst
  bpf_helpers.h: Define bpf_tail_call_static when building with GCC
  bpf, docs: Add introduction for use in the ISA Internet Draft
  selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us
  bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args
  selftests/bpf: dummy_st_ops should reject 0 for non-nullable params
  bpf: check bpf_dummy_struct_ops program params for test runs
  selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops
  selftests/bpf: adjust dummy_st_ops_success to detect additional error
  bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable
  selftests/bpf: Add ring_buffer__consume_n test.
  bpf: Add bpf_guard_preempt() convenience macro
  selftests: bpf: crypto: add benchmark for crypto functions
  selftests: bpf: crypto skcipher algo selftests
  bpf: crypto: add skcipher to bpf crypto
  bpf: make common crypto API for TC/XDP programs
  bpf: update the comment for BTF_FIELDS_MAX
  selftests/bpf: Fix wq test.
  selftests/bpf: Use make_sockaddr in test_sock_addr
  selftests/bpf: Use connect_to_addr in test_sock_addr
  ...
====================

Link: https://lore.kernel.org/r/20240429131657.19423-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: phy: micrel: Add support for PTP_PF_EXTTS for lan8814
Horatiu Vultur [Fri, 26 Apr 2024 14:02:24 +0000 (16:02 +0200)]
net: phy: micrel: Add support for PTP_PF_EXTTS for lan8814

Extend the PTP programmable gpios to implement also PTP_PF_EXTTS
function. The pins can be configured to capture both of rising
and falling edge. Once the event is seen, then an interrupt is
generated and the LTC is saved in the registers.
On lan8814 only GPIO 3 can be configured for this.

This was tested using:
ts2phc -m -l 7 -s generic -f ts2phc.cfg

Where the configuration was the following:
    ---
    [global]
    ts2phc.pin_index  3

    [eth0]
    ---

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoMerge branch 'dsa-realtek-leds'
David S. Miller [Mon, 29 Apr 2024 12:35:41 +0000 (13:35 +0100)]
Merge branch 'dsa-realtek-leds'

Luiz Angelo Daros de Luca says:

====================
net: dsa: realtek: fix LED support for rtl8366

This series fixes the LED support for rtl8366. The existing code was not
tested in a device with switch LEDs and it was using a flawed logic.

The driver now keeps the default LED configuration if nothing requests a
different behavior. This may be enough for most devices. This can be
achieved either by omitting the LED from the device-tree or configuring
all LEDs in a group with the default state set to "keep".

The hardware trigger for LEDs in Realtek switches is shared among all
LEDs in a group. This behavior doesn't align well with the Linux LED
API, which controls LEDs individually. Once the OS changes the
brightness of a LED in a group still triggered by the hardware, the
entire group switches to software-controlled LEDs, even for those not
metioned in the device-tree. This shared behavior also prevents
offloading the trigger to the hardware as it would require an
orchestration between LEDs in a group, not currently present in the LED
API.

The assertion of device hardware reset during driver removal was removed
because it was causing an issue with the LED release code. Devres
devices are released after the driver's removal is executed. Asserting
the reset at that point was causing timeout errors during LED release
when it attempted to turn off the LED.

To: Linus Walleij <linus.walleij@linaro.org>
To: Alvin Šipraga <alsi@bang-olufsen.dk>
To: Andrew Lunn <andrew@lunn.ch>
To: Florian Fainelli <f.fainelli@gmail.com>
To: Vladimir Oltean <olteanv@gmail.com>
To: David S. Miller <davem@davemloft.net>
To: Eric Dumazet <edumazet@google.com>
To: Jakub Kicinski <kuba@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>
To: Rob Herring <robh+dt@kernel.org>
To: Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>
To: Conor Dooley <conor+dt@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: devicetree@vger.kernel.org
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Changes in v2:
- Fixed commit message formatting
- Added GROUP to LED group enum values. With that, moved the code that
  disables LED into a new function to keep 80-collumn limit.
- Dropped unused enable argument in rb8366rb_get_port_led()
- Fixed variable order in rtl8366rb_setup_led()
- Removed redundant led group test in rb8366rb_{g,s}et_port_led()
- Initialize ret as 0 in rtl8366rb_setup_leds()
- Updated comments related to LED blinking and setup
- Link to v1: https://lore.kernel.org/r/20240310-realtek-led-v1-0-4d9813ce938e@gmail.com

Changes in v1:
- Rebased on new relatek DSA drivers
- Improved commit messages
- Added commit to remove the reset assert during .remove
- Link to RFC: https://lore.kernel.org/r/20240106184651.3665-1-luizluca@gmail.com
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: dsa: realtek: add LED drivers for rtl8366rb
Luiz Angelo Daros de Luca [Sat, 27 Apr 2024 05:11:30 +0000 (02:11 -0300)]
net: dsa: realtek: add LED drivers for rtl8366rb

This commit introduces LED drivers for rtl8366rb, enabling LEDs to be
described in the device tree using the same format as qca8k. Each port
can configure up to 4 LEDs.

If all LEDs in a group use the default state "keep", they will use the
default behavior after a reset. Changing the brightness of one LED,
either manually or by a trigger, will disable the default hardware
trigger and switch the entire LED group to manually controlled LEDs.
Once in this mode, there is no way to revert to hardware-controlled LEDs
(except by resetting the switch).

Software triggers function as expected with manually controlled LEDs.

Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: dsa: realtek: do not assert reset on remove
Luiz Angelo Daros de Luca [Sat, 27 Apr 2024 05:11:29 +0000 (02:11 -0300)]
net: dsa: realtek: do not assert reset on remove

The necessity of asserting the reset on removal was previously
questioned, as DSA's own cleanup methods should suffice to prevent
traffic leakage[1].

When a driver has subdrivers controlled by devres, they will be
unregistered after the main driver's .remove is executed. If it asserts
a reset, the subdrivers will be unable to communicate with the hardware
during their cleanup. For LEDs, this means that they will fail to turn
off, resulting in a timeout error.

[1] https://lore.kernel.org/r/20240123215606.26716-9-luizluca@gmail.com/

Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: dsa: realtek: keep default LED state in rtl8366rb
Luiz Angelo Daros de Luca [Sat, 27 Apr 2024 05:11:28 +0000 (02:11 -0300)]
net: dsa: realtek: keep default LED state in rtl8366rb

This switch family supports four LEDs for each of its six ports. Each
LED group is composed of one of these four LEDs from all six ports. LED
groups can be configured to display hardware information, such as link
activity, or manually controlled through a bitmap in registers
RTL8366RB_LED_0_1_CTRL_REG and RTL8366RB_LED_2_3_CTRL_REG.

After a reset, the default LED group configuration for groups 0 to 3
indicates, respectively, link activity, link at 1000M, 100M, and 10M, or
RTL8366RB_LED_CTRL_REG as 0x5432. These configurations are commonly used
for LED indications. However, the driver was replacing that
configuration to use manually controlled LEDs (RTL8366RB_LED_FORCE)
without providing a way for the OS to control them. The default
configuration is deemed more useful than fixed, uncontrollable turned-on
LEDs.

The driver was enabling/disabling LEDs during port_enable/disable.
However, these events occur when the port is administratively controlled
(up or down) and are not related to link presence. Additionally, when a
port N was disabled, the driver was turning off all LEDs for group N,
not only the corresponding LED for port N in any of those 4 groups. In
such cases, if port 0 was brought down, the LEDs for all ports in LED
group 0 would be turned off. As another side effect, the driver was
wrongly warning that port 5 didn't have an LED ("no LED for port 5").
Since showing the administrative state of ports is not an orthodox way
to use LEDs, it was not worth it to fix it and all this code was
dropped.

The code to disable LEDs was simplified only changing each LED group to
the RTL8366RB_LED_OFF state. Registers RTL8366RB_LED_0_1_CTRL_REG and
RTL8366RB_LED_2_3_CTRL_REG are only used when the corresponding LED
group is configured with RTL8366RB_LED_FORCE and they don't need to be
cleaned. The code still references an LED controlled by
RTL8366RB_INTERRUPT_CONTROL_REG, but as of now, no test device has
actually used it. Also, some magic numbers were replaced by macros.

Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoipv6: introduce dst_rt6_info() helper
Eric Dumazet [Fri, 26 Apr 2024 15:19:52 +0000 (15:19 +0000)]
ipv6: introduce dst_rt6_info() helper

Instead of (struct rt6_info *)dst casts, we can use :

 #define dst_rt6_info(_ptr) \
         container_of_const(_ptr, struct rt6_info, dst)

Some places needed missing const qualifiers :

ip6_confirm_neigh(), ipv6_anycast_destination(),
ipv6_unicast_destination(), has_gateway()

v2: added missing parts (David Ahern)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agobpf, docs: Clarify PC use in instruction-set.rst
Dave Thaler [Fri, 26 Apr 2024 23:11:26 +0000 (16:11 -0700)]
bpf, docs: Clarify PC use in instruction-set.rst

This patch elaborates on the use of PC by expanding the PC acronym,
explaining the units, and the relative position to which the offset
applies.

Signed-off-by: Dave Thaler <dthaler1968@googlemail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20240426231126.5130-1-dthaler1968@gmail.com
6 months agoMerge branch 'mlxsw-events-processing-performance'
David S. Miller [Mon, 29 Apr 2024 09:47:05 +0000 (10:47 +0100)]
Merge branch 'mlxsw-events-processing-performance'

Petr Machata says:

====================
mlxsw: Improve events processing performance

Amit Cohen writes:

Spectrum ASICs only support a single interrupt, it means that all the
events are handled by one IRQ (interrupt request) handler.

Currently, we schedule a tasklet to handle events in EQ, then we also use
tasklet for CQ, SDQ and RDQ. Tasklet runs in softIRQ (software IRQ)
context, and will be run on the same CPU which scheduled it. It means that
today we have one CPU which handles all the packets (both network packets
and EMADs) from hardware.

The existing implementation is not efficient and can be improved.

Measuring latency of EMADs in the driver (without the time in FW) shows
that latency is increased by factor of 28 (x28) when network traffic is
handled by the driver.

Measuring throughput in CPU shows that CPU can handle ~35% less packets
of specific flow when corrupted packets are also handled by the driver.
There are cases that these values even worse, we measure decrease of ~44%
packet rate.

This can be improved if network packet and EMADs will be handled in
parallel by several CPUs, and more than that, if different types of traffic
will be handled in parallel. We can achieve this using NAPI.

This set converts the driver to process completions from hardware via NAPI.
The idea is to add NAPI instance per CQ (which is mapped 1:1 to SDQ/RDQ),
which means that each DQ can be handled separately. we have DQ for EMADs
and DQs for each trap group (like LLDP, BGP, L3 drops, etc..). See more
details in commit messages.

An additional improvement which is done as part of this set is related to
doorbells' ring. The idea is to handle small chunks of Rx packets (which
is also recommended using NAPI) and ring doorbells once per chunk. This
reduces the access to hardware which is expensive (time wise) and might
take time because of memory barriers.

With this set we can see better performance.
To summerize:

EMADs latency:
+------------------------------------------------------------------------+
|                  | Before this set           | Now                     |
|------------------|---------------------------|-------------------------|
| Increased factor | x28                       | x1.5                    |
+------------------------------------------------------------------------+
Note that we can see even measurements that show better latency when
traffic is handled by the driver.

Throughput:
+------------------------------------------------------------------------+
|             | Before this set            | Now                         |
|-------------|----------------------------|-----------------------------|
| Reduced     | 35%                        | 6%                          |
| packet rate |                            |                             |
+------------------------------------------------------------------------+

Additional improvements are planned - use page pool for buffer allocations
and avoid cache miss of each SKB using napi_build_skb().

Patch set overview:
Patches #1-#2 improve access to hardware by reducing dorbells' rings
Patch #3-#4 are preaparations for NAPI usage
Patch #5 converts the driver to use NAPI
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomlxsw: pci: Use NAPI for event processing
Amit Cohen [Fri, 26 Apr 2024 12:42:26 +0000 (14:42 +0200)]
mlxsw: pci: Use NAPI for event processing

Spectrum ASICs only support a single interrupt, that means that all the
events are handled by one IRQ (interrupt request) handler. Once an
interrupt is received, we schedule tasklet to handle events from EQ and
then schedule tasklets to handle completions from CQs. Tasklet runs in
softIRQ (software IRQ) context, and will be run on the same CPU which
scheduled it. That means that today we use only one CPU to handle all the
packets (both network packets and EMADs) from hardware.

This can be improved using NAPI. The idea is to use NAPI instance per
CQ, which is mapped 1:1 to DQ (RDQ or SDQ). NAPI poll method can be run
in kernel thread, so then the driver will be able to handle WQEs in several
CPUs. Convert the existing code to use NAPI APIs.

Add NAPI instance as part of 'struct mlxsw_pci_queue' and initialize it
as part of CQs initialization. Set the appropriate poll method and dummy
net device, according to queue number, similar to tasklet setup. For CQs
which are used for completions of RDQ, use Rx poll method and
'napi_dev_rx', which is set as 'threaded'. It means that Rx poll method
will run in kernel context, so several RDQs will be handled in parallel.
For CQs which are used for completions of SDQ, use Tx poll method and
'napi_dev_tx', this method will run in softIRQ context, as it is
recommended in NAPI documentation, as Tx packets' processing is short task.

Convert mlxsw_pci_cq_{rx,tx}_tasklet() to poll methods. Handle 'budget'
argument - ignore it in Tx poll method, as it is recommended to not limit
Tx processing. For Rx processing, handle up to 'budget' completions.
Return 'work_done' which is the amount of completions that were handled.

Handle the following cases:
1. After processing 'budget' completions, the driver still has work to do:
   Return work-done = budget. In that case, the NAPI instance will be
   polled again (without the need to be rescheduled). Do not re-arm the
   queue, as NAPI will handle the reschedule, so we do not have to involve
   hardware to send an additional interrupt for the completions that should
   be processed.

2. Event processing has been completed:
   Call napi_complete_done() to mark NAPI processing as completed, which
   means that the poll method will not be rescheduled. Re-arm the queue,
   as all completions were handled.

   In case that poll method handled exactly 'budget' completions, return
   work-done = budget -1, to distinguish from the case that driver still
   has completions to handle. Otherwise, return the amount of completions
   that were handled.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomlxsw: pci: Reorganize 'mlxsw_pci_queue' structure
Amit Cohen [Fri, 26 Apr 2024 12:42:25 +0000 (14:42 +0200)]
mlxsw: pci: Reorganize 'mlxsw_pci_queue' structure

The next patch will set the driver to use NAPI for event processing. Then
tasklet mechanism will be used only for EQ. Reorganize 'mlxsw_pci_queue'
to hold EQ and CQ attributes in a union. For now, add tasklet for both EQ
and CQ. This will be changed in the next patch, as 'tasklet_struct' will be
replaced with NAPI instance.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomlxsw: pci: Initialize dummy net devices for NAPI
Amit Cohen [Fri, 26 Apr 2024 12:42:24 +0000 (14:42 +0200)]
mlxsw: pci: Initialize dummy net devices for NAPI

mlxsw will use NAPI for event processing in a next patch. As preparation,
add two dummy net devices and initialize them.

NAPI instance should be attached to net device. Usually each queue is used
by a single net device in network drivers, so the mapping between net
device to NAPI instance is intuitive. In our case, Rx queues are not per
port, they are per trap-group. Tx queues are mapped to net devices, but we
do not have a separate queue for each local port, several ports share the
same queue.

Use init_dummy_netdev() to initialize dummy net devices for NAPI.

To run NAPI poll method in a kernel thread, the net device which NAPI
instance is attached to should be marked as 'threaded'. It is
recommended to handle Tx packets in softIRQ context, as usually this is
a short task - just free the Tx packet which has been transmitted.
Rx packets handling is more complicated task, so drivers can use a
dedicated kernel thread to process them. It allows processing packets from
different Rx queues in parallel. We would like to handle only Rx packets in
kernel threads, which means that we will use two dummy net devices
(one for Rx and one for Tx). Set only one of them with 'threaded' as it
will be used for Rx processing. Do not fail in case that setting 'threaded'
fails, as it is better to use regular softIRQ NAPI rather than preventing
the driver from loading.

Note that the net devices are initialized with init_dummy_netdev(), so
they are not registered, which means that they will not be visible to user.
It will not be possible to change 'threaded' configuration from user
space, but it is reasonable in our case, as there is no another
configuration which makes sense, considering that user has no influence
on the usage of each queue.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomlxsw: pci: Ring RDQ and CQ doorbells once per several completions
Amit Cohen [Fri, 26 Apr 2024 12:42:23 +0000 (14:42 +0200)]
mlxsw: pci: Ring RDQ and CQ doorbells once per several completions

Currently, for each CQE in CQ, we ring CQ doorbell, then handle RDQ and
ring RDQ doorbell. Finally we ring CQ arm doorbell - once per CQ tasklet.

The idea of ringing CQ doorbell before RDQ doorbell, is to be sure that
when we post new WQE (after RDQ is handled), there is an available CQE.
This was done because of a hardware bug as part of
commit c9ebea04cb1b ("mlxsw: pci: Ring CQ's doorbell before RDQ's").

There is no real reason to ring RDQ and CQ doorbells for each completion,
it is better to handle several completions and reduce number of ringings,
as access to hardware is expensive (time wise) and might take time because
of memory barriers.

A previous patch changed CQ tasklet to handle up to 64 Rx packets. With
this limitation, we can ring CQ and RDQ doorbells once per CQ tasklet.
The counters of the doorbells are increased by the amount of packets
that we handled, then the device will know for which completion to send
an additional event.

To avoid reordering CQ and RDQ doorbells' ring, let the tasklet to ring
also RDQ doorbell, mlxsw_pci_cqe_rdq_handle() handles the counter but
does not ring the doorbell.

Note that with this change there is no need to copy the CQE, as we ring CQ
doorbell only after Rx packet processing (which uses the CQE) is done.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomlxsw: pci: Handle up to 64 Rx completions in tasklet
Amit Cohen [Fri, 26 Apr 2024 12:42:22 +0000 (14:42 +0200)]
mlxsw: pci: Handle up to 64 Rx completions in tasklet

We can get many completions in one interrupt. Currently, the CQ tasklet
handles up to half queue size completions, and then arms the hardware to
generate additional events, which means that in case that there were
additional completions that we did not handle, we will get immediately an
additional interrupt to handle the rest.

The decision to handle up to half of the queue size is arbitrary and was
determined in 2015, when mlxsw driver was added to the kernel. One
additional fact that should be taken into account is that while WQEs
from RDQ are handled, the CPU that handles the tasklet is dedicated for
this task, which means that we might hold the CPU for a long time.

Handle WQEs in smaller chucks, then arm CQ doorbell to notify the hardware
to send additional notifications. Set the chunk size to 64 as this number
is recommended using NAPI and the driver will use NAPI in a next patch.
Note that for now we use ARM doorbell to retrigger CQ tasklet, but with
NAPI it will be more efficient as software will reschedule the poll
method and we will not involve hardware for that.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoipv6: use call_rcu_hurry() in fib6_info_release()
Eric Dumazet [Fri, 26 Apr 2024 10:47:22 +0000 (10:47 +0000)]
ipv6: use call_rcu_hurry() in fib6_info_release()

This is a followup of commit c4e86b4363ac ("net: add two more
call_rcu_hurry()")

fib6_info_destroy_rcu() is calling nexthop_put() or fib6_nh_release()

We must not delay it too much or risk unregister_netdevice/ref_tracker
traces because references to netdev are not released in time.

This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoinet: use call_rcu_hurry() in inet_free_ifa()
Eric Dumazet [Fri, 26 Apr 2024 07:02:02 +0000 (07:02 +0000)]
inet: use call_rcu_hurry() in inet_free_ifa()

This is a followup of commit c4e86b4363ac ("net: add two more
call_rcu_hurry()")

Our reference to ifa->ifa_dev must be freed ASAP
to release the reference to the netdev the same way.

inet_rcu_free_ifa()

in_dev_put()
 -> in_dev_finish_destroy()
   -> netdev_put()

This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: give more chances to rcu in netdev_wait_allrefs_any()
Eric Dumazet [Fri, 26 Apr 2024 06:42:22 +0000 (06:42 +0000)]
net: give more chances to rcu in netdev_wait_allrefs_any()

This came while reviewing commit c4e86b4363ac ("net: add two more
call_rcu_hurry()").

Paolo asked if adding one synchronize_rcu() would help.

While synchronize_rcu() does not help, making sure to call
rcu_barrier() before msleep(wait) is definitely helping
to make sure lazy call_rcu() are completed.

Instead of waiting ~100 seconds in my tests, the ref_tracker
splats occurs one time only, and netdev_wait_allrefs_any()
latency is reduced to the strict minimum.

Ideally we should audit our call_rcu() users to make sure
no refcount (or cascading call_rcu()) is held too long,
because rcu_barrier() is quite expensive.

Fixes: 0e4be9e57e8c ("net: use exponential backoff in netdev_wait_allrefs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/all/28bbf698-befb-42f6-b561-851c67f464aa@kernel.org/T/#m76d73ed6b03cd930778ac4d20a777f22a08d6824
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: ethernet: ti: am65-cpsw-qos: Add support to taprio for past base_time
Tanmay Patil [Thu, 25 Apr 2024 10:31:42 +0000 (16:01 +0530)]
net: ethernet: ti: am65-cpsw-qos: Add support to taprio for past base_time

If the base-time for taprio is in the past, start the schedule at the time
of the form "base_time + N*cycle_time" where N is the smallest possible
integer such that the above time is in the future.

Signed-off-by: Tanmay Patil <t-patil@ti.com>
Signed-off-by: Chintan Vankar <c-vankar@ti.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 months agotools: ynl: don't append doc of missing type directly to the type
Jakub Kicinski [Fri, 26 Apr 2024 00:31:11 +0000 (17:31 -0700)]
tools: ynl: don't append doc of missing type directly to the type

When using YNL in tests appending the doc string to the type
name makes it harder to check that we got the correct error.
Put the doc under a separate key.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240426003111.359285-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agoMerge branch 'selftests-drv-net-round-some-sharp-edges'
Jakub Kicinski [Fri, 26 Apr 2024 23:10:27 +0000 (16:10 -0700)]
Merge branch 'selftests-drv-net-round-some-sharp-edges'

Jakub Kicinski says:

====================
selftests: drv-net: round some sharp edges

I had to explain how to run the driver tests twice already.
Improve the README so we can just point to it.
Improve the config validation.

v1: https://lore.kernel.org/r/20240424221444.4194069-1-kuba@kernel.org/
====================

Link: https://lore.kernel.org/r/20240425222341.309778-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agoselftests: drv-net: validate the environment
Jakub Kicinski [Thu, 25 Apr 2024 22:23:41 +0000 (15:23 -0700)]
selftests: drv-net: validate the environment

Throw a slightly more helpful exception when env variables
are partially populated. Prior to this change we'd get
a dictionary key exception somewhere later on.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240425222341.309778-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agoselftests: drv-net: reimplement the config parser
Jakub Kicinski [Thu, 25 Apr 2024 22:23:40 +0000 (15:23 -0700)]
selftests: drv-net: reimplement the config parser

The shell lexer is not helping much, do very basic parsing
manually.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240425222341.309778-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agoselftests: drv-net: extend the README with more info and example
Jakub Kicinski [Thu, 25 Apr 2024 22:23:39 +0000 (15:23 -0700)]
selftests: drv-net: extend the README with more info and example

Add more info to the README. It's also now copied to GitHub for
increased visibility:

 https://github.com/linux-netdev/nipa/wiki/Running-driver-tests

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240425222341.309778-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agotcp: fix tcp_grow_skb() vs tstamps
Eric Dumazet [Thu, 25 Apr 2024 19:34:50 +0000 (19:34 +0000)]
tcp: fix tcp_grow_skb() vs tstamps

I forgot to call tcp_skb_collapse_tstamp() in the
case we consume the second skb in write queue.

Neal suggested to create a common helper used by tcp_mtu_probe()
and tcp_grow_skb().

Fixes: 8ee602c63520 ("tcp: try to send bigger TSO packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240425193450.411640-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: dsa: lan9303: use ethtool_puts() for lan9303_get_strings()
Justin Stitt [Thu, 25 Apr 2024 01:19:13 +0000 (01:19 +0000)]
net: dsa: lan9303: use ethtool_puts() for lan9303_get_strings()

This pattern of strncpy with some pointer arithmetic setting fixed-sized
intervals with string literal data is a bit weird so let's use
ethtool_puts() as this has more obvious behavior and is less-error
prone.

Nicely, we also get to drop a usage of the now deprecated strncpy() [1].

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://github.com/KSPP/linux/issues/90
Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240425-strncpy-drivers-net-dsa-lan9303-core-c-v4-1-9fafd419d7bb@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agobpf_helpers.h: Define bpf_tail_call_static when building with GCC
Jose E. Marchesi [Fri, 26 Apr 2024 14:51:58 +0000 (16:51 +0200)]
bpf_helpers.h: Define bpf_tail_call_static when building with GCC

The definition of bpf_tail_call_static in tools/lib/bpf/bpf_helpers.h
is guarded by a preprocessor check to assure that clang is recent
enough to support it.  This patch updates the guard so the function is
compiled when using GCC 13 or later as well.

Tested in bpf-next master. No regressions.

Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240426145158.14409-1-jose.marchesi@oracle.com
7 months agoMerge branch 'implement-reset-reason-mechanism-to-detect'
Paolo Abeni [Fri, 26 Apr 2024 13:34:04 +0000 (15:34 +0200)]
Merge branch 'implement-reset-reason-mechanism-to-detect'

Jason Xing says:

====================
Implement reset reason mechanism to detect

From: Jason Xing <kernelxing@tencent.com>

In production, there are so many cases about why the RST skb is sent but
we don't have a very convenient/fast method to detect the exact underlying
reasons.

RST is implemented in two kinds: passive kind (like tcp_v4_send_reset())
and active kind (like tcp_send_active_reset()). The former can be traced
carefully 1) in TCP, with the help of drop reasons, which is based on
Eric's idea[1], 2) in MPTCP, with the help of reset options defined in
RFC 8684. The latter is relatively independent, which should be
implemented on our own, such as active reset reasons which can not be
replace by skb drop reason or something like this.

In this series, I focus on the fundamental implement mostly about how
the rstreason mechanism works and give the detailed passive part as an
example, not including the active reset part. In future, we can go
further and refine those NOT_SPECIFIED reasons.

Here are some examples when tracing:
<idle>-0       [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
        skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED
<idle>-0       [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
        skaddr=x src=x dest=x state=x reason=NO_SOCKET

[1]
Link: https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w@mail.gmail.com/
====================

Link: https://lore.kernel.org/r/20240425031340.46946-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agorstreason: make it work in trace world
Jason Xing [Thu, 25 Apr 2024 03:13:40 +0000 (11:13 +0800)]
rstreason: make it work in trace world

At last, we should let it work by introducing this reset reason in
trace world.

One of the possible expected outputs is:
... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
state=TCP_ESTABLISHED reason=NOT_SPECIFIED

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agomptcp: introducing a helper into active reset logic
Jason Xing [Thu, 25 Apr 2024 03:13:39 +0000 (11:13 +0800)]
mptcp: introducing a helper into active reset logic

Since we have mapped every mptcp reset reason definition in enum
sk_rst_reason, introducing a new helper can cover some missing places
where we have already set the subflow->reset_reason.

Note: using SK_RST_REASON_NOT_SPECIFIED is the same as
SK_RST_REASON_MPTCP_RST_EUNSPEC. They are both unknown. So we can convert
it directly.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agomptcp: support rstreason for passive reset
Jason Xing [Thu, 25 Apr 2024 03:13:38 +0000 (11:13 +0800)]
mptcp: support rstreason for passive reset

It relies on what reset options in the skb are as rfc8684 says. Reusing
this logic can save us much energy. This patch replaces most of the prior
NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agotcp: support rstreason for passive reset
Jason Xing [Thu, 25 Apr 2024 03:13:37 +0000 (11:13 +0800)]
tcp: support rstreason for passive reset

Reuse the dropreason logic to show the exact reason of tcp reset,
so we can finally display the corresponding item in enum sk_reset_reason
instead of reinventing new reset reasons. This patch replaces all
the prior NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agorstreason: prepare for active reset
Jason Xing [Thu, 25 Apr 2024 03:13:36 +0000 (11:13 +0800)]
rstreason: prepare for active reset

Like what we did to passive reset:
only passing possible reset reason in each active reset path.

No functional changes.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agorstreason: prepare for passive reset
Jason Xing [Thu, 25 Apr 2024 03:13:35 +0000 (11:13 +0800)]
rstreason: prepare for passive reset

Adjust the parameter and support passing reason of reset which
is for now NOT_SPECIFIED. No functional changes.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agonet: introduce rstreason to detect why the RST is sent
Jason Xing [Thu, 25 Apr 2024 03:13:34 +0000 (11:13 +0800)]
net: introduce rstreason to detect why the RST is sent

Add a new standalone file for the easy future extension to support
both active reset and passive reset in the TCP/DCCP/MPTCP protocols.

This patch only does the preparations for reset reason mechanism,
nothing else changes.

The reset reasons are divided into three parts:
1) reuse drop reasons for passive reset in TCP
2) our own independent reasons which aren't relying on other reasons at all
3) reuse MP_TCPRST option for MPTCP

The benefits of a standalone reset reason are listed here:
1) it can cover more than one case, such as reset reasons in MPTCP,
active reset reasons.
2) people can easily/fastly understand and maintain this mechanism.
3) we get unified format of output with prefix stripped.
4) more new reset reasons are on the way
...

I will implement the basic codes of active/passive reset reason in
those three protocols, which are not complete for this moment. For
passive reset part in TCP, I only introduce the NO_SOCKET common case
which could be set as an example.

After this series applied, it will have the ability to open a new
gate to let other people contribute more reasons into it :)

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agoigc: Add Tx hardware timestamp request for AF_XDP zero-copy packet
Song Yoong Siang [Wed, 24 Apr 2024 21:02:54 +0000 (14:02 -0700)]
igc: Add Tx hardware timestamp request for AF_XDP zero-copy packet

This patch adds support to per-packet Tx hardware timestamp request to
AF_XDP zero-copy packet via XDP Tx metadata framework. Please note that
user needs to enable Tx HW timestamp capability via igc_ioctl() with
SIOCSHWTSTAMP cmd before sending xsk Tx hardware timestamp request.

Same as implementation in RX timestamp XDP hints kfunc metadata, Timer 0
(adjustable clock) is used in xsk Tx hardware timestamp. i225/i226 have
four sets of timestamping registers. *skb and *xsk_tx_buffer pointers
are used to indicate whether the timestamping register is already occupied.

Furthermore, a boolean variable named xsk_pending_ts is used to hold the
transmit completion until the tx hardware timestamp is ready. This is
because, for i225/i226, the timestamp notification event comes some time
after the transmit completion event. The driver will retrigger hardware irq
to clean the packet after retrieve the tx hardware timestamp.

Besides, xsk_meta is added into struct igc_tx_timestamp_request as a hook
to the metadata location of the transmit packet. When the Tx timestamp
interrupt is fired, the interrupt handler will copy the value of Tx hwts
into metadata location via xsk_tx_metadata_complete().

This patch is tested with tools/testing/selftests/bpf/xdp_hw_metadata
on Intel ADL-S platform. Below are the test steps and results.

Test Step 1: Run xdp_hw_metadata app
 ./xdp_hw_metadata <iface> > /dev/shm/result.log

Test Step 2: Enable Tx hardware timestamp
 hwstamp_ctl -i <iface> -t 1 -r 1

Test Step 3: Run ptp4l and phc2sys for time synchronization

Test Step 4: Generate UDP packets with 1ms interval for 10s
 trafgen --dev <iface> '{eth(da=<addr>), udp(dp=9091)}' -t 1ms -n 10000

Test Step 5: Rerun Step 1-3 with 10s iperf3 as background traffic

Test Step 6: Rerun Step 1-4 with 10s iperf3 as background traffic

Based on iperf3 results below, the impact of holding tx completion to
throughput is not observable.

Result of last UDP packet (no. 10000) in Step 4:
poll: 1 (0) skip=99 fail=0 redir=10000
xsk_ring_cons__peek: 1
0x5640a37972d0: rx_desc[9999]->addr=f2110 addr=f2110 comp_addr=f2110 EoP
rx_hash: 0x2049BE1D with RSS type:0x1
HW RX-time:   1679819246792971268 (sec:1679819246.7930) delta to User RX-time sec:0.0000 (14.990 usec)
XDP RX-time:   1679819246792981987 (sec:1679819246.7930) delta to User RX-time sec:0.0000 (4.271 usec)
No rx_vlan_tci or rx_vlan_proto, err=-95
0x5640a37972d0: ping-pong with csum=ab19 (want 315b) csum_start=34 csum_offset=6
0x5640a37972d0: complete tx idx=9999 addr=f010
HW TX-complete-time:   1679819246793036971 (sec:1679819246.7930) delta to User TX-complete-time sec:0.0001 (77.656 usec)
XDP RX-time:   1679819246792981987 (sec:1679819246.7930) delta to User TX-complete-time sec:0.0001 (132.640 usec)
HW RX-time:   1679819246792971268 (sec:1679819246.7930) delta to HW TX-complete-time sec:0.0001 (65.703 usec)
0x5640a37972d0: complete rx idx=10127 addr=f2110

Result of iperf3 without tx hwts request in step 5:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.74 GBytes  2.36 Gbits/sec    0             sender
[  5]   0.00-10.05  sec  2.74 GBytes  2.34 Gbits/sec                  receiver

Result of iperf3 running parallel with trafgen command in step 6:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.74 GBytes  2.36 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  2.74 GBytes  2.34 Gbits/sec                  receiver

Co-developed-by: Lai Peter Jun Ann <jun.ann.lai@intel.com>
Signed-off-by: Lai Peter Jun Ann <jun.ann.lai@intel.com>
Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Naama Meir <naamax.meir@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20240424210256.3440903-1-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agoMerge branch 'selftests-virtio_net-introduce-initial-testing-infrastructure'
Paolo Abeni [Fri, 26 Apr 2024 11:26:56 +0000 (13:26 +0200)]
Merge branch 'selftests-virtio_net-introduce-initial-testing-infrastructure'

Jiri Pirko says:

====================
selftests: virtio_net: introduce initial testing infrastructure

This patchset aims at introducing very basic initial infrastructure
for virtio_net testing, namely it focuses on virtio feature testing.

The first patch adds support for debugfs for virtio devices, allowing
user to filter features to pretend to be driver that is not capable
of the filtered feature.

Example:
$ cat /sys/bus/virtio/devices/virtio0/features
1110010111111111111101010000110010000000100000000000000000000000
$ echo "5" >/sys/kernel/debug/virtio/virtio0/filter_feature_add
$ cat /sys/kernel/debug/virtio/virtio0/filter_features
5
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/unbind
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/bind
$ cat /sys/bus/virtio/devices/virtio0/features
1110000111111111111101010000110010000000100000000000000000000000

Leverage that in the last patch that lays ground for virtio_net
selftests testing, including very basic F_MAC feature test.

To run this, do:
$ make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests

It is assumed, as with lot of other selftests in the net group,
that there are netdevices connected back-to-back. In this case,
two virtio_net devices connected back to back. If you use "tap" qemu
netdevice type, to configure this loop on a hypervisor, one may use
this script:

DEV1="$1"
DEV2="$2"

sudo tc qdisc add dev $DEV1 clsact
sudo tc qdisc add dev $DEV2 clsact
sudo tc filter add dev $DEV1 ingress protocol all pref 1 matchall action mirred egress redirect dev $DEV2
sudo tc filter add dev $DEV2 ingress protocol all pref 1 matchall action mirred egress redirect dev $DEV1
sudo ip link set $DEV1 up
sudo ip link set $DEV2 up

Another possibility is to use virtme-ng like this:
$ vng --network=loop
or directly:
$ vng --network=loop -- make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests

"loop" network type will take care of creating two "hubport" qemu netdevs
putting them into a single hub.

To do it manually with qemu, pass following command line options:
-nic hubport,hubid=1,id=nd0,model=virtio-net-pci
-nic hubport,hubid=1,id=nd1,model=virtio-net-pci
====================

Link: https://lore.kernel.org/r/20240424104049.3935572-1-jiri@resnulli.us
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agoselftests: virtio_net: add initial tests
Jiri Pirko [Wed, 24 Apr 2024 10:40:49 +0000 (12:40 +0200)]
selftests: virtio_net: add initial tests

Introduce initial tests for virtio_net driver. Focus on feature testing
leveraging previously introduced debugfs feature filtering
infrastructure. Add very basic ping and F_MAC feature tests.

To run this, do:
$ make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests

Run it on a system with 2 virtio_net devices connected back-to-back
on the hypervisor.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Tested-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agoselftests: forwarding: add wait_for_dev() helper
Jiri Pirko [Wed, 24 Apr 2024 10:40:48 +0000 (12:40 +0200)]
selftests: forwarding: add wait_for_dev() helper

The existing setup_wait*() helper family check the status of the
interface to be up. Introduce wait_for_dev() to wait for the netdevice
to appear, for example after test script does manual device bind.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agoselftests: forwarding: add check_driver() helper
Jiri Pirko [Wed, 24 Apr 2024 10:40:47 +0000 (12:40 +0200)]
selftests: forwarding: add check_driver() helper

Add a helper to be used to check if the netdevice is backed by specified
driver.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agoselftests: forwarding: add ability to assemble NETIFS array by driver name
Jiri Pirko [Wed, 24 Apr 2024 10:40:46 +0000 (12:40 +0200)]
selftests: forwarding: add ability to assemble NETIFS array by driver name

Allow driver tests to work without specifying the netdevice names.
Introduce a possibility to search for available netdevices according to
set driver name. Allow test to specify the name by setting
NETIF_FIND_DRIVER variable.

Note that user overrides this either by passing netdevice names on the
command line or by declaring NETIFS array in custom forwarding.config
configuration file.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agovirtio: add debugfs infrastructure to allow to debug virtio features
Jiri Pirko [Wed, 24 Apr 2024 10:40:45 +0000 (12:40 +0200)]
virtio: add debugfs infrastructure to allow to debug virtio features

Currently there is no way for user to set what features the driver
should obey or not, it is hard wired in the code.

In order to be able to debug the device behavior in case some feature is
disabled, introduce a debugfs infrastructure with couple of files
allowing user to see what features the device advertises and
to set filter for features used by driver.

Example:
$cat /sys/bus/virtio/devices/virtio0/features
1110010111111111111101010000110010000000100000000000000000000000
$ echo "5" >/sys/kernel/debug/virtio/virtio0/filter_feature_add
$ cat /sys/kernel/debug/virtio/virtio0/filter_features
5
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/unbind
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/bind
$ cat /sys/bus/virtio/devices/virtio0/features
1110000111111111111101010000110010000000100000000000000000000000

Note that sysfs "features" now already exists, this patch does not
touch it.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agoMerge branch 'net-hsr-add-support-for-hsr-san-redbox'
Paolo Abeni [Fri, 26 Apr 2024 10:04:45 +0000 (12:04 +0200)]
Merge branch 'net-hsr-add-support-for-hsr-san-redbox'

Lukasz Majewski says:

====================
net: hsr: Add support for HSR-SAN (RedBOX)

This patch set provides v6 of HSR-SAN (RedBOX) as well as hsr_redbox.sh
test script.

The most straightforward way to test those patches is to use buildroot
(2024.02.01) to create rootfs and QEMU based environment to run x86_64
Linux.

Then one shall run hsr_redbox.sh and hsr_ping.sh from
tools/testing/selftests/net/hsr.
====================

Link: https://lore.kernel.org/r/20240423124908.2073400-1-lukma@denx.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agotest: hsr: Add test for HSR RedBOX (HSR-SAN) mode of operation
Lukasz Majewski [Tue, 23 Apr 2024 12:49:08 +0000 (14:49 +0200)]
test: hsr: Add test for HSR RedBOX (HSR-SAN) mode of operation

This patch adds hsr_redbox.sh script to test if HSR-SAN mode of operation
works correctly.

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agotest: hsr: Extract version agnostic information from ping command output
Lukasz Majewski [Tue, 23 Apr 2024 12:49:07 +0000 (14:49 +0200)]
test: hsr: Extract version agnostic information from ping command output

Current code checks if ping command output match hardcoded pattern:
"10 packets transmitted, 10 received, 0% packet loss,".

Such approach will work only from one ping program version (for which
this test has been originally written).
This patch address problem when ping with different summary output
like "10 packets transmitted, 10 packets received, 0% packet" is
used to run this test - for example one from busybox (as the test
system runs in QEMU with rootfs created with buildroot).

The fix is to modify output of ping command to be agnostic to ping
version used on the platform.

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agotest: hsr: Move common code to hsr_common.sh file
Lukasz Majewski [Tue, 23 Apr 2024 12:49:06 +0000 (14:49 +0200)]
test: hsr: Move common code to hsr_common.sh file

Some of the code already present in the hsr_ping.sh test program can be
moved to a separate script file, so it can be reused by other HSR
functionality (like HSR-SAN) tests.

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agotest: hsr: Remove script code already implemented in lib.sh
Lukasz Majewski [Tue, 23 Apr 2024 12:49:05 +0000 (14:49 +0200)]
test: hsr: Remove script code already implemented in lib.sh

Some parts (like netns creation and cleanup) of hsr_ping.sh script are
already implemented in ../lib.sh common script, so can be replaced by it.

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agonet: hsr: Provide RedBox support (HSR-SAN)
Lukasz Majewski [Tue, 23 Apr 2024 12:49:04 +0000 (14:49 +0200)]
net: hsr: Provide RedBox support (HSR-SAN)

Introduce RedBox support (HSR-SAN to be more precise) for HSR networks.
Following traffic reduction optimizations have been implemented:
- Do not send HSR supervisory frames to Port C (interlink)
- Do not forward to HSR ring frames addressed to Port C
- Do not forward to Port C frames from HSR ring
- Do not send duplicate HSR frame to HSR ring when destination is Port C

The corresponding patch to modify iptable2 sources has already been sent:
https://lore.kernel.org/netdev/20240308145729.490863-1-lukma@denx.de/T/

Testing procedure (veth and netns):
-----------------------------------
One shall run:
linux-vanila/tools/testing/selftests/net/hsr/hsr_redbox.sh
(Detailed description of the setup one can find in the test
script file).

Testing procedure (real hardware):
----------------------------------
The EVB-KSZ9477 has been used for testing on net-next branch
(SHA1: 5fc68320c1fb3c7d456ddcae0b4757326a043e6f).

Ports 4/5 were used for SW managed HSR (hsr1) as first hsr0 for ports 1/2
(with HW offloading for ksz9477) was created. Port 3 has been used as
interlink port (single USB-ETH dongle).

Configuration - RedBox (EVB-KSZ9477):
if link set lan1 down;ip link set lan2 down
ip link add name hsr0 type hsr slave1 lan1 slave2 lan2 supervision 45 version 1
ip link add name hsr1 type hsr slave1 lan4 slave2 lan5 interlink lan3 supervision 45 version 1
ip link set lan4 up;ip link set lan5 up
ip link set lan3 up
ip addr add 192.168.0.11/24 dev hsr1
ip link set hsr1 up

Configuration - DAN-H (EVB-KSZ9477):

ip link set lan1 down;ip link set lan2 down
ip link add name hsr0 type hsr slave1 lan1 slave2 lan2 supervision 45 version 1
ip link add name hsr1 type hsr slave1 lan4 slave2 lan5 supervision 45 version 1
ip link set lan4 up;ip link set lan5 up
ip addr add 192.168.0.12/24 dev hsr1
ip link set hsr1 up

This approach uses only SW based HSR devices (hsr1).

--------------          -----------------       ------------
DAN-H  Port5 | <------> | Port5         |       |
       Port4 | <------> | Port4   Port3 | <---> | PC
             |          | (RedBox)      |       | (USB-ETH)
EVB-KSZ9477  |          | EVB-KSZ9477   |       |
--------------          -----------------       ------------

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agonet/sched: fix false lockdep warning on qdisc root lock
Davide Caratti [Thu, 18 Apr 2024 13:50:11 +0000 (15:50 +0200)]
net/sched: fix false lockdep warning on qdisc root lock

Xiumei and Christoph reported the following lockdep splat, complaining of
the qdisc root lock being taken twice:

 ============================================
 WARNING: possible recursive locking detected
 6.7.0-rc3+ #598 Not tainted
 --------------------------------------------
 swapper/2/0 is trying to acquire lock:
 ffff888177190110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70

 but task is already holding lock:
 ffff88811995a110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&sch->q.lock);
   lock(&sch->q.lock);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 5 locks held by swapper/2/0:
  #0: ffff888135a09d98 ((&in_dev->mr_ifc_timer)){+.-.}-{0:0}, at: call_timer_fn+0x11a/0x510
  #1: ffffffffaaee5260 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x2c0/0x1ed0
  #2: ffffffffaaee5200 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x209/0x2e70
  #3: ffff88811995a110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70
  #4: ffffffffaaee5200 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x209/0x2e70

 stack backtrace:
 CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.7.0-rc3+ #598
 Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
 Call Trace:
  <IRQ>
  dump_stack_lvl+0x4a/0x80
  __lock_acquire+0xfdd/0x3150
  lock_acquire+0x1ca/0x540
  _raw_spin_lock+0x34/0x80
  __dev_queue_xmit+0x1560/0x2e70
  tcf_mirred_act+0x82e/0x1260 [act_mirred]
  tcf_action_exec+0x161/0x480
  tcf_classify+0x689/0x1170
  prio_enqueue+0x316/0x660 [sch_prio]
  dev_qdisc_enqueue+0x46/0x220
  __dev_queue_xmit+0x1615/0x2e70
  ip_finish_output2+0x1218/0x1ed0
  __ip_finish_output+0x8b3/0x1350
  ip_output+0x163/0x4e0
  igmp_ifc_timer_expire+0x44b/0x930
  call_timer_fn+0x1a2/0x510
  run_timer_softirq+0x54d/0x11a0
  __do_softirq+0x1b3/0x88f
  irq_exit_rcu+0x18f/0x1e0
  sysvec_apic_timer_interrupt+0x6f/0x90
  </IRQ>

This happens when TC does a mirred egress redirect from the root qdisc of
device A to the root qdisc of device B. As long as these two locks aren't
protecting the same qdisc, they can be acquired in chain: add a per-qdisc
lockdep key to silence false warnings.
This dynamic key should safely replace the static key we have in sch_htb:
it was added to allow enqueueing to the device "direct qdisc" while still
holding the qdisc root lock.

v2: don't use static keys anymore in HTB direct qdiscs (thanks Eric Dumazet)

CC: Maxim Mikityanskiy <maxim@isovalent.com>
CC: Xiumei Mu <xmu@redhat.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/451
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/7dc06d6158f72053cf877a82e2a7a5bd23692faa.1713448007.git.dcaratti@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 months agoMerge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Fri, 26 Apr 2024 03:00:54 +0000 (20:00 -0700)]
Merge branch '40GbE' of git://git./linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
net: intel: start The Great Code Dedup + Page Pool for iavf

Alexander Lobakin says:

Here's a two-shot: introduce {,Intel} Ethernet common library (libeth and
libie) and switch iavf to Page Pool. Details are in the commit messages;
here's a summary:

Not a secret there's a ton of code duplication between two and more Intel
ethernet modules. Before introducing new changes, which would need to be
copied over again, start decoupling the already existing duplicate
functionality into a new module, which will be shared between several
Intel Ethernet drivers. The first name that came to my mind was
"libie" -- "Intel Ethernet common library". Also this sounds like
"lovelie" (-> one word, no "lib I E" pls) and can be expanded as
"lib Internet Explorer" :P
The "generic", pure-software part is placed separately, so that it can be
easily reused in any driver by any vendor without linking to the Intel
pre-200G guts. In a few words, it's something any modern driver does the
same way, but nobody moved it level up (yet).
The series is only the beginning. From now on, adding every new feature
or doing any good driver refactoring will remove much more lines than add
for quite some time. There's a basic roadmap with some deduplications
planned already, not speaking of that touching every line now asks:
"can I share this?". The final destination is very ambitious: have only
one unified driver for at least i40e, ice, iavf, and idpf with a struct
ops for each generation. That's never gonna happen, right? But you still
can at least try.
PP conversion for iavf lands within the same series as these two are tied
closely. libie will support Page Pool model only, so that a driver can't
use much of the lib until it's converted. iavf is only the example, the
rest will eventually be converted soon on a per-driver basis. That is
when it gets really interesting. Stay tech.

* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  MAINTAINERS: add entry for libeth and libie
  iavf: switch to Page Pool
  iavf: pack iavf_ring more efficiently
  libeth: add Rx buffer management
  page_pool: add DMA-sync-for-CPU inline helper
  page_pool: constify some read-only function arguments
  slab: introduce kvmalloc_array_node() and kvcalloc_node()
  iavf: drop page splitting and recycling
  iavf: kill "legacy-rx" for good
  net: intel: introduce {, Intel} Ethernet common library
====================

Link: https://lore.kernel.org/r/20240424203559.3420468-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agoMerge branch 'net-lan966x-flower-validate-control-flags'
Jakub Kicinski [Fri, 26 Apr 2024 02:36:39 +0000 (19:36 -0700)]
Merge branch 'net-lan966x-flower-validate-control-flags'

Asbjørn Sloth Tønnesen says:

====================
net: lan966x: flower: validate control flags

This series adds flower control flags validation to the
lan966x driver, and changes it from assuming that it handles
all control flags, to instead reject rules if they have
masked any unknown/unsupported control flags.

v1: https://lore.kernel.org/netdev/20240423102720.228728-1-ast@fiberby.net/
====================

Link: https://lore.kernel.org/r/20240424125347.461995-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: lan966x: flower: check for unsupported control flags
Asbjørn Sloth Tønnesen [Wed, 24 Apr 2024 12:53:40 +0000 (12:53 +0000)]
net: lan966x: flower: check for unsupported control flags

Use flow_rule_is_supp_control_flags() to reject filters with
unsupported control flags.

In case any unsupported control flags are masked,
flow_rule_is_supp_control_flags() sets a NL extended
error message, and we return -EOPNOTSUPP.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20240424125347.461995-4-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: lan966x: flower: rename goto in lan966x_tc_flower_handler_control_usage()
Asbjørn Sloth Tønnesen [Wed, 24 Apr 2024 12:53:39 +0000 (12:53 +0000)]
net: lan966x: flower: rename goto in lan966x_tc_flower_handler_control_usage()

Rename goto label, as the error message is specific to the fragment flags.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20240424125347.461995-3-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: lan966x: flower: add extack to lan966x_tc_flower_handler_control_usage()
Asbjørn Sloth Tønnesen [Wed, 24 Apr 2024 12:53:38 +0000 (12:53 +0000)]
net: lan966x: flower: add extack to lan966x_tc_flower_handler_control_usage()

Define extack locally, to reduce line lengths and aid future users.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20240424125347.461995-2-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agoMerge branch 'net-sparx5-flower-validate-control-flags'
Jakub Kicinski [Fri, 26 Apr 2024 02:35:10 +0000 (19:35 -0700)]
Merge branch 'net-sparx5-flower-validate-control-flags'

Asbjørn Sloth Tønnesen says:

====================
net: sparx5: flower: validate control flags

This series adds flower control flags validation to the
sparx5 driver, and changes it from assuming that it handles
all control flags, to instead reject rules if they have
masked any unknown/unsupported control flags.
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
v1: https://lore.kernel.org/netdev/20240423102728.228765-1-ast@fiberby.net/
====================

Link: https://lore.kernel.org/r/20240424121632.459022-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: sparx5: flower: check for unsupported control flags
Asbjørn Sloth Tønnesen [Wed, 24 Apr 2024 12:16:25 +0000 (12:16 +0000)]
net: sparx5: flower: check for unsupported control flags

Use flow_rule_is_supp_control_flags() to reject filters with
unsupported control flags.

In case any unsupported control flags are masked,
flow_rule_is_supp_control_flags() sets a NL extended
error message, and we return -EOPNOTSUPP.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424121632.459022-5-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: sparx5: flower: remove goto in sparx5_tc_flower_handler_control_usage()
Asbjørn Sloth Tønnesen [Wed, 24 Apr 2024 12:16:24 +0000 (12:16 +0000)]
net: sparx5: flower: remove goto in sparx5_tc_flower_handler_control_usage()

Remove goto, as it's only used once, and the error message is
specific to that context.

Only compile tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424121632.459022-4-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: sparx5: flower: add extack to sparx5_tc_flower_handler_control_usage()
Asbjørn Sloth Tønnesen [Wed, 24 Apr 2024 12:16:23 +0000 (12:16 +0000)]
net: sparx5: flower: add extack to sparx5_tc_flower_handler_control_usage()

Define extack locally, to reduce line lengths and aid future users.

Only compile tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424121632.459022-3-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: sparx5: flower: only do lookup if fragment flags are set
Asbjørn Sloth Tønnesen [Wed, 24 Apr 2024 12:16:22 +0000 (12:16 +0000)]
net: sparx5: flower: only do lookup if fragment flags are set

The fragment lookup should only be performed, when
at least one of the fragment flags are set.

This change was deliberately not included in commit
68aba00483c7 ("net: sparx5: flower: fix fragment flags handling")
as it's only needed for future proffing the code, since
"mask" is currently only set in conjunction with the
fragment flags.
(The 3rd flag FLOW_DIS_ENCAPSULATION is only used with "key")

Only compile tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424121632.459022-2-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: wwan: t7xx: Un-embed dummy device
Breno Leitao [Wed, 24 Apr 2024 16:11:07 +0000 (09:11 -0700)]
net: wwan: t7xx: Un-embed dummy device

Embedding net_device into structures prohibits the usage of flexible
arrays in the net_device structure. For more details, see the discussion
at [1].

Un-embed the net_device from the private struct by converting it
into a pointer. Then use the leverage the new alloc_netdev_dummy()
helper to allocate and initialize dummy devices.

[1] https://lore.kernel.org/all/20240229225910.79e224cf@kernel.org/

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240424161108.3397057-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agobpf, docs: Add introduction for use in the ISA Internet Draft
Dave Thaler [Mon, 22 Apr 2024 19:09:42 +0000 (12:09 -0700)]
bpf, docs: Add introduction for use in the ISA Internet Draft

The proposed intro paragraph text is derived from the first paragraph
of the IETF BPF WG charter at https://datatracker.ietf.org/wg/bpf/about/

Signed-off-by: Dave Thaler <dthaler1968@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20240422190942.24658-1-dthaler1968@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 months agoMerge branch 'net-microchip-correct-spelling-in-comments'
Jakub Kicinski [Fri, 26 Apr 2024 02:13:28 +0000 (19:13 -0700)]
Merge branch 'net-microchip-correct-spelling-in-comments'

Simon Horman says:

====================
net: microchip: Correct spelling in comments

Correct spelling in comments in Microchip drivers.
Flagged by codespell.

v1: https://lore.kernel.org/r/20240419-lan743x-confirm-v1-0-2a087617a3e5@kernel.org
====================

Link: https://lore.kernel.org/r/20240424-lan743x-confirm-v2-0-f0480542e39f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: sparx5: Correct spelling in comments
Simon Horman [Wed, 24 Apr 2024 15:13:26 +0000 (16:13 +0100)]
net: sparx5: Correct spelling in comments

Correct spelling in comments, as flagged by codespell.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424-lan743x-confirm-v2-4-f0480542e39f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: encx24j600: Correct spelling in comments
Simon Horman [Wed, 24 Apr 2024 15:13:25 +0000 (16:13 +0100)]
net: encx24j600: Correct spelling in comments

Correct spelling in comments, as flagged by codespell.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424-lan743x-confirm-v2-3-f0480542e39f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: lan966x: Correct spelling in comments
Simon Horman [Wed, 24 Apr 2024 15:13:24 +0000 (16:13 +0100)]
net: lan966x: Correct spelling in comments

Correct spelling in comments, as flagged by codespell.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20240424-lan743x-confirm-v2-2-f0480542e39f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: lan743x: Correct spelling in comments
Simon Horman [Wed, 24 Apr 2024 15:13:23 +0000 (16:13 +0100)]
net: lan743x: Correct spelling in comments

Correct spelling in comments, as flagged by codespell.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424-lan743x-confirm-v2-1-f0480542e39f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agor8152: replace dev_info with dev_dbg for loading firmware
Hayes Wang [Wed, 24 Apr 2024 08:45:32 +0000 (16:45 +0800)]
r8152: replace dev_info with dev_dbg for loading firmware

Someone complains the message appears continuously. This occurs
because the device is woken from UPS mode, and the driver re-loads
the firmware.

When the device enters runtime suspend and cable is unplugged, the
device would enter UPS mode. If the runtime resume occurs, and the
device is woken from UPS mode, the driver has to re-load the firmware
and causes the message. If someone wakes the device continuously, the
message would be shown continuously, too. Use dev_dbg to avoid it.

Note that, the function could be called before register_netdev(), so I
don't use netif_info() or netif_dbg().

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Link: https://lore.kernel.org/r/20240424084532.159649-1-hayeswang@realtek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: usb: ax88179_178a: Add check for usbnet_get_endpoints()
Ma Ke [Wed, 24 Apr 2024 06:56:34 +0000 (14:56 +0800)]
net: usb: ax88179_178a: Add check for usbnet_get_endpoints()

To avoid the failure of usbnet_get_endpoints(), we should check the
return value of the usbnet_get_endpoints().

Signed-off-by: Ma Ke <make_ruc2021@163.com>
Reviewed-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://lore.kernel.org/r/20240424065634.1870027-1-make_ruc2021@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: sfp: add quirk for ATS SFP-GE-T 1000Base-TX module
Daniel Golle [Tue, 23 Apr 2024 09:00:25 +0000 (11:00 +0200)]
net: sfp: add quirk for ATS SFP-GE-T 1000Base-TX module

Add quirk for ATS SFP-GE-T 1000Base-TX module.

This copper module comes with broken TX_FAULT indicator which must be
ignored for it to work.

Co-authored-by: Josef Schlehofer <pepe.schlehofer@gmail.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
[ rebased on top of net-next ]
Signed-off-by: Marek Behún <kabel@kernel.org>
Link: https://lore.kernel.org/r/20240423090025.29231-1-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: sfp: enhance quirk for Fibrestore 2.5G copper SFP module
Marek Behún [Tue, 23 Apr 2024 08:50:39 +0000 (10:50 +0200)]
net: sfp: enhance quirk for Fibrestore 2.5G copper SFP module

Enhance the quirk for Fibrestore 2.5G copper SFP module. The original
commit e27aca3760c0 ("net: sfp: add quirk for FS's 2.5G copper SFP")
introducing the quirk says that the PHY is inaccessible, but that is
not true.

The module uses Rollball protocol to talk to the PHY, and needs a 4
second wait before probing it, same as FS 10G module.

The PHY inside the module is Realtek RTL8221B-VB-CG PHY. The realtek
driver recently gained support to set it up via clause 45 accesses.

Signed-off-by: Marek Behún <kabel@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240423085039.26957-2-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: sfp: update comment for FS SFP-10G-T quirk
Marek Behún [Tue, 23 Apr 2024 08:50:38 +0000 (10:50 +0200)]
net: sfp: update comment for FS SFP-10G-T quirk

Update the comment for the Fibrestore SFP-10G-T module: since commit
e9301af385e7 ("net: sfp: fix PHY discovery for FS SFP-10G-T module")
we also do a 4 second wait before probing the PHY.

Fixes: e9301af385e7 ("net: sfp: fix PHY discovery for FS SFP-10G-T module")
Signed-off-by: Marek Behún <kabel@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240423085039.26957-1-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agonet: add two more call_rcu_hurry()
Eric Dumazet [Tue, 23 Apr 2024 20:54:08 +0000 (20:54 +0000)]
net: add two more call_rcu_hurry()

I had failures with pmtu.sh selftests lately,
with netns dismantles firing ref_tracking alerts [1].

After much debugging, I found that some queued
rcu callbacks were delayed by minutes, because
of CONFIG_RCU_LAZY=y option.

Joel Fernandes had a similar issue in the past,
fixed with commit 483c26ff63f4 ("net: Use call_rcu_hurry()
for dst_release()")

In this commit, I make sure nexthop_free_rcu()
and free_fib_info_rcu() are not delayed too much
because they both can release device references.

tools/testing/selftests/net/pmtu.sh no longer fails.

Traces were:

[  968.179860] ref_tracker: veth_A-R1@00000000d0ff3fe2 has 3/5 users at
                    dst_alloc+0x76/0x160
                    ip6_dst_alloc+0x25/0x80
                    ip6_pol_route+0x2a8/0x450
                    ip6_pol_route_output+0x1f/0x30
                    fib6_rule_lookup+0x163/0x270
                    ip6_route_output_flags+0xda/0x190
                    ip6_dst_lookup_tail.constprop.0+0x1d0/0x260
                    ip6_dst_lookup_flow+0x47/0xa0
                    udp_tunnel6_dst_lookup+0x158/0x210
                    vxlan_xmit_one+0x4c2/0x1550 [vxlan]
                    vxlan_xmit+0x52d/0x14f0 [vxlan]
                    dev_hard_start_xmit+0x7b/0x1e0
                    __dev_queue_xmit+0x20b/0xe40
                    ip6_finish_output2+0x2ea/0x6e0
                    ip6_finish_output+0x143/0x320
                    ip6_output+0x74/0x140

[  968.179860] ref_tracker: veth_A-R1@00000000d0ff3fe2 has 1/5 users at
                    netdev_get_by_index+0xc0/0xe0
                    fib6_nh_init+0x1a9/0xa90
                    rtm_new_nexthop+0x6fa/0x1580
                    rtnetlink_rcv_msg+0x155/0x3e0
                    netlink_rcv_skb+0x61/0x110
                    rtnetlink_rcv+0x19/0x20
                    netlink_unicast+0x23f/0x380
                    netlink_sendmsg+0x1fc/0x430
                    ____sys_sendmsg+0x2ef/0x320
                    ___sys_sendmsg+0x86/0xd0
                    __sys_sendmsg+0x67/0xc0
                    __x64_sys_sendmsg+0x21/0x30
                    x64_sys_call+0x252/0x2030
                    do_syscall_64+0x6c/0x190
                    entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  968.179860] ref_tracker: veth_A-R1@00000000d0ff3fe2 has 1/5 users at
                    ipv6_add_dev+0x136/0x530
                    addrconf_notify+0x19d/0x770
                    notifier_call_chain+0x65/0xd0
                    raw_notifier_call_chain+0x1a/0x20
                    call_netdevice_notifiers_info+0x54/0x90
                    register_netdevice+0x61e/0x790
                    veth_newlink+0x230/0x440
                    __rtnl_newlink+0x7d2/0xaa0
                    rtnl_newlink+0x4c/0x70
                    rtnetlink_rcv_msg+0x155/0x3e0
                    netlink_rcv_skb+0x61/0x110
                    rtnetlink_rcv+0x19/0x20
                    netlink_unicast+0x23f/0x380
                    netlink_sendmsg+0x1fc/0x430
                    ____sys_sendmsg+0x2ef/0x320
                    ___sys_sendmsg+0x86/0xd0
....
[ 1079.316024]  ? show_regs+0x68/0x80
[ 1079.316087]  ? __warn+0x8c/0x140
[ 1079.316103]  ? ref_tracker_free+0x1a0/0x270
[ 1079.316117]  ? report_bug+0x196/0x1c0
[ 1079.316135]  ? handle_bug+0x42/0x80
[ 1079.316149]  ? exc_invalid_op+0x1c/0x70
[ 1079.316162]  ? asm_exc_invalid_op+0x1f/0x30
[ 1079.316193]  ? ref_tracker_free+0x1a0/0x270
[ 1079.316208]  ? _raw_spin_unlock+0x1a/0x40
[ 1079.316222]  ? free_unref_page+0x126/0x1a0
[ 1079.316239]  ? destroy_large_folio+0x69/0x90
[ 1079.316251]  ? __folio_put+0x99/0xd0
[ 1079.316276]  dst_dev_put+0x69/0xd0
[ 1079.316308]  fib6_nh_release_dsts.part.0+0x3d/0x80
[ 1079.316327]  fib6_nh_release+0x45/0x70
[ 1079.316340]  nexthop_free_rcu+0x131/0x170
[ 1079.316356]  rcu_do_batch+0x1ee/0x820
[ 1079.316370]  ? rcu_do_batch+0x179/0x820
[ 1079.316388]  rcu_core+0x1aa/0x4d0
[ 1079.316405]  rcu_core_si+0x12/0x20
[ 1079.316417]  __do_softirq+0x13a/0x3dc
[ 1079.316435]  __irq_exit_rcu+0xa3/0x110
[ 1079.316449]  irq_exit_rcu+0x12/0x30
[ 1079.316462]  sysvec_apic_timer_interrupt+0x5b/0xe0
[ 1079.316474]  asm_sysvec_apic_timer_interrupt+0x1f/0x30
[ 1079.316569] RIP: 0033:0x7f06b65c63f0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240423205408.39632-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agoMerge branch 'bpf: add mrtt and srtt as ctx->args for BPF_SOCK_OPS_RTT_CB'
Martin KaFai Lau [Thu, 25 Apr 2024 20:26:21 +0000 (13:26 -0700)]
Merge branch 'bpf: add mrtt and srtt as ctx->args for BPF_SOCK_OPS_RTT_CB'

Philo Lu says:

====================
These provides more information about tcp RTT estimation. The selftest for
BPF_SOCK_OPS_RTT_CB is extended for the added args.

changelogs
-> v1:
- extend rtt selftest for added args (suggested by Stanislav)
====================

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
7 months agoselftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us
Philo Lu [Thu, 25 Apr 2024 16:17:24 +0000 (00:17 +0800)]
selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us

Because srtt and mrtt_us are added as args in bpf_sock_ops at
BPF_SOCK_OPS_RTT_CB, a simple check is added to make sure they are both
non-zero.

$ ./test_progs -t tcp_rtt
  #373     tcp_rtt:OK
  Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240425161724.73707-3-lulie@linux.alibaba.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
7 months agobpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args
Philo Lu [Thu, 25 Apr 2024 16:17:23 +0000 (00:17 +0800)]
bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args

Two important arguments in RTT estimation, mrtt and srtt, are passed to
tcp_bpf_rtt(), so that bpf programs get more information about RTT
computation in BPF_SOCK_OPS_RTT_CB.

The difference between bpf_sock_ops->srtt_us and the srtt here is: the
former is an old rtt before update, while srtt passed by tcp_bpf_rtt()
is that after update.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240425161724.73707-2-lulie@linux.alibaba.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
7 months agoMerge branch 'check-bpf_dummy_struct_ops-program-params-for-test-runs'
Alexei Starovoitov [Thu, 25 Apr 2024 19:40:20 +0000 (12:40 -0700)]
Merge branch 'check-bpf_dummy_struct_ops-program-params-for-test-runs'

Eduard Zingerman says:

====================
check bpf_dummy_struct_ops program params for test runs

When doing BPF_PROG_TEST_RUN for bpf_dummy_struct_ops programs,
execution should be rejected when NULL is passed for non-nullable
params, because for such params verifier assumes that such params are
never NULL and thus might optimize out NULL checks.

This problem was reported by Jose E. Marchesi in off-list discussion.
The code generated by GCC for dummy_st_ops_success/test_1() function
differs from LLVM variant in a way that allows verifier to remove the
NULL check. The test dummy_st_ops/dummy_init_ret_value actually sets
the 'state' parameter to NULL, thus GCC-generated version of the test
triggers NULL pointer dereference when BPF program is executed.

This patch-set addresses the issue in the following steps:
- patch #1 marks bpf_dummy_struct_ops.test_1 parameter as nullable,
  for verifier to have correct assumptions about test_1() programs;
- patch #2 modifies dummy_st_ops/dummy_init_ret_value to trigger NULL
  dereference with both GCC and LLVM (if patch #1 is not applied);
- patch #3 adjusts a few dummy_st_ops test cases to avoid passing NULL
  for 'state' parameter of test_2() and test_sleepable() functions,
  as parameters of these functions are not marked as nullable;
- patch #4 adjusts bpf_dummy_struct_ops to reject test execution of
  programs if NULL is passed for non-nullable parameter;
- patch #5 adds a test to verify logic from patch #4.
====================

Link: https://lore.kernel.org/r/20240424012821.595216-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 months agoselftests/bpf: dummy_st_ops should reject 0 for non-nullable params
Eduard Zingerman [Wed, 24 Apr 2024 01:28:21 +0000 (18:28 -0700)]
selftests/bpf: dummy_st_ops should reject 0 for non-nullable params

Check if BPF_PROG_TEST_RUN for bpf_dummy_struct_ops programs
rejects execution if NULL is passed for non-nullable parameter.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240424012821.595216-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 months agobpf: check bpf_dummy_struct_ops program params for test runs
Eduard Zingerman [Wed, 24 Apr 2024 01:28:20 +0000 (18:28 -0700)]
bpf: check bpf_dummy_struct_ops program params for test runs

When doing BPF_PROG_TEST_RUN for bpf_dummy_struct_ops programs,
reject execution when NULL is passed for non-nullable params.
For programs with non-nullable params verifier assumes that
such params are never NULL and thus might optimize out NULL checks.

Suggested-by: Kui-Feng Lee <sinquersw@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240424012821.595216-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 months agoselftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops
Eduard Zingerman [Wed, 24 Apr 2024 01:28:19 +0000 (18:28 -0700)]
selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops

dummy_st_ops.test_2 and dummy_st_ops.test_sleepable do not have their
'state' parameter marked as nullable. Update dummy_st_ops.c to avoid
passing NULL for such parameters, as the next patch would allow kernel
to enforce this restriction.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240424012821.595216-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 months agoselftests/bpf: adjust dummy_st_ops_success to detect additional error
Eduard Zingerman [Wed, 24 Apr 2024 01:28:18 +0000 (18:28 -0700)]
selftests/bpf: adjust dummy_st_ops_success to detect additional error

As reported by Jose E. Marchesi in off-list discussion, GCC and LLVM
generate slightly different code for dummy_st_ops_success/test_1():

  SEC("struct_ops/test_1")
  int BPF_PROG(test_1, struct bpf_dummy_ops_state *state)
  {
   int ret;

   if (!state)
   return 0xf2f3f4f5;

   ret = state->val;
   state->val = 0x5a;
   return ret;
  }

  GCC-generated                  LLVM-generated
  ----------------------------   ---------------------------
  0: r1 = *(u64 *)(r1 + 0x0)     0: w0 = -0xd0c0b0b
  1: if r1 == 0x0 goto 5f        1: r1 = *(u64 *)(r1 + 0x0)
  2: r0 = *(s32 *)(r1 + 0x0)     2: if r1 == 0x0 goto 6f
  3: *(u32 *)(r1 + 0x0) = 0x5a   3: r0 = *(u32 *)(r1 + 0x0)
  4: exit                        4: w2 = 0x5a
  5: r0 = -0xd0c0b0b             5: *(u32 *)(r1 + 0x0) = r2
  6: exit                        6: exit

If the 'state' argument is not marked as nullable in
net/bpf/bpf_dummy_struct_ops.c, the verifier would assume that
'r1 == 0x0' is never true:
- for the GCC version, this means that instructions #5-6 would be
  marked as dead and removed;
- for the LLVM version, all instructions would be marked as live.

The test dummy_st_ops/dummy_init_ret_value actually sets the 'state'
parameter to NULL.

Therefore, when the 'state' argument is not marked as nullable,
the GCC-generated version of the code would trigger a NULL pointer
dereference at instruction #3.

This patch updates the test_1() test case to always follow a shape
similar to the GCC-generated version above, in order to verify whether
the 'state' nullability is marked correctly.

Reported-by: Jose E. Marchesi <jemarch@gnu.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240424012821.595216-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 months agobpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable
Eduard Zingerman [Wed, 24 Apr 2024 01:28:17 +0000 (18:28 -0700)]
bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable

Test case dummy_st_ops/dummy_init_ret_value passes NULL as the first
parameter of the test_1() function. Mark this parameter as nullable to
make verifier aware of such possibility.
Otherwise, NULL check in the test_1() code:

      SEC("struct_ops/test_1")
      int BPF_PROG(test_1, struct bpf_dummy_ops_state *state)
      {
            if (!state)
                    return ...;

            ... access state ...
      }

Might be removed by verifier, thus triggering NULL pointer dereference
under certain conditions.

Reported-by: Jose E. Marchesi <jemarch@gnu.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240424012821.595216-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 25 Apr 2024 19:40:48 +0000 (12:40 -0700)]
Merge git://git./linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/ti/icssg/icssg_prueth.c

net/mac80211/chan.c
  89884459a0b9 ("wifi: mac80211: fix idle calculation with multi-link")
  87f5500285fb ("wifi: mac80211: simplify ieee80211_assign_link_chanctx()")
https://lore.kernel.org/all/20240422105623.7b1fbda2@canb.auug.org.au/

net/unix/garbage.c
  1971d13ffa84 ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().")
  4090fa373f0e ("af_unix: Replace garbage collection algorithm.")

drivers/net/ethernet/ti/icssg/icssg_prueth.c
drivers/net/ethernet/ti/icssg/icssg_common.c
  4dcd0e83ea1d ("net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()")
  e2dc7bfd677f ("net: ti: icssg-prueth: Move common functions into a separate file")

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agotcp: avoid premature drops in tcp_add_backlog()
Eric Dumazet [Tue, 23 Apr 2024 12:56:20 +0000 (12:56 +0000)]
tcp: avoid premature drops in tcp_add_backlog()

While testing TCP performance with latest trees,
I saw suspect SOCKET_BACKLOG drops.

tcp_add_backlog() computes its limit with :

    limit = (u32)READ_ONCE(sk->sk_rcvbuf) +
            (u32)(READ_ONCE(sk->sk_sndbuf) >> 1);
    limit += 64 * 1024;

This does not take into account that sk->sk_backlog.len
is reset only at the very end of __release_sock().

Both sk->sk_backlog.len and sk->sk_rmem_alloc could reach
sk_rcvbuf in normal conditions.

We should double sk->sk_rcvbuf contribution in the formula
to absorb bubbles in the backlog, which happen more often
for very fast flows.

This change maintains decent protection against abuses.

Fixes: c377411f2494 ("net: sk_add_backlog() take rmem_alloc into account")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240423125620.3309458-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 months agoMerge tag 'wireless-next-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel...
Jakub Kicinski [Thu, 25 Apr 2024 18:49:35 +0000 (11:49 -0700)]
Merge tag 'wireless-next-2024-04-24' of git://git./linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.10

The second "new features" pull request for v6.10 with changes both in
stack and in drivers. This time the pull request is rather small and
nothing special standing out except maybe that we have several
kernel-doc fixes. Great to see that we are getting warning free
wireless code (until new warnings are added).

Major changes:

rtl8xxxu:
 * enable Management Frame Protection (MFP) support

rtw88:
 * disable unsupported interface type of mesh point for all chips, and only
   support station mode for SDIO chips.

* tag 'wireless-next-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (63 commits)
  wifi: mac80211: handle link ID during management Tx
  wifi: mac80211: handle sdata->u.ap.active flag with MLO
  wifi: cfg80211: add return docs for regulatory functions
  wifi: cfg80211: make some regulatory functions void
  wifi: mac80211: add return docs for sta_info_flush()
  wifi: mac80211: keep mac80211 consistent on link activation failure
  wifi: mac80211: simplify ieee80211_assign_link_chanctx()
  wifi: mac80211: reserve chanctx during find
  wifi: cfg80211: fix cfg80211 function kernel-doc
  wifi: mac80211_hwsim: Use wider regulatory for custom for 6GHz tests
  wifi: iwlwifi: mvm: Don't allow EMLSR when the RSSI is low
  wifi: iwlwifi: mvm: disable EMLSR when we suspend with wowlan
  wifi: iwlwifi: mvm: get periodic statistics in EMLSR
  wifi: iwlwifi: mvm: don't recompute EMLSR mode in can_activate_links
  wifi: iwlwifi: mvm: implement EMLSR prevention mechanism.
  wifi: iwlwifi: mvm: exit EMLSR upon missed beacon
  wifi: iwlwifi: mvm: init vif works only once
  wifi: iwlwifi: mvm: Add helper functions to update EMLSR status
  wifi: iwlwifi: mvm: Implement new link selection algorithm
  wifi: iwlwifi: mvm: move EMLSR/links code
  ...
====================

Link: https://lore.kernel.org/r/20240424100122.217AEC113CE@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>