Simon Schippers [Thu, 6 Nov 2025 17:56:15 +0000 (18:56 +0100)]
usbnet: Add support for Byte Queue Limits (BQL)
In the current implementation, usbnet uses a fixed tx_qlen of:
USB2: 60 * 1518 bytes = 91.08 KB
USB3: 60 * 5 * 1518 bytes = 454.80 KB
Such large transmit queues can be problematic, especially for cellular
modems. For example, with a typical celluar link speed of 10 Mbit/s, a
fully occupied USB3 transmit queue results in:
454.80 KB / (10 Mbit/s / 8 bit/byte) = 363.84 ms
of additional latency.
This patch adds support for Byte Queue Limits (BQL) [1] to dynamically
manage the transmit queue size and reduce latency without sacrificing
throughput.
Testing was performed on various devices using the usbnet driver for
packet transmission:
- DELOCK 66045: USB3 to 2.5 GbE adapter (ax88179_178a)
- DELOCK 61969: USB2 to 1 GbE adapter (asix)
- Quectel RM520: 5G modem (qmi_wwan)
- USB2 Android tethering (cdc_ncm)
No performance degradation was observed for iperf3 TCP or UDP traffic,
while latency for a prioritized ping application was significantly
reduced. For example, using the USB3 to 2.5 GbE adapter, which was fully
utilized by iperf3 UDP traffic, the prioritized ping was improved from
1.6 ms to 0.6 ms. With the same setup but with a 100 Mbit/s Ethernet
connection, the prioritized ping was improved from 35 ms to 5 ms.
[1] https://lwn.net/Articles/469652/
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106175615.26948-1-simon.schippers@tu-dortmund.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Breno Leitao [Fri, 7 Nov 2025 10:36:59 +0000 (02:36 -0800)]
tg3: Fix num of RX queues being reported by ethtool
Using num_online_cpus() to report number of queues is actually not
correct, as reported by Michael[1].
netif_get_num_default_rss_queues() was used to replace num_online_cpus()
in the past, but tg3 ethtool callbacks didn't get converted. Doing it
now.
Link: https://lore.kernel.org/all/CACKFLim7ruspmqvjr6bNRq5Z_XXVk3vVaLZOons7kMCzsEG23A@mail.gmail.com/#t
Signed-off-by: Breno Leitao <leitao@debian.org>
Suggested-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20251107-tg3_counts-v1-1-337fe5c8ccb7@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 11 Nov 2025 01:11:09 +0000 (17:11 -0800)]
Merge branch 'net-dsa-b53-add-support-for-bcm5389-97-98-and-bcm63xx-arl-formats'
Jonas Gorski says:
====================
net: dsa: b53: add support for BCM5389/97/98 and BCM63XX ARL formats
Currently b53 assumes that all switches apart from BCM5325/5365 use the
same ARL formats, but there are actually multiple formats in use.
Older switches use a format apparently introduced with BCM5387/BCM5389,
while newer chips use a format apparently introduced with BCM5395.
Note that these numbers are not linear, BCM5397/BCM5398 use the older
format.
In addition to that the switches integrated into BCM63XX SoCs use their
own format. While accessing these normal read/write ARL entries are the
same format as BCM5389 one, the search format is different.
So in order to support all these different format, split all code
accessing these entries into chip-family specific functions, and collect
them in appropriate arl ops structs to keep the code cleaner.
Sent as net-next since the ARL accesses have never worked before, and
the extensive refactoring might be too much to warrant a fix.
====================
Link: https://patch.msgid.link/20251107080749.26936-1-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonas Gorski [Fri, 7 Nov 2025 08:07:49 +0000 (09:07 +0100)]
net: dsa: b53: add support for bcm63xx ARL entry format
The ARL registers of BCM63XX embedded switches are somewhat unique. The
normal ARL table access registers have the same format as BCM5389, but
the ARL search registers differ:
* SRCH_CTL is at the same offset of BCM5389, but 16 bits wide. It does
not have more fields, just needs to be accessed by a 16 bit read.
* SRCH_RSLT_MACVID and SRCH_RSLT are aligned to 32 bit, and have shifted
offsets.
* SRCH_RSLT has a different format than the normal ARL data entry
register.
* There is only one set of ENTRY_N registers, implying a 1 bin layout.
So add appropriate ops for bcm63xx and let it use it.
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-9-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonas Gorski [Fri, 7 Nov 2025 08:07:48 +0000 (09:07 +0100)]
net: dsa: b53: add support for 5389/5397/5398 ARL entry format
BCM5389, BCM5397 and BCM5398 use a different ARL entry format with just
a 16 bit fwdentry register, as well as different search control and data
offsets.
So add appropriate ops for them and switch those chips to use them.
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-8-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonas Gorski [Fri, 7 Nov 2025 08:07:47 +0000 (09:07 +0100)]
net: dsa: b53: move ARL entry functions into ops struct
Now that the differences in ARL entry formats are neatly contained into
functions per chip family, wrap them into an ops struct and add wrapper
functions to access them.
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-7-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonas Gorski [Fri, 7 Nov 2025 08:07:46 +0000 (09:07 +0100)]
net: dsa: b53: split reading search entry into their own functions
Split reading search entries into a function for each format.
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-6-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonas Gorski [Fri, 7 Nov 2025 08:07:45 +0000 (09:07 +0100)]
net: dsa: b53: provide accessors for accessing ARL_SRCH_CTL
In order to more easily support more formats, move accessing
ARL_SRCH_CTL into helper functions to contain the differences.
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-5-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonas Gorski [Fri, 7 Nov 2025 08:07:44 +0000 (09:07 +0100)]
net: dsa: b53: move writing ARL entries into their own functions
Move writing ARL entries into individual functions for each format.
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-4-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonas Gorski [Fri, 7 Nov 2025 08:07:43 +0000 (09:07 +0100)]
net: dsa: b53: move reading ARL entries into their own function
Instead of duplicating the whole code iterating over all bins for
BCM5325, factor out reading and parsing the entry into its own
functions, and name it the modern one after the first chip with that ARL
format, (BCM53)95.
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-3-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonas Gorski [Fri, 7 Nov 2025 08:07:42 +0000 (09:07 +0100)]
net: dsa: b53: b53_arl_read{,25}(): use the entry for comparision
Align the b53_arl_read{,25}() functions by consistently using the
parsed arl entry instead of parsing the raw registers again.
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-2-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 11 Nov 2025 00:43:51 +0000 (16:43 -0800)]
Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2025-11-10
We've added 19 non-merge commits during the last 3 day(s) which contain
a total of 22 files changed, 1345 insertions(+), 197 deletions(-).
The main changes are:
1) Preserve skb metadata after a TC BPF program has changed the skb,
from Jakub Sitnicki.
This allows a TC program at the end of a TC filter chain to still see
the skb metadata, even if another TC program at the front of the chain
has changed the skb using BPF helpers.
2) Initial af_smc bpf_struct_ops support to control the smc specific
syn/synack options, from D. Wythe.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
bpf/selftests: Add selftest for bpf_smc_hs_ctrl
net/smc: bpf: Introduce generic hook for handshake flow
bpf: Export necessary symbols for modules with struct_ops
selftests/bpf: Cover skb metadata access after bpf_skb_change_proto
selftests/bpf: Cover skb metadata access after change_head/tail helper
selftests/bpf: Cover skb metadata access after bpf_skb_adjust_room
selftests/bpf: Cover skb metadata access after vlan push/pop helper
selftests/bpf: Expect unclone to preserve skb metadata
selftests/bpf: Dump skb metadata on verification failure
selftests/bpf: Verify skb metadata in BPF instead of userspace
bpf: Make bpf_skb_change_head helper metadata-safe
bpf: Make bpf_skb_change_proto helper metadata-safe
bpf: Make bpf_skb_adjust_room metadata-safe
bpf: Make bpf_skb_vlan_push helper metadata-safe
bpf: Make bpf_skb_vlan_pop helper metadata-safe
vlan: Make vlan_remove_tag return nothing
bpf: Unclone skb head on bpf_dynptr_write to skb metadata
net: Preserve metadata on pskb_expand_head
net: Helper to move packet data and metadata after skb_push/pull
====================
Link: https://patch.msgid.link/20251110232427.3929291-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Niklas Söderlund [Fri, 7 Nov 2025 20:01:00 +0000 (21:01 +0100)]
net: ravb: Correct bad check of timestamp control flags
When converting the Renesas network drivers to use flags from enum
hwtstamp_rx_filters to control when to timestamp packages instead of a
driver specific schema with bit-wise flags an error was made.
The bit-wise driver specific flags correct logic to set get_ts was:
q: RAVB_BE + tstamp_rx_ctrl: 0 => 0
q: RAVB_NC + tstamp_rx_ctrl: 0 => 0
q: RAVB_BE + tstamp_rx_ctrl: RAVB_RXTSTAMP_TYPE_V2_L2_EVENT => 0
q: RAVB_NC + tstamp_rx_ctrl: RAVB_RXTSTAMP_TYPE_V2_L2_EVENT => 1
q: RAVB_BE + tstamp_rx_ctrl: RAVB_RXTSTAMP_TYPE_ALL => 1
q: RAVB_NC + tstamp_rx_ctrl: RAVB_RXTSTAMP_TYPE_ALL => 1
The converted logic to use enum flags mapped tstamp_rx_ctrl as
0 to HWTSTAMP_FILTER_NONE
RAVB_RXTSTAMP_TYPE_V2_L2_EVENT to HWTSTAMP_FILTER_PTP_V2_L2_EVENT
RAVB_RXTSTAMP_TYPE_ALL to HWTSTAMP_FILTER_ALL
But the logic was incorrectly changed to:
q: RAVB_BE + tstamp_rx_ctrl: HWTSTAMP_FILTER_NONE => 1 (error)
q: RAVB_NC + tstamp_rx_ctrl: HWTSTAMP_FILTER_NONE => 0
q: RAVB_BE + tstamp_rx_ctrl: HWTSTAMP_FILTER_PTP_V2_L2_EVENT => 0
q: RAVB_NC + tstamp_rx_ctrl: HWTSTAMP_FILTER_PTP_V2_L2_EVENT => 1
q: RAVB_BE + tstamp_rx_ctrl: HWTSTAMP_FILTER_ALL => 1
q: RAVB_NC + tstamp_rx_ctrl: HWTSTAMP_FILTER_ALL => 0 (error)
This change restores the converted flag check to the correct logic of
the bit-wise driver specific flags.
Reported-by: Simon Horman <horms@kernel.org>
Closes: https://lore.kernel.org/linux-renesas-soc/aQ4xSv9629XF-Bt3@horms.kernel.org/
Fixes:
16e2e6cf75e6 ("net: ravb: Use common defines for time stamping control")
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251107200100.3637869-1-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zhongqiu Han [Fri, 7 Nov 2025 07:45:33 +0000 (15:45 +0800)]
ptp: ocp: Document sysfs output format for backward compatibility
Add a comment to ptp_ocp_tty_show() explaining that the sysfs output
intentionally does not include a trailing newline. This is required for
backward compatibility with existing userspace software that reads the
sysfs attribute and uses the value directly as a device path.
A previous attempt to add a newline to align with common kernel
conventions broke userspace applications that were opening device paths
like "/dev/ttyS4\n" instead of "/dev/ttyS4", resulting in ENOENT errors.
This comment prevents future attempts to "fix" this behavior, which would
break existing userspace applications.
Link: https://lore.kernel.org/netdev/20251030124519.1828058-1-zhongqiu.han@oss.qualcomm.com/
Link: https://lore.kernel.org/netdev/aef3b850-5f38-4c28-a018-3b0006dc2f08@linux.dev/
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251107074533.416048-1-zhongqiu.han@oss.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Thu, 6 Nov 2025 22:34:06 +0000 (22:34 +0000)]
sctp: Don't inherit do_auto_asconf in sctp_clone_sock().
syzbot reported list_del(&sp->auto_asconf_list) corruption
in sctp_destroy_sock().
The repro calls setsockopt(SCTP_AUTO_ASCONF, 1) to a SCTP
listener, calls accept(), and close()s the child socket.
setsockopt(SCTP_AUTO_ASCONF, 1) sets sp->do_auto_asconf
to 1 and links sp->auto_asconf_list to a per-netns list.
Both fields are placed after sp->pd_lobby in struct sctp_sock,
and sctp_copy_descendant() did not copy the fields before the
cited commit.
Also, sctp_clone_sock() did not set them explicitly.
In addition, sctp_auto_asconf_init() is called from
sctp_sock_migrate(), but it initialises the fields only
conditionally.
The two fields relied on __GFP_ZERO added in sk_alloc(),
but sk_clone() does not use it.
Let's clear newsp->do_auto_asconf in sctp_clone_sock().
[0]:
list_del corruption. prev->next should be
ffff8880799e9148, but was
ffff8880799e8808. (prev=
ffff88803347d9f8)
kernel BUG at lib/list_debug.c:64!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 6008 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025
RIP: 0010:__list_del_entry_valid_or_report+0x15a/0x190 lib/list_debug.c:62
Code: e8 7b 26 71 fd 43 80 3c 2c 00 74 08 4c 89 ff e8 7c ee 92 fd 49 8b 17 48 c7 c7 80 0a bf 8b 48 89 de 4c 89 f9 e8 07 c6 94 fc 90 <0f> 0b 4c 89 f7 e8 4c 26 71 fd 43 80 3c 2c 00 74 08 4c 89 ff e8 4d
RSP: 0018:
ffffc90003067ad8 EFLAGS:
00010246
RAX:
000000000000006d RBX:
ffff8880799e9148 RCX:
b056988859ee6e00
RDX:
0000000000000000 RSI:
0000000000000202 RDI:
0000000000000000
RBP:
dffffc0000000000 R08:
ffffc90003067807 R09:
1ffff9200060cf00
R10:
dffffc0000000000 R11:
fffff5200060cf01 R12:
1ffff1100668fb3f
R13:
dffffc0000000000 R14:
ffff88803347d9f8 R15:
ffff88803347d9f8
FS:
00005555823e5500(0000) GS:
ffff88812613e000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000200000000480 CR3:
00000000741ce000 CR4:
00000000003526f0
Call Trace:
<TASK>
__list_del_entry_valid include/linux/list.h:132 [inline]
__list_del_entry include/linux/list.h:223 [inline]
list_del include/linux/list.h:237 [inline]
sctp_destroy_sock+0xb4/0x370 net/sctp/socket.c:5163
sk_common_release+0x75/0x310 net/core/sock.c:3961
sctp_close+0x77e/0x900 net/sctp/socket.c:1550
inet_release+0x144/0x190 net/ipv4/af_inet.c:437
__sock_release net/socket.c:662 [inline]
sock_close+0xc3/0x240 net/socket.c:1455
__fput+0x44c/0xa70 fs/file_table.c:468
task_work_run+0x1d4/0x260 kernel/task_work.c:227
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
exit_to_user_mode_loop+0xe9/0x130 kernel/entry/common.c:43
exit_to_user_mode_prepare include/linux/irq-entry-common.h:225 [inline]
syscall_exit_to_user_mode_work include/linux/entry-common.h:175 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:210 [inline]
do_syscall_64+0x2bd/0xfa0 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes:
16942cf4d3e3 ("sctp: Use sk_clone() in sctp_accept().")
Reported-by: syzbot+ba535cb417f106327741@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/
690d2185.
a70a0220.22f260.000e.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251106223418.1455510-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Martin KaFai Lau [Mon, 10 Nov 2025 19:10:09 +0000 (11:10 -0800)]
Merge branch 'net-smc-introduce-smc_hs_ctrl'
D. Wythe says:
====================
net/smc: Introduce smc_hs_ctrl
This patch aims to introduce BPF injection capabilities for SMC and
includes a self-test to ensure code stability.
Since the SMC protocol isn't ideal for every situation, especially
short-lived ones, most applications can't guarantee the absence of
such scenarios. Consequently, applications may need specific strategies
to decide whether to use SMC. For example, an application might limit SMC
usage to certain IP addresses or ports.
To maintain the principle of transparent replacement, we want applications
to remain unaffected even if they need specific SMC strategies. In other
words, they should not require recompilation of their code.
Additionally, we need to ensure the scalability of strategy implementation.
While using socket options or sysctl might be straightforward, it could
complicate future expansions.
Fortunately, BPF addresses these concerns effectively. Users can write
their own strategies in eBPF to determine whether to use SMC, and they can
easily modify those strategies in the future.
This is a rework of the series from [1]. Changes since [1] are limited to
the SMC parts:
1. Rename smc_ops to smc_hs_ctrl and change interface name.
2. Squash SMC patches, removing standalone non-BPF hook capability.
3. Fix typos
[1]: https://lore.kernel.org/bpf/
20250123015942.94810-1-alibuda@linux.alibaba.com/#t
v2 -> v1:
- Removed the fixes patch, which have already been merged on current branch.
- Fixed compilation warning of smc_call_hsbpf() when CONFIG_SMC_HS_CTRL_BPF
is not enabled.
- Changed the default value of CONFIG_SMC_HS_CTRL_BPF to Y.
- Fix typo and renamed some variables
v3 -> v2:
- Removed the libbpf patch, which have already been merged on current branch.
- Fixed sparse warning of smc_call_hsbpf() and xchg().
v4 -> v3:
- Rebased on latest bpf-next, updated SMC loopback config from SMC_LO to DIBS_LO
per upstream changes.
v5 -> v4:
- Removed the redundant sk parameter from smc_call_hsbpf
- Reject registration when bpf_link is set, link support will be added in the
future.
- Updated selftests with new test heplers.
====================
Link: https://patch.msgid.link/20251107035632.115950-1-alibuda@linux.alibaba.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
D. Wythe [Fri, 7 Nov 2025 03:56:32 +0000 (11:56 +0800)]
bpf/selftests: Add selftest for bpf_smc_hs_ctrl
This tests introduces a tiny smc_hs_ctrl for filtering SMC connections
based on IP pairs, and also adds a realistic topology model to verify it.
Also, we can only use SMC loopback under CI test, so an additional
configuration needs to be enabled.
Follow the steps below to run this test.
make -C tools/testing/selftests/bpf
cd tools/testing/selftests/bpf
sudo ./test_progs -t smc
Results shows:
Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Tested-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20251107035632.115950-4-alibuda@linux.alibaba.com
D. Wythe [Fri, 7 Nov 2025 03:56:31 +0000 (11:56 +0800)]
net/smc: bpf: Introduce generic hook for handshake flow
The introduction of IPPROTO_SMC enables eBPF programs to determine
whether to use SMC based on the context of socket creation, such as
network namespaces, PID and comm name, etc.
As a subsequent enhancement, to introduce a new generic hook that
allows decisions on whether to use SMC or not at runtime, including
but not limited to local/remote IP address or ports.
User can write their own implememtion via bpf_struct_ops now to choose
whether to use SMC or not before TCP 3rd handshake to be comleted.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://patch.msgid.link/20251107035632.115950-3-alibuda@linux.alibaba.com
D. Wythe [Fri, 7 Nov 2025 03:56:30 +0000 (11:56 +0800)]
bpf: Export necessary symbols for modules with struct_ops
Exports three necessary symbols for implementing struct_ops with
tristate subsystem.
To hold or release refcnt of struct_ops refcnt by inline funcs
bpf_try_module_get and bpf_module_put which use bpf_struct_ops_get(put)
conditionally.
And to copy obj name from one to the other with effective checks by
bpf_obj_name_cpy.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251107035632.115950-2-alibuda@linux.alibaba.com
Martin KaFai Lau [Mon, 10 Nov 2025 18:52:33 +0000 (10:52 -0800)]
Merge branch 'make-tc-bpf-helpers-preserve-skb-metadata'
Jakub Sitnicki says:
====================
Make TC BPF helpers preserve skb metadata
Changes in v4:
- Fix copy-paste bug in check_metadata() test helper (AI review)
- Add "out of scope" section (at the bottom)
- Link to v3: https://lore.kernel.org/r/
20251026-skb-meta-rx-path-v3-0-
37cceebb95d3@cloudflare.com
Changes in v3:
- Use the already existing BPF_STREAM_STDERR const in tests (Martin)
- Unclone skb head on bpf_dynptr_write to skb metadata (patch 3) (Martin)
- Swap order of patches 1 & 2 to refer to skb_postpush_data_move() in docs
- Mention in skb_data_move() docs how to move just the metadata
- Note in pskb_expand_head() docs to move metadata after skb_push() (Jakub)
- Link to v2: https://lore.kernel.org/r/
20251019-skb-meta-rx-path-v2-0-
f9a58f3eb6d6@cloudflare.com
Changes in v2:
- Tweak WARN_ON_ONCE check in skb_data_move() (patch 2)
- Convert all tests to verify skb metadata in BPF (patches 9-10)
- Add test coverage for modified BPF helpers (patches 12-15)
- Link to RFCv1: https://lore.kernel.org/r/
20250929-skb-meta-rx-path-v1-0-
de700a7ab1cb@cloudflare.com
This patch set continues our work [1] to allow BPF programs and user-space
applications to attach multiple bytes of metadata to packets via the
XDP/skb metadata area.
The focus of this patch set it to ensure that skb metadata remains intact
when packets pass through a chain of TC BPF programs that call helpers
which operate on skb head.
Currently, several helpers that either adjust the skb->data pointer or
reallocate skb->head do not preserve metadata at its expected location,
that is immediately in front of the MAC header. These are:
- bpf_skb_adjust_room
- bpf_skb_change_head
- bpf_skb_change_proto
- bpf_skb_change_tail
- bpf_skb_vlan_pop
- bpf_skb_vlan_push
In TC BPF context, metadata must be moved whenever skb->data changes to
keep the skb->data_meta pointer valid. I don't see any way around
it. Creative ideas how to avoid that would be very welcome.
With that in mind, we can patch the helpers in at least two different ways:
1. Integrate metadata move into header move
Replace the existing memmove, which follows skb_push/pull, with a helper
that moves both headers and metadata in a single call. This avoids an
extra memmove but reduces transparency.
skb_pull(skb, len);
- memmove(skb->data, skb->data - len, n);
+ skb_postpull_data_move(skb, len, n);
skb->mac_header += len;
skb_push(skb, len)
- memmove(skb->data, skb->data + len, n);
+ skb_postpush_data_move(skb, len, n);
skb->mac_header -= len;
2. Move metadata separately
Add a dedicated metadata move after the header move. This is more
explicit but costs an additional memmove.
skb_pull(skb, len);
memmove(skb->data, skb->data - len, n);
+ skb_metadata_postpull_move(skb, len);
skb->mac_header += len;
skb_push(skb, len)
+ skb_metadata_postpush_move(skb, len);
memmove(skb->data, skb->data + len, n);
skb->mac_header -= len;
This patch set implements option (1), expecting that "you can have just one
memmove" will be the most obvious feedback, while readability is a,
somewhat subjective, matter of taste, which I don't claim to have ;-)
The structure of the patch set is as follows:
- patches 1-4 prepare ground for safe-proofing the BPF helpers
- patches 5-9 modify the BPF helpers to preserve skb metadata
- patches 10-11 prepare ground for metadata tests with BPF helper calls
- patches 12-16 adapt and expand tests to cover the made changes
Out of scope for this series:
- safe-proofing tunnel & tagging devices - VLAN, GRE, ...
(next in line, in development preview at [2])
- metadata access after packet foward
(to do after Rx path - once metadata reliably reaches sk_filter)
Thanks,
-jkbs
[1] https://lore.kernel.org/all/
20250814-skb-metadata-thru-dynptr-v7-0-
8a39e636e0fb@cloudflare.com/
[2] https://github.com/jsitnicki/linux/commits/skb-meta/safeproof-netdevs/
====================
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-0-5ceb08a9b37b@cloudflare.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:53 +0000 (21:19 +0100)]
selftests/bpf: Cover skb metadata access after bpf_skb_change_proto
Add a test to verify that skb metadata remains accessible after calling
bpf_skb_change_proto(), which modifies packet headroom to accommodate
different IP header sizes.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-16-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:52 +0000 (21:19 +0100)]
selftests/bpf: Cover skb metadata access after change_head/tail helper
Add a test to verify that skb metadata remains accessible after calling
bpf_skb_change_head() and bpf_skb_change_tail(), which modify packet
headroom/tailroom and can trigger head reallocation.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-15-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:51 +0000 (21:19 +0100)]
selftests/bpf: Cover skb metadata access after bpf_skb_adjust_room
Add a test to verify that skb metadata remains accessible after calling
bpf_skb_adjust_room(), which modifies the packet headroom and can trigger
head reallocation.
The helper expects an Ethernet frame carrying an IP packet so switch test
packet identification by source MAC address since we can no longer rely on
Ethernet proto being set to zero.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-14-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:50 +0000 (21:19 +0100)]
selftests/bpf: Cover skb metadata access after vlan push/pop helper
Add a test to verify that skb metadata remains accessible after calling
bpf_skb_vlan_push() and bpf_skb_vlan_pop(), which modify the packet
headroom.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-13-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:49 +0000 (21:19 +0100)]
selftests/bpf: Expect unclone to preserve skb metadata
Since pskb_expand_head() no longer clears metadata on unclone, update tests
for cloned packets to expect metadata to remain intact.
Also simplify the clone_dynptr_kept_on_{data,meta}_slice_write tests.
Creating an r/w dynptr slice is sufficient to trigger an unclone in the
prologue, so remove the extraneous writes to the data/meta slice.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-12-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:48 +0000 (21:19 +0100)]
selftests/bpf: Dump skb metadata on verification failure
Add diagnostic output when metadata verification fails to help with
troubleshooting test failures. Introduce a check_metadata() helper that
prints both expected and received metadata to the BPF program's stderr
stream on mismatch. The userspace test reads and dumps this stream on
failure.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-11-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:47 +0000 (21:19 +0100)]
selftests/bpf: Verify skb metadata in BPF instead of userspace
Move metadata verification into the BPF TC programs. Previously,
userspace read metadata from a map and verified it once at test end.
Now TC programs compare metadata directly using __builtin_memcmp() and
set a test_pass flag. This enables verification at multiple points during
test execution rather than a single final check.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-10-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:46 +0000 (21:19 +0100)]
bpf: Make bpf_skb_change_head helper metadata-safe
Although bpf_skb_change_head() doesn't move packet data after skb_push(),
skb metadata still needs to be relocated. Use the dedicated helper to
handle it.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-9-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:45 +0000 (21:19 +0100)]
bpf: Make bpf_skb_change_proto helper metadata-safe
bpf_skb_change_proto reuses the same headroom operations as
bpf_skb_adjust_room, already updated to handle metadata safely.
The remaining step is to ensure that there is sufficient headroom to
accommodate metadata on skb_push().
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-8-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:44 +0000 (21:19 +0100)]
bpf: Make bpf_skb_adjust_room metadata-safe
bpf_skb_adjust_room() may push or pull bytes from skb->data. In both cases,
skb metadata must be moved accordingly to stay accessible.
Replace existing memmove() calls, which only move payload, with a helper
that also handles metadata. Reserve enough space for metadata to fit after
skb_push.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-7-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:43 +0000 (21:19 +0100)]
bpf: Make bpf_skb_vlan_push helper metadata-safe
Use the metadata-aware helper to move packet bytes after skb_push(),
ensuring metadata remains valid after calling the BPF helper.
Also, take care to reserve sufficient headroom for metadata to fit.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-6-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:42 +0000 (21:19 +0100)]
bpf: Make bpf_skb_vlan_pop helper metadata-safe
Use the metadata-aware helper to move packet bytes after skb_pull(),
ensuring metadata remains valid after calling the BPF helper.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-5-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:41 +0000 (21:19 +0100)]
vlan: Make vlan_remove_tag return nothing
All callers ignore the return value.
Prepare to reorder memmove() after skb_pull() which is a common pattern.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-4-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:40 +0000 (21:19 +0100)]
bpf: Unclone skb head on bpf_dynptr_write to skb metadata
Currently bpf_dynptr_from_skb_meta() marks the dynptr as read-only when
the skb is cloned, preventing writes to metadata.
Remove this restriction and unclone the skb head on bpf_dynptr_write() to
metadata, now that the metadata is preserved during uncloning. This makes
metadata dynptr consistent with skb dynptr, allowing writes regardless of
whether the skb is cloned.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-3-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:39 +0000 (21:19 +0100)]
net: Preserve metadata on pskb_expand_head
pskb_expand_head() copies headroom, including skb metadata, into the newly
allocated head, but then clears the metadata. As a result, metadata is lost
when BPF helpers trigger an skb head reallocation.
Let the skb metadata remain in the newly created copy of head.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-2-5ceb08a9b37b@cloudflare.com
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:38 +0000 (21:19 +0100)]
net: Helper to move packet data and metadata after skb_push/pull
Lay groundwork for fixing BPF helpers available to TC(X) programs.
When skb_push() or skb_pull() is called in a TC(X) ingress BPF program, the
skb metadata must be kept in front of the MAC header. Otherwise, BPF
programs using the __sk_buff->data_meta pseudo-pointer lose access to it.
Introduce a helper that moves both metadata and a specified number of
packet data bytes together, suitable as a drop-in replacement for
memmove().
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-1-5ceb08a9b37b@cloudflare.com
Jakub Kicinski [Sat, 8 Nov 2025 03:15:36 +0000 (19:15 -0800)]
Merge branch '40GbE' of git://git./linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-11-06 (i40, ice, iavf)
Mohammad Heib introduces a new devlink parameter, max_mac_per_vf, for
controlling the maximum number of MAC address filters allowed by a VF. This
allows administrators to control the VF behavior in a more nuanced manner.
Aleksandr and Przemek add support for Receive Side Scaling of GTP to iAVF
for VFs running on E800 series ice hardware. This improves performance and
scalability for virtualized network functions in 5G and LTE deployments.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
iavf: add RSS support for GTP protocol via ethtool
ice: Extend PTYPE bitmap coverage for GTP encapsulated flows
ice: improve TCAM priority handling for RSS profiles
ice: implement GTP RSS context tracking and configuration
ice: add virtchnl definitions and static data for GTP RSS
ice: add flow parsing for GTP and new protocol field support
i40e: support generic devlink param "max_mac_per_vf"
devlink: Add new "max_mac_per_vf" generic device param
====================
Link: https://patch.msgid.link/20251106225321.1609605-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 8 Nov 2025 03:05:51 +0000 (19:05 -0800)]
Merge branch 'net-stmmac-lpc18xx-and-sti-convert-to-set_phy_intf_sel'
Russell King says:
====================
net: stmmac: lpc18xx and sti: convert to set_phy_intf_sel()
This series converts lpc18xx and sti to use the new .set_phy_intf_sel()
method.
====================
Link: https://patch.msgid.link/aQyEs4DAZRWpAz32@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:52 +0000 (11:23 +0000)]
net: stmmac: sti: use ->set_phy_intf_sel()
Rather than placing the phy_intf_sel() setup in the ->init() method,
move it to the new ->set_phy_intf_sel() method.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5o-0000000DhQn-34JE@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:47 +0000 (11:23 +0000)]
net: stmmac: sti: use stmmac_get_phy_intf_sel()
Use stmmac_get_phy_intf_sel() to decode the PHY interface mode to the
phy_intf_sel value, validate the result and use that to set the
control register to select the operating mode for the DWMAC core.
Note that when an unsupported interface mode is used, the array would
decode this to PHY_INTF_SEL_GMII_MII, so preserve this behaviour.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5j-0000000DhQh-2e0x@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:42 +0000 (11:23 +0000)]
net: stmmac: sti: use PHY_INTF_SEL_x directly
Use the PHY_INTF_SEL_x values directly rather than the driver private
ETH_PHY_SEL_x values. Move the FIELD_PREP() into sti_dwmac_set_mode().
Use dwmac->interface directly.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5e-0000000DhQb-2B7I@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:37 +0000 (11:23 +0000)]
net: stmmac: sti: use PHY_INTF_SEL_x to select PHY interface
Use the common dwmac definitions for the PHY interface selection field,
adding MII_PHY_SEL_VAL() temporarily to avoid line wrapping.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5Z-0000000DhQV-1e2l@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:32 +0000 (11:23 +0000)]
net: stmmac: lpc18xx: use ->set_phy_intf_sel()
Move the configuration of the dwmac PHY interface selection to the new
->set_phy_intf_sel() method.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5U-0000000DhQP-19Hd@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:27 +0000 (11:23 +0000)]
net: stmmac: lpc18xx: validate phy_intf_sel
Validate the phy_intf_sel value rather than the PHY interface mode.
This will allow us to transition to the ->set_phy_intf_sel() method.
Note that this will allow GMII as well as MII as the phy_intf_sel
value is the same for both.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5P-0000000DhQJ-0Oi3@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:21 +0000 (11:23 +0000)]
net: stmmac: lpc18xx: use stmmac_get_phy_intf_sel()
Use stmmac_get_phy_intf_sel() to decode the PHY interface mode to the
phy_intf_sel value, and use the result to program the ethernet mode.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5J-0000000DhQD-46Ob@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:16 +0000 (11:23 +0000)]
net: stmmac: lpc18xx: use PHY_INTF_SEL_x directly
Use the PHY_INTF_SEL_x values directly rather than the driver private
LPC18XX_CREG_CREG6_ETHMODE_x definitions, and convert
LPC18XX_CREG_CREG6_ETHMODE_MASK to use GENMASK().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5E-0000000DhQ7-3cuy@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:11 +0000 (11:23 +0000)]
net: stmmac: lpc18xx: convert to PHY_INTF_SEL_x
Use the common dwmac definitions for the PHY interface selection field.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy59-0000000DhQ1-393H@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 8 Nov 2025 03:02:42 +0000 (19:02 -0800)]
Merge branch 'net-use-skb_attempt_defer_free-in-napi_consume_skb'
Eric Dumazet says:
====================
net: use skb_attempt_defer_free() in napi_consume_skb()
There is a lack of NUMA awareness and more generally lack
of slab caches affinity on TX completion path.
Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
in per-cpu caches so that they can be recycled in RX path.
Only use this if the skb was allocated on the same cpu,
otherwise use skb_attempt_defer_free() so that the skb
is freed on the original cpu.
This removes contention on SLUB spinlocks and data structures,
and this makes sure that recycled sk_buff have correct NUMA locality.
After this series, I get ~50% improvement for an UDP tx workload
on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
I will later refactor skb_attempt_defer_free()
to no longer have to care of skb_shared() and skb_release_head_state().
====================
Link: https://patch.msgid.link/20251106202935.1776179-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Thu, 6 Nov 2025 20:29:35 +0000 (20:29 +0000)]
net: increase skb_defer_max default to 128
skb_defer_max value is very conservative, and can be increased
to avoid too many calls to kick_defer_list_purge().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20251106202935.1776179-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Thu, 6 Nov 2025 20:29:34 +0000 (20:29 +0000)]
net: fix napi_consume_skb() with alien skbs
There is a lack of NUMA awareness and more generally lack
of slab caches affinity on TX completion path.
Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
in per-cpu caches so that they can be recycled in RX path.
Only use this if the skb was allocated on the same cpu,
otherwise use skb_attempt_defer_free() so that the skb
is freed on the original cpu.
This removes contention on SLUB spinlocks and data structures.
After this patch, I get ~50% improvement for an UDP tx workload
on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
80 Mpps -> 120 Mpps.
Profiling one of the 32 cpus servicing NIC interrupts :
Before:
mpstat -P 511 1 1
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
2.11% swapper [kernel.kallsyms] [k] __slab_free
2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
1.03% swapper [kernel.kallsyms] [k] fq_dequeue
0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
0.93% swapper [kernel.kallsyms] [k] read_tsc
0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
0.76% swapper [kernel.kallsyms] [k] idpf_features_check
0.72% swapper [kernel.kallsyms] [k] skb_release_data
0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
0.48% swapper [kernel.kallsyms] [k] sock_wfree
After:
mpstat -P 511 1 1
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
6.69% swapper [kernel.kallsyms] [k] sock_wfree
5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
3.10% swapper [kernel.kallsyms] [k] fq_dequeue
3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
2.73% swapper [kernel.kallsyms] [k] read_tsc
2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
1.20% swapper [kernel.kallsyms] [k] idpf_features_check
1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
0.53% swapper [kernel.kallsyms] [k] io_idle
0.43% swapper [kernel.kallsyms] [k] netif_skb_features
0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
0.34% swapper [kernel.kallsyms] [k] handle_softirqs
0.32% swapper [kernel.kallsyms] [k] net_rx_action
0.32% swapper [kernel.kallsyms] [k] dql_completed
0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
0.28% swapper [kernel.kallsyms] [k] ktime_get
0.24% swapper [kernel.kallsyms] [k] __qdisc_run
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20251106202935.1776179-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Thu, 6 Nov 2025 20:29:33 +0000 (20:29 +0000)]
net: allow skb_release_head_state() to be called multiple times
Currently, only skb dst is cleared (thanks to skb_dst_drop())
Make sure skb->destructor, conntrack and extensions are cleared.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20251106202935.1776179-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Thu, 6 Nov 2025 08:55:00 +0000 (08:55 +0000)]
net: add prefetch() in skb_defer_free_flush()
skb_defer_free_flush() is becoming more important these days.
Add a prefetch operation to reduce latency a bit on some
platforms like AMD EPYC 7B12.
On more recent cpus, a stall happens when reading skb_shinfo().
Avoiding it will require a more elaborate strategy.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20251106085500.2438951-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 8 Nov 2025 02:54:07 +0000 (18:54 -0800)]
Merge branch 'psp-track-stats-from-core-and-provide-a-driver-stats-api'
Daniel Zahka says:
====================
psp: track stats from core and provide a driver stats api
This series introduces stats counters for psp. Device key rotations,
and so called 'stale-events' are common to all drivers and are tracked
by the core.
A driver facing api is provided for reporting stats required by the
"Implementation Requirements" section of the PSP Architecture
Specification. Drivers must implement these stats.
Lastly, implementations of the driver stats api for mlx5 and netdevsim
are included.
Here is the output of running the psp selftest suite and then
printing out stats with the ynl cli on system with a psp-capable CX7:
$ ./ksft-psp-stats/drivers/net/psp.py
TAP version 13
1..28
ok 1 psp.test_case # SKIP Test requires IPv4 connectivity
ok 2 psp.data_basic_send_v0_ip6
ok 3 psp.test_case # SKIP Test requires IPv4 connectivity
ok 4 psp.data_basic_send_v1_ip6
ok 5 psp.test_case # SKIP Test requires IPv4 connectivity
ok 6 psp.data_basic_send_v2_ip6 # SKIP ('PSP version not supported', 'hdr0-aes-gmac-128')
ok 7 psp.test_case # SKIP Test requires IPv4 connectivity
ok 8 psp.data_basic_send_v3_ip6 # SKIP ('PSP version not supported', 'hdr0-aes-gmac-256')
ok 9 psp.test_case # SKIP Test requires IPv4 connectivity
ok 10 psp.data_mss_adjust_ip6
ok 11 psp.dev_list_devices
ok 12 psp.dev_get_device
ok 13 psp.dev_get_device_bad
ok 14 psp.dev_rotate
ok 15 psp.dev_rotate_spi
ok 16 psp.assoc_basic
ok 17 psp.assoc_bad_dev
ok 18 psp.assoc_sk_only_conn
ok 19 psp.assoc_sk_only_mismatch
ok 20 psp.assoc_sk_only_mismatch_tx
ok 21 psp.assoc_sk_only_unconn
ok 22 psp.assoc_version_mismatch
ok 23 psp.assoc_twice
ok 24 psp.data_send_bad_key
ok 25 psp.data_send_disconnect
ok 26 psp.data_stale_key
ok 27 psp.removal_device_rx # XFAIL Test only works on netdevsim
ok 28 psp.removal_device_bi # XFAIL Test only works on netdevsim
# Totals: pass:19 fail:0 xfail:2 xpass:0 skip:7 error:0
#
# Responder logs (0):
# STDERR:
# Set PSP enable on device 1 to 0x3
# Set PSP enable on device 1 to 0x0
$ ynl --family psp --dump get-stats
[{'dev-id': 1,
'key-rotations': 5,
'rx-auth-fail': 21,
'rx-bad': 0,
'rx-bytes': 11844,
'rx-error': 0,
'rx-packets': 94,
'stale-events': 6,
'tx-bytes':
1128456,
'tx-error': 0,
'tx-packets': 780}]
====================
Link: https://patch.msgid.link/20251106002608.1578518-1-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Zahka [Thu, 6 Nov 2025 00:26:06 +0000 (16:26 -0800)]
netdevsim: implement psp device stats
For now only tx/rx packets/bytes are reported. This is not compliant
with the PSP Architecture Specification.
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251106002608.1578518-6-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 6 Nov 2025 00:26:05 +0000 (16:26 -0800)]
net/mlx5e: Add PSP stats support for Rx/Tx flows
Add all statistics described under the "Implementation Requirements"
section of the PSP Architecture Specification:
Rx successfully decrypted PSP packets:
psp_rx_pkts : Number of packets decrypted successfully
psp_rx_bytes : Number of bytes decrypted successfully
Rx PSP authentication failure statistics:
psp_rx_pkts_auth_fail : Number of PSP packets that failed authentication
psp_rx_bytes_auth_fail : Number of PSP bytes that failed authentication
Rx PSP bad frame error statistics:
psp_rx_pkts_frame_err;
psp_rx_bytes_frame_err;
Rx PSP drop statistics:
psp_rx_pkts_drop : Number of PSP packets dropped
psp_rx_bytes_drop : Number of PSP bytes dropped
Tx successfully encrypted PSP packets:
psp_tx_pkts : Number of packets encrypted successfully
psp_tx_bytes : Number of bytes encrypted successfully
Tx drops:
tx_drop : Number of misc psp related drops
The above can be seen using the ynl cli:
./pyynl/cli.py --spec netlink/specs/psp.yaml --dump get-stats
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251106002608.1578518-5-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 6 Nov 2025 00:26:04 +0000 (16:26 -0800)]
psp: add stats from psp spec to driver facing api
Provide a driver api for reporting device statistics required by the
"Implementation Requirements" section of the PSP Architecture
Specification. Use a warning to ensure drivers report stats required
by the spec.
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251106002608.1578518-4-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Zahka [Thu, 6 Nov 2025 00:26:03 +0000 (16:26 -0800)]
selftests: drv-net: psp: add assertions on core-tracked psp dev stats
Add assertions to existing test cases to cover key rotations and
'stale-events'.
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251106002608.1578518-3-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 6 Nov 2025 00:26:02 +0000 (16:26 -0800)]
psp: report basic stats from the core
Track and report stats common to all psp devices from the core. A
'stale-event' is when the core marks the rx state of an active
psp_assoc as incapable of authenticating psp encapsulated data.
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251106002608.1578518-2-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Wed, 5 Nov 2025 22:09:17 +0000 (23:09 +0100)]
net: phy: fixed_phy: shrink size of struct fixed_phy_status
All three members are effectively of type bool, so make this explicit
and shrink size of struct fixed_phy_status.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/9eca3d7e-fa64-4724-8fdc-f2c1a8f2ae8f@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 8 Nov 2025 02:52:36 +0000 (18:52 -0800)]
Merge branch 'net-phy-add-open-alliance-tc14-10base-t1s-phy-cable-diagnostic-support'
Parthiban Veerasooran says:
====================
net: phy: Add Open Alliance TC14 10Base-T1S PHY cable diagnostic support
This patch series adds Open Alliance TC14 (OATC14) 10BASE-T1S cable
diagnostic feature support to the Linux kernel PHY subsystem and enable
this feature for Microchip LAN867x Rev.D0 PHYs. These patches provide
standardized cable test functionality for 10BASE-T1S Ethernet PHYs,
allowing users to perform cable diagnostics via ethtool.
Patch Summary:
1. add OATC14 10BASE-T1S PHY cable diagnostic support
- Implements support for the OATC14 cable diagnostic feature in
Clause 45 PHYs.
- Adds functions to start a cable test and retrieve its status,
mapping hardware results to ethtool codes.
- Exports these functions for use by PHY drivers.
- Open Alliance TC14 10BASE-T1S Advanced Diagnostic PHY Features.
https://opensig.org/wp-content/uploads/2025/06/OPEN_Alliance_10BASE-T1S_Advanced_PHY_features_for-automotive_Ethernet_V2.1b.pdf
2. add cable diagnostic support for LAN867x Rev.D0
- Integrates the generic OATC14 cable test functions into the
Microchip LAN867x Rev.D0 PHY driver.
- Enables ethtool cable diagnostics for this PHY, improving
troubleshooting and maintenance.
====================
Link: https://patch.msgid.link/20251105051213.50443-1-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Parthiban Veerasooran [Wed, 5 Nov 2025 05:12:13 +0000 (10:42 +0530)]
net: phy: microchip_t1s:: add cable diagnostic support for LAN867x Rev.D0
Enable Open Alliance TC14 (OATC14) 10Base-T1S cable diagnostic feature
for Microchip LAN867x Rev.D0 PHY by implementing `cable_test_start` and
`cable_test_get_status` using the generic C45 functions. This allows the
`ethtool` utility to perform cable diagnostic tests directly on the PHY,
improving network troubleshooting and maintenance.
Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Link: https://patch.msgid.link/20251105051213.50443-3-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Parthiban Veerasooran [Wed, 5 Nov 2025 05:12:12 +0000 (10:42 +0530)]
net: phy: phy-c45: add OATC14 10BASE-T1S PHY cable diagnostic support
Add support for Open Alliance TC14 (OATC14) 10BASE-T1S PHYs cable
diagnostic feature.
This patch implements:
- genphy_c45_oatc14_cable_test_start() to initiate a cable test
- genphy_c45_oatc14_cable_test_get_status() to retrieve test results
- Helper function to map PHY cable test status to ethtool result codes
- Function declarations and exports for use by PHY drivers
This enables ethtool to report ok, open, short, and undetectable cable
conditions on OATC14 10Base-T1S PHYs.
Open Alliance TC14 10BASE-T1S Advanced Diagnostic PHY Features
Specification ref:
https://opensig.org/wp-content/uploads/2025/06/OPEN_Alliance_10BASE-T1S_Advanced_PHY_features_for-automotive_Ethernet_V2.1b.pdf
Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Link: https://patch.msgid.link/20251105051213.50443-2-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Erni Sri Satya Vennela [Wed, 5 Nov 2025 19:04:28 +0000 (11:04 -0800)]
net: mana: Fix incorrect speed reported by debugfs
Once the netshaper is created for MANA, the current bandwidth
is reported in debugfs like this:
$ sudo ./tools/net/ynl/pyynl/cli.py \
--spec Documentation/netlink/specs/net_shaper.yaml \
--do set \
--json '{"ifindex":'3',
"handle":{ "scope": "netdev", "id":'1' },
"bw-max":
200000000 }'
None
$ sudo cat /sys/kernel/debug/mana/1/vport0/current_speed
200
After the shaper is deleted, it is expected to report
the maximum speed supported by the SKU. But currently it is
reporting 0, which is incorrect.
Fix this inconsistency, by resetting apc->speed to apc->max_speed
during deletion of the shaper object. This will improve
readability and debuggability.
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/1762369468-32570-1-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lorenzo Bianconi [Thu, 6 Nov 2025 12:53:23 +0000 (13:53 +0100)]
net: airoha: Add the capability to consume out-of-order DMA tx descriptors
EN7581 and AN7583 SoCs are capable of DMA mapping non-linear tx skbs on
non-consecutive DMA descriptors. This feature is useful when multiple
flows are queued on the same hw tx queue since it allows to fully utilize
the available tx DMA descriptors and to avoid the starvation of
high-priority flow we have in the current codebase due to head-of-line
blocking introduced by low-priority flows.
Tested-by: Xuegang Lu <xuegang.lu@airoha.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20251106-airoha-tx-linked-list-v2-1-0706d4a322bd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Thu, 6 Nov 2025 11:52:36 +0000 (11:52 +0000)]
tcp: add net.ipv4.tcp_comp_sack_rtt_percent
TCP SACK compression has been added in 2018 in commit
5d9f4262b7ea ("tcp: add SACK compression").
It is working great for WAN flows (with large RTT).
Wifi in particular gets a significant boost _when_ ACK are suppressed.
Add a new sysctl so that we can tune the very conservative 5 % value
that has been used so far in this formula, so that small RTT flows
can benefit from this feature.
delay = min ( 5 % of RTT, 1 ms)
This patch adds new tcp_comp_sack_rtt_percent sysctl
to ease experiments and tuning.
Given that we cap the delay to 1ms (tcp_comp_sack_delay_ns sysctl),
set the default value to 33 %.
Quoting Neal Cardwell ( https://lore.kernel.org/netdev/CADVnQymZ1tFnEA1Q=vtECs0=Db7zHQ8=+WCQtnhHFVbEOzjVnQ@mail.gmail.com/ )
The rationale for 33% is basically to try to facilitate pipelining,
where there are always at least 3 ACKs and 3 GSO/TSO skbs per SRTT, so
that the path can maintain a budget for 3 full-sized GSO/TSO skbs "in
flight" at all times:
+ 1 skb in the qdisc waiting to be sent by the NIC next
+ 1 skb being sent by the NIC (being serialized by the NIC out onto the wire)
+ 1 skb being received and aggregated by the receiver machine's
aggregation mechanism (some combination of LRO, GRO, and sack
compression)
Note that this is basically the same magic number (3) and the same
rationales as:
(a) tcp_tso_should_defer() ensuring that we defer sending data for no
longer than cwnd/tcp_tso_win_divisor (where tcp_tso_win_divisor = 3),
and
(b) bbr_quantization_budget() ensuring that cwnd is at least 3 GSO/TSO
skbs to maintain pipelining and full throughput at low RTTs
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20251106115236.3450026-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 8 Nov 2025 02:05:28 +0000 (18:05 -0800)]
Merge branch 'tcp-clean-up-syn-ack-rto-code-and-apply-max-rto'
Kuniyuki Iwashima says:
====================
tcp: Clean up SYN+ACK RTO code and apply max RTO.
Patch 1 - 4 are misc cleanup.
Patch 5 applies max RTO to non-TFO SYN+ACK.
Patch 6 adds a test for max RTO of SYN+ACK.
====================
Link: https://patch.msgid.link/20251106003357.273403-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Thu, 6 Nov 2025 00:32:45 +0000 (00:32 +0000)]
selftest: packetdrill: Add max RTO test for SYN+ACK.
This script sets net.ipv4.tcp_rto_max_ms to 1000 and checks
if SYN+ACK RTO is capped at 1s for TFO and non-TFO.
Without the previous patch, the max RTO is applied to TFO
SYN+ACK only, and non-TFO SYN+ACK RTO increases exponentially.
# selftests: net/packetdrill: tcp_rto_synack_rto_max.pkt
# TAP version 13
# 1..2
# tcp_rto_synack_rto_max.pkt:46: error handling packet: timing error:
expected outbound packet at 5.091936 sec but happened at 6.107826 sec; tolerance 0.127974 sec
# script packet: 5.091936 S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
# actual packet: 6.107826 S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK>
# not ok 1 ipv4
# tcp_rto_synack_rto_max.pkt:46: error handling packet: timing error:
expected outbound packet at 5.075901 sec but happened at 6.091841 sec; tolerance 0.127976 sec
# script packet: 5.075901 S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
# actual packet: 6.091841 S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK>
# not ok 2 ipv6
# # Totals: pass:0 fail:2 xfail:0 xpass:0 skip:0 error:0
not ok 49 selftests: net/packetdrill: tcp_rto_synack_rto_max.pkt # exit=1
With the previous patch, all SYN+ACKs are retransmitted
after 1s.
# selftests: net/packetdrill: tcp_rto_synack_rto_max.pkt
# TAP version 13
# 1..2
# ok 1 ipv4
# ok 2 ipv6
# # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0
ok 49 selftests: net/packetdrill: tcp_rto_synack_rto_max.pkt
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Thu, 6 Nov 2025 00:32:44 +0000 (00:32 +0000)]
tcp: Apply max RTO to non-TFO SYN+ACK.
Since commit
54a378f43425 ("tcp: add the ability to control
max RTO"), TFO SYN+ACK RTO is capped by the TFO full sk's
inet_csk(sk)->icsk_rto_max.
The value is inherited from the parent listener.
Let's apply the same cap to non-TFO SYN+ACK.
Note that req->rsk_listener is always non-NULL when we call
tcp_reqsk_timeout() in reqsk_timer_handler() or tcp_check_req().
It could be NULL for SYN cookie req, but we do not use
req->timeout then.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Thu, 6 Nov 2025 00:32:43 +0000 (00:32 +0000)]
tcp: Remove timeout arg from reqsk_timeout().
reqsk_timeout() is always called with @timeout being TCP_RTO_MAX.
Let's remove the arg.
As a prep for the next patch, reqsk_timeout() is moved to tcp.h
and renamed to tcp_reqsk_timeout().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Thu, 6 Nov 2025 00:32:42 +0000 (00:32 +0000)]
tcp: Remove redundant init for req->num_timeout.
Commit
5903123f662e ("tcp: Use BPF timeout setting for SYN ACK
RTO") introduced req->timeout and initialised it in 3 places:
1. reqsk_alloc() sets 0
2. inet_reqsk_alloc() sets TCP_TIMEOUT_INIT
3. tcp_conn_request() sets tcp_timeout_init()
1. has been always redundant as 2. overwrites it immediately.
2. was necessary for TFO SYN+ACK but is no longer needed after
commit
8ea731d4c2ce ("tcp: Make SYN ACK RTO tunable by BPF
programs with TFO").
3. was moved to reqsk_queue_hash_req() in the previous patch.
Now, we always initialise req->timeout just before scheduling
the SYN+ACK timer:
* For non-TFO SYN+ACK : reqsk_queue_hash_req()
* For TFO SYN+ACK : tcp_fastopen_create_child()
Let's remove the redundant initialisation of req->timeout in
reqsk_alloc() and inet_reqsk_alloc().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Thu, 6 Nov 2025 00:32:41 +0000 (00:32 +0000)]
tcp: Remove timeout arg from reqsk_queue_hash_req().
inet_csk_reqsk_queue_hash_add() is no longer shared by DCCP.
We do not need to pass req->timeout down to reqsk_queue_hash_req().
Let's move tcp_timeout_init() from tcp_conn_request() to
reqsk_queue_hash_req().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Thu, 6 Nov 2025 00:32:40 +0000 (00:32 +0000)]
tcp: Call tcp_syn_ack_timeout() directly.
Since DCCP has been removed, we do not need to use
request_sock_ops.syn_ack_timeout().
Let's call tcp_syn_ack_timeout() directly.
Now other function pointers of request_sock_ops are
protocol-dependent.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 4 Nov 2025 23:23:44 +0000 (15:23 -0800)]
netlink: specs: netdev add missing stats to qstat-get
Add missing entries in C attribute list.
Link: https://patch.msgid.link/20251104232348.1954349-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 7 Nov 2025 01:38:27 +0000 (17:38 -0800)]
Merge branch 'net-renesas-cleanup-usage-of-gptp-flags'
Niklas Söderlund says:
====================
net: renesas: Cleanup usage of gPTP flags
This series aim is to prepare for future work that will enable the use
of gPTP on R-Car RAVB on Gen4. Currently RAVB have a dedicated gPTP
implementation supported on Gen2 and Gen3 (ravb_ptp.c). For Gen4 a new
implementation that is already upstream (rcar_gen4_ptp.c) and used by
other Gen4 devices such as RTSN and RSWITCH is needed.
Unfortunately the design of the Gen2/Gen3 RAVB driver where driver
specific flags to control gPTP behavior have been mimicked in RTSN and
RSWITCH. This was OK as there was no overlap between the two gPTP
implementations. Now that RAVB needs to be able to use both having to
translate between driver specific flags and common net code flags
becomes even more cumbersome as there are two sets of driver specific
flags to pick from.
This series cleans this up for all Renesas drivers using gPTP by
removing all driver specific flags and using the common flags directly.
This simplifies drivers while at the same time prepare RAVB to be
extended with Gen4 support.
Patch 1/7 is a drive by patch where RSWITCH specific define was added in
the wrong header. Patch 2/7 removes a short-cut used in RTSN and RSWITCH
that prevents extending Gen4 support to RAVB without fuss. While patch
3/7 to 7/7 rework the Renesas drivers to use the common flags instead of
driver specific ones.
There is no intentional behavior change and only a small rework in logic
in the RAVB driver. Looking at patch 3/7, 4/7 and 7/7 one can clearly
see how the code have been copied from RAVB to the later implementations
in RTSN and RSWITCH.
====================
Link: https://patch.msgid.link/20251104222420.882731-1-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Niklas Söderlund [Tue, 4 Nov 2025 22:24:20 +0000 (23:24 +0100)]
net: ravb: Use common defines for time stamping control
Instead of translating to/from driver specific flags for packet time
stamp control use the common flags directly. This simplifies the driver
as the translating code can be removed while at the same time making it
clear the flags are not flags written to hardware registers.
The change from a device specific bit-field track variable to the common
enum datatypes forces us to touch the ravb_rx_rcar_hwstamp() in a non
trivial way. To make this cleaner and easier to understand expand the
nested conditions.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251104222420.882731-8-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Niklas Söderlund [Tue, 4 Nov 2025 22:24:19 +0000 (23:24 +0100)]
net: ravb: Break out Rx hardware timestamping
Prepare for moving away from device specific bit-fields to track how to
do hardware Rx timestamping to using net common enums by breaking out
the timestamping to a helper function. This is done to create cleaner
code and prepare for easier changes improving the hardware timestapming.
There is no functional change.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251104222420.882731-7-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Niklas Söderlund [Tue, 4 Nov 2025 22:24:18 +0000 (23:24 +0100)]
net: rcar_gen4_ptp: Remove unused defines
The driver specific flags to control packet time stamps have all been
replaced by values from enum hwtstamp_tx_types and enum
hwtstamp_rx_filters. Remove the driver specific flags as there are no
more users.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251104222420.882731-6-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Niklas Söderlund [Tue, 4 Nov 2025 22:24:17 +0000 (23:24 +0100)]
net: rtsn: Use common defines for time stamping control
Instead of translating to/from driver specific flags for packet time
stamp control use the common flags directly. This simplifies the driver
as the translating code can be removed while at the same time making it
clear the flags are not flags written to hardware registers.
One thing to note is that the bit-wise and check in rtsn_rx() of
RCAR_GEN4_RXTSTAMP_TYPE_V2_L2_EVENT is replaced with a not set check of
HWTSTAMP_FILTER_NONE. This is okay as the bit of device specific event
replaced was set for all modes except HWTSTAMP_FILTER_NONE.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251104222420.882731-5-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Niklas Söderlund [Tue, 4 Nov 2025 22:24:16 +0000 (23:24 +0100)]
net: rswitch: Use common defines for time stamping control
Instead of translating to/from driver specific flags for packet time
stamp control use the common flags directly. This simplifies the driver
as the translating code can be removed while at the same time making it
clear the flags are not flags written to hardware registers.
One thing to note is that the bit-wise and check in rswitch_rx() of
RCAR_GEN4_RXTSTAMP_TYPE_V2_L2_EVENT is replaced with a not set check of
HWTSTAMP_FILTER_NONE. This is okay as the bit of device specific event
replaced was set for all modes except HWTSTAMP_FILTER_NONE.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251104222420.882731-4-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Niklas Söderlund [Tue, 4 Nov 2025 22:24:15 +0000 (23:24 +0100)]
net: rcar_gen4_ptp: Move control fields to users
The struct rcar_gen4_ptp_private provides two fields for convenience of
its users, tstamp_tx_ctrl and tstamp_rx_ctrl. These fields are not used
by the rcar_gen4_ptp driver itself but only by the drivers using it.
Upcoming work will enable the RAVB driver currently only supporting gPTP
on pre-Gen4 SoCs to use the Gen4 implementation as well. To facilitate
this the convenience of having these fields in struct
rcar_gen4_ptp_private becomes a problem as the RAVB driver already have
it's own driver specific fields for the same thing.
Move the fields from struct rcar_gen4_ptp_private to each driver using
the Gen4 gPTP clocks own private data structures. There is no functional
change.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251104222420.882731-3-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Niklas Söderlund [Tue, 4 Nov 2025 22:24:14 +0000 (23:24 +0100)]
net: rswitch: Move definition of S4 gPTP offset
The files rcar_gen4_ptp.{c,h} implements an abstraction of the gPTP
support implemented together with different other IP blocks. The first
device added which supported this was RSWITCH on R-Car S4.
While doing so the RSWITCH R-Car S4 specific offset was added to the
generic Gen4 gPTP header file. Move it to the RSWITCH driver to make it
clear it only applies to this driver.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251104222420.882731-2-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann [Fri, 31 Oct 2025 21:20:59 +0000 (22:20 +0100)]
netkit: Document fast vs slowpath members via macros
Instead of a comment, just use two cachline groups to document the intent
for members often accessed in fast or slow path.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20251031212103.310683-11-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann [Fri, 31 Oct 2025 21:20:55 +0000 (22:20 +0100)]
xsk: Move NETDEV_XDP_ACT_ZC into generic header
Move NETDEV_XDP_ACT_ZC into xdp_sock_drv.h header such that external code
can reuse it, and rename it into more generic NETDEV_XDP_ACT_XSK.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20251031212103.310683-7-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
FUJITA Tomonori [Wed, 5 Nov 2025 13:31:26 +0000 (22:31 +0900)]
net: phy: qt2025: Wait until PHY becomes ready
Wait until a PHY becomes ready in the probe callback by
using read_poll_timeout function.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Gary Guo <gary@garyguo.net>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Link: https://patch.msgid.link/20251105133126.3221948-1-fujita.tomonori@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Breno Leitao [Wed, 5 Nov 2025 18:01:12 +0000 (10:01 -0800)]
tg3: extract GRXRINGS from .get_rxnfc
Commit
84eaf4359c36 ("net: ethtool: add get_rx_ring_count callback to
optimize RX ring queries") added specific support for GRXRINGS callback,
simplifying .get_rxnfc.
Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new
.get_rx_ring_count().
Given that tg3_get_rxnfc() only handles ETHTOOL_GRXRINGS, then this
function becomes useless now, and it is removed.
This also fixes the behavior for devices without MSIX support.
Previously, the function would return -EOPNOTSUPP, but now it correctly
returns 1.
The functionality remains the same: return the current queue count
if the device is running, otherwise return the minimum of online
CPUs and TG3_RSS_MAX_NUM_QS.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20251105-grxrings_v1-v1-1-54c2caafa1fd@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Aleksandr Loktionov [Thu, 30 Oct 2025 13:59:50 +0000 (14:59 +0100)]
iavf: add RSS support for GTP protocol via ethtool
Extend the iavf driver to support Receive Side Scaling (RSS)
configuration for GTP (GPRS Tunneling Protocol) flows using ethtool.
The implementation introduces new RSS flow segment headers and hash field
definitions for various GTP encapsulations, including:
- GTPC
- GTPU (IP, Extension Header, Uplink, Downlink)
- TEID-based hashing
The ethtool interface is updated to parse and apply these new flow types
and hash fields, enabling fine-grained traffic distribution for GTP-based
mobile workloads.
This enhancement improves performance and scalability for virtualized
network functions (VNFs) and user plane functions (UPFs) in 5G and LTE
deployments.
Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Przemek Kitszel [Thu, 30 Oct 2025 13:59:49 +0000 (14:59 +0100)]
ice: Extend PTYPE bitmap coverage for GTP encapsulated flows
Consolidate updates to the Protocol Type (PTYPE) bitmap definitions
across multiple flow types in the Intel ICE driver to support GTP
(GPRS Tunneling Protocol) encapsulated traffic.
Enable improved Receive Side Scaling (RSS) configuration for both user
and control plane GTP flows.
Cover a wide range of protocol and encapsulation scenarios, including:
- MAC OFOS and IL
- IPv4 and IPv6 (OFOS, IL, ALL, no-L4)
- TCP, SCTP, ICMP
- GRE OF
- GTPC (control plane)
Expand the PTYPE bitmap entries to improve classification and
distribution of GTP traffic across multiple queues, enhancing
performance and scalability in mobile network environments.
Co-developed-by: Dan Nowlin <dan.nowlin@intel.com>
Signed-off-by: Dan Nowlin <dan.nowlin@intel.com>
Co-developed-by: Qi Zhang <qi.z.zhang@intel.com>
Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
Co-developed-by: Jie Wang <jie1x.wang@intel.com>
Signed-off-by: Jie Wang <jie1x.wang@intel.com>
Co-developed-by: Junfeng Guo <junfeng.guo@intel.com>
Signed-off-by: Junfeng Guo <junfeng.guo@intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Aleksandr Loktionov [Thu, 30 Oct 2025 13:59:48 +0000 (14:59 +0100)]
ice: improve TCAM priority handling for RSS profiles
Enhance TCAM priority logic to avoid conflicts between RSS profiles
with overlapping PTGs and attributes.
Track used PTG and attribute combinations.
Ensure higher-priority profiles override lower ones.
Add helper for setting TCAM flags and masks.
Ensure RSS rule consistency and prevent unintended matches.
Co-developed-by: Dan Nowlin <dan.nowlin@intel.com>
Signed-off-by: Dan Nowlin <dan.nowlin@intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Aleksandr Loktionov [Thu, 30 Oct 2025 13:59:47 +0000 (14:59 +0100)]
ice: implement GTP RSS context tracking and configuration
This commit implements the core RSS context management and configuration
logic for GTP (GTPU) protocol support in VF RSS operations.
Key implementation features:
- GTPU hash context management with pre/post processing functions
- Context index calculation and mapping for different GTPU scenarios
- Integration with main RSS configuration flow via wrapper functions
- Support for IPv4/IPv6 GTPU RSS configurations
- Rollback mechanism for handling RSS rule conflicts
- Hash context reset and cleanup functionality
The implementation provides comprehensive GTPU RSS support by:
1. Adding ice_add_rss_cfg_pre_gtpu() for preprocessing GTPU contexts
2. Adding ice_add_rss_cfg_post_gtpu() for postprocessing configurations
3. Adding ice_calc_gtpu_ctx_idx() for context index calculation
4. Integrating GTPU logic into ice_add_rss_cfg_wrap() and
ice_rem_rss_cfg_wrap()
5. Supporting context tracking in VF hash_ctx structures
This completes the GTP RSS infrastructure enabling VFs to configure
RSS hashing on GTP-encapsulated traffic.
Co-developed-by: Dan Nowlin <dan.nowlin@intel.com>
Signed-off-by: Dan Nowlin <dan.nowlin@intel.com>
Co-developed-by: Jie Wang <jie1x.wang@intel.com>
Signed-off-by: Jie Wang <jie1x.wang@intel.com>
Co-developed-by: Junfeng Guo <junfeng.guo@intel.com>
Signed-off-by: Junfeng Guo <junfeng.guo@intel.com>
Co-developed-by: Qi Zhang <qi.z.zhang@intel.com>
Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
Co-developed-by: Ting Xu <ting.xu@intel.com>
Signed-off-by: Ting Xu <ting.xu@intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Aleksandr Loktionov [Thu, 30 Oct 2025 13:59:46 +0000 (14:59 +0100)]
ice: add virtchnl definitions and static data for GTP RSS
Add virtchnl protocol header and field definitions for advanced RSS
configuration including GTPC, GTPU, L2TPv2, ECPRI, PPP, GRE, and IP
fragment headers.
- Define new virtchnl protocol header types
- Add RSS field selectors for tunnel protocols
- Extend static mapping arrays for protocol field matching
- Add L2TPv2 session ID and length+session ID field support
This provides the foundational definitions needed for VF RSS
configuration of tunnel protocols.
Co-developed-by: Dan Nowlin <dan.nowlin@intel.com>
Signed-off-by: Dan Nowlin <dan.nowlin@intel.com>
Co-developed-by: Jie Wang <jie1x.wang@intel.com>
Signed-off-by: Jie Wang <jie1x.wang@intel.com>
Co-developed-by: Junfeng Guo <junfeng.guo@intel.com>
Signed-off-by: Junfeng Guo <junfeng.guo@intel.com>
Co-developed-by: Qi Zhang <qi.z.zhang@intel.com>
Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
Co-developed-by: Ting Xu <ting.xu@intel.com>
Signed-off-by: Ting Xu <ting.xu@intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Aleksandr Loktionov [Thu, 30 Oct 2025 13:59:45 +0000 (14:59 +0100)]
ice: add flow parsing for GTP and new protocol field support
Introduce new protocol header types and field sizes to support GTPU, GTPC
tunneling protocols.
- Add field size macros for GTP TEID, QFI, and other headers
- Extend ice_flow_field_info and enum definitions
- Update hash macros for new protocols
- Add support for IPv6 prefix matching and fragment headers
This patch lays the groundwork for enhanced RSS and flow classification
capabilities.
Co-developed-by: Dan Nowlin <dan.nowlin@intel.com>
Signed-off-by: Dan Nowlin <dan.nowlin@intel.com>
Co-developed-by: Junfeng Guo <junfeng.guo@intel.com>
Signed-off-by: Junfeng Guo <junfeng.guo@intel.com>
Co-developed-by: Ting Xu <ting.xu@intel.com>
Signed-off-by: Ting Xu <ting.xu@intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jakub Kicinski [Thu, 6 Nov 2025 22:16:20 +0000 (14:16 -0800)]
Merge branch 'net-dsa-lantiq_gswip-add-support-for-maxlinear-gsw1xx-switch-family'
Daniel Golle says:
====================
net: dsa: lantiq_gswip: Add support for MaxLinear GSW1xx switch family
This patch series extends the existing lantiq_gswip DSA driver to
support the MaxLinear GSW1xx family of dedicated Ethernet switch ICs.
These switches are based on the same IP as the Lantiq/Intel GSWIP found
in VR9 and xRX MIPS router SoCs which are currently supported by the
lantiq_gswip driver, but they are dedicated ICs connected via MDIO
rather than built-in components of a SoC accessible via memory-mapped
I/O.
The series includes several improvements and refactoring to implement
support for GSW1xx switch ICs by reusing the existing lantiq_gswip
driver.
The GSW1xx family includes several variants:
- GSW120: 4 ports, 2 PHYs, RGMII & SGMII/2500Base-X
- GSW125: 4 ports, 2 PHYs, RGMII & SGMII/2500Base-X, industrial temperature
- GSW140: 6 ports, 4 PHYs, RGMII & SGMII/2500Base-X
- GSW141: 6 ports, 4 PHYs, RGMII & SGMII
- GSW145: 6 ports, 4 PHYs, RGMII & SGMII/2500Base-X, industrial temperature
Key features implemented:
- MDIO-based register access using regmap
- Support for SGMII/1000Base-X/2500Base-X SerDes interfaces
- Configurable RGMII delays via device tree properties
- Configurable RMII clock direction
- Energy Efficient Ethernet (EEE) support
- enabling/disabling learning
====================
Link: https://patch.msgid.link/cover.1762170107.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Golle [Mon, 3 Nov 2025 12:20:28 +0000 (12:20 +0000)]
net: dsa: add driver for MaxLinear GSW1xx switch family
Add driver for the MaxLinear GSW1xx family of Ethernet switch ICs which
are based on the same IP as the Lantiq/Intel GSWIP found in the Lantiq VR9
and Intel GRX MIPS router SoCs. The main difference is that instead of
using memory-mapped I/O to communicate with the host CPU these ICs are
connected via MDIO (or SPI, which isn't supported by this driver).
Implement the regmap API to access the switch registers over MDIO to allow
reusing lantiq_gswip_common for all core functionality.
The GSW1xx also comes with a SerDes port capable of 1000Base-X, SGMII and
2500Base-X, which can either be used to connect an external PHY or SFP
cage, or as the CPU port. Support for the SerDes interface is implemented
in this driver using the phylink_pcs interface.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Tested-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Link: https://patch.msgid.link/b567ec1b4beb08fd37abf18b280c56d5d8253c26.1762170107.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Golle [Mon, 3 Nov 2025 12:20:20 +0000 (12:20 +0000)]
net: dsa: add tagging driver for MaxLinear GSW1xx switch family
Add support for a new DSA tagging protocol driver for the MaxLinear
GSW1xx switch family. The GSW1xx switches use a proprietary 8-byte
special tag inserted between the source MAC address and the EtherType
field to indicate the source and destination ports for frames
traversing the CPU port.
Implement the tag handling logic to insert the special tag on transmit
and parse it on receive.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Tested-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Link: https://patch.msgid.link/0e973ebfd9433c30c96f50670da9e9449a0d98f2.1762170107.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Golle [Mon, 3 Nov 2025 12:20:12 +0000 (12:20 +0000)]
dt-bindings: net: dsa: lantiq,gswip: add support for MaxLinear GSW1xx switches
Extend the Lantiq GSWIP device tree binding to also cover MaxLinear
GSW1xx switches which are based on the same hardware IP but connected
via MDIO instead of being memory-mapped.
Add compatible strings for MaxLinear GSW120, GSW125, GSW140, GSW141,
and GSW145 switches and adjust the schema to handle the different
connection methods with conditional properties.
Add MaxLinear GSW125 example showing MDIO-connected configuration.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/fc96f1dedb2b418a63e69960356dde7f6eb86424.1762170107.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Golle [Mon, 3 Nov 2025 12:20:05 +0000 (12:20 +0000)]
net: dsa: lantiq_gswip: allow adjusting MII delays
Currently the MII clk vs. data delay is configured based on the PHY
interface mode.
In addition to that add support for setting up MII delays using the
standard Device Tree properties 'tx-internal-delay-ps' and
'rx-internal-delay-ps', using the values determined by the PHY interface
mode as default to maintain backward compatibility with legacy device
trees.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/37203e831cff87dc46e5ef9e8cbd68fb8689773d.1762170107.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Golle [Mon, 3 Nov 2025 12:19:58 +0000 (12:19 +0000)]
dt-bindings: net: dsa: lantiq,gswip: add support for MII delay properties
Add support for standard tx-internal-delay-ps and rx-internal-delay-ps
properties on port nodes to allow fine-tuning of RGMII clock delays.
The GSWIP switch hardware supports delay values in 500 picosecond
increments from 0 to 3500 picoseconds, with a post-reset default of 2000
picoseconds for both TX and RX delays. The driver currently sets the
delay to 0 in case the PHY is setup to carry out the delay by the
corresponding interface modes ("rgmii-id", "rgmii-rxid", "rgmii-txid").
This corresponds to the driver changes that allow adjusting MII delays
using Device Tree properties instead of relying solely on the PHY
interface mode.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/9e007d4f85c2c6d69e0b91f3663d99e0f6fc8eac.1762170107.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Golle [Mon, 3 Nov 2025 12:19:47 +0000 (12:19 +0000)]
net: dsa: lantiq_gswip: add vendor property to setup MII refclk output
Read boolean Device Tree property "maxlinear,rmii-refclk-out" and switch
the RMII reference clock to be a clock output rather than an input if it
is set.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Tested-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Link: https://patch.msgid.link/947d14970f74f760e4a60c777aabee64e7e4f356.1762170107.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Golle [Mon, 3 Nov 2025 12:19:34 +0000 (12:19 +0000)]
dt-bindings: net: dsa: lantiq,gswip: add MaxLinear RMII refclk output property
Add support for the maxlinear,rmii-refclk-out boolean property on port
nodes to configure the RMII reference clock to be an output rather than
an input.
This property is only applicable for ports in RMII mode and allows the
switch to provide the reference clock for RMII-connected PHYs instead
of requiring an external clock source.
This corresponds to the driver changes that read this Device Tree
property to configure the RMII clock direction.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/9813bb916ecce9bae366e6c50c081014fe5371ea.1762170107.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Golle [Mon, 3 Nov 2025 12:19:25 +0000 (12:19 +0000)]
net: dsa: lantiq_gswip: define and use GSWIP_TABLE_MAC_BRIDGE_VAL1_VALID
When adding FDB entries to the MAC bridge table on GSWIP 2.2 or later it
is needed to set an (undocumented) bit to mark the entry as valid. If this
bit isn't set for entries in the MAC bridge table, then those entries won't
be considered as valid MAC addresses.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/e02fe0d946c98920bc55b5f389a8f56382aae7df.1762170107.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>