linux-2.6-microblaze.git
5 weeks agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Paolo Abeni [Wed, 11 Feb 2026 14:14:35 +0000 (15:14 +0100)]
Merge git://git./linux/kernel/git/netdev/net

Merge in late fixes in preparation for the net-next PR.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agobnge/bng_re: Add a new HSI
Vikas Gupta [Sun, 8 Feb 2026 17:29:25 +0000 (22:59 +0530)]
bnge/bng_re: Add a new HSI

The HSI is shared between the firmware and the driver and is
automatically generated.
Add a new HSI for the BNGE driver. The current HSI refers to BNXT,
which will become incompatible with ThorUltra devices as the
BNGE driver adds more features. The BNGE driver will not use the HSI
located in the bnxt folder.
Also, add an HSI for ThorUltra RoCE driver.

Changes in v3:
- Fix in bng_roce_hsi.h reported by Jakub (AI review)
  https://lore.kernel.org/netdev/20260207051422.4181717-1-kuba@kernel.org/
- Add an entry in MAINTAINERS

Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Reviewed-by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com>
Link: https://patch.msgid.link/20260208172925.1861255-1-vikas.gupta@broadcom.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: macb: Fix tx/rx malfunction after phy link down and up
Kevin Hao [Sun, 8 Feb 2026 08:45:52 +0000 (16:45 +0800)]
net: macb: Fix tx/rx malfunction after phy link down and up

In commit 99537d5c476c ("net: macb: Relocate mog_init_rings() callback
from macb_mac_link_up() to macb_open()"), the mog_init_rings() callback
was moved from macb_mac_link_up() to macb_open() to resolve a deadlock
issue. However, this change introduced a tx/rx malfunction following
phy link down and up events. The issue arises from a mismatch between
the software queue->tx_head, queue->tx_tail, queue->rx_prepared_head,
and queue->rx_tail values and the hardware's internal tx/rx queue
pointers.

According to the Zynq UltraScale TRM [1], when tx/rx is disabled, the
internal tx queue pointer resets to the value in the tx queue base
address register, while the internal rx queue pointer remains unchanged.
The following is quoted from the Zynq UltraScale TRM:
  When transmit is disabled, with bit [3] of the network control register
  set low, the transmit-buffer queue pointer resets to point to the address
  indicated by the transmit-buffer queue base address register. Disabling
  receive does not have the same effect on the receive-buffer queue
  pointer.

Additionally, there is no need to reset the RBQP and TBQP registers in a
phy event callback. Therefore, move macb_init_buffers() to macb_open().
In a phy link up event, the only required action is to reset the tx
software head and tail pointers to align with the hardware's behavior.

[1] https://docs.amd.com/v/u/en-US/ug1085-zynq-ultrascale-trm

Fixes: 99537d5c476c ("net: macb: Relocate mog_init_rings() callback from macb_mac_link_up() to macb_open()")
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260208-macb-init-ring-v1-1-939a32c14635@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoaf_unix: Fix memleak of newsk in unix_stream_connect().
Kuniyuki Iwashima [Sat, 7 Feb 2026 23:22:34 +0000 (23:22 +0000)]
af_unix: Fix memleak of newsk in unix_stream_connect().

When prepare_peercred() fails in unix_stream_connect(),
unix_release_sock() is not called for newsk, and the memory
is leaked.

Let's move prepare_peercred() before unix_create1().

Fixes: fd0a109a0f6b ("net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260207232236.2557549-1-kuniyu@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ti: icssg-prueth: Add optional dependency on HSR
Kevin Hao [Sat, 7 Feb 2026 06:21:46 +0000 (14:21 +0800)]
net: ti: icssg-prueth: Add optional dependency on HSR

Commit 95540ad6747c ("net: ti: icssg-prueth: Add support for HSR frame
forward offload") introduced support for offloading HSR frame forwarding,
which relies on functions such as is_hsr_master() provided by the HSR
module. Although HSR provides stubs for configurations with HSR
disabled, this driver still requires an optional dependency on HSR.
Otherwise, build failures will occur when icssg-prueth is built-in
while HSR is configured as a module.
  ld.lld: error: undefined symbol: is_hsr_master
  >>> referenced by icssg_prueth.c:710 (drivers/net/ethernet/ti/icssg/icssg_prueth.c:710)
  >>>               drivers/net/ethernet/ti/icssg/icssg_prueth.o:(icssg_prueth_hsr_del_mcast) in archive vmlinux.a
  >>> referenced by icssg_prueth.c:681 (drivers/net/ethernet/ti/icssg/icssg_prueth.c:681)
  >>>               drivers/net/ethernet/ti/icssg/icssg_prueth.o:(icssg_prueth_hsr_add_mcast) in archive vmlinux.a
  >>> referenced by icssg_prueth.c:1812 (drivers/net/ethernet/ti/icssg/icssg_prueth.c:1812)
  >>>               drivers/net/ethernet/ti/icssg/icssg_prueth.o:(prueth_netdevice_event) in archive vmlinux.a

  ld.lld: error: undefined symbol: hsr_get_port_ndev
  >>> referenced by icssg_prueth.c:712 (drivers/net/ethernet/ti/icssg/icssg_prueth.c:712)
  >>>               drivers/net/ethernet/ti/icssg/icssg_prueth.o:(icssg_prueth_hsr_del_mcast) in archive vmlinux.a
  >>> referenced by icssg_prueth.c:712 (drivers/net/ethernet/ti/icssg/icssg_prueth.c:712)
  >>>               drivers/net/etherneteth_hsr_del_mcast) in archive vmlinux.a
  >>> referenced by icssg_prueth.c:683 (drivers/net/ethernet/ti/icssg/icssg_prueth.c:683)
  >>>               drivers/net/ethernet/ti/icssg/icssg_prueth.o:(icssg_prueth_hsr_add_mcast) in archive vmlinux.a
  >>> referenced 1 more times

Fixes: 95540ad6747c ("net: ti: icssg-prueth: Add support for HSR frame forward offload")
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260207-icssg-dep-v3-1-8c47c1937f81@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoMerge branch 'net-dsa-initial-support-for-maxlinear-mxl862xx-switches'
Paolo Abeni [Wed, 11 Feb 2026 10:28:00 +0000 (11:28 +0100)]
Merge branch 'net-dsa-initial-support-for-maxlinear-mxl862xx-switches'

Daniel Golle says:

====================
net: dsa: initial support for MaxLinear MxL862xx switches

This series adds very basic DSA support for the MaxLinear MxL86252
(5x 2500Base-T PHYs) and MxL86282 (8x 2500Base-T PHYs) switches.
In addition to the 2.5G TP ports both switches also come with two
SerDes interfaces which can be used either to connect external PHYs
or SFP cages, or as CPU port when using the switch with this DSA driver.

MxL862xx integrates a firmware running on an embedded processor (based on
Zephyr RTOS). Host interaction uses a simple netlink-like API transported
over MDIO/MMD.

This series includes only what's needed to pass traffic between user
ports and the CPU port: relayed MDIO to internal PHYs, basic port
enable/disable, and CPU-port special tagging.

The SerDes interface of the CPU port is automatically configured by the
switch after reset using a board-specific configuration stored together
with the firmware in the flash chip attached to the switch, so no action
is needed from the driver to setup the interface mode of the CPU port.

Also MAC settings of the PHY ports are automatically configured, which
means the driver works fine with phylink_mac_ops being all no-op stubs.

Multiple follow up series will bring support for setting up the other
SerDes PCS interface (ie. not used for the CPU port), bridge, VLAN, ...
offloading, and support for using an 802.1Q-based special tag instead of
the proprietary 8-byte tag.
====================

Link: https://patch.msgid.link/cover.1770433307.git.daniel@makrotopia.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: dsa: add basic initial driver for MxL862xx switches
Daniel Golle [Sat, 7 Feb 2026 03:07:27 +0000 (03:07 +0000)]
net: dsa: add basic initial driver for MxL862xx switches

Add very basic DSA driver for MaxLinear's MxL862xx switches.

In contrast to previous MaxLinear switches the MxL862xx has a built-in
processor that runs a sophisticated firmware based on Zephyr RTOS.
Interaction between the host and the switch hence is organized using a
software API of that firmware rather than accessing hardware registers
directly.

Add descriptions of the most basic firmware API calls to access the
built-in MDIO bus hosting the 2.5GE PHYs, basic port control as well as
setting up the CPU port.

Implement a very basic DSA driver using that API which is sufficient to
get packets flowing between the user ports and the CPU port.

The firmware offers all features one would expect from a modern switch
hardware, they are going to be added one by one in follow-up patch
series.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/ccde07e8cf33d8ae243000013b57cfaa2695e0a9.1770433307.git.daniel@makrotopia.org
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: mdio: add unlocked mdiodev C45 bus accessors
Daniel Golle [Sat, 7 Feb 2026 03:07:18 +0000 (03:07 +0000)]
net: mdio: add unlocked mdiodev C45 bus accessors

Add helper inline functions __mdiodev_c45_read() and
__mdiodev_c45_write(), which are the C45 equivalents of the existing
__mdiodev_read() and __mdiodev_write() added by commit e6a45700e7e1
("net: mdio: add unlocked mdiobus and mdiodev bus accessors")

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/8d1d55949a75a871d2a3b90e421de4bd58d77685.1770433307.git.daniel@makrotopia.org
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: dsa: add tag format for MxL862xx switches
Daniel Golle [Sat, 7 Feb 2026 03:07:11 +0000 (03:07 +0000)]
net: dsa: add tag format for MxL862xx switches

Add proprietary special tag format for the MaxLinear MXL862xx family of
switches. While using the same Ethertype as MaxLinear's GSW1xx switches,
the actual tag format differs significantly, hence we need a dedicated
tag driver for that.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/c64e6ddb6c93a4fac39f9ab9b2d8bf551a2b118d.1770433307.git.daniel@makrotopia.org
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agodt-bindings: net: dsa: add MaxLinear MxL862xx
Daniel Golle [Sat, 7 Feb 2026 03:07:04 +0000 (03:07 +0000)]
dt-bindings: net: dsa: add MaxLinear MxL862xx

Add documentation and an example for MaxLinear MxL86282 and MxL86252
switches.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/22a6a3c8c15b932ff4b7d0cd8863939f06a0c2b4.1770433307.git.daniel@makrotopia.org
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoselftests: drivers: net: hw: Modify toeplitz.c to poll for packets
Dimitri Daskalakis [Sat, 7 Feb 2026 01:30:18 +0000 (17:30 -0800)]
selftests: drivers: net: hw: Modify toeplitz.c to poll for packets

Prior to this the receiver would sleep for the configured timeout,
then attempt to receive as many packets as possible. This would result
in a large burst of packets, and we don't necessarily need that many samples.

The tests now run faster.

Before

 ok 12 toeplitz.test.rps_udp_ipv6
 # Totals: pass:12 fail:0 xfail:0 xpass:0 skip:0 error:0

 real 0m54.792s
 user 0m12.486s
 sys 0m10.887s

After

 ok 12 toeplitz.test.rps_udp_ipv6
 # Totals: pass:12 fail:0 xfail:0 xpass:0 skip:0 error:0

 real 0m36.892s
 user 0m4.203s
 sys 0m8.314s

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Link: https://patch.msgid.link/20260207013018.551347-1-dimitri.daskalakis1@gmail.com
[pabeni@redhat.com: whitespaces fixes]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoocteontx2-pf: Unregister devlink on probe failure
Hariprasad Kelam [Fri, 6 Feb 2026 18:26:45 +0000 (23:56 +0530)]
octeontx2-pf: Unregister devlink on probe failure

When probe fails after devlink registration, the missing devlink unregister
call causing a memory leak.

Fixes: 2da489432747 ("octeontx2-pf: devlink params support to set mcam entry count")
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://patch.msgid.link/20260206182645.4032737-1-hkelam@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: renesas: rswitch: fix forwarding offload statemachine
Michael Dege [Fri, 6 Feb 2026 13:41:53 +0000 (14:41 +0100)]
net: renesas: rswitch: fix forwarding offload statemachine

A change of the port state of one port, caused the state of another
port to change. This behvior was unintended.

Fixes: b7502b1043de ("net: renesas: rswitch: add offloading for L2 switching")
Signed-off-by: Michael Dege <michael.dege@renesas.com>
Link: https://patch.msgid.link/20260206-fix-offloading-statemachine-v3-1-07bfba07d03e@renesas.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoionic: Rate limit unknown xcvr type messages
Eric Joyner [Fri, 6 Feb 2026 22:46:51 +0000 (14:46 -0800)]
ionic: Rate limit unknown xcvr type messages

Running ethtool repeatedly with a transceiver unknown to the driver or
firmware will cause the driver to spam the kernel logs with "unknown
xcvr type" messages which can distract from real issues; and this isn't
interesting information outside of debugging. Fix this by rate limiting
the output so that there are still notifications but not so many that
they flood the log.

Using dev_dbg_once() would reduce the number of messages further, but
this would miss the case where a different unknown transceiver type is
plugged in, and its status is requested.

Fixes: 4d03e00a2140 ("ionic: Add initial ethtool support")
Signed-off-by: Eric Joyner <eric.joyner@amd.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Link: https://patch.msgid.link/20260206224651.1491-1-eric.joyner@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'ipv6-tcp-no-longer-rebuild-fl6-at-each-transmit'
Jakub Kicinski [Wed, 11 Feb 2026 04:57:52 +0000 (20:57 -0800)]
Merge branch 'ipv6-tcp-no-longer-rebuild-fl6-at-each-transmit'

Eric Dumazet says:

====================
ipv6: tcp: no longer rebuild fl6 at each transmit

TCP v6 spends a good amount of time rebuilding a fresh fl6 at each
transmit in inet6_csk_xmit()/inet6_csk_route_socket().

TCP v4 caches the information in inet->cork.fl.u.ip4 instead.

This series changes TCP v6 to behave the same, saving cpu cycles
and reducing cache line misses and stack use.
====================

Link: https://patch.msgid.link/20260206173426.1638518-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotcp: inet6_csk_xmit() optimization
Eric Dumazet [Fri, 6 Feb 2026 17:34:26 +0000 (17:34 +0000)]
tcp: inet6_csk_xmit() optimization

After prior patches, inet6_csk_xmit() can reuse inet->cork.fl.u.ip6
if __sk_dst_check() returns a valid dst.

Otherwise call inet6_csk_route_socket() to refresh inet->cork.fl.u.ip6
content and get a new dst.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260206173426.1638518-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock()
Eric Dumazet [Fri, 6 Feb 2026 17:34:25 +0000 (17:34 +0000)]
tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock()

As explained in commit 85d05e281712 ("ipv6: change inet6_sk_rebuild_header()
to use inet->cork.fl.u.ip6"):

TCP v6 spends a good amount of time rebuilding a fresh fl6 at each
transmit in inet6_csk_xmit()/inet6_csk_route_socket().

TCP v4 caches the information in inet->cork.fl.u.ip4 instead.

After this patch, passive TCP ipv6 flows have correctly initialized
inet->cork.fl.u.ip6 structure.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260206173426.1638518-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotcp: populate inet->cork.fl.u.ip6 in tcp_v6_connect()
Eric Dumazet [Fri, 6 Feb 2026 17:34:24 +0000 (17:34 +0000)]
tcp: populate inet->cork.fl.u.ip6 in tcp_v6_connect()

Instead of using private @fl6 and @final variables
use respectively inet->cork.fl.u.ip6 and np->final.

As explained in commit 85d05e281712 ("ipv6: change inet6_sk_rebuild_header()
to use inet->cork.fl.u.ip6"):

TCP v6 spends a good amount of time rebuilding a fresh fl6 at each
transmit in inet6_csk_xmit()/inet6_csk_route_socket().

TCP v4 caches the information in inet->cork.fl.u.ip4 instead.

After this patch, active TCP ipv6 flows have correctly initialized
inet->cork.fl.u.ip6 structure.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260206173426.1638518-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoipv6: inet6_csk_xmit() and inet6_csk_update_pmtu() use inet->cork.fl.u.ip6
Eric Dumazet [Fri, 6 Feb 2026 17:34:23 +0000 (17:34 +0000)]
ipv6: inet6_csk_xmit() and inet6_csk_update_pmtu() use inet->cork.fl.u.ip6

Convert inet6_csk_route_socket() to use np->final instead of an
automatic variable to get rid of a stack canary.

Convert inet6_csk_xmit() and inet6_csk_update_pmtu() to use
inet->cork.fl.u.ip6 instead of @fl6 automatic variable.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260206173426.1638518-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoipv6: use inet->cork.fl.u.ip6 and np->final in ip6_datagram_dst_update()
Eric Dumazet [Fri, 6 Feb 2026 17:34:22 +0000 (17:34 +0000)]
ipv6: use inet->cork.fl.u.ip6 and np->final in ip6_datagram_dst_update()

Get rid of @fl6 and &final variables in ip6_datagram_dst_update().

Use instead inet->cork.fl.u.ip6 and np->final so that a stack canary
is no longer needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260206173426.1638518-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoipv6: use np->final in inet6_sk_rebuild_header()
Eric Dumazet [Fri, 6 Feb 2026 17:34:21 +0000 (17:34 +0000)]
ipv6: use np->final in inet6_sk_rebuild_header()

Instead of using an automatic variable, use np->final
to get rid of the stack canary in inet6_sk_rebuild_header().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260206173426.1638518-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoipv6: add daddr/final storage in struct ipv6_pinfo
Eric Dumazet [Fri, 6 Feb 2026 17:34:20 +0000 (17:34 +0000)]
ipv6: add daddr/final storage in struct ipv6_pinfo

After commit b409a7f7176b ("ipv6: colocate inet6_cork in
inet_cork_full") we have room in ipv6_pinfo to hold daddr/final
in case they need to be populated in fl6_update_dst() calls.

This will allow stack canary removal in IPv6 tx fast paths.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260206173426.1638518-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge tag 'nf-next-26-02-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfi...
Jakub Kicinski [Wed, 11 Feb 2026 04:25:38 +0000 (20:25 -0800)]
Merge tag 'nf-next-26-02-06' of https://git./linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
netfilter: updates for net-next

The following patchset contains Netfilter updates for *net-next*:

1) Fix net-next-only use-after-free bug in nf_tables rbtree set:
   Expired elements cannot be released right away after unlink anymore
   because there is no guarantee that the binary-search blob is going to
   be updated.  Spotted by syzkaller.

2) Fix esoteric bug in nf_queue with udp fraglist gro, broken since
   6.11. Patch 3 adds extends the nfqueue selftest for this.

4) Use dedicated slab for flowtable entries, currently the -512 cache
   is used, which is wasteful.  From Qingfang Deng.

5) Recent net-next update extended existing test for ip6ip6 tunnels, add
   the required /config entry.  Test still passed by accident because the
   previous tests network setup gets re-used, so also update the test so
   it will fail in case the ip6ip6 tunnel interface cannot be added.

6) Fix 'nft get element mytable myset { 1.2.3.4 }' on big endian
   platforms, this was broken since code was added in v5.1.

7) Fix nf_tables counter reset support on 32bit platforms, where counter
   reset may cause huge values to appear due to wraparound.
   Broken since reset feature was added in v6.11.  From Anders Grahn.

8-11) update nf_tables rbtree set type to detect partial
   operlaps.  This will eventually speed up nftables userspace: at this
   time userspace does a netlink dump of the set content which slows down
   incremental updates on interval sets.  From Pablo Neira Ayuso.

* tag 'nf-next-26-02-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nft_set_rbtree: validate open interval overlap
  netfilter: nft_set_rbtree: validate element belonging to interval
  netfilter: nft_set_rbtree: check for partial overlaps in anonymous sets
  netfilter: nft_set_rbtree: fix bogus EEXIST with NLM_F_CREATE with null interval
  netfilter: nft_counter: fix reset of counters on 32bit archs
  netfilter: nft_set_hash: fix get operation on big endian
  selftests: netfilter: add IPV6_TUNNEL to config
  netfilter: flowtable: dedicated slab for flow entry
  selftests: netfilter: nft_queue.sh: add udp fraglist gro test case
  netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation
  netfilter: nft_set_rbtree: don't gc elements on insert
====================

Link: https://patch.msgid.link/20260206153048.17570-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: stmmac: qcom-ethqos: fix qcom_ethqos_serdes_powerup()
Russell King (Oracle) [Fri, 6 Feb 2026 17:19:21 +0000 (17:19 +0000)]
net: stmmac: qcom-ethqos: fix qcom_ethqos_serdes_powerup()

Add cleanup for failure paths in qcom_ethqos_serdes_powerup(). This
was missing calling phy_exit() and phy_power_off() at appropriate
failure points.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Link: https://patch.msgid.link/E1voPUH-000000083ji-25FH@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoxfrm: reduce struct sec_path size
Paolo Abeni [Fri, 6 Feb 2026 17:14:36 +0000 (18:14 +0100)]
xfrm: reduce struct sec_path size

The mentioned struct has an hole and uses unnecessary wide type to
store MAC length and indexes of very small arrays.

It's also embedded into the skb_extensions, and the latter, due
to recent CAN changes, may exceeds the 192 bytes mark (3 cachelines
on x86_64 arch) on some reasonable configurations.

Reordering and the sec_path fields, shrinking xfrm_offload.orig_mac_len
to 16 bits and xfrm_offload.{len,olen,verified_cnt} to u8, we can save
16 bytes and keep skb_extensions size under control.

Before:

struct sec_path {
int                        len;
int                        olen;
int                        verified_cnt;

/* XXX 4 bytes hole, try to pack */$
struct xfrm_state *        xvec[6];
struct xfrm_offload ovec[1];

/* size: 88, cachelines: 2, members: 5 */
/* sum members: 84, holes: 1, sum holes: 4 */
/* last cacheline: 24 bytes */
};

After:

struct sec_path {
struct xfrm_state *        xvec[6];
struct xfrm_offload        ovec[1];
/* typedef u8 -> __u8 */ unsigned char              len;
/* typedef u8 -> __u8 */ unsigned char              olen;
/* typedef u8 -> __u8 */ unsigned char              verified_cnt;

/* size: 72, cachelines: 2, members: 5 */
/* padding: 1 */
/* last cacheline: 8 bytes */
};

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Steffen Klassert <steffen.klassert@secunet.com>
Link: https://patch.msgid.link/83846bd2e3fa08899bd0162e41bfadfec95e82ef.1770398071.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'bnxt_en-add-rss-context-resource-check'
Jakub Kicinski [Wed, 11 Feb 2026 04:17:58 +0000 (20:17 -0800)]
Merge branch 'bnxt_en-add-rss-context-resource-check'

Michael Chan says:

====================
bnxt_en: Add RSS context resource check

Add missing logic to check that we have enough RSS contexts.  This
will make the recent change to increase the use of RSS contexts for
a larger RSS indirection table more complete.
====================

Link: https://patch.msgid.link/20260207235118.1987301-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agobnxt_en: Check RSS contexts in bnxt_need_reserve_rings()
Michael Chan [Sat, 7 Feb 2026 23:51:18 +0000 (15:51 -0800)]
bnxt_en: Check RSS contexts in bnxt_need_reserve_rings()

bnxt_need_reserve_rings() checks all resources except HW RSS contexts
to determine if a new reservation is required.  For completeness, add
the check for HW RSS contexts.  This makes the code more complete after
the recent commit to increase the number of RSS contexts for a larger
RSS indirection table:

Fixes: 51b9d3f948b8 ("bnxt_en: Use a larger RSS indirection table on P5_PLUS chips")
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260207235118.1987301-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agobnxt_en: Refactor bnxt_need_reserve_rings()
Michael Chan [Sat, 7 Feb 2026 23:51:17 +0000 (15:51 -0800)]
bnxt_en: Refactor bnxt_need_reserve_rings()

bnxt_need_reserve_rings() checks 6 ring resources against the reserved
values to determine if a new reservation is needed.  Factor out the code
to collect the total resources into a new helper function
bnxt_get_total_resources() to make the code cleaner and easier to read.
Instead of individual scalar variables, use the struct bnxt_hw_rings to
hold all the ring resources.  Using the struct, hwr.cp replaces the nq
variable and the chip specific hwr.cp_p5 replaces cp on newer chips.

There is no change in behavior.  This will make it easier to check the
RSS context resource in the next patch.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260207235118.1987301-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agomptcp: allow overridden write_space to be invoked
Geliang Tang [Fri, 6 Feb 2026 13:09:24 +0000 (14:09 +0100)]
mptcp: allow overridden write_space to be invoked

Future extensions with psock will override their own sk->sk_write_space
callback. This patch ensures that the overridden sk_write_space can be
invoked by MPTCP.

INDIRECT_CALL is used to keep the default path optimised.

Note that sk->sk_write_space was never called directly with MPTCP
sockets, so changing it to sk_stream_write_space in the init, and using
it from mptcp_write_space() is not supposed to change the current
behaviour.

This patch is shared early to ease discussions around future RFC and
avoid confusions with this "fix" that is needed for different future
extensions.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260206-net-next-mptcp-write_space-override-v2-1-e0b12be818c6@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoocteontx2-af: CGX: fix bitmap leaks
Bo Sun [Fri, 6 Feb 2026 13:09:24 +0000 (21:09 +0800)]
octeontx2-af: CGX: fix bitmap leaks

The RX/TX flow-control bitmaps (rx_fc_pfvf_bmap and tx_fc_pfvf_bmap)
are allocated by cgx_lmac_init() but never freed in cgx_lmac_exit().
Unbinding and rebinding the driver therefore triggers kmemleak:

    unreferenced object (size 16):
        backtrace:
          rvu_alloc_bitmap
          cgx_probe

Free both bitmaps during teardown.

Fixes: e740003874ed ("octeontx2-af: Flow control resource management")
Cc: stable@vger.kernel.org
Signed-off-by: Bo Sun <bo@mboxify.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20260206130925.1087588-2-bo@mboxify.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'net-netconsole-convert-to-nbcon-console-infrastructure'
Jakub Kicinski [Wed, 11 Feb 2026 03:51:59 +0000 (19:51 -0800)]
Merge branch 'net-netconsole-convert-to-nbcon-console-infrastructure'

Breno Leitao says:

====================
net: netconsole: convert to NBCON console infrastructure

This series adds support for the nbcon (new buffer console) infrastructure
to netconsole, enabling lock-free, priority-based console operations that
are safer in crash scenarios.

The implementation is introduced in three steps:

0) Extend printk to expose CPU and taskname (task->comm) where the
   printk originated from. (Thanks John and Petr for the support in
   getting this done)
1) Refactor the message fragmentation logic into a reusable helper function
2) Extend nbcon support to non-extended (basic) consoles using the same
   infrastructure.

The initial discussion about it appeared a while ago in [1], in order to
solve Mike's HARDIRQ-safe -> HARDIRQ-unsafe lock order warning, and the root
cause is that some hosts were calling IRQ unsafe locks from inside console
lock.

At that time, we didn't have the CON_NBCON_ATOMIC_UNSAFE yet. John
kindly implemented CON_NBCON_ATOMIC_UNSAFE in 187de7c212e5 ("printk:
nbcon: Allow unsafe write_atomic() for panic"), and now we can
implement netconsole on top of nbcon.

Important to note that netconsole continues to call netpoll and the
network TX helpers with interrupt disable, given the TX are called with
target_list_lock.
====================

Link: https://patch.msgid.link/20260206-nbcon-v7-0-62bda69b1b41@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonetconsole: Use printk context for CPU and task information
Breno Leitao [Fri, 6 Feb 2026 12:45:32 +0000 (04:45 -0800)]
netconsole: Use printk context for CPU and task information

Use the CPU and task name captured at printk() time from
nbcon_write_context instead of querying the current execution context.
This provides accurate information about where the message originated,
rather than where netconsole happens to be running.

For CPU, use wctxt->cpu instead of raw_smp_processor_id().

For taskname, use wctxt->comm directly which contains the task
name captured at printk time.

This change ensures netconsole outputs reflect the actual context that
generated the log message, which is especially important when the
console driver runs asynchronously in a dedicated thread.

Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260206-nbcon-v7-4-62bda69b1b41@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonetconsole: convert to NBCON console infrastructure
Breno Leitao [Fri, 6 Feb 2026 12:45:31 +0000 (04:45 -0800)]
netconsole: convert to NBCON console infrastructure

Convert netconsole from the legacy console API to the NBCON framework.
NBCON provides threaded printing which unblocks printk()s and flushes in
a thread, decoupling network TX from printk() when netconsole is
in use.

Since netconsole relies on the network stack which cannot safely operate
from all atomic contexts, mark both consoles with
CON_NBCON_ATOMIC_UNSAFE. (See discussion in [1])

CON_NBCON_ATOMIC_UNSAFE restricts write_atomic() usage to emergency
scenarios (panic) where regular messages are sent in threaded mode.

Implementation changes:
- Unify write_ext_msg() and write_msg() into netconsole_write()
- Add device_lock/device_unlock callbacks to manage target_list_lock
- Use nbcon_enter_unsafe()/nbcon_exit_unsafe() around network
  operations.
  - If nbcon_enter_unsafe() fails, just return given netconsole lost
    the ownership of the console.
- Set write_thread and write_atomic callbacks (both use same function)

Link: https://lore.kernel.org/all/b2qps3uywhmjaym4mht2wpxul4yqtuuayeoq4iv4k3zf5wdgh3@tocu6c7mj4lt/
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260206-nbcon-v7-3-62bda69b1b41@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonetconsole: extract message fragmentation into send_msg_udp()
Breno Leitao [Fri, 6 Feb 2026 12:45:30 +0000 (04:45 -0800)]
netconsole: extract message fragmentation into send_msg_udp()

Extract the message fragmentation logic from write_msg() into a
dedicated send_msg_udp() function. This improves code readability
and prepares for future enhancements.

The new send_msg_udp() function handles splitting messages that
exceed MAX_PRINT_CHUNK into smaller fragments and sending them
sequentially. This function is placed before send_ext_msg_udp()
to maintain a logical ordering of related functions.

No functional changes - this is purely a refactoring commit.

Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260206-nbcon-v7-2-62bda69b1b41@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoprintk: Add execution context (task name/CPU) to printk_info
Breno Leitao [Fri, 6 Feb 2026 12:45:29 +0000 (04:45 -0800)]
printk: Add execution context (task name/CPU) to printk_info

Extend struct printk_info to include the task name, pid, and CPU
number where printk messages originate. This information is captured
at vprintk_store() time and propagated through printk_message to
nbcon_write_context, making it available to nbcon console drivers.

This is useful for consoles like netconsole that want to include
execution context in their output, allowing correlation of messages
with specific tasks and CPUs regardless of where the console driver
actually runs.

The feature is controlled by CONFIG_PRINTK_EXECUTION_CTX, which is
automatically selected by CONFIG_NETCONSOLE_DYNAMIC. When disabled,
the helper functions compile to no-ops with no overhead.

Suggested-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20260206-nbcon-v7-1-62bda69b1b41@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'disable-interrupts-and-ensure-dbell-updation'
Paolo Abeni [Tue, 10 Feb 2026 14:58:04 +0000 (15:58 +0100)]
Merge branch 'disable-interrupts-and-ensure-dbell-updation'

Vimlesh Kumar says:

====================
disable interrupts and ensure dbell updation

Disable per ring interrupts when netdev goes down and ensure dbell BADDR
updation for both PFs and VFs by adding wait and check for updated value.

Resending based on discussion with reviewer.
====================

Link: https://patch.msgid.link/20260206111510.1045092-1-vimleshk@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoocteon_ep_vf: ensure dbell BADDR updation
Vimlesh Kumar [Fri, 6 Feb 2026 11:15:08 +0000 (11:15 +0000)]
octeon_ep_vf: ensure dbell BADDR updation

Make sure the OUT DBELL base address reflects the
latest values written to it.

Fix:
Add a wait until the OUT DBELL base address register
is updated with the DMA ring descriptor address,
and modify the setup_oq function to properly
handle failures.

Fixes: 2c0c32c72be29 ("octeon_ep_vf: add hardware configuration APIs")
Signed-off-by: Sathesh Edara <sedara@marvell.com>
Signed-off-by: Shinas Rasheed <srasheed@marvell.com>
Signed-off-by: Vimlesh Kumar <vimleshk@marvell.com>
Link: https://patch.msgid.link/20260206111510.1045092-4-vimleshk@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoocteon_ep: ensure dbell BADDR updation
Vimlesh Kumar [Fri, 6 Feb 2026 11:15:07 +0000 (11:15 +0000)]
octeon_ep: ensure dbell BADDR updation

Make sure the OUT DBELL base address reflects the
latest values written to it.

Fix:
Add a wait until the OUT DBELL base address register
is updated with the DMA ring descriptor address,
and modify the setup_oq function to properly
handle failures.

Fixes: 0807dc76f3bf5 ("octeon_ep: support Octeon CN10K devices")
Signed-off-by: Sathesh Edara <sedara@marvell.com>
Signed-off-by: Shinas Rasheed <srasheed@marvell.com>
Signed-off-by: Vimlesh Kumar <vimleshk@marvell.com>
Link: https://patch.msgid.link/20260206111510.1045092-3-vimleshk@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoocteon_ep: disable per ring interrupts
Vimlesh Kumar [Fri, 6 Feb 2026 11:15:06 +0000 (11:15 +0000)]
octeon_ep: disable per ring interrupts

Disable the MSI-X per ring interrupt for every PF ring when PF
netdev goes down.

Fixes: 1f2c2d0cee023 ("octeon_ep: add hardware configuration APIs")
Signed-off-by: Sathesh Edara <sedara@marvell.com>
Signed-off-by: Shinas Rasheed <srasheed@marvell.com>
Signed-off-by: Vimlesh Kumar <vimleshk@marvell.com>
Link: https://patch.msgid.link/20260206111510.1045092-2-vimleshk@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet/mlx5e: remove declarations of mlx5e_shampo_{fill_umr,dealloc_hd}
Simon Horman [Fri, 6 Feb 2026 11:18:51 +0000 (11:18 +0000)]
net/mlx5e: remove declarations of mlx5e_shampo_{fill_umr,dealloc_hd}

These functions were recently removed by commit 24cf78c73831
("net/mlx5e: SHAMPO, Switch to header memcpy"), however,
their declarations were left behind.

This patch removes those declarations.

Flagged by review-prompts while I was exercising Orc mode locally.
Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260206-shampo-v1-1-75b20c6657e5@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: wan/fsl_ucc_hdlc: Fix dma_free_coherent() in uhdlc_memclean()
Thomas Fourier [Fri, 6 Feb 2026 08:53:33 +0000 (09:53 +0100)]
net: wan/fsl_ucc_hdlc: Fix dma_free_coherent() in uhdlc_memclean()

The priv->rx_buffer and priv->tx_buffer are alloc'd together as
contiguous buffers in uhdlc_init() but freed as two buffers in
uhdlc_memclean().

Change the cleanup to only call dma_free_coherent() once on the whole
buffer.

Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Fixes: c19b6d246a35 ("drivers/net: support hdlc function for QE-UCC")
Cc: <stable@vger.kernel.org>
Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com>
Link: https://patch.msgid.link/20260206085334.21195-2-fourier.thomas@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: dsa: eliminate local type for tc policers
Vladimir Oltean [Fri, 6 Feb 2026 07:54:21 +0000 (15:54 +0800)]
net: dsa: eliminate local type for tc policers

David Yang is saying that struct flow_action_entry in
include/net/flow_offload.h has gained new fields and DSA's struct
dsa_mall_policer_tc_entry, derived from that, isn't keeping up.
This structure is passed to drivers and they are completely oblivious to
the values of fields they don't see.

This has happened before, and almost always the solution was to make the
DSA layer thinner and use the upstream data structures. Here, the reason
why we didn't do that is because struct flow_action_entry :: police is
an anonymous structure.

That is easily enough fixable, just name those fields "struct
flow_action_police" and reference them from DSA.

Make the according transformations to the two users (sja1105 and felix):
"rate_bytes_per_sec" -> "rate_bytes_ps".

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Co-developed-by: David Yang <mmyangfl@gmail.com>
Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260206075427.44733-1-mmyangfl@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoserial: caif: fix use-after-free in caif_serial ldisc_close()
Jiayuan Chen [Fri, 6 Feb 2026 07:44:44 +0000 (15:44 +0800)]
serial: caif: fix use-after-free in caif_serial ldisc_close()

There is a use-after-free bug in caif_serial where handle_tx() may
access ser->tty after the tty has been freed.

The race condition occurs between ldisc_close() and packet transmission:

    CPU 0 (close)                     CPU 1 (xmit)
    -------------                     ------------
    ldisc_close()
      tty_kref_put(ser->tty)
      [tty may be freed here]
                     <-- race window -->
                                      caif_xmit()
                                        handle_tx()
                                          tty = ser->tty  // dangling ptr
                                          tty->ops->write() // UAF!
      schedule_work()
        ser_release()
          unregister_netdevice()

The root cause is that tty_kref_put() is called in ldisc_close() while
the network device is still active and can receive packets.

Since ser and tty have a 1:1 binding relationship with consistent
lifecycles (ser is allocated in ldisc_open and freed in ser_release
via unregister_netdevice, and each ser binds exactly one tty), we can
safely defer the tty reference release to ser_release() where the
network device is unregistered.

Fix this by moving tty_kref_put() from ldisc_close() to ser_release(),
after unregister_netdevice(). This ensures the tty reference is held
as long as the network device exists, preventing the UAF.

Note: We save ser->tty before unregister_netdevice() because ser is
embedded in netdev's private data and will be freed along with netdev
(needs_free_netdev = true).

How to reproduce: Add mdelay(500) at the beginning of ldisc_close()
to widen the race window, then run the reproducer program [1].

Note: There is a separate deadloop issue in handle_tx() when using
PORT_UNKNOWN serial ports (e.g., /dev/ttyS3 in QEMU without proper
serial backend). This deadloop exists even without this patch,
and is likely caused by inconsistency between uart_write_room() and
uart_write() in serial core. It has been addressed in a separate
patch [2].

KASAN report:

==================================================================
BUG: KASAN: slab-use-after-free in handle_tx+0x5d1/0x620
Read of size 1 at addr ffff8881131e1490 by task caif_uaf_trigge/9929

Call Trace:
 <TASK>
 dump_stack_lvl+0x10e/0x1f0
 print_report+0xd0/0x630
 kasan_report+0xe4/0x120
 handle_tx+0x5d1/0x620
 dev_hard_start_xmit+0x9d/0x6c0
 __dev_queue_xmit+0x6e2/0x4410
 packet_xmit+0x243/0x360
 packet_sendmsg+0x26cf/0x5500
 __sys_sendto+0x4a3/0x520
 __x64_sys_sendto+0xe0/0x1c0
 do_syscall_64+0xc9/0xf80
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f615df2c0d7

Allocated by task 9930:

Freed by task 64:

Last potentially related work creation:

The buggy address belongs to the object at ffff8881131e1000
 which belongs to the cache kmalloc-cg-2k of size 2048
The buggy address is located 1168 bytes inside of
 freed 2048-byte region [ffff8881131e1000ffff8881131e1800)

The buggy address belongs to the physical page:
page_owner tracks the page as allocated
page last free pid 9778 tgid 9778 stack trace:

Memory state around the buggy address:
 ffff8881131e1380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8881131e1400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8881131e1480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                         ^
 ffff8881131e1500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8881131e1580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
[1]: https://gist.github.com/mrpre/f683f244544f7b11e7fa87df9e6c2eeb
[2]: https://lore.kernel.org/linux-serial/20260204074327.226165-1-jiayuan.chen@linux.dev/T/#u

Reported-by: syzbot+827272712bd6d12c79a4@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000a4a7550611e234f5@google.com/T/
Fixes: 56e0ef527b18 ("drivers/net: caif: fix wrong rtnl_is_locked() usage")
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Reviewed-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20260206074450.154267-1-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ethernet: marvell: skge: remove incorrect conflicting PCI ID
Ethan Nelson-Moore [Fri, 6 Feb 2026 07:17:14 +0000 (23:17 -0800)]
net: ethernet: marvell: skge: remove incorrect conflicting PCI ID

The ID 1186:4302 is matched by both r8169 and skge. The same device ID
should not be in more than one driver, because in that case, which
driver is used is unpredictable. I downloaded the latest drivers for
all hardware revisions of the D-Link DGE-530T from D-Link's website,
and the only drivers which contain this ID are Realtek drivers.
Therefore, remove this device ID from skge.

In the kernel bug report which requested addition of this device ID,
someone created a patch to add the ID to skge. Then, it was pointed
out that this device is an "r8169 in disguise", and a patch was created
to add it to r8169. Somehow, both of these patches got merged. See the
link below.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=38862
Fixes: c074304c2bcf ("add pci-id for DGE-530T")
Cc: stable@vger.kernel.org
Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Link: https://patch.msgid.link/20260206071724.15268-1-enelsonmoore@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoxfrm: fix ip_rt_bug race in icmp_route_lookup reverse path
Jiayuan Chen [Fri, 6 Feb 2026 05:02:19 +0000 (13:02 +0800)]
xfrm: fix ip_rt_bug race in icmp_route_lookup reverse path

icmp_route_lookup() performs multiple route lookups to find a suitable
route for sending ICMP error messages, with special handling for XFRM
(IPsec) policies.

The lookup sequence is:
1. First, lookup output route for ICMP reply (dst = original src)
2. Pass through xfrm_lookup() for policy check
3. If blocked (-EPERM) or dst is not local, enter "reverse path"
4. In reverse path, call xfrm_decode_session_reverse() to get fl4_dec
   which reverses the original packet's flow (saddr<->daddr swapped)
5. If fl4_dec.saddr is local (we are the original destination), use
   __ip_route_output_key() for output route lookup
6. If fl4_dec.saddr is NOT local (we are a forwarding node), use
   ip_route_input() to simulate the reverse packet's input path
7. Finally, pass rt2 through xfrm_lookup() with XFRM_LOOKUP_ICMP flag

The bug occurs in step 6: ip_route_input() is called with fl4_dec.daddr
(original packet's source) as destination. If this address becomes local
between the initial check and ip_route_input() call (e.g., due to
concurrent "ip addr add"), ip_route_input() returns a LOCAL route with
dst.output set to ip_rt_bug.

This route is then used for ICMP output, causing dst_output() to call
ip_rt_bug(), triggering a WARN_ON:

 ------------[ cut here ]------------
 WARNING: net/ipv4/route.c:1275 at ip_rt_bug+0x21/0x30, CPU#1
 Call Trace:
  <TASK>
  ip_push_pending_frames+0x202/0x240
  icmp_push_reply+0x30d/0x430
  __icmp_send+0x1149/0x24f0
  ip_options_compile+0xa2/0xd0
  ip_rcv_finish_core+0x829/0x1950
  ip_rcv+0x2d7/0x420
  __netif_receive_skb_one_core+0x185/0x1f0
  netif_receive_skb+0x90/0x450
  tun_get_user+0x3413/0x3fb0
  tun_chr_write_iter+0xe4/0x220
  ...

Fix this by checking rt2->rt_type after ip_route_input(). If it's
RTN_LOCAL, the route cannot be used for output, so treat it as an error.

The reproducer requires kernel modification to widen the race window,
making it unsuitable as a selftest. It is available at:

  https://gist.github.com/mrpre/eae853b72ac6a750f5d45d64ddac1e81

Reported-by: syzbot+e738404dcd14b620923c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000b1060905eada8881@google.com/T/
Closes: https://lore.kernel.org/r/20260128090523.356953-1-jiayuan.chen@linux.dev
Fixes: 8b7817f3a959 ("[IPSEC]: Add ICMP host relookup support")
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://patch.msgid.link/20260206050220.59642-1-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoMerge branch 'net-ftgmac100-various-probe-cleanups'
Paolo Abeni [Tue, 10 Feb 2026 12:40:52 +0000 (13:40 +0100)]
Merge branch 'net-ftgmac100-various-probe-cleanups'

Jacky Chou says:

====================
net: ftgmac100: Various probe cleanups

The probe function of the ftgmac100 is rather complex, due to the way
it has evolved over time, dealing with poor DT descriptions, and new
variants of the MAC.

Make use of DT match data to identify the MAC variant, rather than
looking at the compatible string all the time.

Make use of devm_ calls to simplify cleanup. This indirectly fixes
inconsistent goto label names.

Always probe the MDIO bus, when it exists. This simplifies the logic a
bit.

Move code into helpers to simply probe.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
====================

Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-0-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: Use devm_mdiobus_alloc/devm_of_mdiobus_register
Jacky Chou [Fri, 6 Feb 2026 03:17:55 +0000 (11:17 +0800)]
net: ftgmac100: Use devm_mdiobus_alloc/devm_of_mdiobus_register

Make use of devm_ methods to allocate and register mdiobus to simplify
cleanup.

Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-15-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: Fix wrong netif_napi_del in release
Andrew Lunn [Fri, 6 Feb 2026 03:17:54 +0000 (11:17 +0800)]
net: ftgmac100: Fix wrong netif_napi_del in release

netif_napi_add() is called in open. There is a symmetric call to
netif_napi_del() in stop. Remove to wrong call to netif_napi_del() in
release.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-14-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: Simplify condition on HW arbitration
Andrew Lunn [Fri, 6 Feb 2026 03:17:53 +0000 (11:17 +0800)]
net: ftgmac100: Simplify condition on HW arbitration

The MAC ID is sufficient to indicate this is a ast2600.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-13-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: Remove redundant PHY_POLL
Andrew Lunn [Fri, 6 Feb 2026 03:17:52 +0000 (11:17 +0800)]
net: ftgmac100: Remove redundant PHY_POLL

When an MDIO bus is allocated, the irqs for each PHY are set to
polling. Remove the redundant code in the MAC driver which does the
same.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-12-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: Move DT probe into a helper
Andrew Lunn [Fri, 6 Feb 2026 03:17:51 +0000 (11:17 +0800)]
net: ftgmac100: Move DT probe into a helper

By moving all the DT probe code into a helper, the complex if else if
else structure can be simplified. No functional change intended.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-11-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: Simplify legacy MDIO setup
Andrew Lunn [Fri, 6 Feb 2026 03:17:50 +0000 (11:17 +0800)]
net: ftgmac100: Simplify legacy MDIO setup

There are old device trees which place the PHY nodes directly in the
MAC nodes, rather than within an MDIO container node.

The probe logic indicates that the use of NCSI and the legacy
placement of PHYs is mutually exclusive. Hence priv->use_ncsi cannot
be true, so there is no reason to set it false.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-10-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: Always register the MDIO bus when it exists
Andrew Lunn [Fri, 6 Feb 2026 03:17:49 +0000 (11:17 +0800)]
net: ftgmac100: Always register the MDIO bus when it exists

Both the Aspeed 2400 and 2500 and the original faraday version of the
MAC have MDIO bus controllers as part of the MAC. Since it exists,
always registering it makes the code simpler, and causes no harm. If
there is no mdio node in device tree, of_mdiobus_register() will fall
back to mdiobus_register(), making it safe.

AST2600 uses an external MDIO controller and does not have an embedded
MDIO bus in the MAC. For such configurations, the legacy MII probe path
must not be entered without a registered mii_bus.

Add an explicit check to fail gracefully when no MDIO bus is present,
preventing a NULL pointer dereference while keeping the intended
behavior for platforms without embedded MDIO.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-9-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: Move NCSI probe code into a helper
Andrew Lunn [Fri, 6 Feb 2026 03:17:48 +0000 (11:17 +0800)]
net: ftgmac100: Move NCSI probe code into a helper

To help reduce the complexity of the probe function, move the NCSI
probe code into a helper.

The refactoring results in improved cleanup of the fixed PHY in
error paths.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-8-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: Simplify error handling for ftgmac100_initial_mac
Andrew Lunn [Fri, 6 Feb 2026 03:17:47 +0000 (11:17 +0800)]
net: ftgmac100: Simplify error handling for ftgmac100_initial_mac

ftgmac100_initial_mac() does not allocate any resources. All resources
by the probe function up until this call point use devm_ methods. So
just return the error code rather than use a goto.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-7-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: Use devm_clk_get_enabled
Andrew Lunn [Fri, 6 Feb 2026 03:17:46 +0000 (11:17 +0800)]
net: ftgmac100: Use devm_clk_get_enabled

Make use of devm_ methods to request and enable clocks to simplify
cleanup.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-6-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: Use devm_request_memory_region/devm_ioremap
Andrew Lunn [Fri, 6 Feb 2026 03:17:45 +0000 (11:17 +0800)]
net: ftgmac100: Use devm_request_memory_region/devm_ioremap

Make use of devm_ methods to request and remap the device memory to
simplify cleanup.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-5-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: Use devm_alloc_etherdev()
Andrew Lunn [Fri, 6 Feb 2026 03:17:44 +0000 (11:17 +0800)]
net: ftgmac100: Use devm_alloc_etherdev()

Make use of devm_alloc_etherdev() to simplify cleanup.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-4-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: Replace all of_device_is_compatible()
Andrew Lunn [Fri, 6 Feb 2026 03:17:43 +0000 (11:17 +0800)]
net: ftgmac100: Replace all of_device_is_compatible()

Now that the priv structure includes the MAC ID, make use of it
instead of the more expensive of_device_is_compatible().

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-3-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: Add match data containing MAC ID
Andrew Lunn [Fri, 6 Feb 2026 03:17:42 +0000 (11:17 +0800)]
net: ftgmac100: Add match data containing MAC ID

The driver supports 4 different versions of the FTGMAC core.  Extend
the compatible matching to include match data, which indicates the
version of the MAC. Default to the initial Faraday device if DT is not
being used. Lookup the match data early in probe to keep error handing
simple.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-2-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: ftgmac100: List all compatibles
Andrew Lunn [Fri, 6 Feb 2026 03:17:41 +0000 (11:17 +0800)]
net: ftgmac100: List all compatibles

As a step towards cleanup the probe function, list each compatible the
driver supports.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20260206-ftgmac-cleanup-v5-1-ad28a9067ea7@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: sunhme: Fix sbus regression
René Rebe [Thu, 5 Feb 2026 16:09:59 +0000 (17:09 +0100)]
net: sunhme: Fix sbus regression

Commit cc216e4b44ce ("net: sunhme: Switch SBUS to devres") changed
explicit sized of_ioremap with BMAC_REG_SIZEs to
devm_platform_ioremap_resource mapping all the resource. However,
this does not work on my Sun Ultra 2 with SBUS HMEs:

hme f0072f38: error -EBUSY: can't request region for resource [mem 0x1ffe8c07000-0x1ffe8c0701f]
hme f0072f38: Cannot map TCVR registers.
hme f0072f38: probe with driver hme failed with error -16
hme f007ab44: error -EBUSY: can't request region for resource [mem 0x1ff28c07000-0x1ff28c0701f]
hme f007ab44: Cannot map TCVR registers.
hme f007ab44: probe with driver hme failed with error -16

Turns out the open-firmware resources overlap, at least on this
machines and PROM version:

hexdump /proc/device-tree/sbus@1f,0/SUNW,hme@2,8c00000/reg:
00 00 00 02 08 c0 00 00  00 00 01 08
00 00 00 02 08 c0 20 00  00 00 20 00
00 00 00 02 08 c0 40 00  00 00 20 00
00 00 00 02 08 c0 60 00  00 00 20 00
00 00 00 02 08 c0 70 00  00 00 00 20

And the driver previously explicitly mapped way smaller mmio regions:

/proc/iomem:
1ff28c00000-1ff28c00107 : HME Global Regs
1ff28c02000-1ff28c02033 : HME TX Regs
1ff28c04000-1ff28c0401f : HME RX Regs
1ff28c06000-1ff28c0635f : HME BIGMAC Regs
1ff28c07000-1ff28c0701f : HME Tranceiver Regs

Quirk this specific issue by truncating the previous resource to not
overlap into the TCVR registers.

Fixes: cc216e4b44ce ("net: sunhme: Switch SBUS to devres")
Signed-off-by: René Rebe <rene@exactco.de>
Reviewed-by: Sean Anderson <seanga2@gmail.com>
Link: https://patch.msgid.link/20260205.170959.89574674688839340.rene@exactco.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoMerge branch 'hsr-implement-more-robust-duplicate-discard-algorithm'
Paolo Abeni [Tue, 10 Feb 2026 11:02:31 +0000 (12:02 +0100)]
Merge branch 'hsr-implement-more-robust-duplicate-discard-algorithm'

Felix Maurer says:

====================
hsr: Implement more robust duplicate discard algorithm

The duplicate discard algorithms for PRP and HSR do not work reliably
with certain link faults. Especially with packet loss on one link, the
duplicate discard algorithms drop valid packets. For a more thorough
description see patches 4 (for PRP) and 6 (for HSR).

This patchset replaces the current algorithms (based on a drop window
for PRP and highest seen sequence number for HSR) with a single new one
that tracks the received sequence numbers individually (descriptions
again in patches 4 and 6).

The changes will lead to higher memory usage and more work to do for
each packet. But I argue that this is an acceptable trade-off to make
for a more robust PRP and HSR behavior with faulty links. After all,
both protocols are to be used in environments where redundancy is needed
and people are willing to setup special network topologies to achieve
that.

Some more reasoning on the overhead and expected scale of the deployment
from the RFC discussion:

> As for the expected scale, there are two dimensions: the number of nodes
> in the network and the data rate with which they send.
>
> The number of nodes in the network affect the memory usage because each
> node now has the block buffer. For PRP that's 64 blocks * 32 byte =
> 2kbyte for each node in the node table. A PRP network doesn't have an
> explicit limit for the number of nodes. However, the whole network is a
> single layer-2 segment which shouldn't grow too large anyways. Even if
> one really tries to put 1000 nodes into the PRP network, the memory
> overhead (2Mbyte) is acceptable in my opinion.
>
> For HSR, the blocks would be larger because we need to track the
> sequence numbers per port. I expect 64 blocks * 80 byte = 5kbyte per
> node in the node table. There is no explicit limit for the size of an
> HSR ring either. But I expect them to be of limited size because the
> forwarding delays add up throughout the ring. I've seen vendors limiting
> the ring size to 50 nodes with 100Mbit/s links and 300 with 1Gbit/s
> links. In both cases I consider the memory overhead acceptable.
>
> The data rates are harder to reason about. In general, the data rates
> for HSR and PRP are limited because too high packet rates would lead to
> very fast re-use of the 16bit sequence numbers. The IEC 62439-3:2021
> mentions 100Mbit/s links and 1Gbit/s links. I don't expect HSR or PRP
> networks to scale out to, e.g., 10Gbit/s links with the current
> specification as this would mean that sequence numbers could repeat as
> often as every ~4ms. The default constants in the IEC standard, which we
> also use, are oriented at a 100Mbit/s network.
>
> In my tests with veth pairs, the CPU overhead didn't lead to
> significantly lower data rates. The main factor limiting the data rate
> at the moment, I assume, is the per-node spinlock that is taken for each
> received packet. IMHO, there is a lot more to gain in terms of CPU
> overhead from making this lock smaller or getting rid of it, than we
> loose with the more accurate duplicate discard algorithm in this patchset.
>
> The CPU overhead of the algorithm benefits from the fact that in high
> packet rate scenarios (where it really matters) many packets will have
> sequence numbers in already initialized blocks. These packets just have
> additionally: one xarray lookup, one comparison, and one bit setting. If
> a block needs to be initialized (once every 128 packets plus their 128
> duplicates if all sequence numbers are seen), we will have: one
> xa_erase, a bunch of memory writes, and one xa_store.
>
> In theory, all packets could end up in the slow path if a node sends
> every 128th packet to us. If this is sent from a well behaving node, the
> packet rate wouldn't be an issue anymore, though.

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
====================

Link: https://patch.msgid.link/cover.1770299429.git.fmaurer@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoMAINTAINERS: Assign hsr selftests to HSR
Felix Maurer [Thu, 5 Feb 2026 13:57:35 +0000 (14:57 +0100)]
MAINTAINERS: Assign hsr selftests to HSR

Despite the HSR subsystem being orphaned at the moment due to the original
maintainer being unreachable for a while, assign the selftests to the
subsystem for the future.

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/f4a356b96f5e0c99d9db3984ea62596c99a97469.1770299429.git.fmaurer@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoselftests: hsr: Add more link fault tests for HSR
Felix Maurer [Thu, 5 Feb 2026 13:57:34 +0000 (14:57 +0100)]
selftests: hsr: Add more link fault tests for HSR

Run the packet loss and reordering tests also for both HSR versions. Now
they can be removed from the hsr_ping tests completely. The timeout needs
to be increased because there are 15 link fault test cases now, with each
of them taking 5-6sec for the test and at most 5sec for the HSR node tables
to get merged and we also want some room to make the test runs stable.

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/eb6f667d3804ce63d86f0ee3fbc0e0ac9e1a209a.1770299429.git.fmaurer@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agohsr: Implement more robust duplicate discard for HSR
Felix Maurer [Thu, 5 Feb 2026 13:57:33 +0000 (14:57 +0100)]
hsr: Implement more robust duplicate discard for HSR

The HSR duplicate discard algorithm had even more basic problems than the
described for PRP in the previous patch. It relied only on the last
received sequence number to decide if a new frame should be forwarded to
any port. This does not work correctly in any case where frames are
received out of order. The linked bug report claims that this can even
happen with perfectly fine links due to the order in which incoming frames
are processed (which can be unexpected on multi-core systems). The issue
also occasionally shows up in the HSR selftests. The main reason is that
the sequence number that was last forwarded to the master port may have
skipped a number which will in turn never be delivered to the host.

As the problem (we accidentally skip over a sequence number that has not
been received but will be received in the future) is similar to PRP, we can
apply a similar solution. The duplicate discard algorithm based on the
"sparse bitmap" works well for HSR if it is extended to track one bitmap
for each port (A, B, master, interlink). To do this, change the sequence
number blocks to contain a flexible array member as the last member that
can keep chunks for as many bitmaps as we need. This design makes it easy
to reuse the same algorithm in a potential PRP RedBox implementation.

The duplicate discard algorithm functions are modified to deal with
sequence number blocks of different sizes and to correctly use the array of
bitmap chunks. There is a notable speciality for HSR: the port type has a
special port type NONE with value 0. This leads to the number of port types
being 5 instead of actually 4. To save memory, remove the NONE port from
the bitmap (by subtracting 1) when setting up the block buffer and when
accessing the bitmap chunks in the array.

Removing the old algorithm allows us to get rid of a few fields that are
not needed any more: time_out and seq_out for each port. We can also remove
some functions that were only necessary for the previous duplicate discard
algorithm.

The removal of seq_out is possible despite its previous usage in
hsr_register_frame_in: it was used to prevent updates to time_in when
"invalid" sequence numbers were received. With the new duplicate discard
algorithm, time_in has no relevance for the expiry of sequence numbers
anymore. They will expire based on the timestamps in the sequence number
blocks after at most 400ms. There is no need that a node "re-registers" to
"resume communication": after 400ms, all sequence numbers are accepted
again. Also, according to the IEC 62439-3:2021, all nodes are supposed to
send no traffic for 500ms after boot to lead exactly to this expiry of seen
sequence numbers. time_in is still used for pruning nodes from the node
table after no traffic has been received for 60sec. Pruning is only needed
if the node is really gone and has not been sending any traffic for that
period.

seq_out was also used to report the last incoming sequence number from a
node through netlink. I am not sure how useful this value is to userspace
at all, but added getting it from the sequence number blocks. This number
can be outdated after node merging until a new block has been added.

Update the KUnit test for the PRP duplicate discard so that the node
allocation matches and expectations on the removed fields are removed.

Reported-by: Yoann Congal <yoann.congal@smile.fr>
Closes: https://lore.kernel.org/netdev/7d221a07-8358-4c0b-a09c-3b029c052245@smile.fr/
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/36dc3bc5bdb7e68b70bb5ef86f53ca95a3f35418.1770299429.git.fmaurer@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoselftests: hsr: Add tests for more link faults with PRP
Felix Maurer [Thu, 5 Feb 2026 13:57:32 +0000 (14:57 +0100)]
selftests: hsr: Add tests for more link faults with PRP

Add tests where one link has different rates of packet loss or reorders
packets. PRP should still be able to recover from these link faults and
show no packet loss.  However, it is acceptable to receive some level of
duplicate packets. This matches the current specification (IEC
62439-3:2021) of the duplicate discard algorithm that requires it to be
"designed such that it never rejects a legitimate frame, while occasional
acceptance of a duplicate can be tolerated." The rate of acceptable
duplicates in this test is intentionally high (10%) to make the test
stable, the values I observed in the worst test cases (20% loss) are around
5% duplicates.

The duplicates occur because of the 10ms ping interval in the test. As
blocks expire after 400ms based on the timestamp of the first received
sequence number in the block, every approx. 40th will lead to a new, clean
block being used where the sequence number hasn't been seen before. As this
occurs on both nodes in the test (for requests and replies), we observe
around 20 duplicate frames.

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/7b36506d3a80e53786fe56526cf6046c74dfeee1.1770299429.git.fmaurer@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agohsr: Implement more robust duplicate discard for PRP
Felix Maurer [Thu, 5 Feb 2026 13:57:31 +0000 (14:57 +0100)]
hsr: Implement more robust duplicate discard for PRP

The PRP duplicate discard algorithm does not work reliably with certain
link faults. Especially with packet loss on one link, the duplicate discard
algorithm drops valid packets which leads to packet loss on the PRP
interface where the link fault should in theory be perfectly recoverable by
PRP. This happens because the algorithm opens the drop window on the lossy
link, covering received and lost sequence numbers. If the other, non-lossy
link receives the duplicate for a lost frame, it is within the drop window
of the lossy link and therefore dropped.

Since IEC 62439-3:2012, a node has one sequence number counter for frames
it sends, instead of one sequence number counter for each destination.
Therefore, a node can not expect to receive contiguous sequence numbers
from a sender. A missing sequence number can be totally normal (if the
sender intermittently communicates with another node) or mean a frame was
lost.

The algorithm, as previously implemented in commit 05fd00e5e7b1 ("net: hsr:
Fix PRP duplicate detection"), was part of IEC 62439-3:2010 (HSRv0/PRPv0)
but was removed with IEC 62439-3:2012 (HSRv1/PRPv1). Since that, no
algorithm is specified but up to implementers. It should be "designed such
that it never rejects a legitimate frame, while occasional acceptance of a
duplicate can be tolerated" (IEC 62439-3:2021).

For the duplicate discard algorithm, this means that 1) we need to track
the sequence numbers individually to account for non-contiguous sequence
numbers, and 2) we should always err on the side of accepting a duplicate
than dropping a valid frame.

The idea of the new algorithm is to store the seen sequence numbers in a
bitmap. To keep the size of the bitmap in control, we store it as a "sparse
bitmap" where the bitmap is split into blocks and not all blocks exist at
the same time. The sparse bitmap is implemented using an xarray that keeps
the references to the individual blocks and a backing ring buffer that
stores the actual blocks. New blocks are initialized in the buffer and
added to the xarray as needed when new frames arrive. Existing blocks are
removed in two conditions:
1. The block found for an arriving sequence number is old and therefore not
   relevant to the duplicate discard algorithm anymore, i.e., it has been
   added more than the entry forget time ago. In this case, the block is
   removed from the xarray and marked as forgotten (by setting its
   timestamp to 0).
2. Space is needed in the ring buffer for a new block. In this case, the
   block is removed from the xarray, if it hasn't already been forgotten
   (by 1.). Afterwards, the new block is initialized in its place.

This has the nice property that we can reliably track sequence numbers on
low traffic situations (where they expire based on their timestamp) and
more quickly forget sequence numbers in high traffic situations before they
potentially wrap over and repeat before they are expired.

When nodes are merged, the blocks are merged as well. The timestamp of a
merged block is set to the minimum of the two timestamps to never keep
around a seen sequence number for too long. The bitmaps are or'd to mark
all seen sequence numbers as seen.

All of this still happens under seq_out_lock, to prevent concurrent
access to the blocks.

The KUnit test for the algorithm is updated as well. The updates are done
in a way to match the original intends pretty closely. Currently, there is
much knowledge about the actual algorithm baked into the tests (especially
the expectations) which may need some redesign in the future.

Reported-by: Steffen Lindner <steffen.lindner@de.abb.com>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Steffen Lindner <steffen.lindner@de.abb.com>
Link: https://patch.msgid.link/8ce15a996099df2df5b700969a39e7df400e8dbb.1770299429.git.fmaurer@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoselftests: hsr: Add tests for faulty links
Felix Maurer [Thu, 5 Feb 2026 13:57:30 +0000 (14:57 +0100)]
selftests: hsr: Add tests for faulty links

Add a test case that can support different types of faulty links for all
protocol versions (HSRv0, HSRv1, PRPv1). It starts with a baseline with
fully functional links. The first faulty case is one link being cut during
the ping. This test uses a different function for ping that sends more
packets in shorter intervals to stress the duplicate detection algorithms a
bit more and allow for future tests with other link faults (packet loss,
reordering, etc.).

As the link fault tests now cover the cut link for HSR and PRP, it can be
removed from the hsr_ping test. Note that the removed cut link test did not
really test the fault because do_ping_long takes about 1sec while the link
is only cut after a 3sec sleep.

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/dad52276e2c349ecb96168bef7e3001bf7becc81.1770299429.git.fmaurer@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoselftests: hsr: Check duplicates on HSR with VLAN
Felix Maurer [Thu, 5 Feb 2026 13:57:29 +0000 (14:57 +0100)]
selftests: hsr: Check duplicates on HSR with VLAN

Previously the hsr_ping test only checked that all nodes in a VLAN are
reachable (using do_ping). Update the test to also check that there is no
packet loss and no duplicate packets by running the same tests for VLANs as
without VLANs (including using do_ping_long). This also adds tests for IPv6
over VLAN. To unify the test code, the topology without VLANs now uses IP
addresses from dead:beef:0::/64 to align with the 100.64.0.0/24 range for
IPv4. Error messages are updated across the board to make it easier to find
what actually failed.

Also update the VLAN test to only run in VLAN 2, as there is no need to
check if ping really works with VLAN IDs 2, 3, 4, and 5. This lowers the
number of long ping tests on VLANs to keep the overall test runtime in
bounds.

It's still necessary to bump the test timeout a bit, though: a ping long
tests takes 1sec, do_ping_tests performs 12 of them, do_link_problem_tests
6, and the VLAN tests again 12. With some buffer for setup and waiting and
for two protocol versions, 90sec timeout seems reasonable.

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/e3ded0e2547b5f720524b62fabeb96debc579697.1770299429.git.fmaurer@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoselftests: hsr: Add ping test for PRP
Felix Maurer [Thu, 5 Feb 2026 13:57:28 +0000 (14:57 +0100)]
selftests: hsr: Add ping test for PRP

Add a selftest for PRP that performs a basic ping test on IPv4 and IPv6,
over the plain PRP interface and a VLAN interface, similar to the existing
ping test for HSR. The test first checks reachability of the other node,
then checks for no loss and no duplicates.

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/4a342189e842d7308d037da72af566729ee75834.1770299429.git.fmaurer@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: atm: fix crash due to unvalidated vcc pointer in sigd_send()
Jiayuan Chen [Thu, 5 Feb 2026 09:54:51 +0000 (17:54 +0800)]
net: atm: fix crash due to unvalidated vcc pointer in sigd_send()

Reproducer available at [1].

The ATM send path (sendmsg -> vcc_sendmsg -> sigd_send) reads the vcc
pointer from msg->vcc and uses it directly without any validation. This
pointer comes from userspace via sendmsg() and can be arbitrarily forged:

    int fd = socket(AF_ATMSVC, SOCK_DGRAM, 0);
    ioctl(fd, ATMSIGD_CTRL);  // become ATM signaling daemon
    struct msghdr msg = { .msg_iov = &iov, ... };
    *(unsigned long *)(buf + 4) = 0xdeadbeef;  // fake vcc pointer
    sendmsg(fd, &msg, 0);  // kernel dereferences 0xdeadbeef

In normal operation, the kernel sends the vcc pointer to the signaling
daemon via sigd_enq() when processing operations like connect(), bind(),
or listen(). The daemon is expected to return the same pointer when
responding. However, a malicious daemon can send arbitrary pointer values.

Fix this by introducing find_get_vcc() which validates the pointer by
searching through vcc_hash (similar to how sigd_close() iterates over
all VCCs), and acquires a reference via sock_hold() if found.

Since struct atm_vcc embeds struct sock as its first member, they share
the same lifetime. Therefore using sock_hold/sock_put is sufficient to
keep the vcc alive while it is being used.

Note that there may be a race with sigd_close() which could mark the vcc
with various flags (e.g., ATM_VF_RELEASED) after find_get_vcc() returns.
However, sock_hold() guarantees the memory remains valid, so this race
only affects the logical state, not memory safety.

[1]: https://gist.github.com/mrpre/1ba5949c45529c511152e2f4c755b0f3
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+1f22cb1769f249df9fa0@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69039850.a70a0220.5b2ed.005d.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Link: https://patch.msgid.link/20260205095501.131890-1-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoMerge branch 'net-fec-improve-xdp-copy-mode-and-add-af_xdp-zero-copy-support'
Paolo Abeni [Tue, 10 Feb 2026 09:58:22 +0000 (10:58 +0100)]
Merge branch 'net-fec-improve-xdp-copy-mode-and-add-af_xdp-zero-copy-support'

Wei Fang says:

====================
net: fec: improve XDP copy mode and add AF_XDP zero-copy support

This patch set optimizes the XDP copy mode logic as follows.

1. Separate the processing of RX XDP frames from fec_enet_rx_queue(),
and adds a separate function fec_enet_rx_queue_xdp() for handling XDP
frames.

2. For TX XDP packets, using the batch sending method to avoid frequent
MMIO writes.

3. Use the switch statement to check the tx_buf type instead of the
if...else... statement, making the cleanup logic of TX BD ring cleared
and more efficient.

We compared the performance of XDP copy mode before and after applying
this patch set, and the results show that the performance has improved.

Before applying this patch set.
root@imx93evk:~# ./xdp-bench tx eth0
Summary                   396,868 rx/s                  0 err,drop/s
Summary                   396,024 rx/s                  0 err,drop/s

root@imx93evk:~# ./xdp-bench drop eth0
Summary                   684,781 rx/s                  0 err/s
Summary                   675,746 rx/s                  0 err/s

root@imx93evk:~# ./xdp-bench pass eth0
Summary                   208,552 rx/s                  0 err,drop/s
Summary                   208,654 rx/s                  0 err,drop/s

root@imx93evk:~# ./xdp-bench redirect eth0 eth0
eth0->eth0                311,210 rx/s                  0 err,drop/s      311,208 xmit/s
eth0->eth0                310,808 rx/s                  0 err,drop/s      310,809 xmit/s

After applying this patch set.
root@imx93evk:~# ./xdp-bench tx eth0
Summary                   425,778 rx/s                  0 err,drop/s
Summary                   426,042 rx/s                  0 err,drop/s

root@imx93evk:~# ./xdp-bench drop eth0
Summary                   698,351 rx/s                  0 err/s
Summary                   701,882 rx/s                  0 err/s

root@imx93evk:~# ./xdp-bench pass eth0
Summary                   210,348 rx/s                  0 err,drop/s
Summary                   210,016 rx/s                  0 err,drop/s

root@imx93evk:~# ./xdp-bench redirect eth0 eth0
eth0->eth0                354,407 rx/s                  0 err,drop/s      354,401 xmit/s
eth0->eth0                350,381 rx/s                  0 err,drop/s      350,389 xmit/s

This patch set also addes the AF_XDP zero-copy support, and we tested
the performance on i.MX93 platform with xdpsock tool. The following is
the performance comparison of copy mode and zero-copy mode. It can be
seen that the performance of zero-copy mode is better than that of copy
mode.

1. MAC swap L2 forwarding
1.1 Zero-copy mode
root@imx93evk:~# ./xdpsock -i eth0 -l -z
 sock0@eth0:0 l2fwd xdp-drv
                   pps            pkts           1.00
rx                 414715         415455
tx                 414715         415455

1.2 Copy mode
root@imx93evk:~# ./xdpsock -i eth0 -l -c
 sock0@eth0:0 l2fwd xdp-drv
                   pps            pkts           1.00
rx                 356396         356609
tx                 356396         356609

2. TX only
2.1 Zero-copy mode
root@imx93evk:~# ./xdpsock -i eth0 -t -s 64 -z
 sock0@eth0:0 txonly xdp-drv
                   pps            pkts           1.00
rx                 0              0
tx                 1119573        1126720

2.2 Copy mode
root@imx93evk:~# ./xdpsock -i eth0 -t -s 64 -c
sock0@eth0:0 txonly xdp-drv
                   pps            pkts           1.00
rx                 0              0
tx                 406864         407616
====================

Link: https://patch.msgid.link/20260205085742.2685134-1-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: add AF_XDP zero-copy support
Wei Fang [Thu, 5 Feb 2026 08:57:42 +0000 (16:57 +0800)]
net: fec: add AF_XDP zero-copy support

This patch adds AF_XDP zero-copy support for both TX and RX on the FEC
driver. It introduces new functions for XSK buffer allocation, RX/TX
queue processing in zero-copy mode, and XSK pool setup/teardown.

For RX, fec_alloc_rxq_buffers_zc() is added to allocate RX buffers from
XSK pool. And fec_enet_rx_queue_xsk() is used to process the frames from
the RX queue which is bound to the AF_XDP socket. Similar to the copy
mode, the zero-copy mode also supports XDP_TX, XDP_PASS, XDP_DROP and
XDP_REDIRECT actions. In addition, fec_enet_xsk_tx_xmit() is similar to
fec_enet_xdp_tx_xmit() and is used to handle XDP_TX action in zero-copy
mode.

For TX, there are two cases, one is the frames from the AF_XDP socket,
so fec_enet_xsk_xmit() is added to directly transmit the frames from
the socket and the buffer type is marked as FEC_TXBUF_T_XSK_XMIT. The
other one is the frames from the RX queue (XDP_TX action), the buffer
type is marked as FEC_TXBUF_T_XSK_TX. Therefore, fec_enet_tx_queue()
could correctly clean the TX queue base on the buffer type.

Also, some tests have been done on the i.MX93-EVK board with the xdpsock
tool, the following are the results.

Env: i.MX93 connects to a packet generator, the link speed is 1Gbps, and
flow-control is off. The RX packet size is 64 bytes including FCS. Only
one RX queue (CPU) is used to receive frames.

1. MAC swap L2 forwarding
1.1 Zero-copy mode
root@imx93evk:~# ./xdpsock -i eth0 -l -z
 sock0@eth0:0 l2fwd xdp-drv
                   pps            pkts           1.00
rx                 414715         415455
tx                 414715         415455

1.2 Copy mode
root@imx93evk:~# ./xdpsock -i eth0 -l -c
 sock0@eth0:0 l2fwd xdp-drv
                   pps            pkts           1.00
rx                 356396         356609
tx                 356396         356609

2. TX only
2.1 Zero-copy mode
root@imx93evk:~# ./xdpsock -i eth0 -t -s 64 -z
 sock0@eth0:0 txonly xdp-drv
                   pps            pkts           1.00
rx                 0              0
tx                 1119573        1126720

2.2 Copy mode
root@imx93evk:~# ./xdpsock -i eth0 -t -s 64 -c
sock0@eth0:0 txonly xdp-drv
                   pps            pkts           1.00
rx                 0              0
tx                 406864         407616

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-16-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: improve fec_enet_tx_queue()
Wei Fang [Thu, 5 Feb 2026 08:57:41 +0000 (16:57 +0800)]
net: fec: improve fec_enet_tx_queue()

To support AF_XDP zero-copy mode in the subsequent patch, the following
adjustments have been made to fec_tx_queue().

1. Change the parameters of fec_tx_queue().
2. Some variables are initialized at the time of declaration, and the
order of local variables is updated to follow the reverse xmas tree
style.
3. Remove the variable xdpf and add the variable tx_buf.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-15-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: add fec_alloc_rxq_buffers_pp() to allocate buffers from page pool
Wei Fang [Thu, 5 Feb 2026 08:57:40 +0000 (16:57 +0800)]
net: fec: add fec_alloc_rxq_buffers_pp() to allocate buffers from page pool

Currently, the buffers of RX queue are allocated from the page pool. In
the subsequent patches to support XDP zero copy, the RX buffers will be
allocated from the UMEM. Therefore, extract fec_alloc_rxq_buffers_pp()
from fec_enet_alloc_rxq_buffers() and we will add another helper to
allocate RX buffers from UMEM for the XDP zero copy mode. In addition,
fec_alloc_rxq_buffers_pp() only initializes bdp->bufaddr and does not
initialize other fields of bdp, because these will be initialized in
fec_enet_bd_init().

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-14-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: move xdp_rxq_info* APIs out of fec_enet_create_page_pool()
Wei Fang [Thu, 5 Feb 2026 08:57:39 +0000 (16:57 +0800)]
net: fec: move xdp_rxq_info* APIs out of fec_enet_create_page_pool()

Extract fec_xdp_rxq_info_reg() from fec_enet_create_page_pool() and move
it out of fec_enet_create_page_pool(), so that it can be reused in the
subsequent patches to support XDP zero copy mode.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-13-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: remove the size parameter from fec_enet_create_page_pool()
Wei Fang [Thu, 5 Feb 2026 08:57:38 +0000 (16:57 +0800)]
net: fec: remove the size parameter from fec_enet_create_page_pool()

Remove the size parameter from fec_enet_create_page_pool(), since
rxq->bd.ring_size already contains this information.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-12-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: use switch statement to check the type of tx_buf
Wei Fang [Thu, 5 Feb 2026 08:57:37 +0000 (16:57 +0800)]
net: fec: use switch statement to check the type of tx_buf

The tx_buf has three types: FEC_TXBUF_T_SKB, FEC_TXBUF_T_XDP_NDO and
FEC_TXBUF_T_XDP_TX. Currently, the driver uses 'if...else...' statements
to check the type and perform the corresponding processing. This is very
detrimental to future expansion. To support AF_XDP zero-copy mode, two
new types will be added in the future, continuing to use 'if...else...'
would be a very bad coding style. So the 'if...else...' statements in
the current driver are replaced with switch statements.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-11-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: remove unnecessary NULL pointer check when clearing TX BD ring
Wei Fang [Thu, 5 Feb 2026 08:57:36 +0000 (16:57 +0800)]
net: fec: remove unnecessary NULL pointer check when clearing TX BD ring

The tx_buf pointer will not NULL when its type is FEC_TXBUF_T_XDP_NDO or
FEC_TXBUF_T_XDP_TX. If the type is FEC_TXBUF_T_SKB, dev_kfree_skb_any()
will do NULL pointer check. So it is unnecessary to do NULL pointer check
in fec_enet_bd_init() and fec_enet_tx_queue().

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-10-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: transmit XDP frames in bulk
Wei Fang [Thu, 5 Feb 2026 08:57:35 +0000 (16:57 +0800)]
net: fec: transmit XDP frames in bulk

Currently, the driver writes the ENET_TDAR register for every XDP frame
to trigger transmit start. Frequent MMIO writes consume more CPU cycles
and may reduce XDP TX performance, so transmit XDP frames in bulk.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-9-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: add tx_qid parameter to fec_enet_xdp_tx_xmit()
Wei Fang [Thu, 5 Feb 2026 08:57:34 +0000 (16:57 +0800)]
net: fec: add tx_qid parameter to fec_enet_xdp_tx_xmit()

Remove fec_enet_xdp_get_tx_queue() from fec_enet_xdp_tx_xmit() and add
the tx_qid parameter to it. Then, calculate the TX queue ID for XDP_TX
frames in fec_enet_rx_queue_xdp(). This way, the TX queue ID only needs
to be calculated once for XDP_TX frames during each NAPI polling. And
since the number of RX queues and TX queues in FEC is generally equal,
the RX queue ID can be directly used as the TX queue ID. In exceptional
cases, fec_enet_xdp_get_tx_queue() is used to calculate the TX queue ID.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-8-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: add fec_enet_rx_queue_xdp() for XDP path
Wei Fang [Thu, 5 Feb 2026 08:57:33 +0000 (16:57 +0800)]
net: fec: add fec_enet_rx_queue_xdp() for XDP path

Currently, the processing of XDP path packets and protocol stack packets
are both mixed in fec_enet_rx_queue(), which makes the logic somewhat
confusing and debugging more difficult. Furthermore, some logic is not
needed by each other. Such as the kernel path does not need to call
xdp_init_buff(), XDP path does not support swap_buffer(), because
fec_enet_bpf() returns "-EOPNOTSUPP" for those platforms which need
swap_buffer()), and so on. This prevents XDP from achieving its maximum
performance. Therefore, XDP path packets processing has been separated
from fec_enet_rx_queue() by adding the fec_enet_rx_queue_xdp() function
to optimize XDP path logic and improve XDP performance.

The XDP performance on the iMX93 platform was compared before and after
applying this patch. Detailed results are as follows and we can see the
performance has been improved.

Env: i.MX93, packet size 64 bytes including FCS, only single core and RX
BD ring are used to receive packets, flow-control is off.

Before the patch is applied:
xdp-bench tx eth0
Summary                   396,868 rx/s                  0 err,drop/s
Summary                   396,024 rx/s                  0 err,drop/s

xdp-bench drop eth0
Summary                   684,781 rx/s                  0 err/s
Summary                   675,746 rx/s                  0 err/s

xdp-bench pass eth0
Summary                   208,552 rx/s                  0 err,drop/s
Summary                   208,654 rx/s                  0 err,drop/s

xdp-bench redirect eth0 eth0
eth0->eth0                311,210 rx/s                  0 err,drop/s      311,208 xmit/s
eth0->eth0                310,808 rx/s                  0 err,drop/s      310,809 xmit/s

After the patch is applied:
xdp-bench tx eth0
Summary                   409,975 rx/s                  0 err,drop/s
Summary                   411,073 rx/s                  0 err,drop/s

xdp-bench drop eth0
Summary                   700,681 rx/s                  0 err/s
Summary                   698,102 rx/s                  0 err/s

xdp-bench pass eth0
Summary                   211,356 rx/s                  0 err,drop/s
Summary                   210,629 rx/s                  0 err,drop/s

xdp-bench redirect eth0 eth0
eth0->eth0                320,351 rx/s                  0 err,drop/s      320,348 xmit/s
eth0->eth0                318,988 rx/s                  0 err,drop/s      318,988 xmit/s

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-7-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: improve fec_enet_rx_queue()
Wei Fang [Thu, 5 Feb 2026 08:57:32 +0000 (16:57 +0800)]
net: fec: improve fec_enet_rx_queue()

This patch has made the following adjustments to fec_enet_rx_queue().

1. The function parameters are modified to maintain the same style as
subsequently added XDP-related interfaces.

2. Some variables are initialized at the time of declaration, and the
order of local variables is updated to follow the reverse xmas tree
style.

3. Replace variable cbd_bufaddr with dma.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-6-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: add fec_build_skb() to build a skb
Wei Fang [Thu, 5 Feb 2026 08:57:31 +0000 (16:57 +0800)]
net: fec: add fec_build_skb() to build a skb

Extract the helper fec_build_skb() from fec_enet_rx_queue(), so that the
code for building a skb is centralized in fec_build_skb(), which makes
the code of fec_enet_rx_queue() more concise and readable.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-5-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: add rx_shift to indicate the extra bytes padded in front of RX frame
Wei Fang [Thu, 5 Feb 2026 08:57:30 +0000 (16:57 +0800)]
net: fec: add rx_shift to indicate the extra bytes padded in front of RX frame

The FEC of some platforms supports RX FIFO shift-16, it means the actual
frame data starts at bit 16 of the first word read from RX FIFO aligning
the Ethernet payload on a 32-bit boundary. The MAC writes two additional
bytes in front of each frame received into the RX FIFO. Currently, the
fec_enet_rx_queue() updates the data_start, sub_len and the rx_bytes
statistics by checking whether FEC_QUIRK_HAS_RACC is set. This makes the
code less concise, so rx_shift is added to represent the number of extra
bytes padded in front of the RX frame. Furthermore, when adding separate
RX handling functions for XDP copy mode and zero copy mode in the future,
it will no longer be necessary to check FEC_QUIRK_HAS_RACC to update the
corresponding variables.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-4-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: add fec_rx_error_check() to check RX errors
Wei Fang [Thu, 5 Feb 2026 08:57:29 +0000 (16:57 +0800)]
net: fec: add fec_rx_error_check() to check RX errors

Extract fec_rx_error_check() from fec_enet_rx_queue(), this helper is
used to check RX errors. And it will be used in XDP and XDP zero copy
paths in subsequent patches.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-3-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: fec: add fec_txq_trigger_xmit() helper
Wei Fang [Thu, 5 Feb 2026 08:57:28 +0000 (16:57 +0800)]
net: fec: add fec_txq_trigger_xmit() helper

Currently, the workaround for FEC_QUIRK_ERR007885 has three call sites,
so add the helper fec_txq_trigger_xmit() to make the code more concise
and reusable.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260205085742.2685134-2-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: arcnet: com20020-pci: use module_pci_driver
Ethan Nelson-Moore [Thu, 5 Feb 2026 07:06:31 +0000 (23:06 -0800)]
net: arcnet: com20020-pci: use module_pci_driver

The only thing this driver's init/exit functions do is call
pci_register/unregister_driver, and in the case of the init function,
print an unnecessary message. Replace them with module_pci_driver to
simplify the code.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260205070632.37516-1-enelsonmoore@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoMerge branch 'net-dsa-mxl-gsw1xx-setup-polarities-and-validate-chip'
Paolo Abeni [Tue, 10 Feb 2026 08:09:31 +0000 (09:09 +0100)]
Merge branch 'net-dsa-mxl-gsw1xx-setup-polarities-and-validate-chip'

Daniel Golle says:

====================
net: dsa: mxl-gsw1xx: setup polarities and validate chip

Now that common PHY properties make it easy to configure the SerDes RX
and TX polarities, use that for the SGMII/1000Base-X/2500Base-X port of
the MaxLinear GSW1xx switches.

Also, validate hardware in probe() function to make sure the switch is
actually present and MDIO communication works properly.
====================

Link: https://patch.msgid.link/cover.1769916962.git.daniel@makrotopia.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: dsa: mxl-gsw1xx: validate chip ID
Daniel Golle [Sun, 1 Feb 2026 03:42:18 +0000 (03:42 +0000)]
net: dsa: mxl-gsw1xx: validate chip ID

No check for actually present hardware is being performed in the probe
function of the mxl-gsw1xx switch driver. So even if the switch isn't
present at the configured MDIO bus address the driver wrongly tells the
user that a "GSWIP version 0 mod 0" was found, outputting errors about
PHY capabilities not matching.

Read and validate the chip MANU_ID and PNUM_ID registers and output
information while probing, but return an error and abort probing in case
the hardware is not actually present.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/3194d3d3bb0b51f08755d392e1fdf7bb6dc49608.1769916962.git.daniel@makrotopia.org
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: dsa: mxl-gsw1xx: configure SerDes port polarities
Daniel Golle [Sun, 1 Feb 2026 03:42:00 +0000 (03:42 +0000)]
net: dsa: mxl-gsw1xx: configure SerDes port polarities

Configure SerDes (port 4) RX and TX polarities using the newly
introduced generic properties. The polarities are described at the port
level which equals the polarities of the external pins of the chip.

Note that the RX lane is inverted internally and the vendor driver
simply always sets bit GSW1XX_SGMII_PHY_RX0_CFG2_INVERT unconditionally
to end up with the correct (ie. as documented in datasheets) polarity at
the external pins.

In this sense, PHY_POLARITY_NORMAL denotes normal polarity for pins as
documented for the MRQFN 105-pin package (GSW120, GSW125, GSW140, GSW141
and GSW145 all use the same package and have identical pin layouts
except for TP port 2 and 3 being N/C on GSW12x):
pin B18 (TX0_P) positive signal of the differential SGMII data output pair
pin B19 (TX0_M) negative signal of the differential SGMII data output pair
pin B20 (RX0_P) positive signal of the differential SGMII data input pair
pin B21 (RX0_M) negative signal of the differential SGMII data input pair

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/8bf79b3476e23673fceffbe2bc9d6abc13d132e5.1769916962.git.daniel@makrotopia.org
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agodt-bindings: net: dsa: lantiq,gswip: reference common PHY properties
Daniel Golle [Sun, 1 Feb 2026 03:41:53 +0000 (03:41 +0000)]
dt-bindings: net: dsa: lantiq,gswip: reference common PHY properties

Reference the common PHY properties so RX and TX SerDes lane polarity
of the SGMII/1000Base-X/2500Base-X port can be configured.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/f556ef8be75e37a2f864b9d905a78962bbe76d18.1769916962.git.daniel@makrotopia.org
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoMerge branch 'net-stats-tools-driver-tests-for-hw-gro'
Jakub Kicinski [Tue, 10 Feb 2026 05:08:39 +0000 (21:08 -0800)]
Merge branch 'net-stats-tools-driver-tests-for-hw-gro'

Jakub Kicinski says:

====================
net: stats, tools, driver tests for HW GRO [part]

Add miscellaneous pieces related to production use of HW-GRO:
 - report standard stats from drivers (bnxt included here,
   Gal recently posted patches for mlx5 which is great)
 - CLI tool for calculating HW GRO savings / effectiveness
====================

Link: https://patch.msgid.link/20260207003509.3927744-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotools: ynltool: add qstats analysis for HW-GRO efficiency / savings
Jakub Kicinski [Sat, 7 Feb 2026 00:35:03 +0000 (16:35 -0800)]
tools: ynltool: add qstats analysis for HW-GRO efficiency / savings

Extend ynltool to compute HW GRO savings metric - how many
packets has HW GRO been able to save the kernel from seeing.

Note that this definition does not actually take into account
whether the segments were or weren't eligible for HW GRO.
If a machine is receiving all-UDP traffic - new metric will show
HW-GRO savings of 0%. Conversely since the super-packet still
counts as a received packet, savings of 100% is not achievable.
Perfect HW-GRO on a machine with 4k MTU and 64kB super-frames
would show ~93.75% savings. With 1.5k MTU we may see up to
~97.8% savings (if my math is right).

Example after 10 sec of iperf on a freshly booted machine
with 1.5k MTU:

  $ ynltool qstats show
  eth0     rx-packets:  40681280               rx-bytes:   61575208437
        rx-alloc-fail:         0      rx-hw-gro-packets:       1225133
                                 rx-hw-gro-wire-packets:      40656633
  $ ynltool qstats hw-gro
  eth0: 96.9% savings

None of the NICs I have access to can report "missed" HW-GRO
opportunities so computing a true "effectiveness" metric
is not possible. One could also argue that effectiveness metric
is inferior in environments where we control both senders and
receivers, the savings metrics will capture both regressions
in receiver's HW GRO effectiveness but also regressions in senders
sending smaller TSO trains. And we care about both. The main
downside is that it's hard to tell at a glance how well the NIC
is doing because the savings will be dependent on traffic patterns.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260207003509.3927744-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotools: ynltool: factor out qstat dumping
Jakub Kicinski [Sat, 7 Feb 2026 00:35:02 +0000 (16:35 -0800)]
tools: ynltool: factor out qstat dumping

The logic to open a socket and dump the queues is the same
across sub-commands. Factor it out, we'll need it again.

No functional changes intended.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260207003509.3927744-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoeth: bnxt: gather and report HW-GRO stats
Jakub Kicinski [Sat, 7 Feb 2026 00:35:01 +0000 (16:35 -0800)]
eth: bnxt: gather and report HW-GRO stats

Count and report HW-GRO stats as seen by the kernel.
The device stats for GRO seem to not reflect the reality,
perhaps they count sessions which did not actually result
in any aggregation. Also they count wire packets, so we
have to count super-frames, anyway.

Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20260207003509.3927744-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonfc: nxp-nci: remove interrupt trigger type
Carl Lee [Thu, 5 Feb 2026 11:11:39 +0000 (19:11 +0800)]
nfc: nxp-nci: remove interrupt trigger type

For NXP NCI devices (e.g. PN7150), the interrupt is level-triggered and
active high, not edge-triggered.

Using IRQF_TRIGGER_RISING in the driver can cause interrupts to fail
to trigger correctly.

Remove IRQF_TRIGGER_RISING and rely on the IRQ trigger type configured
via Device Tree.

Signed-off-by: Carl Lee <carl.lee@amd.com>
Link: https://patch.msgid.link/20260205-fc-nxp-nci-remove-interrupt-trigger-type-v2-1-79d2ed4a7e42@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: hns3: fix double free issue for tx spare buffer
Jian Shen [Thu, 5 Feb 2026 12:17:19 +0000 (20:17 +0800)]
net: hns3: fix double free issue for tx spare buffer

In hns3_set_ringparam(), a temporary copy (tmp_rings) of the ring structure
is created for rollback. However, the tx_spare pointer in the original
ring handle is incorrectly left pointing to the old backup memory.

Later, if memory allocation fails in hns3_init_all_ring() during the setup,
the error path attempts to free all newly allocated rings. Since tx_spare
contains a stale (non-NULL) pointer from the backup, it is mistaken for
a newly allocated buffer and is erroneously freed, leading to a double-free
of the backup memory.

The root cause is that the tx_spare field was not cleared after its value
was saved in tmp_rings, leaving a dangling pointer.

Fix this by setting tx_spare to NULL in the original ring structure
when the creation of the new `tx_spare` fails. This ensures the
error cleanup path only frees genuinely newly allocated buffers.

Fixes: 907676b130711 ("net: hns3: use tx bounce buffer for small packets")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260205121719.3285730-1-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'big-tcp-without-hbh-in-ipv6'
Jakub Kicinski [Sat, 7 Feb 2026 04:48:35 +0000 (20:48 -0800)]
Merge branch 'big-tcp-without-hbh-in-ipv6'

Alice Mikityanska says:

====================
BIG TCP without HBH in IPv6

Resubmitting after the grace period.

This series is part 1 of "BIG TCP for UDP tunnels". Due to the number of
patches, I'm splitting it into two logical parts:

* Remove hop-by-hop header for BIG TCP IPv6 to align with BIG TCP IPv4.
* Fix up things that prevent BIG TCP from working with UDP tunnels.

The current BIG TCP IPv6 code inserts a hop-by-hop extension header with
32-bit length of the packet. When the packet is encapsulated, and either
the outer or the inner protocol is IPv6, or both are IPv6, there will be
1 or 2 HBH headers that need to be dealt with. The issues that arise:

1. The drivers don't strip it, and they'd all need to know the structure
of each tunnel protocol in order to strip it correctly, also taking into
account all combinations of IPv4/IPv6 inner/outer protocols.

2. Even if (1) is implemented, it would be an additional performance
penalty per aggregated packet.

3. The skb_gso_validate_network_len check is skipped in
ip6_finish_output_gso when IP6SKB_FAKEJUMBO is set, but it seems that it
would make sense to do the actual validation, just taking into account
the length of the HBH header. When the support for tunnels is added, it
becomes trickier, because there may be one or two HBH headers, depending
on whether it's IPv6 in IPv6 or not.

At the same time, having an HBH header to store the 32-bit length is not
strictly necessary, as BIG TCP IPv4 doesn't do anything like this and
just restores the length from skb->len. The same thing can be done for
BIG TCP IPv6. Removing HBH from BIG TCP would allow to simplify the
implementation significantly, and align it with BIG TCP IPv4, which has
been a long-standing goal.
====================

Link: https://patch.msgid.link/20260205133925.526371-1-alice.kernel@fastmail.im
Signed-off-by: Jakub Kicinski <kuba@kernel.org>