Eric Dumazet [Tue, 15 Nov 2022 09:18:51 +0000 (09:18 +0000)]
tcp: annotate data-race around queue->synflood_warned
Annotate the lockless read of queue->synflood_warned.
Following xchg() has the needed data-race resolution.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Li zeming [Tue, 15 Nov 2022 02:14:24 +0000 (10:14 +0800)]
ax25: af_ax25: Remove unnecessary (void*) conversions
The valptr pointer is of (void *) type, so other pointers need not be
forced to assign values to it.
Signed-off-by: Li zeming <zeming@nfschina.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Nov 2022 12:48:44 +0000 (12:48 +0000)]
Merge branch 'net-atomic-dev-stats'
Eric Dumazet says:
====================
net: add atomic dev->stats infra
Long standing KCSAN issues are caused by data-race around
some dev->stats changes.
Most performance critical paths already use per-cpu
variables, or per-queue ones.
It is reasonable (and more correct) to use atomic operations
for the slow paths.
First patch adds the infrastructure, then three patches address
the most common paths that syzbot is playing with.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2022 08:53:58 +0000 (08:53 +0000)]
ipv4: tunnels: use DEV_STATS_INC()
Most of code paths in tunnels are lockless (eg NETIF_F_LLTX in tx).
Adopt SMP safe DEV_STATS_INC() to update dev->stats fields.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2022 08:53:57 +0000 (08:53 +0000)]
ipv6: tunnels: use DEV_STATS_INC()
Most of code paths in tunnels are lockless (eg NETIF_F_LLTX in tx).
Adopt SMP safe DEV_STATS_{INC|ADD}() to update dev->stats fields.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2022 08:53:56 +0000 (08:53 +0000)]
ipv6/sit: use DEV_STATS_INC() to avoid data-races
syzbot/KCSAN reported that multiple cpus are updating dev->stats.tx_error
concurrently.
This is because sit tunnels are NETIF_F_LLTX, meaning their ndo_start_xmit()
is not protected by a spinlock.
While original KCSAN report was about tx path, rx path has the same issue.
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2022 08:53:55 +0000 (08:53 +0000)]
net: add atomic_long_t to net_device_stats fields
Long standing KCSAN issues are caused by data-race around
some dev->stats changes.
Most performance critical paths already use per-cpu
variables, or per-queue ones.
It is reasonable (and more correct) to use atomic operations
for the slow paths.
This patch adds an union for each field of net_device_stats,
so that we can convert paths that are not yet protected
by a spinlock or a mutex.
netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64
Note that the memcpy() we were using on 64bit arches
had no provision to avoid load-tearing,
while atomic_long_read() is providing the needed protection
at no cost.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Nov 2022 12:42:01 +0000 (12:42 +0000)]
Merge branch 'net-try_cmpxchg-conversions'
Eric Dumazet says:
====================
net: more try_cmpxchg() conversions
Adopt try_cmpxchg() and friends in more places, as this
is preferred nowadays.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2022 09:11:01 +0000 (09:11 +0000)]
net: __sock_gen_cookie() cleanup
Adopt atomic64_try_cmpxchg() and remove the loop,
to make the intent more obvious.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2022 09:11:00 +0000 (09:11 +0000)]
net: adopt try_cmpxchg() in napi_{enable|disable}()
This makes code a bit cleaner.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2022 09:10:59 +0000 (09:10 +0000)]
net: adopt try_cmpxchg() in napi_schedule_prep() and napi_complete_done()
This makes the code slightly more efficient.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2022 09:10:58 +0000 (09:10 +0000)]
net: net_{enable|disable}_timestamp() optimizations
Adopting atomic_try_cmpxchg() makes the code cleaner.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2022 09:10:57 +0000 (09:10 +0000)]
ipv6: fib6_new_sernum() optimization
Adopt atomic_try_cmpxchg() which is slightly more efficient.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2022 09:10:56 +0000 (09:10 +0000)]
net: mm_account_pinned_pages() optimization
Adopt atomic_long_try_cmpxchg() in mm_account_pinned_pages()
as it is slightly more efficient.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 14 Nov 2022 14:42:56 +0000 (16:42 +0200)]
net: linkwatch: only report IF_OPER_LOWERLAYERDOWN if iflink is actually down
RFC 2863 says:
The lowerLayerDown state is also a refinement on the down state.
This new state indicates that this interface runs "on top of" one or
more other interfaces (see ifStackTable) and that this interface is
down specifically because one or more of these lower-layer interfaces
are down.
DSA interfaces are virtual network devices, stacked on top of the DSA
master, but they have a physical MAC, with a PHY that reports a real
link status.
But since DSA (perhaps improperly) uses an iflink to describe the
relationship to its master since commit
c084080151e1 ("dsa: set ->iflink
on slave interfaces to the ifindex of the parent"), default_operstate()
will misinterpret this to mean that every time the carrier of a DSA
interface is not ok, it is because of the master being not ok.
In fact, since commit
c0a8a9c27493 ("net: dsa: automatically bring user
ports down when master goes down"), DSA cannot even in theory be in the
lowerLayerDown state, because it just calls dev_close_many(), thereby
going down, when the master goes down.
We could revert the commit that creates an iflink between a DSA user
port and its master, especially since now we have an alternative
IFLA_DSA_MASTER which has less side effects. But there may be tooling in
use which relies on the iflink, which has existed since 2009.
We could also probably do something local within DSA to overwrite what
rfc2863_policy() did, in a way similar to hsr_set_operstate(), but this
seems like a hack.
What seems appropriate is to follow the iflink, and check the carrier
status of that interface as well. If that's down too, yes, keep
reporting lowerLayerDown, otherwise just down.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Nov 2022 09:43:36 +0000 (09:43 +0000)]
Merge branch 'udp-pernetns-hash'
Kuniyuki Iwashima says:
====================
udp: Introduce optional per-netns hash table.
This series is the UDP version of the per-netns ehash series [0],
which were initially in the same patch set. [1]
The notable difference with TCP is the max table size is 64K and the min
size is 128. This is because the possible hash range by udp_hashfn()
always fits in 64K within the same netns and because we want to keep a
bitmap in udp_lib_get_port() on the stack. Also, the UDP per-netns table
isolates both 1-tuple and 2-tuple tables.
For details, please see the last patch.
patch 1 - 4: prep for per-netns hash table
patch 5: add per-netns hash table
[0]: https://lore.kernel.org/netdev/
20220908011022.45342-1-kuniyu@amazon.com/
[1]: https://lore.kernel.org/netdev/
20220826000445.46552-1-kuniyu@amazon.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Mon, 14 Nov 2022 21:57:57 +0000 (13:57 -0800)]
udp: Introduce optional per-netns hash table.
The maximum hash table size is 64K due to the nature of the protocol. [0]
It's smaller than TCP, and fewer sockets can cause a performance drop.
On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in
different netns, creating 32Mi sockets without data transfer in the root
netns causes regression for the iperf3's connection.
uhash_entries sockets length Gbps
64K 1 1 5.69
1Mi 16 5.27
2Mi 32 4.90
4Mi 64 4.09
8Mi 128 2.96
16Mi 256 2.06
32Mi 512 1.12
The per-netns hash table breaks the lengthy lists into shorter ones. It is
useful on a multi-tenant system with thousands of netns. With smaller hash
tables, we can look up sockets faster, isolate noisy neighbours, and reduce
lock contention.
The max size of the per-netns table is 64K as well. This is because the
possible hash range by udp_hashfn() always fits in 64K within the same
netns and we cannot make full use of the whole buckets larger than 64K.
/* 0 < num < 64K -> X < hash < X + 64K */
(num + net_hash_mix(net)) & mask;
Also, the min size is 128. We use a bitmap to search for an available
port in udp_lib_get_port(). To keep the bitmap on the stack and not
fire the CONFIG_FRAME_WARN error at build time, we round up the table
size to 128.
The sysctl usage is the same with TCP:
$ dmesg | cut -d ' ' -f 6- | grep "UDP hash"
UDP hash table entries: 65536 (order: 9,
2097152 bytes, vmalloc)
# sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 65536 # can be changed by uhash_entries
# sysctl net.ipv4.udp_child_hash_entries
net.ipv4.udp_child_hash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = -65536 # share the global table
# sysctl -w net.ipv4.udp_child_hash_entries=100
net.ipv4.udp_child_hash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 128 # own a per-netns table with 2^n buckets
We could optimise the hash table lookup/iteration further by removing
the netns comparison for the per-netns one in the future. Also, we
could optimise the sparse udp_hslot layout by putting it in udp_table.
[0]: https://lore.kernel.org/netdev/
4ACC2815.
7010101@gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Mon, 14 Nov 2022 21:57:56 +0000 (13:57 -0800)]
udp: Access &udp_table via net.
We will soon introduce an optional per-netns hash table
for UDP.
This means we cannot use udp_table directly in most places.
Instead, access it via net->ipv4.udp_table.
The access will be valid only while initialising udp_table
itself and creating/destroying each netns.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Mon, 14 Nov 2022 21:57:55 +0000 (13:57 -0800)]
udp: Set NULL to udp_seq_afinfo.udp_table.
We will soon introduce an optional per-netns hash table
for UDP.
This means we cannot use the global udp_seq_afinfo.udp_table
to fetch a UDP hash table.
Instead, set NULL to udp_seq_afinfo.udp_table for UDP and get
a proper table from net->ipv4.udp_table.
Note that we still need udp_seq_afinfo.udp_table for UDP LITE.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Mon, 14 Nov 2022 21:57:54 +0000 (13:57 -0800)]
udp: Set NULL to sk->sk_prot->h.udp_table.
We will soon introduce an optional per-netns hash table
for UDP.
This means we cannot use the global sk->sk_prot->h.udp_table
to fetch a UDP hash table.
Instead, set NULL to sk->sk_prot->h.udp_table for UDP and get
a proper table from net->ipv4.udp_table.
Note that we still need sk->sk_prot->h.udp_table for UDP LITE.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Mon, 14 Nov 2022 21:57:53 +0000 (13:57 -0800)]
udp: Clean up some functions.
This patch adds no functional change and cleans up some functions
that the following patches touch around so that we make them tidy
and easy to review/revert. The change is mainly to keep reverse
christmas tree order.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Nov 2022 09:07:03 +0000 (09:07 +0000)]
Merge branch 'sfc-TC-offload-counters'
Edward Cree says:
====================
sfc: TC offload counters
EF100 hardware supports attaching counters to action-sets in the MAE.
Use these counters to implement stats for TC flower offload.
The counters are delivered to the host over a special hardware RX queue
which should only ever receive counter update messages, not 'real'
network packets.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Mon, 14 Nov 2022 13:16:01 +0000 (13:16 +0000)]
sfc: implement counters readout to TC stats
On FLOW_CLS_STATS, look up the MAE counter by TC cookie, and report the
change in packet and byte count since the last time FLOW_CLS_STATS read
them.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Mon, 14 Nov 2022 13:16:00 +0000 (13:16 +0000)]
sfc: validate MAE action order
Currently the only actions supported are COUNT and DELIVER, which can only
happen in the right order; but when more actions are added, it will be
necessary to check that they are only used in the same order in which the
hardware performs them (since the hardware API takes an action *set* in
which the order is implicit). For instance, a VLAN pop must not follow a
VLAN push. Most practical use-cases should be unaffected by these
restrictions.
Add a function efx_tc_flower_action_order_ok() that checks whether it is
appropriate to add a specified action to the existing action-set.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Mon, 14 Nov 2022 13:15:59 +0000 (13:15 +0000)]
sfc: attach an MAE counter to TC actions that need it
The only actions that expect stats (that sfc HW supports) are gact shot
(drop), mirred redirect and mirred mirror. Since these are 'deliverish'
actions that end an action-set, we only require at most one counter per
action-set.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Mon, 14 Nov 2022 13:15:58 +0000 (13:15 +0000)]
sfc: accumulate MAE counter values from update packets
Add the packet and byte counts to the software running total, and store
the latest jiffies every time the counter is bumped.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Mon, 14 Nov 2022 13:15:57 +0000 (13:15 +0000)]
sfc: add functions to allocate/free MAE counters
efx_tc_flower_get_counter_index() will create an MAE counter mapped to
the passed (TC filter) cookie, or increment the reference if one already
exists for that cookie.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Mon, 14 Nov 2022 13:15:56 +0000 (13:15 +0000)]
sfc: add hashtables for MAE counters and counter ID mappings
Nothing populates them yet.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Mon, 14 Nov 2022 13:15:55 +0000 (13:15 +0000)]
sfc: add extra RX channel to receive MAE counter updates on ef100
Currently there is no counter-allocating machinery to connect the
resulting counter update values to; that will be added in a
subsequent patch.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Mon, 14 Nov 2022 13:15:54 +0000 (13:15 +0000)]
sfc: add ef100 MAE counter support functions
Start and stop MAE counter streaming, and grant credits.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Mon, 14 Nov 2022 13:15:53 +0000 (13:15 +0000)]
sfc: add ability for extra channels to receive raw RX buffers
The TC extra channel will need its own special RX handling, which must
operate before any code that expects the RX buffer to contain a network
packet; buffers on this RX queue contain MAE counter packets in a
special format that does not resemble an Ethernet frame, and many fields
of the RX packet prefix are not populated.
The USER_MARK field, however, is populated with the generation count from
the counter subsystem, which needs to be passed on to the RX handler.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Mon, 14 Nov 2022 13:15:52 +0000 (13:15 +0000)]
sfc: add start and stop methods to channels
The TC extra channel needs to do extra work in efx_{start,stop}_channels()
to start/stop MAE counter streaming from the hardware. Add callbacks for
it to implement.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Mon, 14 Nov 2022 13:15:51 +0000 (13:15 +0000)]
sfc: add ability for an RXQ to grant credits on refill
EF100 hardware streams MAE counter updates to the driver over a dedicated
RX queue; however, the MCPU is not able to detect when RX buffers have
been posted to the ring. Thus, the driver must call
MC_CMD_MAE_COUNTERS_STREAM_GIVE_CREDITS; this patch adds the
infrastructure to support that to the core RXQ handling code.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Mon, 14 Nov 2022 13:15:50 +0000 (13:15 +0000)]
sfc: fix ef100 RX prefix macro
Macro PREFIX_WIDTH_MASK uses unsigned long arithmetic for a shift of up
to 32 bits, which breaks on 32-bit systems. This did not previously
show up as we weren't using any fields of width 32, but we now need to
access ESF_GZ_RX_PREFIX_USER_MARK.
Change it to unsigned long long.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 16 Nov 2022 04:34:30 +0000 (20:34 -0800)]
Merge branch 'remove-phylink_validate-from-felix-dsa-driver'
Vladimir Oltean says:
====================
Remove phylink_validate() from Felix DSA driver
The Felix DSA driver still uses its own phylink_validate() procedure
rather than the (relatively newly introduced) phylink_generic_validate()
because the latter did not cater for the case where a PHY provides rate
matching between the Ethernet cable side speed and the SERDES side
speed (and does not advertise other speeds except for the SERDES speed).
This changed with Sean Anderson's generic support for rate matching PHYs
in phylib and phylink:
https://patchwork.kernel.org/project/netdevbpf/cover/
20220920221235.
1487501-1-sean.anderson@seco.com/
Building upon that support, this patch set makes Linux understand that
the PHYs used in combination with the Felix DSA driver (SCH-30841 riser
card with AQR412 PHY, used with SERDES protocol 0x7777 - 4x2500base-x,
plugged into LS1028A-QDS) do support PAUSE rate matching. This requires
Aquantia PHY driver support for new PHY IDs.
To activate the rate matching support in phylink, config->mac_capabilities
must be populated. Coincidentally, this also opts the Felix driver into
the generic phylink validation.
Next, code that is no longer necessary is eliminated. This includes the
Felix driver validation procedures for VSC9959 and VSC9953, the
workaround in the Ocelot switch library to leave RX flow control always
enabled, as well as DSA plumbing necessary for a custom phylink
validation procedure to be propagated to the hardware driver level.
Many thanks go to Sean Anderson for providing generic support for rate
matching.
====================
Link: https://lore.kernel.org/r/20221114170730.2189282-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Mon, 14 Nov 2022 17:07:30 +0000 (19:07 +0200)]
net: dsa: remove phylink_validate() method
As of now, no DSA driver uses a custom link mode validation procedure
anymore. So remove this DSA operation and let phylink determine what is
supported based on config->mac_capabilities (if provided by the driver).
Leave a comment why we left the code that we did, and that there is more
work to do.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Mon, 14 Nov 2022 17:07:29 +0000 (19:07 +0200)]
net: mscc: ocelot: drop workaround for forcing RX flow control
As phylink gained generic support for PHYs with rate matching via PAUSE
frames, the phylink_mac_link_up() method will be called with the maximum
speed and with rx_pause=true if rate matching is in use. This means that
setups with 2500base-x as the SERDES protocol between the MAC/PCS and
the PHY now work with no need for the driver to do anything special.
Tested with fsl-ls1028a-qds-7777.dts.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Mon, 14 Nov 2022 17:07:28 +0000 (19:07 +0200)]
net: dsa: felix: use phylink_generic_validate()
Drop the custom implementation of phylink_validate() in favor of the
generic one, which requires config->mac_capabilities to be set.
This was used up until now because of the possibility of being paired
with Aquantia PHYs with support for rate matching. The phylink framework
gained generic support for these, and knows to advertise all 10/100/1000
lower speed link modes when our SERDES protocol is 2500base-x
(fixed speed).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Mon, 14 Nov 2022 17:07:27 +0000 (19:07 +0200)]
net: phy: aquantia: add AQR112 and AQR412 PHY IDs
These are Gen3 Aquantia N-BASET PHYs which support 5GBASE-T,
2.5GBASE-T, 1000BASE-T and 100BASE-TX (not 10G); also EEE, Sync-E,
PTP, PoE.
The 112 is a single PHY package, the 412 is a quad PHY package.
The system-side SERDES interface of these PHYs selects its protocol
depending on the negotiated media side link speed. That protocol can be
1000BASE-X, 2500BASE-X, 10GBASE-R, SGMII, USXGMII.
The configuration of which SERDES protocol to use for which link speed
is made by firmware; even though it could be overwritten over MDIO by
Linux, we assume that the firmware provisioning is ok for the board on
which the driver probes.
For cases when the system side runs at a fixed rate, we want phylib/phylink
to detect the PAUSE rate matching ability of these PHYs, so we need to
use the Aquantia rather than the generic C45 driver. This needs
aqr107_read_status() -> aqr107_read_rate() to set phydev->rate_matching,
as well as the aqr107_get_rate_matching() method.
I am a bit unsure about the naming convention in the driver. Since
AQR107 is a Gen2 PHY, I assume all functions prefixed with "aqr107_"
rather than "aqr_" mean Gen2+ features. So I've reused this naming
convention.
I've tested PHY "SGMII" statistics as well as the .link_change_notify
method, which prints:
Aquantia AQR412 mdio_mux-0.4:00: Link partner is Aquantia PHY, FW 4.3, fast-retrain downshift advertised, fast reframe advertised
Tested SERDES protocols are usxgmii and 2500base-x (the latter with
PAUSE rate matching). Tested link modes are 100/1000/2500 Base-T
(with Aquantia link partner and with other link partners). No notable
events observed.
The placement of these PHY IDs in the driver is right before AQR113C,
a Gen4 PHY.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 16 Nov 2022 04:33:19 +0000 (20:33 -0800)]
Merge git://git./linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
1) Fix sparse warning in the new nft_inner expression, reported
by Jakub Kicinski.
2) Incorrect vlan header check in nft_inner, from Peng Wu.
3) Two patches to pass reset boolean to expression dump operation,
in preparation for allowing to reset stateful expressions in rules.
This adds a new NFT_MSG_GETRULE_RESET command. From Phil Sutter.
4) Inconsistent indentation in nft_fib, from Jiapeng Chong.
5) Speed up siphash calculation in conntrack, from Florian Westphal.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: conntrack: use siphash_4u64
netfilter: rpfilter/fib: clean up some inconsistent indenting
netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET
netfilter: nf_tables: Extend nft_expr_ops::dump callback parameters
netfilter: nft_inner: fix return value check in nft_inner_parse_l2l3()
netfilter: nft_payload: use __be16 to store gre version
====================
Link: https://lore.kernel.org/r/20221115095922.139954-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Walter Heymans [Tue, 15 Nov 2022 09:08:34 +0000 (10:08 +0100)]
Documentation: nfp: update documentation
The NFP documentation is updated to include information about Corigine,
and the new NFP3800 chips. The 'Acquiring Firmware' section is updated
with new information about where to find firmware.
Two new sections are added to expand the coverage of the documentation.
The new sections include:
- Devlink Info
- Configure Device
Signed-off-by: Walter Heymans <walter.heymans@corigine.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund@corigine.com>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20221115090834.738645-1-simon.horman@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 16 Nov 2022 04:23:18 +0000 (20:23 -0800)]
Merge branch 'mtk_eth_soc-rx-vlan-offload-improvement-dsa-hardware-untag-support'
Felix Fietkau says:
====================
mtk_eth_soc rx vlan offload improvement + dsa hardware untag support
This series improves rx vlan offloading on mtk_eth_soc and extends it to
support hardware DSA untagging where possible.
This improves performance by avoiding calls into the DSA tag driver receive
function, including mangling of skb->data.
This is split out of a previous series, which added other fixes and
multiqueue support
====================
Link: https://lore.kernel.org/r/20221114124214.58199-1-nbd@nbd.name
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Felix Fietkau [Mon, 14 Nov 2022 12:42:14 +0000 (13:42 +0100)]
net: ethernet: mtk_eth_soc: enable hardware DSA untagging
- pass the tag to DSA via metadata dst
- disabled on 7986 for now, since it's not working yet
- disabled if a MAC is enabled that does not use DSA
This improves performance by bypassing the DSA tag driver and avoiding extra
skb data mangling
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Felix Fietkau [Mon, 14 Nov 2022 12:42:13 +0000 (13:42 +0100)]
net: ethernet: mtk_eth_soc: add support for configuring vlan rx offload
Keep the vlan rx offload feature in sync across all netdevs belonging to the
device, since the feature is global and can't be turned off per MAC
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Felix Fietkau [Mon, 14 Nov 2022 12:42:12 +0000 (13:42 +0100)]
net: ethernet: mtk_eth_soc: pass correct VLAN protocol ID to the network stack
Use the id from the DMA descriptor instead of hardcoding 802.1q
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Felix Fietkau [Mon, 14 Nov 2022 12:42:11 +0000 (13:42 +0100)]
net: dsa: add support for DSA rx offloading via metadata dst
If a metadata dst is present with the type METADATA_HW_PORT_MUX on a dsa cpu
port netdev, assume that it carries the port number and that there is no DSA
tag present in the skb data.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Machon [Mon, 14 Nov 2022 09:29:50 +0000 (10:29 +0100)]
net: dcb: move getapptrust to separate function
This patch fixes a frame size warning, reported by kernel test robot.
>> net/dcb/dcbnl.c:1230:1: warning: the frame size of 1244 bytes is
>> larger than 1024 bytes [-Wframe-larger-than=]
The getapptrust part of dcbnl_ieee_fill is moved to a separate function,
and the selector array is now dynamically allocated, instead of stack
allocated.
Tested on microchip sparx5 driver.
Fixes:
6182d5875c33 ("net: dcb: add new apptrust attribute")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20221114092950.2490451-1-daniel.machon@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Florian Westphal [Wed, 2 Nov 2022 12:46:33 +0000 (13:46 +0100)]
netfilter: conntrack: use siphash_4u64
This function is used for every packet, siphash_4u64 is noticeably faster
than using local buffer + siphash:
Before:
1.23% kpktgend_0 [kernel.vmlinux] [k] __siphash_unaligned
0.14% kpktgend_0 [nf_conntrack] [k] hash_conntrack_raw
After:
0.79% kpktgend_0 [kernel.vmlinux] [k] siphash_4u64
0.15% kpktgend_0 [nf_conntrack] [k] hash_conntrack_raw
In the pktgen test this gives about ~2.4% performance improvement.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jiapeng Chong [Fri, 4 Nov 2022 03:55:04 +0000 (11:55 +0800)]
netfilter: rpfilter/fib: clean up some inconsistent indenting
No functional modification involved.
net/ipv4/netfilter/nft_fib_ipv4.c:141 nft_fib4_eval() warn: inconsistent indenting.
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2733
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Phil Sutter [Fri, 14 Oct 2022 21:45:59 +0000 (23:45 +0200)]
netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET
Analogous to NFT_MSG_GETOBJ_RESET, but for rules: Reset stateful
expressions like counters or quotas. The latter two are the only
consumers, adjust their 'dump' callbacks to respect the parameter
introduced earlier.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Phil Sutter [Fri, 14 Oct 2022 21:45:58 +0000 (23:45 +0200)]
netfilter: nf_tables: Extend nft_expr_ops::dump callback parameters
Add a 'reset' flag just like with nft_object_ops::dump. This will be
useful to reset "anonymous stateful objects", e.g. simple rule counters.
No functional change intended.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jamie Bainbridge [Mon, 14 Nov 2022 01:00:08 +0000 (12:00 +1100)]
tcp: Add listening address to SYN flood message
The SYN flood message prints the listening port number, but with many
processes bound to the same port on different IPs, it's impossible to
tell which socket is the problem.
Add the listen IP address to the SYN flood message.
For IPv6 use "[IP]:port" as per RFC-5952 and to provide ease of
copy-paste to "ss" filters. For IPv4 use "IP:port" to match.
Each protcol's "any" address and a host address now look like:
Possible SYN flooding on port 0.0.0.0:9001.
Possible SYN flooding on port 127.0.0.1:9001.
Possible SYN flooding on port [::]:9001.
Possible SYN flooding on port [fc00::1]:9001.
Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://lore.kernel.org/r/4fedab7ce54a389aeadbdc639f6b4f4988e9d2d7.1668386107.git.jamie.bainbridge@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 15 Nov 2022 02:43:24 +0000 (18:43 -0800)]
Merge branch 'genirq-msi-treewide-cleanup-of-pointless-linux-msi-h-includes'
Thomas Gleixner says:
====================
genirq/msi: Treewide cleanup of pointless linux/msi.h includes
While working on per device MSI domains I noticed that quite some files
include linux/msi.h just because.
The top level comment in the header file clearly says:
Regular device drivers have no business with any of these functions....
and actually none of the drivers needs anything from msi.h.
====================
Link: https://lore.kernel.org/r/20221113201935.776707081@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Thomas Gleixner [Sun, 13 Nov 2022 20:34:05 +0000 (21:34 +0100)]
net: nfp: Remove linux/msi.h includes
Nothing in these files needs anything from linux/msi.h
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: oss-drivers@corigine.com
Acked-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Thomas Gleixner [Sun, 13 Nov 2022 20:34:04 +0000 (21:34 +0100)]
net: dpaa2: Remove linux/msi.h includes
Nothing in these file needs anything from linux/msi.h
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yoshihiro Shimoda [Thu, 10 Nov 2022 01:27:20 +0000 (10:27 +0900)]
net: ethernet: renesas: rswitch: Fix build error about ptp
If CONFIG_PTP_1588_CLOCK_OPTIONAL=m and CONFIG_RENESAS_ETHER_SWITCH=y,
the following build error happened:
aarch64-linux-ld: DWARF error: could not find abbrev number 60
drivers/net/ethernet/renesas/rswitch.o: in function `rswitch_get_ts_info':
rswitch.c:(.text+0x408): undefined reference to `ptp_clock_index'
aarch64-linux-ld: DWARF error: could not find abbrev number
1190123
drivers/net/ethernet/renesas/rcar_gen4_ptp.o: in function `rcar_gen4_ptp_register':
rcar_gen4_ptp.c:(.text+0x4dc): undefined reference to `ptp_clock_register'
aarch64-linux-ld: drivers/net/ethernet/renesas/rcar_gen4_ptp.o: in function `rcar_gen4_ptp_unregister':
rcar_gen4_ptp.c:(.text+0x584): undefined reference to `ptp_clock_unregister'
To fix the issue, add "depends on PTP_1588_CLOCK_OPTIONAL" into the
Kconfig.
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Fixes:
6c6fa1a00ad3 ("net: ethernet: renesas: rswitch: Add R-Car Gen4 gPTP support")
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Link: https://lore.kernel.org/r/20221110012720.3552060-1-yoshihiro.shimoda.uh@renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David S. Miller [Mon, 14 Nov 2022 11:35:28 +0000 (11:35 +0000)]
Merge tag 'mlx5-updates-2022-11-12' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2022-11-12
Misc updates to mlx5 driver
1) Support enhanced CQE compression, on ConnectX6-Dx
Reduce irq rate, cpu utilization and latency.
2) Connection tracking: Optimize the pre_ct table lookup for rules
installed on chain 0.
3) implement ethtool get_link_ext_stats for PHY down events
4) Expose device vhca_id to debugfs
5) misc cleanups and trivial changes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Shenwei Wang [Fri, 11 Nov 2022 15:35:05 +0000 (09:35 -0600)]
net: fec: add xdp and page pool statistics
Added xdp and page pool statistics.
In order to make the implementation simple and compatible, the patch
uses the 32bit integer to record the XDP statistics.
Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 14 Nov 2022 11:24:17 +0000 (11:24 +0000)]
Merge branch 'sparx5-sorted-VCAP-rules'
Steen Hegelund says:
====================
net: Add support for sorted VCAP rules in Sparx5
This provides support for adding Sparx5 VCAP rules in sorted order, VCAP
rule counters and TC filter matching on ARP frames.
It builds on top of the initial IS2 VCAP support found in these series:
https://lore.kernel.org/all/
20221020130904.
1215072-1-steen.hegelund@microchip.com/
https://lore.kernel.org/all/
20221109114116.
3612477-1-steen.hegelund@microchip.com/
Functionality
=============
When a new VCAP rule is added the driver will now ensure that the rule is
inserted in sorted order, and when a rule is removed, the remaining rules
will be moved to keep the sorted order and remove any gaps in the VCAP
address space.
A VCAP rule is ordered using these 3 values:
- Rule size: the count of VCAP addresses used by the rule. The largest
rule have highest priority
- Rule User: The rules are ordered by the user enumeration
- Priority: The priority provided in the flower filter. The lowest value
has the highest priority.
A VCAP instance may contain the counter as part of the VCAP cache area, and
this counter may be one or more bits in width. This type of counter
automatically increments its value when the rule is hit.
Other VCAP instances have a dedicated counter area outside of the VCAP and
in this case the rule must contain the counter id to be able to locate the
counter value and cause the counter to be incremented. In this case there
must also be a VCAP rule action that sets the counter id.
The Sparx5 IS2 VCAP uses a dedicated counter area with 32bit counters.
This series adds support for getting VCAP rule counters and provide these
via the TC statistic interface.
This only support packet counters, not byte counters.
Finally the series adds support for the ARP frame dissector and configures
the Sparx5 IS2 VCAP to generate the ARP keyset when ARP traffic is
received.
Delivery:
=========
This is current plan for delivering the full VCAP feature set of Sparx5:
- DebugFS support for inspecting rules
- TC protocol all support
- Sparx5 IS0 VCAP support
- TC policer and drop action support (depends on the Sparx5 QoS support
upstreamed separately)
- Sparx5 ES0 VCAP support
- TC flower template support
- TC matchall filter support for mirroring and policing ports
- TC flower filter mirror action support
- Sparx5 ES2 VCAP support
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steen Hegelund [Fri, 11 Nov 2022 13:05:19 +0000 (14:05 +0100)]
net: microchip: sparx5: Add KUNIT test of counters and sorted rules
This tests the insert, move and deleting of rules and checks that the
unused VCAP addresses are initialized correctly.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steen Hegelund [Fri, 11 Nov 2022 13:05:18 +0000 (14:05 +0100)]
net: microchip: sparx5: Add support for TC flower filter statistics
This provides flower filter packet statistics (bytes are not supported) via
the dedicated IS2 counter feature.
All rules having the same TC cookie will contribute to the packet
statistics for the filter as they are considered to be part of the same TC
flower filter.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steen Hegelund [Fri, 11 Nov 2022 13:05:17 +0000 (14:05 +0100)]
net: microchip: sparx5: Add support for IS2 VCAP rule counters
This adds API methods to set and get a rule counter.
A VCAP instance may contain the counter as part of the VCAP cache area, and
this counter may be one or more bits in width. This type of counter
automatically increments it value when the rule is hit.
Other VCAP instances have a dedicated counter area outside of the VCAP and
in this case the rule must contain the counter id to be able to locate the
counter value. In this case there must also be a rule action that updates
the counter using the rule id when the rule is hit.
The Sparx5 IS2 VCAP uses a dedicated counter area.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steen Hegelund [Fri, 11 Nov 2022 13:05:16 +0000 (14:05 +0100)]
net: microchip: sparx5: Add/delete rules in sorted order
This adds a sorting criteria to rule insertion and deletion.
The criteria is (in the listed order):
- Rule size (largest size first)
- User (based on an enumerated user value)
- Priority (highest priority first, aka lowest value)
When a rule is deleted the other rules may need to be moved to fill the gap
to use the available VCAP address space in the best possible way.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steen Hegelund [Fri, 11 Nov 2022 13:05:15 +0000 (14:05 +0100)]
net: microchip: sparx5: Add support for TC flower ARP dissector
This add support for Sparx5 for dissecting TC ARP flower filter keys and
sets up the Sparx5 IS2 VCAP to generate the ARP keyset for ARP frames.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steen Hegelund [Fri, 11 Nov 2022 13:05:14 +0000 (14:05 +0100)]
net: flow_offload: add support for ARP frame matching
This adds a new flow_rule_match_arp function that allows drivers
to be able to dissect ARP frames.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xu xin [Fri, 11 Nov 2022 09:04:20 +0000 (09:04 +0000)]
ipasdv4/tcp_ipv4: remove redundant assignment
The value of 'st->state' has been verified as "TCP_SEQ_STATE_LISTENING",
it's unnecessary to assign TCP_SEQ_STATE_LISTENING to it, so we can remove it.
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 14 Nov 2022 10:47:07 +0000 (10:47 +0000)]
Merge branch 'ibmvnic-affinity-hints'
Nick Child says:
====================
ibmvnic: Introduce affinity hint support
This is a patchset to do 3 things to improve ibmvnic performance:
1. Assign affinity hints to ibmvnic queue irq's
2. Update affinity hints on cpu hotplug events
3. Introduce transmit packet steering (XPS)
NOTE: If irqbalance is running, you need to stop it from overriding
our affinity hints. To do this you can do one of:
- systemctl stop irqbalance
- ban the ibmvnic module irqs
- you must have the latest irqbalance v9.2, the banmod argument was broken before this
- in /etc/sysconfig/irqbalance -> IRQBALANCE_ARGS="--banmod=ibmvnic"
- systemctl restart irqbalance
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nick Child [Thu, 10 Nov 2022 21:32:18 +0000 (15:32 -0600)]
ibmvnic: Update XPS assignments during affinity binding
Transmit Packet Steering (XPS) maps cpu numbers to transmit
queues. By running the same connection on the same set of cpu's,
contention for the queue and cache miss rate can be minimized.
When assigning a cpu mask for a tranmit queues irq number, assign
the same cpu mask as the set of cpu's that XPS should use for that
queue.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
Reviewed-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nick Child [Thu, 10 Nov 2022 21:32:17 +0000 (15:32 -0600)]
ibmvnic: Add hotpluggable CPU callbacks to reassign affinity hints
When CPU's are added and removed, ibmvnic devices will reassign
hint values. Introduce a new cpu hotplug state CPUHP_IBMVNIC_DEAD
to signal to ibmvnic devices that the CPU has been removed and it
is time to reset affinity hint assignments. On the other hand,
when CPU's are being added, add a state instance to
CPUHP_AP_ONLINE_DYN which will trigger a reassignment of affinity
hints once the new CPU's are online. This implementation is based
on the virtio_net driver.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
Reviewed-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nick Child [Thu, 10 Nov 2022 21:32:16 +0000 (15:32 -0600)]
ibmvnic: Assign IRQ affinity hints to device queues
Assign affinity hints to ibmvnic device queue interrupts.
Affinity hints are assigned and removed during sub-crq init and
teardown, respectively. This update should improve latency if
utilized as interrupt lines and processing are more equally
distributed among CPU's. This implementation is based on the
virtio_net driver.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
Reviewed-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Md Fahad Iqbal Polash [Thu, 10 Nov 2022 13:03:53 +0000 (14:03 +0100)]
ice: virtchnl rss hena support
Add support for 2 virtchnl msgs:
VIRTCHNL_OP_SET_RSS_HENA
VIRTCHNL_OP_GET_RSS_HENA_CAPS
The first one allows VFs to clear all previously programmed
RSS configuration and customize it. The second one returns
the RSS HENA bits allowed by the hardware.
Introduce ice_err_to_virt_err which converts kernel
specific errors to virtchnl errors.
Signed-off-by: Md Fahad Iqbal Polash <md.fahad.iqbal.polash@intel.com>
Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Chuang Wang [Thu, 10 Nov 2022 07:31:25 +0000 (15:31 +0800)]
net: tun: rebuild error handling in tun_get_user
The error handling in tun_get_user is very scattered.
This patch unifies error handling, reduces duplication of code, and
makes the logic clearer.
Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed [Wed, 26 Oct 2022 10:26:31 +0000 (11:26 +0100)]
net/mlx5e: ethtool: get_link_ext_stats for PHY down events
Implement ethtool_op get_link_ext_stats for PHY down events
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Oz Shlomo [Mon, 31 Oct 2022 09:00:30 +0000 (09:00 +0000)]
net/mlx5e: CT, optimize pre_ct table lookup
The pre_ct table realizes in hardware the act_ct cache logic, bypassing
the CT table if the ct state was already set by a previous ct lookup.
As such, the pre_ct table will always miss for chain 0 filters.
Optimize the pre_ct table lookup for rules installed on chain 0.
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Tariq Toukan [Wed, 14 Sep 2022 07:35:04 +0000 (10:35 +0300)]
net/mlx5e: kTLS, Use a single async context object per a callback bulk
A single async context object is sufficient to wait for the completions
of many callbacks. Switch to using one instance per a bulk of commands.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Tariq Toukan [Mon, 12 Sep 2022 18:43:18 +0000 (21:43 +0300)]
net/mlx5e: kTLS, Remove unnecessary per-callback completion
Waiting on a completion object for each callback before cleaning up their
async contexts is not necessary, as this is already implied in the
mlx5_cmd_cleanup_async_ctx() API.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Tariq Toukan [Wed, 14 Sep 2022 08:02:34 +0000 (11:02 +0300)]
net/mlx5e: kTLS, Remove unused work field
Work field in struct mlx5e_async_ctx is not used. Remove it.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Roi Dayan [Wed, 21 Sep 2022 06:17:15 +0000 (09:17 +0300)]
net/mlx5e: TC, Remove redundant WARN_ON()
The case where the packet is not offloaded and needs to be restored
to slow path and couldn't find expected tunnel information should not
dump a call trace to the user. there is a debug call.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Guy Truzman [Thu, 29 Sep 2022 12:12:51 +0000 (15:12 +0300)]
net/mlx5e: Add error flow when failing update_rx
Up until now, return value of update_rx was ignored. Therefore, flow
continues even if it fails. Add error flow in case of update_rx fails in
mlx5e_open_locked, mlx5i_open and mlx5i_pkey_open.
Signed-off-by: Guy Truzman <gtruzman@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Tariq Toukan [Wed, 18 May 2022 08:46:35 +0000 (11:46 +0300)]
net/mlx5e: Move params kernel log print to probe function
Params info print was meant to be printed on load.
With time, new calls to mlx5e_init_rq_type_params and
mlx5e_build_rq_params were added, mistakenly printing
the params once again.
Move the print to were it belongs, in mlx5e_probe.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Ofer Levi [Tue, 9 Feb 2021 15:48:11 +0000 (17:48 +0200)]
net/mlx5e: Support enhanced CQE compression
CQE compression feature improves performance by reducing PCI bandwidth
bottleneck on CQEs write.
Enhanced CQE compression introduced in ConnectX-6 and it aims to reduce
CPU utilization of SW side packets decompression by eliminating the
need to rewrite ownership bit, which is likely to cost a cache-miss, is
replaced by validity byte handled solely by HW.
Another advantage of the enhanced feature is that session packets are
available to SW as soon as a single CQE slot is filled, instead of
waiting for session to close, this improves packet latency from NIC to
host.
Performance:
Following are tested scenarios and reults comparing basic and enahnced
CQE compression.
setup: IXIA 100GbE connected directly to port 0 and port 1 of
ConnectX-6 Dx 100GbE dual port.
Case #1 RX only, single flow goes to single queue:
IRQ rate reduced by ~ 30%, CPU utilization improved by 2%.
Case #2 IP forwarding from port 1 to port 0 single flow goes to
single queue:
Avg latency improved from 60us to 21us, frame loss improved from 0.5% to 0.0%.
Case #3 IP forwarding from port 1 to port 0 Max Throughput IXIA sends
100%, 8192 UDP flows, goes to 24 queues:
Enhanced is equal or slightly better than basic.
Testing the basic compression feature with this patch shows there is
no perfrormance degradation of the basic compression feature.
Signed-off-by: Ofer Levi <oferle@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Gal Pressman [Sun, 4 Sep 2022 10:29:26 +0000 (13:29 +0300)]
net/mlx5e: Use clamp operation instead of open coding it
Replace the min/max operations with a single clamp.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Anisse Astier [Mon, 31 Oct 2022 16:56:04 +0000 (17:56 +0100)]
net/mlx5e: remove unused list in arfs
This is never used, and probably something that was intended to be used
before per-protocol hash tables were chosen instead.
Signed-off-by: Anisse Astier <anisse@astier.eu>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Eli Cohen [Wed, 21 Sep 2022 10:33:29 +0000 (13:33 +0300)]
net/mlx5: Expose vhca_id to debugfs
hca_id is an identifier of an mlx5_core instance within the hardware.
This identifier may be required for troubleshooting.
Expose it to debugfs.
Example:
$ cat /sys/kernel/debug/mlx5/mlx5_core.sf.2/vhca_id
0x12
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Moshe Shemesh [Mon, 8 Aug 2022 17:02:59 +0000 (20:02 +0300)]
net/mlx5: Unregister traps on driver unload flow
Before this patch, devlink traps are registered only on full driver
probe and unregistered on driver removal. As devlink traps are not
usable once driver functionality is unloaded, it should be unrgeistered
also on flows that unload the driver and then registered when loaded
back, e.g. devlink reload flow.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Colin Ian King [Mon, 31 Oct 2022 08:01:04 +0000 (08:01 +0000)]
net/mlx5: Fix spelling mistake "destoy" -> "destroy"
There is a spelling mistake in an error message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Roi Dayan [Thu, 27 Oct 2022 08:35:12 +0000 (11:35 +0300)]
net/mlx5: Bridge, Use debug instead of warn if entry doesn't exists
There is no need for the warn if entry already removed.
Use debug print like in the update flow.
Also update the messages so user can identify if the it's
from the update flow or remove flow.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Eric Dumazet [Thu, 10 Nov 2022 19:02:39 +0000 (19:02 +0000)]
tcp: tcp_wfree() refactoring
Use try_cmpxchg() (instead of cmpxchg()) in a more readable way.
oval = smp_load_acquire(&sk->sk_tsq_flags);
do {
...
} while (!try_cmpxchg(&sk->sk_tsq_flags, &oval, nval));
Reduce indentation level.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221110190239.3531280-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Thu, 10 Nov 2022 17:48:29 +0000 (17:48 +0000)]
tcp: adopt try_cmpxchg() in tcp_release_cb()
try_cmpxchg() is slighly more efficient (at least on x86),
and smp_load_acquire(&sk->sk_tsq_flags) could avoid a KCSAN report.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221110174829.3403442-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Thu, 10 Nov 2022 08:54:22 +0000 (10:54 +0200)]
bridge: Add missing parentheses
No changes in generated code.
Reported-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20221110085422.521059-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 12 Nov 2022 05:23:23 +0000 (21:23 -0800)]
Merge branch 'dt-bindings-net-qcom-ipa-relax-some-restrictions'
Alex Elder says:
====================
dt-bindings: net: qcom,ipa: relax some restrictions
The first patch in this series simply removes an unnecessary
requirement in the IPA binding. Previously, if the modem was doing
GSI firmware loading, the firmware name property was required to
*not* be present. There is no harm in having the firmware name be
specified, so this restriction isn't needed.
The second patch restates a requirement on the "memory-region"
property more accurately.
These binding changes have no impact on existing code or DTS files.
These aren't really bug fixes, so no need to back-port.
====================
Link: https://lore.kernel.org/r/20221110195619.1276302-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder [Thu, 10 Nov 2022 19:56:18 +0000 (13:56 -0600)]
dt-bindings: net: qcom,ipa: restate a requirement
Either the AP or modem loads GSI firmware. If the modem-init
property is present, the modem loads it. Otherwise, the AP loads
it, and in that case the memory-region property must be defined.
Currently this requirement is expressed as one or the other of the
modem-init or the memory-region property being required. But it's
harmless for the memory-region to be present if the modem is loading
firmware (it'll just be ignored).
Restate the requirement so that the memory-region property is
required only if modem-init is not present.
Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder [Thu, 10 Nov 2022 19:56:17 +0000 (13:56 -0600)]
dt-bindings: net: qcom,ipa: remove an unnecessary restriction
Commit
d8604b209e9b3 ("dt-bindings: net: qcom,ipa: add firmware-name
property") added a requirement for a "firmware-name" property that
is more restrictive than necessary.
If the AP loads GSI firmware, the name of the firmware file to use
may optionally be provided via a "firmware-name" property. If the
*modem* loads GSI firmware, "firmware-name" doesn't need to be
supplied--but it's harmless to do so (it will simply be ignored).
Remove the unnecessary restriction, and allow "firware-name" to be
supplied even if it's not needed.
Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Angelo Dureghello [Thu, 10 Nov 2022 09:10:27 +0000 (10:10 +0100)]
net: dsa: mv88e6xxx: enable set_policy
Enabling set_policy capability for mv88e6321.
Signed-off-by: Angelo Dureghello <angelo.dureghello@timesys.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20221110091027.998073-1-angelo.dureghello@timesys.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 12 Nov 2022 05:19:50 +0000 (21:19 -0800)]
Merge branch 'mptcp-miscellaneous-refactoring-and-small-fixes'
Mat Martineau says:
====================
mptcp: Miscellaneous refactoring and small fixes
Patches 1-3 do some refactoring to more consistently handle sock casts,
and to remove some duplicate code. No functional changes.
Patch 4 corrects a variable name in a self test, but does not change
functionality since the same value gets used due to bash's
scoping rules.
Patch 5 rewords a comment.
====================
Link: https://lore.kernel.org/r/20221110232322.125068-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mat Martineau [Thu, 10 Nov 2022 23:23:22 +0000 (15:23 -0800)]
mptcp: Fix grammar in a comment
We kept getting initial patches from new contributors to remove a
duplicate 'the' (since grammar checking scripts flag it), but submitters
never followed up after code review.
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Geliang Tang [Thu, 10 Nov 2022 23:23:21 +0000 (15:23 -0800)]
selftests: mptcp: use max_time instead of time
'time' is the local variable of run_test() function, while 'max_time' is
the local variable of do_transfer() function. So in do_transfer(),
$max_time should be used, not $time.
Please note that here $time == $max_time so the behaviour is not changed
but the right variable is used.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Geliang Tang [Thu, 10 Nov 2022 23:23:20 +0000 (15:23 -0800)]
mptcp: get sk from msk directly
Use '(struct sock *)msk' to get 'sk' from 'msk' in a more direct way
instead of using '&msk->sk.icsk_inet.sk'.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Geliang Tang [Thu, 10 Nov 2022 23:23:19 +0000 (15:23 -0800)]
mptcp: change 'first' as a parameter
The function mptcp_subflow_process_delegated() uses the input ssk first,
while __mptcp_check_push() invokes the packet scheduler first.
So this patch adds a new parameter named 'first' for the function
__mptcp_subflow_push_pending() to deal with these two cases separately.
With this change, the code that invokes the packet scheduler in the
function __mptcp_check_push() can be removed, and replaced by invoking
__mptcp_subflow_push_pending() directly.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Geliang Tang [Thu, 10 Nov 2022 23:23:18 +0000 (15:23 -0800)]
mptcp: use msk instead of mptcp_sk
Use msk instead of mptcp_sk(sk) in the functions where the variable
"msk = mptcp_sk(sk)" has been defined.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>