Krzysztof Kozlowski [Wed, 2 Mar 2022 19:25:19 +0000 (20:25 +0100)]
nfc: llcp: simplify llcp_sock_connect() error paths
The llcp_sock_connect() error paths were using a mixed way of central
exit (goto) and cleanup
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Krzysztof Kozlowski [Wed, 2 Mar 2022 19:25:18 +0000 (20:25 +0100)]
nfc: llcp: nullify llcp_sock->dev on connect() error paths
Nullify the llcp_sock->dev on llcp_sock_connect() error paths,
symmetrically to the code llcp_sock_bind(). The non-NULL value of
llcp_sock->dev is used in a few places to check whether the socket is
still valid.
There was no particular issue observed with missing NULL assignment in
connect() error path, however a similar case - in the bind() error path
- was triggereable. That one was fixed in commit
4ac06a1e013c ("nfc:
fix NULL ptr dereference in llcp_sock_getname() after failed connect"),
so the change here seems logical as well.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Mar 2022 10:37:23 +0000 (10:37 +0000)]
Merge branch 'net-hw-counters-for-soft-devices'
Ido Schimmel says:
====================
HW counters for soft devices
Petr says:
Offloading switch device drivers may be able to collect statistics of the
traffic taking place in the HW datapath that pertains to a certain soft
netdevice, such as a VLAN. In this patch set, add the necessary
infrastructure to allow exposing these statistics to the offloaded
netdevice in question, and add mlxsw offload.
Across HW platforms, the counter itself very likely constitutes a limited
resource, and the act of counting may have a performance impact. Therefore
this patch set makes the HW statistics collection opt-in and togglable from
userspace on a per-netdevice basis.
Additionally, HW devices may have various limiting conditions under which
they can realize the counter. Therefore it is also possible to query
whether the requested counter is realized by any driver. In TC parlance,
which is to a degree reused in this patch set, two values are recognized:
"request" tracks whether the user enabled collecting HW statistics, and
"used" tracks whether any HW statistics are actually collected.
In the past, this author has expressed the opinion that `a typical user
doing "ip -s l sh", including various scripts, wants to see the full
picture and not worry what's going on where'. While that would be nice,
unfortunately it cannot work:
- Packets that trap from the HW datapath to the SW datapath would be
double counted.
For a given netdevice, some traffic can be purely a SW artifact, and some
may flow through the HW object corresponding to the netdevice. But some
traffic can also get trapped to the SW datapath after bumping the HW
counter. It is not clear how to make sure double-counting does not occur
in the SW datapath in that case, while still making sure that possibly
divergent SW forwarding path gets bumped as appropriate.
So simply adding HW and SW stats may work roughly, most of the time, but
there are scenarios where the result is nonsensical.
- HW devices will have limitations as to what type of traffic they can
count.
In case of mlxsw, which is part of this patch set, there is no reasonable
way to count all traffic going through a certain netdevice, such as a
VLAN netdevice enslaved to a bridge. It is however very simple to count
traffic flowing through an L3 object, such as a VLAN netdevice with an IP
address.
Similarly for physical netdevices, the L3 object at which the counter is
installed is the subport carrying untagged traffic.
These are not "just counters". It is important that the user understands
what is being counted. It would be incorrect to conflate these statistics
with another existing statistics suite.
To that end, this patch set introduces a statistics suite called "L3
stats". This label should make it easy to understand what is being counted,
and to decide whether a given device can or cannot implement this suite for
some type of netdevice. At the same time, the code is written to make
future extensions easy, should a device pop up that can implement a
different flavor of statistics suite (say L2, or an address-family-specific
suite).
For example, using a work-in-progress iproute2[1], to turn on and then list
the counters on a VLAN netdevice:
# ip stats set dev swp1.200 l3_stats on
# ip stats show dev swp1.200 group offload subgroup l3_stats
56: swp1.200: group offload subgroup l3_stats on used on
RX: bytes packets errors dropped missed mcast
0 0 0 0 0 0
TX: bytes packets errors dropped carrier collsns
0 0 0 0 0 0
The patchset progresses as follows:
- Patch #1 is a cleanup.
- In patch #2, remove the assumption that all LINK_OFFLOAD_XSTATS are
dev-backed.
The only attribute defined under the nest is currently
IFLA_OFFLOAD_XSTATS_CPU_HIT. L3_STATS differs from CPU_HIT in that the
driver that supplies the statistics is not the same as the driver that
implements the netdevice. Make the code compatible with this in patch #2.
- In patch #3, add the possibility to filter inside nests.
The filter_mask field of RTM_GETSTATS header determines which
top-level attributes should be included in the netlink response. This
saves processing time by only including the bits that the user cares
about instead of always dumping everything. This is doubly important
for HW-backed statistics that would typically require a trip to the
device to fetch the stats. In this patch, the UAPI is extended to
allow filtering inside IFLA_STATS_LINK_OFFLOAD_XSTATS in particular,
but the scheme is easily extensible to other nests as well.
- In patch #4, propagate extack where we need it.
In patch #5, make it possible to propagate errors from drivers to the
user.
- In patch #6, add the in-kernel APIs for keeping track of the new stats
suite, and the notifiers that the core uses to communicate with the
drivers.
- In patch #7, add UAPI for obtaining the new stats suite.
- In patch #8, add a new UAPI message, RTM_SETSTATS, which will carry
the message to toggle the newly-added stats suite.
In patch #9, add the toggle itself.
At this point the core is ready for drivers to add support for the new
stats suite.
- In patches #10, #11 and #12, apply small tweaks to mlxsw code.
- In patch #13, add support for L3 stats, which are realized as RIF
counters.
- Finally in patch #14, a selftest is added to the net/forwarding
directory. Technically this is a HW-specific test, in that without a HW
implementing the counters, it just will not pass. But devices that
support L3 statistics at all are likely to be able to reuse this
selftest, so it seems appropriate to put it in the general forwarding
directory.
We also have a netdevsim implementation, and a corresponding selftest that
verifies specifically some of the core code. We intend to contribute these
later. Interested parties can take a look at the raw code at [2].
[1] https://github.com/pmachata/iproute2/commits/soft_counters
[2] https://github.com/pmachata/linux_mlxsw/commits/petrm_soft_counters_2
v2:
- Patch #3:
- Do not declare strict_start_type at the new policies, since they are
used with nla_parse_nested() (sans _deprecated).
- Use NLA_POLICY_NESTED to declare what the nest contents should be
- Use NLA_POLICY_MASK instead of BITFIELD32 for the filtering
attribute.
- Patch #6:
- s/monotonous/monotonic/ in commit message
- Use a newly-added struct rtnl_hw_stats64 for stats transfer
- Patch #7:
- Use a newly-added struct rtnl_hw_stats64 for stats transfer
- Patch #8:
- Do not declare strict_start_type at the new policies, since they are
used with nla_parse_nested() (sans _deprecated).
- Patch #13:
- Use a newly-added struct rtnl_hw_stats64 for stats transfer
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:28 +0000 (18:31 +0200)]
selftests: forwarding: hw_stats_l3: Add a new test
Add a test that verifies operation of L3 HW statistics.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:27 +0000 (18:31 +0200)]
mlxsw: Add support for IFLA_OFFLOAD_XSTATS_L3_STATS
Spectrum machines support L3 stats by binding a counter to a RIF, a
hardware object representing a router interface. Recognize the netdevice
notifier events, NETDEV_OFFLOAD_XSTATS_*, to support enablement,
disablement, and reporting back to core.
As a netdevice gains a RIF, if L3 stats are enabled, install the counters,
and ping the core so that a userspace notification can be emitted.
Similarly, as a netdevice loses a RIF, push the as-yet-unreported
statistics to the core, so that they are not lost, and ping the core to
emit userspace notification.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:26 +0000 (18:31 +0200)]
mlxsw: Extract classification of router-related events to a helper
Several more events are coming in the following patches, and extending the
if statement is getting awkward. Instead, convert it to a switch.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:25 +0000 (18:31 +0200)]
mlxsw: spectrum_router: Drop mlxsw_sp arg from counter alloc/free functions
The mlxsw_sp reference is carried by the mlxsw_sp_rif object that is passed
to these functions as well. Just deduce the former from the latter,
and drop the explicit mlxsw_sp parameter. Adapt callers.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:24 +0000 (18:31 +0200)]
mlxsw: reg: Fix packing of router interface counters
The function mlxsw_reg_ritr_counter_pack() formats a register to configure
a router interface (RIF) counter. The parameter `egress' determines whether
an ingress or egress counter is to be configured. RITR, the register in
question, has two sets of counter-related fields: one for ingress, one for
egress. When setting values of the fields, the function sets the proper
counter index field, but when setting the counter type, it always sets the
egress field. Thus configuration of ingress counters is broken, and in fact
an attempt to configure an ingress counter mangles a previously configured
egress counter.
This was never discovered, because there is currently no way to enable
ingress counters on a router interface, only the egress one.
Fix in an obvious way.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:23 +0000 (18:31 +0200)]
net: rtnetlink: Add UAPI toggle for IFLA_OFFLOAD_XSTATS_L3_STATS
The offloaded HW stats are designed to allow per-netdevice enablement and
disablement. Add an attribute, IFLA_STATS_SET_OFFLOAD_XSTATS_L3_STATS,
which should be carried by the RTM_SETSTATS message, and expresses a desire
to toggle L3 offload xstats on or off.
As part of the above, add an exported function rtnl_offload_xstats_notify()
that drivers can use when they have installed or deinstalled the counters
backing the HW stats.
At this point, it is possible to enable, disable and query L3 offload
xstats on netdevices. (However there is no driver actually implementing
these.)
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:22 +0000 (18:31 +0200)]
net: rtnetlink: Add RTM_SETSTATS
The offloaded HW stats are designed to allow per-netdevice enablement and
disablement. These stats are only accessible through RTM_GETSTATS, and
therefore should be toggled by a RTM_SETSTATS message. Add it, and the
necessary skeleton handler.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:21 +0000 (18:31 +0200)]
net: rtnetlink: Add UAPI for obtaining L3 offload xstats
Add a new IFLA_STATS_LINK_OFFLOAD_XSTATS child attribute,
IFLA_OFFLOAD_XSTATS_L3_STATS, to carry statistics for traffic that takes
place in a HW router.
The offloaded HW stats are designed to allow per-netdevice enablement and
disablement. Additionally, as a netdevice is configured, it may become or
cease being suitable for binding of a HW counter. Both of these aspects
need to be communicated to the userspace. To that end, add another child
attribute, IFLA_OFFLOAD_XSTATS_HW_S_INFO:
- attr nest IFLA_OFFLOAD_XSTATS_HW_S_INFO
- attr nest IFLA_OFFLOAD_XSTATS_L3_STATS
- attr IFLA_OFFLOAD_XSTATS_HW_S_INFO_REQUEST
- {0,1} as u8
- attr IFLA_OFFLOAD_XSTATS_HW_S_INFO_USED
- {0,1} as u8
Thus this one attribute is a nest that can be used to carry information
about various types of HW statistics, and indexing is very simply done by
wrapping the information for a given statistics suite into the attribute
that carries the suite is the RTM_GETSTATS query. At the same time, because
_HW_S_INFO is nested directly below IFLA_STATS_LINK_OFFLOAD_XSTATS, it is
possible through filtering to request only the metadata about individual
statistics suites, without having to hit the HW to get the actual counters.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:20 +0000 (18:31 +0200)]
net: dev: Add hardware stats support
Offloading switch device drivers may be able to collect statistics of the
traffic taking place in the HW datapath that pertains to a certain soft
netdevice, such as VLAN. Add the necessary infrastructure to allow exposing
these statistics to the offloaded netdevice in question. The API was shaped
by the following considerations:
- Collection of HW statistics is not free: there may be a finite number of
counters, and the act of counting may have a performance impact. It is
therefore necessary to allow toggling whether HW counting should be done
for any particular SW netdevice.
- As the drivers are loaded and removed, a particular device may get
offloaded and unoffloaded again. At the same time, the statistics values
need to stay monotonic (modulo the eventual 64-bit wraparound),
increasing only to reflect traffic measured in the device.
To that end, the netdevice keeps around a lazily-allocated copy of struct
rtnl_link_stats64. Device drivers then contribute to the values kept
therein at various points. Even as the driver goes away, the struct stays
around to maintain the statistics values.
- Different HW devices may be able to count different things. The
motivation behind this patch in particular is exposure of HW counters on
Nvidia Spectrum switches, where the only practical approach to counting
traffic on offloaded soft netdevices currently is to use router interface
counters, and count L3 traffic. Correspondingly that is the statistics
suite added in this patch.
Other devices may be able to measure different kinds of traffic, and for
that reason, the APIs are built to allow uniform access to different
statistics suites.
- Because soft netdevices and offloading drivers are only loosely bound, a
netdevice uses a notifier chain to communicate with the drivers. Several
new notifiers, NETDEV_OFFLOAD_XSTATS_*, have been added to carry messages
to the offloading drivers.
- Devices can have various conditions for when a particular counter is
available. As the device is configured and reconfigured, the device
offload may become or cease being suitable for counter binding. A
netdevice can use a notifier type NETDEV_OFFLOAD_XSTATS_REPORT_USED to
ping offloading drivers and determine whether anyone currently implements
a given statistics suite. This information can then be propagated to user
space.
When the driver decides to unoffload a netdevice, it can use a
newly-added function, netdev_offload_xstats_report_delta(), to record
outstanding collected statistics, before destroying the HW counter.
This patch adds a helper, call_netdevice_notifiers_info_robust(), for
dispatching a notifier with the possibility of unwind when one of the
consumers bails. Given the wish to eventually get rid of the global
notifier block altogether, this helper only invokes the per-netns notifier
block.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:19 +0000 (18:31 +0200)]
net: rtnetlink: rtnl_fill_statsinfo(): Permit non-EMSGSIZE error returns
Obtaining stats for the IFLA_STATS_LINK_OFFLOAD_XSTATS nest involves a HW
access, and can fail for more reasons than just netlink message size
exhaustion. Therefore do not always return -EMSGSIZE on the failure path,
but respect the error code provided by the callee. Set the error explicitly
where it is reasonable to assume -EMSGSIZE as the failure reason.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:18 +0000 (18:31 +0200)]
net: rtnetlink: Propagate extack to rtnl_offload_xstats_fill()
Later patches add handlers for more HW-backed statistics. An extack will be
useful when communicating HW / driver errors to the client. Add the
arguments as appropriate.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:17 +0000 (18:31 +0200)]
net: rtnetlink: RTM_GETSTATS: Allow filtering inside nests
The filter_mask field of RTM_GETSTATS header determines which top-level
attributes should be included in the netlink response. This saves
processing time by only including the bits that the user cares about
instead of always dumping everything. This is doubly important for
HW-backed statistics that would typically require a trip to the device to
fetch the stats.
So far there was only one HW-backed stat suite per attribute. However,
IFLA_STATS_LINK_OFFLOAD_XSTATS is a nest, and will gain a new stat suite in
the following patches. It would therefore be advantageous to be able to
filter within that nest, and select just one or the other HW-backed
statistics suite.
Extend rtnetlink so that RTM_GETSTATS permits attributes in the payload.
The scheme is as follows:
- RTM_GETSTATS
- struct if_stats_msg
- attr nest IFLA_STATS_GET_FILTERS
- attr IFLA_STATS_LINK_OFFLOAD_XSTATS
- u32 filter_mask
This scheme reuses the existing enumerators by nesting them in a dedicated
context attribute. This is covered by policies as usual, therefore a
gradual opt-in is possible. Currently only IFLA_STATS_LINK_OFFLOAD_XSTATS
nest has filtering enabled, because for the SW counters the issue does not
seem to be that important.
rtnl_offload_xstats_get_size() and _fill() are extended to observe the
requested filters.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:16 +0000 (18:31 +0200)]
net: rtnetlink: Stop assuming that IFLA_OFFLOAD_XSTATS_* are dev-backed
The IFLA_STATS_LINK_OFFLOAD_XSTATS attribute is a nest whose child
attributes carry various special hardware statistics. The code that handles
this nest was written with the idea that all these statistics would be
exposed by the device driver of a physical netdevice.
In the following patches, a new attribute is added to the abovementioned
nest, which however can be defined for some soft netdevices. The NDO-based
approach to querying these does not work, because it is not the soft
netdevice driver that exposes these statistics, but an offloading NIC
driver that does so.
The current code does not scale well to this usage. Simply rewrite it back
to the pattern seen in other fill-like and get_size-like functions
elsewhere.
Extract to helpers the code that is concerned with handling specifically
NDO-backed statistics so that it can be easily reused should more such
statistics be added.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 2 Mar 2022 16:31:15 +0000 (18:31 +0200)]
net: rtnetlink: Namespace functions related to IFLA_OFFLOAD_XSTATS_*
The currently used names rtnl_get_offload_stats() and
rtnl_get_offload_stats_size() do not clearly show the namespace. The former
function additionally seems to have been named this way in accordance with
the NDO name, as opposed to the naming used in the rtnetlink.c file (and
indeed elsewhere in the netlink handling code). As more and
differently-flavored attributes are introduced, a common clear prefix is
needed for all related functions.
Rename the functions to follow the rtnl_offload_xstats_* naming scheme.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Wed, 2 Mar 2022 10:52:22 +0000 (02:52 -0800)]
qed: validate and restrict untrusted VFs vlan promisc mode
Today when VFs are put in promiscuous mode, they can request PF
to configure device for them to receive all VLANs traffic regardless
of what vlan is configured by the PF (via ip link) and PF allows this
config request regardless of whether VF is trusted or not.
From security POV, when VLAN is configured for VF through PF (via ip link),
honour such config requests from VF only when they are configured to be
trusted, otherwise restrict such VFs vlan promisc mode config.
Cc: stable@vger.kernel.org
Fixes:
f990c82c385b ("qed*: Add support for ndo_set_vf_trust")
Signed-off-by: Manish Chopra <manishc@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Wed, 2 Mar 2022 10:52:21 +0000 (02:52 -0800)]
qed: display VF trust config
Driver does support SR-IOV VFs trust configuration but
it does not display it when queried via ip link utility.
Cc: stable@vger.kernel.org
Fixes:
f990c82c385b ("qed*: Add support for ndo_set_vf_trust")
Signed-off-by: Manish Chopra <manishc@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Mar 2022 10:14:06 +0000 (10:14 +0000)]
Merge branch 'stmmac-SA8155p-ADP'
@ 2022-03-02 10:39 Bhupesh Sharma
2022-03-02 10:39 ` [PATCH v2 1/2 net-next] net: stmmac: Add support for SM8150 Bhupesh Sharma
2022-03-02 10:39 ` [PATCH v2 2/2 net-next] net: stmmac: dwmac-qcom-ethqos: Adjust rgmii loopback_en per platform Bhupesh Sharma
0 siblings, 2 replies; 3+ messages in thread
Bhupesh Sharma says:
====================
net: stmmac: Enable support for Qualcomm SA8155p-ADP board
Changes since v1:
-----------------
- v1 can be seen here: https://lore.kernel.org/netdev/
20220126221725.710167-1-bhupesh.sharma@linaro.org/t/
- Fixed review comments from Bjorn - broke the v1 series into two
separate series - one each for 'net' tree and 'arm clock/dts' tree
- so as to ease review of the same from the respective maintainers.
- This series is intended for the 'net' tree.
The SA8155p-ADP board supports on-board ethernet (Gibabit Interface),
with support for both RGMII and RMII buses.
This patchset adds the support for the same.
Note that this patchset is based on an earlier sent patchset
for adding PDC controller support on SM8150 (see [1]).
[1]. https://lore.kernel.org/linux-arm-msm/
20220226184028.111566-1-bhupesh.sharma@linaro.org/T/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Bjorn Andersson [Wed, 2 Mar 2022 10:39:50 +0000 (16:09 +0530)]
net: stmmac: dwmac-qcom-ethqos: Adjust rgmii loopback_en per platform
Not all platforms should have RGMII_CONFIG_LOOPBACK_EN and the result it
about 50% packet loss on incoming messages. So make it possile to
configure this per compatible and enable it for QCS404.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vinod Koul [Wed, 2 Mar 2022 10:39:49 +0000 (16:09 +0530)]
net: stmmac: Add support for SM8150
This adds compatible, POR config & driver data for ethernet controller
found in SM8150 SoC.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
[bhsharma: Massage the commit log and other cosmetic changes]
Signed-off-by: Bhupesh Sharma <bhupesh.sharma@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Mar 2022 09:55:28 +0000 (09:55 +0000)]
Merge branch 'page_pool-stats'
Joe Damato says:
====================
page_pool: Add stats counters
Greetings:
Welcome to v9.
This revisions adds a commit which updates the page_pool documentation to
describe the stats API, structures, and fields.
Additionally, this revision contains a minor cosmetic change suggested by
Saeed in page_pool_recycle_in_ring in commit 2: "page_pool: Add recycle
stats", which removes an unnecessary #ifdef.
There are no functional changes in this revision.
Benchmark output from the v7 cover [1] is pasted below, as it is still
relevant since no functional changes have been made in this revision:
Benchmarks have been re-run. As always, results between runs are highly
variable; you'll find results showing that stats disabled are both faster
and slower than stats enabled in back to back benchmark runs.
Raw benchmark output with stats off [2] and stats on [3] are available for
examination.
Test system:
- 2x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
- 2 NUMA zones, with 18 cores per zone and 2 threads per core
bench_page_pool_simple results, loops=
200000000
test name stats enabled stats disabled
cycles nanosec cycles nanosec
for_loop 0 0.335 0 0.336
atomic_inc 14 6.106 13 6.022
lock 30 13.365 32 13.968
no-softirq-page_pool01 75 32.884 74 32.308
no-softirq-page_pool02 79 34.696 74 32.302
no-softirq-page_pool03 110 48.005 105 46.073
tasklet_page_pool01_fast_path 14 6.156 14 6.211
tasklet_page_pool02_ptr_ring 41 18.028 39 17.391
tasklet_page_pool03_slow 107 46.646 105 46.123
bench_page_pool_cross_cpu results, loops=
20000000 returning_cpus=4:
test name stats enabled stats disabled
cycles nanosec cycles nanosec
page_pool_cross_cpu CPU(0) 3973 1731.596 4015 1750.015
page_pool_cross_cpu CPU(1) 3976 1733.217 4022 1752.864
page_pool_cross_cpu CPU(2) 3973 1731.615 4016 1750.433
page_pool_cross_cpu CPU(3) 3976 1733.218 4021 1752.806
page_pool_cross_cpu CPU(4) 994 433.305 1005 438.217
page_pool_cross_cpu average 3378 - 3415 -
bench_page_pool_cross_cpu results, loops=
20000000 returning_cpus=8:
test name stats enabled stats disabled
cycles nanosec cycles nanosec
page_pool_cross_cpu CPU(0) 6969 3037.488 6909 3011.463
page_pool_cross_cpu CPU(1) 6974 3039.469 6913 3012.961
page_pool_cross_cpu CPU(2) 6969 3037.575 6910 3011.585
page_pool_cross_cpu CPU(3) 6974 3039.415 6913 3012.961
page_pool_cross_cpu CPU(4) 6969 3037.288 6909 3011.368
page_pool_cross_cpu CPU(5) 6972 3038.732 6913 3012.920
page_pool_cross_cpu CPU(6) 6969 3037.350 6909 3011.386
page_pool_cross_cpu CPU(7) 6973 3039.356 6913 3012.921
page_pool_cross_cpu CPU(8) 871 379.934 864 376.620
page_pool_cross_cpu average 6293 - 6239 -
Thanks.
[1]: https://lore.kernel.org/all/
1645810914-35485-1-git-send-email-jdamato@fastly.com/
[2]: https://gist.githubusercontent.com/jdamato-fsly/
d7c34b9fa7be1ce132a266b0f2b92aea/raw/
327dcd71d11ece10238fbf19e0472afbcbf22fd4/v7_stats_disabled
[3]: https://gist.githubusercontent.com/jdamato-fsly/
d7c34b9fa7be1ce132a266b0f2b92aea/raw/
327dcd71d11ece10238fbf19e0472afbcbf22fd4/v7_stats_enabled
v8 -> v9:
- Add documentation about the page_pool_get_stats API, stats
structures, and fields to Documentation/networking/page_pool.rst.
- Remove unnecessary #ifdef in page_pool_recycle_in_ring.
v7 -> v8:
- Rename mlx5 ethtool stats so that users have a better idea of
their meaning.
v6 -> v7:
- stats split out into two structs one single per-page pool struct
for allocation path stats and one per-cpu pointer for recycle
path stats.
- page_pool_get_stats updated to use a wrapper struct to gather
stats for allocation and recycle stats with a single argument.
- placement of structs adjusted
- mlx5 driver modified to use page_pool_get_stats API
v5 -> v6:
- Per cpu page_pool_stats struct pointer is now marked as
____cacheline_aligned_in_smp. Placement of the field in the
struct is unchanged; it is the last field.
v4 -> v5:
- Fixed the description of the kernel option in Kconfig.
- Squashed commits 1-10 from v4 into a single commit for easier
review.
- Changed the comment style of the comment for
the this_cpu_inc_alloc_stat macro.
- Changed the return type of page_pool_get_stats from struct
page_pool_stat * to bool.
v3 -> v4:
- Restructured stats to be per-cpu per-pool.
- Global stats and proc file were removed.
- Exposed an API (page_pool_get_stats) for batching the pool stats.
v2 -> v3:
- patch 8/10 ("Add stat tracking cache refill") fixed placement of
counter increment.
- patch 10/10 ("net-procfs: Show page pool stats in proc") updated:
- fix unused label warning from kernel test robot,
- fixed page_pool_seq_show to only display the refill stat
once,
- added a remove_proc_entry for page_pool_stat to
dev_proc_net_exit.
v1 -> v2:
- A new kernel config option has been added, which defaults to N,
preventing this code from being compiled in by default
- The stats structure has been converted to a per-cpu structure
- The stats are now exported via proc (/proc/net/page_pool_stat)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Damato [Wed, 2 Mar 2022 07:55:51 +0000 (23:55 -0800)]
mlx5: add support for page_pool_get_stats
This change adds support for the page_pool_get_stats API to mlx5. If the
user has enabled CONFIG_PAGE_POOL_STATS in their kernel, ethtool will
output page pool stats.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Acked-by: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Damato [Wed, 2 Mar 2022 07:55:50 +0000 (23:55 -0800)]
Documentation: update networking/page_pool.rst
Add the new stats API, kernel config parameter, and stats structure
information to the page_pool documentation.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Damato [Wed, 2 Mar 2022 07:55:49 +0000 (23:55 -0800)]
page_pool: Add function to batch and return stats
Adds a function page_pool_get_stats which can be used by drivers to obtain
stats for a specified page_pool.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Damato [Wed, 2 Mar 2022 07:55:48 +0000 (23:55 -0800)]
page_pool: Add recycle stats
Add per-cpu stats tracking page pool recycling events:
- cached: recycling placed page in the page pool cache
- cache_full: page pool cache was full
- ring: page placed into the ptr ring
- ring_full: page released from page pool because the ptr ring was full
- released_refcnt: page released (and not recycled) because refcnt > 1
Signed-off-by: Joe Damato <jdamato@fastly.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Damato [Wed, 2 Mar 2022 07:55:47 +0000 (23:55 -0800)]
page_pool: Add allocation stats
Add per-pool statistics counters for the allocation path of a page pool.
These stats are incremented in softirq context, so no locking or per-cpu
variables are needed.
This code is disabled by default and a kernel config option is provided for
users who wish to enable them.
The statistics added are:
- fast: successful fast path allocations
- slow: slow path order-0 allocations
- slow_high_order: slow path high order allocations
- empty: ptr ring is empty, so a slow path allocation was forced.
- refill: an allocation which triggered a refill of the cache
- waive: pages obtained from the ptr ring that cannot be added to
the cache due to a NUMA mismatch.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tao Chen [Tue, 1 Mar 2022 14:35:42 +0000 (06:35 -0800)]
tcp: Remove the unused api
Last tcp_write_queue_head() use was removed in commit
114f39feab36 ("tcp: restore autocorking"), so remove it.
Signed-off-by: Tao Chen <chentao3@hotmail.com>
Link: https://lore.kernel.org/r/SYZP282MB33317DEE1253B37C0F57231E86029@SYZP282MB3331.AUSP282.PROD.OUTLOOK.COM
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kurt Kanzenbach [Mon, 28 Feb 2022 19:58:56 +0000 (20:58 +0100)]
flow_dissector: Add support for HSR
Network drivers such as igb or igc call eth_get_headlen() to determine the
header length for their to be constructed skbs in receive path.
When running HSR on top of these drivers, it results in triggering BUG_ON() in
skb_pull(). The reason is the skb headlen is not sufficient for HSR to work
correctly. skb_pull() notices that.
For instance, eth_get_headlen() returns 14 bytes for TCP traffic over HSR which
is not correct. The problem is, the flow dissection code does not take HSR into
account. Therefore, add support for it.
Reported-by: Anthony Harivel <anthony.harivel@linutronix.de>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Link: https://lore.kernel.org/r/20220228195856.88187-1-kurt@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Baruch Siach [Mon, 28 Feb 2022 12:10:03 +0000 (14:10 +0200)]
net: dsa: mv88e6xxx: support RMII cmode
Add support for direct RMII MAC mode. This allows hardware with CPU port
connected in direct 100M fixed link to work properly.
Signed-off-by: Baruch Siach <baruch.siach@siklu.com>
Link: https://lore.kernel.org/r/a962d1ccbeec42daa10dd8aff0e66e31f0faf1eb.1646050203.git.baruch@tkos.co.il
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Baruch Siach [Mon, 28 Feb 2022 12:10:02 +0000 (14:10 +0200)]
net: dsa: mv88e6xxx: don't error out cmode set on missing lane
When the given cmode has no serdes, mv88e6xxx_serdes_get_lane() returns
-NODEV. Earlier in the same function the code skips serdes handing in
this case. Do the same after cmode set.
Signed-off-by: Baruch Siach <baruch.siach@siklu.com>
Link: https://lore.kernel.org/r/cd95cf3422ae8daf297a01fa9ec3931b203cdf45.1646050203.git.baruch@tkos.co.il
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yang Li [Sun, 27 Feb 2022 13:22:08 +0000 (21:22 +0800)]
net: openvswitch: remove unneeded semicolon
Eliminate the following coccicheck warning:
./net/openvswitch/flow.c:379:2-3: Unneeded semicolon
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220227132208.24658-1-yang.lee@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Baowen Zheng [Wed, 2 Mar 2022 03:29:29 +0000 (11:29 +0800)]
flow_offload: improve extack msg for user when adding invalid filter
Add extack message to return exact message to user when adding invalid
filter with conflict flags for TC action.
In previous implement we just return EINVAL which is confusing for user.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Link: https://lore.kernel.org/r/1646191769-17761-1-git-send-email-baowen.zheng@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 3 Mar 2022 06:13:06 +0000 (22:13 -0800)]
Merge branch '40GbE' of git://git./linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
40GbE Intel Wired LAN Driver Updates 2022-03-01
This series contains updates to iavf driver only.
Mateusz adds support for interrupt moderation for 50G and 100G speeds
as well as support for the driver to specify a request as its primary
MAC address. He also refactors VLAN V2 capability exchange into more
generic extended capabilities to ease the addition of future
capabilities. Finally, he corrects the incorrect return of iavf_status
values and removes non-inclusive language.
Minghao Chi removes unneeded variables, instead returning values
directly.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
iavf: Remove non-inclusive language
iavf: Fix incorrect use of assigning iavf_status to int
iavf: stop leaking iavf_status as "errno" values
iavf: remove redundant ret variable
iavf: Add usage of new virtchnl format to set default MAC
iavf: refactor processing of VLAN V2 capability message
iavf: Add support for 50G/100G in AIM algorithm
====================
Link: https://lore.kernel.org/r/20220301185939.3005116-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Christophe JAILLET [Tue, 1 Mar 2022 13:12:12 +0000 (14:12 +0100)]
nfp: flower: Remove usage of the deprecated ida_simple_xxx API
Use ida_alloc_xxx()/ida_free() instead to
ida_simple_get()/ida_simple_remove().
The latter is deprecated and more verbose.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20220301131212.26348-1-simon.horman@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Tue, 1 Mar 2022 08:51:39 +0000 (08:51 +0000)]
net: sfp: use %pe for printing errors
Convert sfp to use %pe for printing error codes, which can print them
as errno symbols rather than numbers.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/E1nOyEN-00BuuE-OB@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Tue, 1 Mar 2022 08:51:34 +0000 (08:51 +0000)]
net: phylink: use %pe for printing errors
Convert phylink to use %pe for printing error codes, which can print
them as errno symbols rather than numbers.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/E1nOyEI-00Buu8-K9@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Harold Huang [Thu, 3 Mar 2022 02:24:40 +0000 (10:24 +0800)]
tuntap: add sanity checks about msg_controllen in sendmsg
In patch [1], tun_msg_ctl was added to allow pass batched xdp buffers to
tun_sendmsg. Although we donot use msg_controllen in this path, we should
check msg_controllen to make sure the caller pass a valid msg_ctl.
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=
fe8dd45bb7556246c6b76277b1ba4296c91c2505
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Harold Huang <baymaxhuang@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20220303022441.383865-1-baymaxhuang@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 3 Mar 2022 05:58:02 +0000 (21:58 -0800)]
Merge tag 'batadv-next-pullrequest-
20220302' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- Remove redundant 'flush_workqueue()' calls, by Christophe JAILLET
- Migrate to linux/container_of.h, by Sven Eckelmann
- Demote batadv-on-batadv skip error message, by Sven Eckelmann
* tag 'batadv-next-pullrequest-
20220302' of git://git.open-mesh.org/linux-merge:
batman-adv: Demote batadv-on-batadv skip error message
batman-adv: Migrate to linux/container_of.h
batman-adv: Remove redundant 'flush_workqueue()' calls
batman-adv: Start new development cycle
====================
Link: https://lore.kernel.org/r/20220302163522.102842-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wang Qing [Wed, 2 Mar 2022 06:41:14 +0000 (22:41 -0800)]
net: hamradio: fix compliation error
add missing ")" which caused by previous commit.
Fixes:
61c4fb9c4d09 ("net: hamradio: use time_is_after_jiffies() instead of open coding it")
Link: https://lore.kernel.org/all/1646018012-61129-1-git-send-email-wangqing@vivo.com/
Signed-off-by: Wang Qing <wangqing@vivo.com>
Link: https://lore.kernel.org/r/1646203277-83159-1-git-send-email-wangqing@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sven Eckelmann [Sun, 27 Feb 2022 22:40:40 +0000 (23:40 +0100)]
batman-adv: Demote batadv-on-batadv skip error message
The error message "Cannot find parent device" was shown for users of
macvtap (on batadv devices) whenever the macvtap was moved to a different
netns. This happens because macvtap doesn't provide an implementation for
rtnl_link_ops->get_link_net.
The situation for which this message is printed is actually not an error
but just a warning that the optional sanity check was skipped. So demote
the message from error to warning and adjust the text to better explain
what happened.
Reported-by: Leonardo Mörlein <freifunk@irrelefant.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Sven Eckelmann [Fri, 21 Jan 2022 16:14:44 +0000 (17:14 +0100)]
batman-adv: Migrate to linux/container_of.h
The commit
d2a8ebbf8192 ("kernel.h: split out container_of() and
typeof_member() macros") introduced a new header for the container_of
related macros from (previously) linux/kernel.h.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Jakub Kicinski [Wed, 2 Mar 2022 02:29:35 +0000 (18:29 -0800)]
Merge branch 'if_ether-h-add-industrial-fieldbus-ethertypes'
Daniel Braunwarth says:
====================
if_ether.h: add industrial fieldbus Ethertypes
This set of patches adds the Ethertypes for PROFINET and EtherCAT.
The defines should be used by iproute2 to extend the list of available link
layer protocols.
====================
Link: https://lore.kernel.org/r/20220228133029.100913-1-daniel@braunwarth.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Braunwarth [Mon, 28 Feb 2022 13:30:29 +0000 (14:30 +0100)]
if_ether.h: add EtherCAT Ethertype
Add the Ethertype for EtherCAT protocol.
Signed-off-by: Daniel Braunwarth <daniel@braunwarth.dev>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Braunwarth [Mon, 28 Feb 2022 13:30:28 +0000 (14:30 +0100)]
if_ether.h: add PROFINET Ethertype
Add the Ethertype for PROFINET protocol.
Signed-off-by: Daniel Braunwarth <daniel@braunwarth.dev>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sven Eckelmann [Mon, 28 Feb 2022 00:32:40 +0000 (01:32 +0100)]
macvtap: advertise link netns via netlink
Assign rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages. This fixes iproute2 which otherwise resolved
the link interface to an interface in the wrong namespace.
Test commands:
ip netns add nst
ip link add dummy0 type dummy
ip link add link macvtap0 link dummy0 type macvtap
ip link set macvtap0 netns nst
ip -netns nst link show macvtap0
Before:
10: macvtap0@gre0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500
link/ether 5e:8f:ae:1d:60:50 brd ff:ff:ff:ff:ff:ff
After:
10: macvtap0@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500
link/ether 5e:8f:ae:1d:60:50 brd ff:ff:ff:ff:ff:ff link-netnsid 0
Reported-by: Leonardo Mörlein <freifunk@irrelefant.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Link: https://lore.kernel.org/r/20220228003240.1337426-1-sven@narfation.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wan Jiabing [Tue, 1 Mar 2022 11:23:54 +0000 (19:23 +0800)]
nfp: avoid newline at end of message in NL_SET_ERR_MSG_MOD
Fix the following coccicheck warning:
./drivers/net/ethernet/netronome/nfp/flower/qos_conf.c:750:7-55: WARNING
avoid newline at end of message in NL_SET_ERR_MSG_MOD
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20220301112356.1820985-1-wanjiabing@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Harold Huang [Mon, 28 Feb 2022 03:38:05 +0000 (11:38 +0800)]
tun: support NAPI for packets received from batched XDP buffs
In tun, NAPI is supported and we can also use NAPI in the path of
batched XDP buffs to accelerate packet processing. What is more, after
we use NAPI, GRO is also supported. The iperf shows that the throughput of
single stream could be improved from 4.5Gbps to 9.2Gbps. Additionally, 9.2
Gbps nearly reachs the line speed of the phy nic and there is still about
15% idle cpu core remaining on the vhost thread.
Test topology:
[iperf server]<--->tap<--->dpdk testpmd<--->phy nic<--->[iperf client]
Iperf stream:
iperf3 -c 10.0.0.2 -i 1 -t 10
Before:
...
[ 5] 5.00-6.00 sec 558 MBytes 4.68 Gbits/sec 0 1.50 MBytes
[ 5] 6.00-7.00 sec 556 MBytes 4.67 Gbits/sec 1 1.35 MBytes
[ 5] 7.00-8.00 sec 556 MBytes 4.67 Gbits/sec 2 1.18 MBytes
[ 5] 8.00-9.00 sec 559 MBytes 4.69 Gbits/sec 0 1.48 MBytes
[ 5] 9.00-10.00 sec 556 MBytes 4.67 Gbits/sec 1 1.33 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 5.39 GBytes 4.63 Gbits/sec 72 sender
[ 5] 0.00-10.04 sec 5.39 GBytes 4.61 Gbits/sec receiver
After:
...
[ 5] 5.00-6.00 sec 1.07 GBytes 9.19 Gbits/sec 0 1.55 MBytes
[ 5] 6.00-7.00 sec 1.08 GBytes 9.30 Gbits/sec 0 1.63 MBytes
[ 5] 7.00-8.00 sec 1.08 GBytes 9.25 Gbits/sec 0 1.72 MBytes
[ 5] 8.00-9.00 sec 1.08 GBytes 9.25 Gbits/sec 77 1.31 MBytes
[ 5] 9.00-10.00 sec 1.08 GBytes 9.24 Gbits/sec 0 1.48 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 10.8 GBytes 9.28 Gbits/sec 166 sender
[ 5] 0.00-10.04 sec 10.8 GBytes 9.24 Gbits/sec receiver
Reported-at: https://lore.kernel.org/all/CACGkMEvTLG0Ayg+TtbN4q4pPW-ycgCCs3sC3-TF8cuRTf7Pp1A@mail.gmail.com
Signed-off-by: Harold Huang <baymaxhuang@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20220228033805.1579435-1-baymaxhuang@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 2 Mar 2022 01:12:46 +0000 (17:12 -0800)]
Merge branch 'sfc-optimize-rxqs-count-and-affinities'
Íñigo Huguet says:
====================
sfc: optimize RXQs count and affinities
In sfc driver one RX queue per physical core was allocated by default.
Later on, IRQ affinities were set spreading the IRQs in all NUMA local
CPUs.
However, with that default configuration it result in a non very optimal
configuration in many modern systems. Specifically, in systems with hyper
threading and 2 NUMA nodes, affinities are set in a way that IRQs are
handled by all logical cores of one same NUMA node. Handling IRQs from
both hyper threading siblings has no benefit, and setting affinities to one
queue per physical core is neither a very good idea because there is a
performance penalty for moving data across nodes (I was able to check it
with some XDP tests using pktgen).
This patches reduce the default number of channels to one per physical
core in the local NUMA node. Then, they set IRQ affinities to CPUs in
the local NUMA node only. This way we save hardware resources since
channels are limited resources. We also leave more room for XDP_TX
channels without hitting driver's limit of 32 channels per interface.
Running performance tests using iperf with a SFC9140 device showed no
performance penalty for reducing the number of channels.
RX XDP tests showed that performance can go down to less than half if
the IRQ is handled by a CPU in a different NUMA node, which doesn't
happen with the new defaults from this patches.
====================
Link: https://lore.kernel.org/r/20220228132254.25787-1-ihuguet@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Íñigo Huguet [Mon, 28 Feb 2022 13:22:54 +0000 (14:22 +0100)]
sfc: set affinity hints in local NUMA node only
Affinity hints were being set to CPUs in local NUMA node first, and then
in other CPUs. This was creating 2 unintended issues:
1. Channels created to be assigned each to a different physical core
were assigned to hyperthreading siblings because of being in same
NUMA node.
Since the patch previous to this one, this did not longer happen
with default rss_cpus modparam because less channels are created.
2. XDP channels could be assigned to CPUs in different NUMA nodes,
decreasing performance too much (to less than half in some of my
tests).
This patch sets the affinity hints spreading the channels only in local
NUMA node's CPUs. A fallback for the case that no CPU in local NUMA node
is online has been added too.
Example of CPUs being assigned in a non optimal way before this and the
previous patch (note: in this system, xdp-8 to xdp-15 are created
because num_possible_cpus == 64, but num_present_cpus == 32 so they're
never used):
$ lscpu | grep -i numa
NUMA node(s): 2
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
$ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
/proc/irq/141/0000:07:00.0-0/../smp_affinity_list:0
/proc/irq/142/0000:07:00.0-1/../smp_affinity_list:1
/proc/irq/143/0000:07:00.0-2/../smp_affinity_list:2
/proc/irq/144/0000:07:00.0-3/../smp_affinity_list:3
/proc/irq/145/0000:07:00.0-4/../smp_affinity_list:4
/proc/irq/146/0000:07:00.0-5/../smp_affinity_list:5
/proc/irq/147/0000:07:00.0-6/../smp_affinity_list:6
/proc/irq/148/0000:07:00.0-7/../smp_affinity_list:7
/proc/irq/149/0000:07:00.0-8/../smp_affinity_list:16
/proc/irq/150/0000:07:00.0-9/../smp_affinity_list:17
/proc/irq/151/0000:07:00.0-10/../smp_affinity_list:18
/proc/irq/152/0000:07:00.0-11/../smp_affinity_list:19
/proc/irq/153/0000:07:00.0-12/../smp_affinity_list:20
/proc/irq/154/0000:07:00.0-13/../smp_affinity_list:21
/proc/irq/155/0000:07:00.0-14/../smp_affinity_list:22
/proc/irq/156/0000:07:00.0-15/../smp_affinity_list:23
/proc/irq/157/0000:07:00.0-xdp-0/../smp_affinity_list:8
/proc/irq/158/0000:07:00.0-xdp-1/../smp_affinity_list:9
/proc/irq/159/0000:07:00.0-xdp-2/../smp_affinity_list:10
/proc/irq/160/0000:07:00.0-xdp-3/../smp_affinity_list:11
/proc/irq/161/0000:07:00.0-xdp-4/../smp_affinity_list:12
/proc/irq/162/0000:07:00.0-xdp-5/../smp_affinity_list:13
/proc/irq/163/0000:07:00.0-xdp-6/../smp_affinity_list:14
/proc/irq/164/0000:07:00.0-xdp-7/../smp_affinity_list:15
/proc/irq/165/0000:07:00.0-xdp-8/../smp_affinity_list:24
/proc/irq/166/0000:07:00.0-xdp-9/../smp_affinity_list:25
/proc/irq/167/0000:07:00.0-xdp-10/../smp_affinity_list:26
/proc/irq/168/0000:07:00.0-xdp-11/../smp_affinity_list:27
/proc/irq/169/0000:07:00.0-xdp-12/../smp_affinity_list:28
/proc/irq/170/0000:07:00.0-xdp-13/../smp_affinity_list:29
/proc/irq/171/0000:07:00.0-xdp-14/../smp_affinity_list:30
/proc/irq/172/0000:07:00.0-xdp-15/../smp_affinity_list:31
CPUs assignments after this and previous patch, so normal channels
created only one per core in NUMA node and affinities set only to local
NUMA node:
$ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
/proc/irq/116/0000:07:00.0-0/../smp_affinity_list:0
/proc/irq/117/0000:07:00.0-1/../smp_affinity_list:1
/proc/irq/118/0000:07:00.0-2/../smp_affinity_list:2
/proc/irq/119/0000:07:00.0-3/../smp_affinity_list:3
/proc/irq/120/0000:07:00.0-4/../smp_affinity_list:4
/proc/irq/121/0000:07:00.0-5/../smp_affinity_list:5
/proc/irq/122/0000:07:00.0-6/../smp_affinity_list:6
/proc/irq/123/0000:07:00.0-7/../smp_affinity_list:7
/proc/irq/124/0000:07:00.0-xdp-0/../smp_affinity_list:16
/proc/irq/125/0000:07:00.0-xdp-1/../smp_affinity_list:17
/proc/irq/126/0000:07:00.0-xdp-2/../smp_affinity_list:18
/proc/irq/127/0000:07:00.0-xdp-3/../smp_affinity_list:19
/proc/irq/128/0000:07:00.0-xdp-4/../smp_affinity_list:20
/proc/irq/129/0000:07:00.0-xdp-5/../smp_affinity_list:21
/proc/irq/130/0000:07:00.0-xdp-6/../smp_affinity_list:22
/proc/irq/131/0000:07:00.0-xdp-7/../smp_affinity_list:23
/proc/irq/132/0000:07:00.0-xdp-8/../smp_affinity_list:0
/proc/irq/133/0000:07:00.0-xdp-9/../smp_affinity_list:1
/proc/irq/134/0000:07:00.0-xdp-10/../smp_affinity_list:2
/proc/irq/135/0000:07:00.0-xdp-11/../smp_affinity_list:3
/proc/irq/136/0000:07:00.0-xdp-12/../smp_affinity_list:4
/proc/irq/137/0000:07:00.0-xdp-13/../smp_affinity_list:5
/proc/irq/138/0000:07:00.0-xdp-14/../smp_affinity_list:6
/proc/irq/139/0000:07:00.0-xdp-15/../smp_affinity_list:7
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Íñigo Huguet [Mon, 28 Feb 2022 13:22:53 +0000 (14:22 +0100)]
sfc: default config to 1 channel/core in local NUMA node only
Handling channels from CPUs in different NUMA node can penalize
performance, so better configure only one channel per core in the same
NUMA node than the NIC, and not per each core in the system.
Fallback to all other online cores if there are not online CPUs in local
NUMA node.
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 1 Mar 2022 22:24:46 +0000 (14:24 -0800)]
net: smc: fix different types in min()
Fix build:
include/linux/minmax.h:45:25: note: in expansion of macro ‘__careful_cmp’
45 | #define min(x, y) __careful_cmp(x, y, <)
| ^~~~~~~~~~~~~
net/smc/smc_tx.c:150:24: note: in expansion of macro ‘min’
150 | corking_size = min(sock_net(&smc->sk)->smc.sysctl_autocorking_size,
| ^~~
Fixes:
12bbb0d163a9 ("net/smc: add sysctl for autocorking")
Link: https://lore.kernel.org/r/20220301222446.1271127-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mateusz Palczewski [Thu, 3 Feb 2022 10:25:18 +0000 (11:25 +0100)]
iavf: Remove non-inclusive language
Remove non-inclusive language from the iavf driver.
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Mateusz Palczewski [Thu, 27 Jan 2022 14:16:40 +0000 (15:16 +0100)]
iavf: Fix incorrect use of assigning iavf_status to int
Currently there are functions in iavf_virtchnl.c for polling specific
virtchnl receive events. These are all assigning iavf_status values to
int values. Fix this and explicitly assign int values if iavf_status
is not IAVF_SUCCESS.
Also, refactor a small amount of duplicated code that can be reused by
all of the previously mentioned functions.
Finally, fix some spacing errors for variable assignment and get rid of
all the goto statements in the refactored functions for clarity.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Mateusz Palczewski [Thu, 27 Jan 2022 14:16:29 +0000 (15:16 +0100)]
iavf: stop leaking iavf_status as "errno" values
Several functions in the iAVF core files take status values of the enum
iavf_status and convert them into integer values. This leads to
confusion as functions return both Linux errno values and status codes
intermixed. Reporting status codes as if they were "errno" values can
lead to confusion when reviewing error logs. Additionally, it can lead
to unexpected behavior if a return value is not interpreted properly.
Fix this by introducing iavf_status_to_errno, a switch that explicitly
converts from the status codes into an appropriate error value. Also
introduce a virtchnl_status_to_errno function for the one case where we
were returning both virtchnl status codes and iavf_status codes in the
same function.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Minghao Chi [Mon, 10 Jan 2022 10:46:56 +0000 (10:46 +0000)]
iavf: remove redundant ret variable
Return value directly instead of taking this in another redundant
variable.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Mateusz Palczewski [Wed, 19 Jan 2022 10:15:21 +0000 (11:15 +0100)]
iavf: Add usage of new virtchnl format to set default MAC
Use new type field of VIRTCHNL_OP_ADD_ETH_ADDR and
VIRTCHNL_OP_DEL_ETH_ADDR requests to indicate that
VF wants to change its default MAC address.
Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Mateusz Palczewski [Fri, 14 Jan 2022 09:36:36 +0000 (10:36 +0100)]
iavf: refactor processing of VLAN V2 capability message
In order to handle the capability exchange necessary for
VIRTCHNL_VF_OFFLOAD_VLAN_V2, the driver must send
a VIRTCHNL_OP_GET_OFFLOAD_VLAN_V2_CAPS message. This must occur prior to
__IAVF_CONFIG_ADAPTER, and the driver must wait for the response from
the PF.
To handle this, the __IAVF_INIT_GET_OFFLOAD_VLAN_V2_CAPS state was
introduced. This state is intended to process the response from the VLAN
V2 caps message. This works ok, but is difficult to extend to adding
more extended capability exchange.
Existing (and future) AVF features are relying more and more on these
sort of extended ops for processing additional capabilities. Just like
VLAN V2, this exchange must happen prior to __IAVF_CONFIG_ADPATER.
Since we only send one outstanding AQ message at a time during init, it
is not clear where to place this state. Adding more capability specific
states becomes a mess. Instead of having the "previous" state send
a message and then transition into a capability-specific state,
introduce __IAVF_EXTENDED_CAPS state. This state will use a list of
extended_caps that determines what messages to send and receive. As long
as there are extended_caps bits still set, the driver will remain in
this state performing one send or one receive per state machine loop.
Refactor the VLAN V2 negotiation to use this new state, and remove the
capability-specific state. This makes it significantly easier to add
a new similar capability exchange going forward.
Extended capabilities are processed by having an associated SEND and
RECV extended capability bit. During __IAVF_EXTENDED_CAPS, the
driver checks these bits in order by feature, first the send bit for
a feature, then the recv bit for a feature. Each send flag will call
a function that sends the necessary response, while each receive flag
will wait for the response from the PF. If a given feature can't be
negotiated with the PF, the associated flags will be cleared in
order to skip processing of that feature.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Mateusz Palczewski [Mon, 10 Jan 2022 15:06:38 +0000 (16:06 +0100)]
iavf: Add support for 50G/100G in AIM algorithm
Advanced link speed support was added long back, but adding AIM support was
missed. This patch adds AIM support for advanced link speed support, which
allows the algorithm to take into account 50G/100G link speeds. Also, other
previous speeds are taken into consideration when advanced link speeds are
supported.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
David S. Miller [Tue, 1 Mar 2022 14:25:12 +0000 (14:25 +0000)]
Merge branch 'smc-datapath-opts'
Dust Li says:
====================
net/smc: some datapath performance optimizations
This series tries to improve the performance of SMC in datapath.
- patch #1, add sysctl interface to support tuning the behaviour of
SMC in container environment.
- patch #2/#3, add autocorking support which is very efficient for small
messages without trade-off for latency.
- patch #4, send directly on setting TCP_NODELAY, without wake up the
TX worker, this make it consistent with clearing TCP_CORK.
- patch #5, this correct the setting of RMB window update limit, so
we don't send CDC messages to update peer's RMB window too frequently
in some cases.
- patch #6, implemented something like NAPI in SMC, decrease the number
of hardirq when busy.
- patch #7, this moves TX work doing in the BH to the user context when
sock_lock is hold by user.
With this patchset applied, we can get a good performance gain:
- qperf tcp_bw test has shown a great improvement. Other benchmarks like
'netperf TCP_STREAM' or 'sockperf throughput' has similar result.
- In my testing environment, running qperf tcp_bw and tcp_lat, SMC behaves
better then TCP in most all message size.
Here are some test results with the following testing command:
client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
-t 30 -vu tcp_{bw|lat}
server: smc_run taskset -c 1 qperf
==== Bandwidth ====
MsgSize Origin SMC TCP SMC with patches
1 0.578 MB/s 2.392 MB/s(313.57%) 2.561 MB/s(342.83%)
2 1.159 MB/s 4.780 MB/s(312.53%) 5.162 MB/s(345.46%)
4 2.283 MB/s 10.266 MB/s(349.77%) 10.122 MB/s(343.46%)
8 4.668 MB/s 19.040 MB/s(307.86%) 20.521 MB/s(339.59%)
16 9.147 MB/s 38.904 MB/s(325.31%) 40.823 MB/s(346.29%)
32 18.369 MB/s 79.587 MB/s(333.25%) 80.535 MB/s(338.42%)
64 36.562 MB/s 148.668 MB/s(306.61%) 158.170 MB/s(332.60%)
128 72.961 MB/s 274.913 MB/s(276.80%) 316.217 MB/s(333.41%)
256 144.705 MB/s 512.059 MB/s(253.86%) 626.019 MB/s(332.62%)
512 288.873 MB/s 884.977 MB/s(206.35%) 1221.596 MB/s(322.88%)
1024 574.180 MB/s 1337.736 MB/s(132.98%) 2203.156 MB/s(283.70%)
2048 1095.192 MB/s 1865.952 MB/s( 70.38%) 3036.448 MB/s(177.25%)
4096 2066.157 MB/s 2380.337 MB/s( 15.21%) 3834.271 MB/s( 85.58%)
8192 3717.198 MB/s 2733.073 MB/s(-26.47%) 4904.910 MB/s( 31.95%)
16384 4742.221 MB/s 2958.693 MB/s(-37.61%) 5220.272 MB/s( 10.08%)
32768 5349.550 MB/s 3061.285 MB/s(-42.77%) 5321.865 MB/s( -0.52%)
65536 5162.919 MB/s 3731.408 MB/s(-27.73%) 5245.021 MB/s( 1.59%)
==== Latency ====
MsgSize Origin SMC TCP SMC with patches
1 10.540 us 11.938 us( 13.26%) 10.356 us( -1.75%)
2 10.996 us 11.992 us( 9.06%) 10.073 us( -8.39%)
4 10.229 us 11.687 us( 14.25%) 9.996 us( -2.28%)
8 10.203 us 11.653 us( 14.21%) 10.063 us( -1.37%)
16 10.530 us 11.313 us( 7.44%) 10.013 us( -4.91%)
32 10.241 us 11.586 us( 13.13%) 10.081 us( -1.56%)
64 10.693 us 11.652 us( 8.97%) 9.986 us( -6.61%)
128 10.597 us 11.579 us( 9.27%) 10.262 us( -3.16%)
256 10.409 us 11.957 us( 14.87%) 10.148 us( -2.51%)
512 11.088 us 12.505 us( 12.78%) 10.206 us( -7.95%)
1024 11.240 us 12.255 us( 9.03%) 10.631 us( -5.42%)
2048 11.485 us 16.970 us( 47.76%) 10.981 us( -4.39%)
4096 12.077 us 13.948 us( 15.49%) 11.847 us( -1.90%)
8192 13.683 us 16.693 us( 22.00%) 13.336 us( -2.54%)
16384 16.470 us 23.615 us( 43.38%) 16.519 us( 0.30%)
32768 22.540 us 40.966 us( 81.75%) 22.452 us( -0.39%)
65536 34.192 us 73.003 us(113.51%) 33.916 us( -0.81%)
------------
Test environment notes:
1. Testing is run on 2 VMs within the same physical host
2. The NIC is ConnectX-4Lx, using SRIOV, and passing through 2 VFs to the
2 VMs respectively.
3. To decrease jitter, VM's vCPU are binded to each physical CPU, and those
physical CPUs are all isolated using boot parameter `isolcpus=xxx`
4. The queue number are set to 1, and interrupt from the queue is binded to
CPU0 in the guest
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Tue, 1 Mar 2022 09:44:02 +0000 (17:44 +0800)]
net/smc: don't send in the BH context if sock_owned_by_user
Send data all the way down to the RDMA device is a time
consuming operation(get a new slot, maybe do RDMA Write
and send a CDC, etc). Moving those operations from BH
to user context is good for performance.
If the sock_lock is hold by user, we don't try to send
data out in the BH context, but just mark we should
send. Since the user will release the sock_lock soon, we
can do the sending there.
Add smc_release_cb() which will be called in release_sock()
and try send in the callback if needed.
This patch moves the sending part out from BH if sock lock
is hold by user. In my testing environment, this saves about
20% softirq in the qperf 4K tcp_bw test in the sender side
with no noticeable throughput drop.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Tue, 1 Mar 2022 09:44:01 +0000 (17:44 +0800)]
net/smc: don't req_notify until all CQEs drained
When we are handling softirq workload, enable hardirq may
again interrupt the current routine of softirq, and then
try to raise softirq again. This only wastes CPU cycles
and won't have any real gain.
Since IB_CQ_REPORT_MISSED_EVENTS already make sure if
ib_req_notify_cq() returns 0, it is safe to wait for the
next event, with no need to poll the CQ again in this case.
This patch disables hardirq during the processing of softirq,
and re-arm the CQ after softirq is done. Somehow like NAPI.
Co-developed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Tue, 1 Mar 2022 09:44:00 +0000 (17:44 +0800)]
net/smc: correct settings of RMB window update limit
rmbe_update_limit is used to limit announcing receive
window updating too frequently. RFC7609 request a minimal
increase in the window size of 10% of the receive buffer
space. But current implementation used:
min_t(int, rmbe_size / 10, SOCK_MIN_SNDBUF / 2)
and SOCK_MIN_SNDBUF / 2 == 2304 Bytes, which is almost
always less then 10% of the receive buffer space.
This causes the receiver always sending CDC message to
update its consumer cursor when it consumes more then 2K
of data. And as a result, we may encounter something like
"TCP silly window syndrome" when sending 2.5~8K message.
This patch fixes this using max(rmbe_size / 10, SOCK_MIN_SNDBUF / 2).
With this patch and SMC autocorking enabled, qperf 2K/4K/8K
tcp_bw test shows 45%/75%/40% increase in throughput respectively.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Tue, 1 Mar 2022 09:43:59 +0000 (17:43 +0800)]
net/smc: send directly on setting TCP_NODELAY
In commit
ea785a1a573b("net/smc: Send directly when
TCP_CORK is cleared"), we don't use delayed work
to implement cork.
This patch use the same algorithm, removes the
delayed work when setting TCP_NODELAY and send
directly in setsockopt(). This also makes the
TCP_NODELAY the same as TCP.
Cc: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Tue, 1 Mar 2022 09:43:58 +0000 (17:43 +0800)]
net/smc: add sysctl for autocorking
This add a new sysctl: net.smc.autocorking_size
We can dynamically change the behaviour of autocorking
by change the value of autocorking_size.
Setting to 0 disables autocorking in SMC
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Tue, 1 Mar 2022 09:43:57 +0000 (17:43 +0800)]
net/smc: add autocorking support
This patch adds autocorking support for SMC which could improve
throughput for small message by x3+.
The main idea is borrowed from TCP autocorking with some RDMA
specific modification:
1. The first message should never cork to make sure we won't
bring extra latency
2. If we have posted any Tx WRs to the NIC that have not
completed, cork the new messages until:
a) Receive CQE for the last Tx WR
b) We have corked enough message on the connection
3. Try to push the corked data out when we receive CQE of
the last Tx WR to prevent the corked messages hang in
the send queue.
Both SMC autocorking and TCP autocorking check the TX completion
to decide whether we should cork or not. The difference is
when we got a SMC Tx WR completion, the data have been confirmed
by the RNIC while TCP TX completion just tells us the data
have been sent out by the local NIC.
Add an atomic variable tx_pushing in smc_connection to make
sure only one can send to let it cork more and save CDC slot.
SMC autocorking should not bring extra latency since the first
message will always been sent out immediately.
The qperf tcp_bw test shows more than x4 increase under small
message size with Mellanox connectX4-Lx, same result with other
throughput benchmarks like sockperf/netperf.
The qperf tcp_lat test shows SMC autocorking has not increase any
ping-pong latency.
Test command:
client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
-t 30 -vu tcp_{bw|lat}
server: smc_run taskset -c 1 qperf
=== Bandwidth ====
MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking
1 0.578 MB/s 2.392 MB/s(313.57%) 2.647 MB/s(357.72%)
2 1.159 MB/s 4.780 MB/s(312.53%) 5.153 MB/s(344.71%)
4 2.283 MB/s 10.266 MB/s(349.77%) 10.363 MB/s(354.02%)
8 4.668 MB/s 19.040 MB/s(307.86%) 21.215 MB/s(354.45%)
16 9.147 MB/s 38.904 MB/s(325.31%) 41.740 MB/s(356.32%)
32 18.369 MB/s 79.587 MB/s(333.25%) 82.392 MB/s(348.52%)
64 36.562 MB/s 148.668 MB/s(306.61%) 161.564 MB/s(341.89%)
128 72.961 MB/s 274.913 MB/s(276.80%) 325.363 MB/s(345.94%)
256 144.705 MB/s 512.059 MB/s(253.86%) 633.743 MB/s(337.96%)
512 288.873 MB/s 884.977 MB/s(206.35%) 1250.681 MB/s(332.95%)
1024 574.180 MB/s 1337.736 MB/s(132.98%) 2246.121 MB/s(291.19%)
2048 1095.192 MB/s 1865.952 MB/s( 70.38%) 2057.767 MB/s( 87.89%)
4096 2066.157 MB/s 2380.337 MB/s( 15.21%) 2173.983 MB/s( 5.22%)
8192 3717.198 MB/s 2733.073 MB/s(-26.47%) 3491.223 MB/s( -6.08%)
16384 4742.221 MB/s 2958.693 MB/s(-37.61%) 4637.692 MB/s( -2.20%)
32768 5349.550 MB/s 3061.285 MB/s(-42.77%) 5385.796 MB/s( 0.68%)
65536 5162.919 MB/s 3731.408 MB/s(-27.73%) 5223.890 MB/s( 1.18%)
==== Latency ====
MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking
1 10.540 us 11.938 us( 13.26%) 10.573 us( 0.31%)
2 10.996 us 11.992 us( 9.06%) 10.269 us( -6.61%)
4 10.229 us 11.687 us( 14.25%) 10.240 us( 0.11%)
8 10.203 us 11.653 us( 14.21%) 10.402 us( 1.95%)
16 10.530 us 11.313 us( 7.44%) 10.599 us( 0.66%)
32 10.241 us 11.586 us( 13.13%) 10.223 us( -0.18%)
64 10.693 us 11.652 us( 8.97%) 10.251 us( -4.13%)
128 10.597 us 11.579 us( 9.27%) 10.494 us( -0.97%)
256 10.409 us 11.957 us( 14.87%) 10.710 us( 2.89%)
512 11.088 us 12.505 us( 12.78%) 10.547 us( -4.88%)
1024 11.240 us 12.255 us( 9.03%) 10.787 us( -4.03%)
2048 11.485 us 16.970 us( 47.76%) 11.256 us( -1.99%)
4096 12.077 us 13.948 us( 15.49%) 12.230 us( 1.27%)
8192 13.683 us 16.693 us( 22.00%) 13.786 us( 0.75%)
16384 16.470 us 23.615 us( 43.38%) 16.459 us( -0.07%)
32768 22.540 us 40.966 us( 81.75%) 23.284 us( 3.30%)
65536 34.192 us 73.003 us(113.51%) 34.233 us( 0.12%)
With SMC autocorking support, we can archive better throughput
than TCP in most message sizes without any latency trade-off.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Tue, 1 Mar 2022 09:43:56 +0000 (17:43 +0800)]
net/smc: add sysctl interface for SMC
This patch add sysctl interface to support container environment
for SMC as we talk in the mail list.
Link: https://lore.kernel.org/netdev/20220224020253.GF5443@linux.alibaba.com
Co-developed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 Mar 2022 08:38:02 +0000 (08:38 +0000)]
Merge branch 'vxlan-vnifiltering'
Roopa Prabhu says:
====================
vxlan metadata device vnifiltering support
This series adds vnifiltering support to vxlan collect metadata device.
Motivation:
You can only use a single vxlan collect metadata device for a given
vxlan udp port in the system today. The vxlan collect metadata device
terminates all received vxlan packets. As shown in the below diagram,
there are use-cases where you need to support multiple such vxlan devices in
independent bridge domains. Each vxlan device must terminate the vni's
it is configured for.
Example usecase: In a service provider network a service provider
typically supports multiple bridge domains with overlapping vlans.
One bridge domain per customer. Vlans in each bridge domain are
mapped to globally unique vxlan ranges assigned to each customer.
This series adds vnifiltering support to collect metadata devices to
terminate only configured vnis. This is similar to vlan filtering in
bridge driver. The vni filtering capability is provided by a new flag on
collect metadata device.
In the below pic:
- customer1 is mapped to br1 bridge domain
- customer2 is mapped to br2 bridge domain
- customer1 vlan 10-11 is mapped to vni 1001-1002
- customer2 vlan 10-11 is mapped to vni 2001-2002
- br1 and br2 are vlan filtering bridges
- vxlan1 and vxlan2 are collect metadata devices with
vnifiltering enabled
┌──────────────────────────────────────────────────────────────────┐
│ switch │
│ │
│ ┌───────────┐ ┌───────────┐ │
│ │ │ │ │ │
│ │ br1 │ │ br2 │ │
│ └┬─────────┬┘ └──┬───────┬┘ │
│ vlans│ │ vlans │ │ │
│ 10,11│ │ 10,11│ │ │
│ │ vlanvnimap: │ vlanvnimap: │
│ │ 10-1001,11-1002 │ 10-2001,11-2002 │
│ │ │ │ │ │
│ ┌──────┴┐ ┌──┴─────────┐ ┌───┴────┐ │ │
│ │ swp1 │ │vxlan1 │ │ swp2 │ ┌┴─────────────┐ │
│ │ │ │ vnifilter:│ │ │ │vxlan2 │ │
│ └───┬───┘ │ 1001,1002│ └───┬────┘ │ vnifilter: │ │
│ │ └────────────┘ │ │ 2001,2002 │ │
│ │ │ └──────────────┘ │
│ │ │ │
└───────┼──────────────────────────────────┼───────────────────────┘
│ │
│ │
┌─────┴───────┐ │
│ customer1 │ ┌─────┴──────┐
│ host/VM │ │customer2 │
└─────────────┘ │ host/VM │
└────────────┘
v2:
- remove stale xstats declarations pointed out by Nikolay Aleksandrov
- squash selinux patch with the tunnel api patch as pointed out by
benjamin poirier
- Fix various build issues:
Reported-by: kernel test robot <lkp@intel.com>
v3:
- incorporate review feedback from Jakub
- move rhashtable declarations to c file
- define and use netlink policy for top level vxlan filter api
- fix unused stats function warning
- pass vninode from vnifilter lookup into stats count function
to avoid another lookup (only applicable to vxlan_rcv)
- fix missing vxlan vni delete notifications in vnifilter uninit
function
- misc cleanups
- remote dev check for multicast groups added via vnifiltering api
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Tue, 1 Mar 2022 05:04:39 +0000 (05:04 +0000)]
drivers: vxlan: vnifilter: add support for stats dumping
Add support for VXLAN vni filter entries' stats dumping
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Tue, 1 Mar 2022 05:04:38 +0000 (05:04 +0000)]
drivers: vxlan: vnifilter: per vni stats
Add per-vni statistics for vni filter mode. Counting Rx/Tx
bytes/packets/drops/errors at the appropriate places.
This patch changes vxlan_vs_find_vni to also return the
vxlan_vni_node in cases where the vni belongs to a vni
filtering vxlan device
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:37 +0000 (05:04 +0000)]
selftests: add new tests for vxlan vnifiltering
This patch adds a new test script test_vxlan_vnifiltering.sh
with tests for vni filtering api, various datapath tests.
Also has a test with a mix of traditional, metadata and vni
filtering devices inuse at the same time.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:36 +0000 (05:04 +0000)]
vxlan: vni filtering support on collect metadata device
This patch adds vnifiltering support to collect metadata device.
Motivation:
You can only use a single vxlan collect metadata device for a given
vxlan udp port in the system today. The vxlan collect metadata device
terminates all received vxlan packets. As shown in the below diagram,
there are use-cases where you need to support multiple such vxlan devices in
independent bridge domains. Each vxlan device must terminate the vni's
it is configured for.
Example usecase: In a service provider network a service provider
typically supports multiple bridge domains with overlapping vlans.
One bridge domain per customer. Vlans in each bridge domain are
mapped to globally unique vxlan ranges assigned to each customer.
vnifiltering support in collect metadata devices terminates only configured
vnis. This is similar to vlan filtering in bridge driver. The vni filtering
capability is provided by a new flag on collect metadata device.
In the below pic:
- customer1 is mapped to br1 bridge domain
- customer2 is mapped to br2 bridge domain
- customer1 vlan 10-11 is mapped to vni 1001-1002
- customer2 vlan 10-11 is mapped to vni 2001-2002
- br1 and br2 are vlan filtering bridges
- vxlan1 and vxlan2 are collect metadata devices with
vnifiltering enabled
┌──────────────────────────────────────────────────────────────────┐
│ switch │
│ │
│ ┌───────────┐ ┌───────────┐ │
│ │ │ │ │ │
│ │ br1 │ │ br2 │ │
│ └┬─────────┬┘ └──┬───────┬┘ │
│ vlans│ │ vlans │ │ │
│ 10,11│ │ 10,11│ │ │
│ │ vlanvnimap: │ vlanvnimap: │
│ │ 10-1001,11-1002 │ 10-2001,11-2002 │
│ │ │ │ │ │
│ ┌──────┴┐ ┌──┴─────────┐ ┌───┴────┐ │ │
│ │ swp1 │ │vxlan1 │ │ swp2 │ ┌┴─────────────┐ │
│ │ │ │ vnifilter:│ │ │ │vxlan2 │ │
│ └───┬───┘ │ 1001,1002│ └───┬────┘ │ vnifilter: │ │
│ │ └────────────┘ │ │ 2001,2002 │ │
│ │ │ └──────────────┘ │
│ │ │ │
└───────┼──────────────────────────────────┼───────────────────────┘
│ │
│ │
┌─────┴───────┐ │
│ customer1 │ ┌─────┴──────┐
│ host/VM │ │customer2 │
└─────────────┘ │ host/VM │
└────────────┘
With this implementation, vxlan dst metadata device can
be associated with range of vnis.
struct vxlan_vni_node is introduced to represent
a configured vni. We start with vni and its
associated remote_ip in this structure. This
structure can be extended to bring in other
per vni attributes if there are usecases for it.
A vni inherits an attribute from the base vxlan device
if there is no per vni attributes defined.
struct vxlan_dev gets a new rhashtable for
vnis called vxlan_vni_group. vxlan_vnifilter.c
implements the necessary netlink api, notifications
and helper functions to process and manage lifecycle
of vxlan_vni_node.
This patch also adds new helper functions in vxlan_multicast.c
to handle per vni remote_ip multicast groups which are part
of vxlan_vni_group.
Fix build problems:
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:35 +0000 (05:04 +0000)]
vxlan_multicast: Move multicast helpers to a separate file
subsequent patches will add more helpers.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:34 +0000 (05:04 +0000)]
rtnetlink: add new rtm tunnel api for tunnel id filtering
This patch adds new rtm tunnel msg and api for tunnel id
filtering in dst_metadata devices. First dst_metadata
device to use the api is vxlan driver with AF_BRIDGE
family.
This and later changes add ability in vxlan driver to do
tunnel id filtering (or vni filtering) on dst_metadata
devices. This is similar to vlan api in the vlan filtering bridge.
this patch includes selinux nlmsg_route_perms support for RTM_*TUNNEL
api from Benjamin Poirier.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:33 +0000 (05:04 +0000)]
vxlan_core: add helper vxlan_vni_in_use
more users in follow up patches
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:32 +0000 (05:04 +0000)]
vxlan_core: make multicast helper take rip and ifindex explicitly
This patch changes multicast helpers to take rip and ifindex as input.
This is needed in future patches where rip can come from a pervni
structure while the ifindex can come from the vxlan device.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:31 +0000 (05:04 +0000)]
vxlan_core: move some fdb helpers to non-static
This patch moves some fdb helpers to non-static
for use in later patches. Ideally, all fdb code
could move into its own file vxlan_fdb.c.
This can be done as a subsequent patch and is out
of scope of this series.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:30 +0000 (05:04 +0000)]
vxlan_core: move common declarations to private header file
This patch moves common structures and global declarations
to a shared private headerfile vxlan_private.h. Subsequent
patches use this header file as a common header file for
additional shared declarations.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:29 +0000 (05:04 +0000)]
vxlan_core: fix build warnings in vxlan_xmit_one
Fix the below build warnings reported by kernel test robot:
- initialize vni in vxlan_xmit_one
- wrap label in ipv6 enabled checks in vxlan_xmit_one
warnings:
static
drivers/net/vxlan/vxlan_core.c:2437:14: warning: variable 'label' set
but not used [-Wunused-but-set-variable]
__be32 vni, label;
^
>> drivers/net/vxlan/vxlan_core.c:2483:7: warning: variable 'vni' is
used uninitialized whenever 'if' condition is true
[-Wsometimes-uninitialized]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 1 Mar 2022 05:04:28 +0000 (05:04 +0000)]
vxlan: move to its own directory
vxlan.c has grown too long. This patch moves
it to its own directory. subsequent patches add new
functionality in new files.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Tue, 1 Mar 2022 00:23:58 +0000 (16:23 -0800)]
Merge branch 'mlx5-next' of git://git./linux/kernel/git/mellanox/linux
Saeed Mahameed says:
====================
mlx5-next 2022-22-02
The following PR includes updates to mlx5-next branch:
Headlines:
==========
1) Jakub cleans up unused static inline functions
2) I did some low level firmware command interface return status changes to
provide the caller with full visibility on the error/status returned by
the Firmware.
3) Use the new command interface in RDMA DEVX usecases to avoid flooding
dmesg with some "expected" user error prone use cases.
4) Moshe also uses the new command interface to grab the specific error
code from MFRL register command to provide the exact error reason for
why SW reset couldn't perform internally in FW.
5) From Mark Bloch: Lag, drop packets in hardware when possible
In active-backup mode the inactive interface's packets are dropped by the
bond device. In switchdev where TC rules are offloaded to the FDB
this can lead to packets being hit in the FDB where without offload
they would have been dropped before reaching TC rules in the kernel.
Create a drop rule to make sure packets on inactive ports are dropped
before reaching the FDB.
Listen on NETDEV_CHANGEUPPER / NETDEV_CHANGEINFODATA events and record
the inactive state and offload accordingly.
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Add clarification on sync reset failure
net/mlx5: Add reset_state field to MFRL register
RDMA/mlx5: Use new command interface API
net/mlx5: cmdif, Refactor error handling and reporting of async commands
net/mlx5: Use mlx5_cmd_do() in core create_{cq,dct}
net/mlx5: cmdif, Add new api for command execution
net/mlx5: cmdif, cmd_check refactoring
net/mlx5: cmdif, Return value improvements
net/mlx5: Lag, offload active-backup drops to hardware
net/mlx5: Lag, record inactive state of bond device
net/mlx5: Lag, don't use magic numbers for ports
net/mlx5: Lag, use local variable already defined to access E-Switch
net/mlx5: E-switch, add drop rule support to ingress ACL
net/mlx5: E-switch, remove special uplink ingress ACL handling
net/mlx5: E-Switch, reserve and use same uplink metadata across ports
net/mlx5: Add ability to insert to specific flow group
mlx5: remove unused static inlines
====================
Link: https://lore.kernel.org/r/20220223233930.319301-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stephen Rothwell [Mon, 28 Feb 2022 17:39:57 +0000 (17:39 +0000)]
net: dm9051: Make remove() callback a void function
Changes introduced since the merge window in the spi subsystem and
available at:
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git tags/spi-remove-void
make the remove() callback for spi return void rather than int, breaking
the newly added dm9051 driver fail to build. This patch fixes this
issue, converting the remove() function provided by the driver to return
void.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
[Rewrote commit message -- broonie]
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220228173957.1262628-2-broonie@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Mon, 28 Feb 2022 18:41:31 +0000 (10:41 -0800)]
Merge tag 'spi-remove-void' of https://git./linux/kernel/git/broonie/spi
Mark Brown says:
====================
spi: Make remove() return void
This series from Uwe Kleine-König converts the spi remove function to
return void since there is nothing useful that we can do with a failure
and it as more buses are converted it'll enable further work on the
driver core.
====================
Link: https://lore.kernel.org/r/20220228173957.1262628-2-broonie@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wang Qing [Mon, 28 Feb 2022 03:15:55 +0000 (19:15 -0800)]
net: decnet: use time_is_before_jiffies() instead of open coding it
Use the helper function time_is_{before,after}_jiffies() to improve
code readability.
Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wang Qing [Mon, 28 Feb 2022 03:13:48 +0000 (19:13 -0800)]
net: wan: lmc: use time_is_before_jiffies() instead of open coding it
Use the helper function time_is_{before,after}_jiffies() to improve
code readability.
Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wang Qing [Mon, 28 Feb 2022 03:13:31 +0000 (19:13 -0800)]
net: hamradio: use time_is_after_jiffies() instead of open coding it
Use the helper function time_is_{before,after}_jiffies() to improve
code readability.
Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wang Qing [Mon, 28 Feb 2022 03:13:15 +0000 (19:13 -0800)]
net: ethernet: sun: use time_is_before_jiffies() instead of open coding it
Use the helper function time_is_{before,after}_jiffies() to improve
code readability.
Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wang Qing [Mon, 28 Feb 2022 03:13:00 +0000 (19:13 -0800)]
net: qlcnic: use time_is_before_jiffies() instead of open coding it
Use the helper function time_is_{before,after}_jiffies() to improve
code readability.
Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wang Qing [Mon, 28 Feb 2022 03:12:22 +0000 (19:12 -0800)]
net: ethernet: use time_is_before_eq_jiffies() instead of open coding it
Use the helper function time_is_{before,after}_jiffies() to improve
code readability.
Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King (Oracle) [Sat, 26 Feb 2022 14:56:22 +0000 (14:56 +0000)]
net: phylink: remove phylink_set_pcs()
As all users of phylink_set_pcs() have now been updated to use the
mac_select_pcs() method, it can be removed from the phylink kernel
API and its functionality moved into phylink_major_config().
Removing phylink_set_pcs() gives us a single approach for attaching
a PCS within phylink.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Foster [Sat, 26 Feb 2022 22:36:50 +0000 (14:36 -0800)]
net: dsa: felix: remove prevalidate_phy_mode interface
All users of the felix driver were creating their own prevalidate_phy_mode
function. The same logic can be performed in a more general way by using a
simple array of bit fields.
Signed-off-by: Colin Foster <colin.foster@in-advantage.com>
Suggested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Fri, 25 Feb 2022 17:16:18 +0000 (09:16 -0800)]
ionic: no transition while stopping
Make sure we don't try to transition the fw_status_ready
while we're still in the FW_STOPPING state, else we can
get stuck in limbo waiting on a transition that already
happened.
While we're here we can remove a superfluous check on
the lif pointer.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 25 Feb 2022 16:18:55 +0000 (08:18 -0800)]
net/sysctl: avoid two synchronize_rcu() calls
Both rps_sock_flow_sysctl() and flow_limit_cpu_sysctl()
are using synchronize_rcu() right before freeing memory
either by vfree() or kfree()
They can switch to kvfree_rcu(ptr) and kfree_rcu(ptr) to benefit
from asynchronous mode, instead of blocking the current thread.
Note that kvfree_rcu(ptr) and kfree_rcu(ptr) eventually can
have to use synchronize_rcu() in some memory pressure cases.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Bianconi [Fri, 25 Feb 2022 15:29:51 +0000 (16:29 +0100)]
net: netsec: enable pp skb recycling
Similar to mvneta or mvpp2, enable page_pool skb recycling for netsec
dirver.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Lu [Fri, 25 Feb 2022 07:34:21 +0000 (15:34 +0800)]
net/smc: Call trace_smc_tx_sendmsg when data corked
This also calls trace_smc_tx_sendmsg() even if data is corked. For ease
of understanding, if statements are not expanded here.
Link: https://lore.kernel.org/all/f4166712-9a1e-51a0-409d-b7df25a66c52@linux.ibm.com/
Fixes:
139653bc6635 ("net/smc: Remove corked dealyed work")
Suggested-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 28 Feb 2022 11:12:39 +0000 (11:12 +0000)]
Merge branch 'flow_offload-tc-police-parameters'
Jianbo Liu says:
====================
flow_offload: add tc police parameters
As a preparation for more advanced police offload in mlx5 (e.g.,
jumping to another chain when bandwidth is not exceeded), extend the
flow offload API with more tc-police parameters. Adjust existing
drivers to reject unsupported configurations.
Changes since v2:
* Rename index to extval in exceed and notexceed acts.
* Add policer validate functions for all drivers.
Changes since v1:
* Add one more strict validation for the control of drop/ok.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jianbo Liu [Thu, 24 Feb 2022 10:29:08 +0000 (10:29 +0000)]
flow_offload: reject offload for all drivers with invalid police parameters
As more police parameters are passed to flow_offload, driver can check
them to make sure hardware handles packets in the way indicated by tc.
The conform-exceed control should be drop/pipe or drop/ok. Besides,
for drop/ok, the police should be the last action. As hardware can't
configure peakrate/avrate/overhead, offload should not be supported if
any of them is configured.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jianbo Liu [Thu, 24 Feb 2022 10:29:07 +0000 (10:29 +0000)]
net: flow_offload: add tc police action parameters
The current police offload action entry is missing exceed/notexceed
actions and parameters that can be configured by tc police action.
Add the missing parameters as a pre-step for offloading police actions
to hardware.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 27 Feb 2022 11:06:14 +0000 (11:06 +0000)]
Merge branch 'dsa-fdb-isolation'
Vladimir Oltean says:
====================
DSA FDB isolation
There are use cases which need FDB isolation between standalone ports
and bridged ports, as well as isolation between ports of different
bridges. Most of these use cases are a result of the fact that packets
can now be partially forwarded by the software bridge, so one port might
need to send a packet to the CPU but its FDB lookup will see that it can
forward it directly to a bridge port where that packet was autonomously
learned. So the source port will attempt to shortcircuit the CPU and
forward autonomously, which it can't due to the forwarding isolation we
have in place. So we will have packet drops instead of proper operation.
Additionally, before DSA can implement IFF_UNICAST_FLT for standalone
ports, we must have control over which database we install FDB entries
corresponding to port MAC addresses in. We don't want to hinder the
operation of the bridging layer.
DSA does not have a driver API that encourages FDB isolation, so this
needs to be created. The basis for this is a new struct dsa_db which
annotates each FDB and MDB entry with the database it belongs to.
The sja1105 and felix drivers are modified to observe the dsa_db
argument, and therefore, enforce the FDB isolation.
Compared to the previous RFC patch series from August:
https://patchwork.kernel.org/project/netdevbpf/cover/
20210818120150.892647-1-vladimir.oltean@nxp.com/
what is different is that I stopped trying to make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE
blocking, instead I'm making use of the fact that DSA waits for switchdev FDB work
items to finish before a port leaves the bridge. This is possible since:
https://patchwork.kernel.org/project/netdevbpf/patch/
20211024171757.
3753288-7-vladimir.oltean@nxp.com/
Additionally, v2 is also rebased over the DSA LAG FDB work.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>