linux-2.6-microblaze.git
4 years agoMerge branch 'net-dsa-replace-routing-tables-with-a-list'
David S. Miller [Thu, 31 Oct 2019 21:26:38 +0000 (14:26 -0700)]
Merge branch 'net-dsa-replace-routing-tables-with-a-list'

Vivien Didelot says:

====================
net: dsa: replace routing tables with a list

This branch gets rid of the ds->rtable static arrays in favor of
a single dst->rtable list. This allows us to move away from the
DSA_MAX_SWITCHES limitation and simplify the switch fabric setup.

Changes in v2:
  - fix the reverse christmas for David
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: dsa: tag_8021q: clarify index limitation
Vivien Didelot [Thu, 31 Oct 2019 02:09:19 +0000 (22:09 -0400)]
net: dsa: tag_8021q: clarify index limitation

Now that there's no restriction from the DSA core side regarding
the switch IDs and port numbers, only tag_8021q which is currently
reserving 3 bits for the switch ID and 4 bits for the port number, has
limitation for these values. Update their descriptions to reflect that.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: dsa: remove limitation of switch index value
Vivien Didelot [Thu, 31 Oct 2019 02:09:18 +0000 (22:09 -0400)]
net: dsa: remove limitation of switch index value

Because there is no static array describing the links between switches
anymore, we have no reason to force a limitation of the index value
set by the device tree.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: dsa: remove tree functions related to switches
Vivien Didelot [Thu, 31 Oct 2019 02:09:17 +0000 (22:09 -0400)]
net: dsa: remove tree functions related to switches

The DSA fabric setup code has been simplified a lot so get rid of
the dsa_tree_remove_switch, dsa_tree_add_switch and dsa_switch_add
helpers, and keep the code simple with only the dsa_switch_probe and
dsa_switch_remove functions.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: dsa: remove the dst->ds array
Vivien Didelot [Thu, 31 Oct 2019 02:09:16 +0000 (22:09 -0400)]
net: dsa: remove the dst->ds array

Now that the DSA ports are listed in the switch fabric, there is
no need to store the dsa_switch structures from the drivers in the
fabric anymore. So get rid of the dst->ds static array.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: dsa: remove switch routing table setup code
Vivien Didelot [Thu, 31 Oct 2019 02:09:15 +0000 (22:09 -0400)]
net: dsa: remove switch routing table setup code

The dsa_switch structure has no routing table specific data to setup,
so the switch fabric can directly walk its ports and initialize its
routing table from them.

This allows us to remove the dsa_switch_setup_routing_table function.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: dsa: remove ds->rtable
Vivien Didelot [Thu, 31 Oct 2019 02:09:14 +0000 (22:09 -0400)]
net: dsa: remove ds->rtable

Drivers do not use the ds->rtable static arrays anymore, get rid of it.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: dsa: list DSA links in the fabric
Vivien Didelot [Thu, 31 Oct 2019 02:09:13 +0000 (22:09 -0400)]
net: dsa: list DSA links in the fabric

Implement a new list of DSA links in the switch fabric itself, to
provide an alterative to the ds->rtable static arrays.

At the same time, provide a new dsa_routing_port() helper to abstract
the usage of ds->rtable in drivers. If there's no port to reach a
given device, return the first invalid port, ds->num_ports. This avoids
potential signedness errors or the need to define special values.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'dpaa2-eth-add-MAC-PHY-support-through-phylink'
David S. Miller [Thu, 31 Oct 2019 21:19:45 +0000 (14:19 -0700)]
Merge branch 'dpaa2-eth-add-MAC-PHY-support-through-phylink'

Ioana Ciornei says:

====================
dpaa2-eth: add MAC/PHY support through phylink

The dpaa2-eth driver now has support for connecting to its associated PHY
device found through standard OF bindings. The PHY interraction is handled
by PHYLINK and even though, at the moment, only RGMII_* phy modes are
supported by the driver, this is just the first step into adding the
necessary changes to support the entire spectrum of capabilities.

This comes after feedback on the initial DPAA2 MAC RFC submitted here:
https://lwn.net/Articles/791182/

The notable change is that now, the DPMAC is not a separate driver, and
communication between the DPMAC and DPNI no longer happens through
firmware. Rather, the DPMAC is now a set of API functions that other
net_device drivers (DPNI, DPSW, etc) can use for PHY management.

The change is incremental, because the DPAA2 architecture has many modes of
connecting net devices in hardware loopback (for example DPNI to DPNI).
Those operating modes do not have a DPMAC and phylink instance.

The documentation patch provides a more complete view of the software
architecture and the current implementation.

Changes in v2:
 - added patch 1/5 in order to fix module build
 - use -ENOTCONN as a proper return error of dprc_get_connection()
 - move the locks to rtnl outside of dpaa2_eth_[dis]connect_mac functions
 - remove setting supported/advertised from .validate()

Changes in v3:
 - remove an unused variable

Changes in v4:
 - use ERR_PTR instead of plain NULL
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: documentation: add docs for MAC/PHY support in DPAA2
Ioana Ciornei [Wed, 30 Oct 2019 23:18:32 +0000 (01:18 +0200)]
net: documentation: add docs for MAC/PHY support in DPAA2

Add documentation file for the MAC/PHY support in the DPAA2
architecture. This describes the architecture and implementation of the
interface between phylink and a DPAA2 network driver.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa2-eth: add MAC/PHY support through phylink
Ioana Ciornei [Wed, 30 Oct 2019 23:18:31 +0000 (01:18 +0200)]
dpaa2-eth: add MAC/PHY support through phylink

The dpaa2-eth driver now has support for connecting to its associated
PHY device found through standard OF bindings.

This happens when the DPNI object (that the driver probes on) gets
connected to a DPMAC. When that happens, the device tree is looked up by
the DPMAC ID, and the associated PHY bindings are found.

The old logic of handling the net device's link state by hand still
needs to be kept, as the DPNI can be connected to other devices on the
bus than a DPMAC: other DPNI, DPSW ports, etc. This logic is only
engaged when there is no DPMAC (and therefore no phylink instance)
attached.

The MC firmware support multiple type of DPMAC links: TYPE_FIXED,
TYPE_PHY. The TYPE_FIXED mode does not require any DPMAC management from
Linux side, and as such, the driver will not handle such a DPMAC.

Although PHYLINK typically handles SFP cages and in-band AN modes, for
the moment the driver only supports the RGMII interfaces found on the
LX2160A. Support for other modes will come later.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa2-eth: update the TX frame queues on DPNI_IRQ_EVENT_ENDPOINT_CHANGED
Ioana Ciornei [Wed, 30 Oct 2019 23:18:30 +0000 (01:18 +0200)]
dpaa2-eth: update the TX frame queues on DPNI_IRQ_EVENT_ENDPOINT_CHANGED

Currently the function is called at every link up event, although the
FQID values will only change when the DPNI is disconnected from the
current object and reconnected to a different one.

The patch also avoids the forward declaration of update_tx_fqids.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agobus: fsl-mc: add the fsl_mc_get_endpoint function
Ioana Ciornei [Wed, 30 Oct 2019 23:18:29 +0000 (01:18 +0200)]
bus: fsl-mc: add the fsl_mc_get_endpoint function

Using the newly added fsl_mc_get_endpoint function a fsl-mc driver can
find its associated endpoint (another object at the other link of a MC
firmware link).

The API will be used in the following patch in order to discover the
connected DPMAC object of a DPNI.

Also, the fsl_mc_device_lookup function is made available to the entire
fsl-mc bus driver and not just for the dprc driver.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agobus: fsl-mc: export device types present on the bus
Ioana Ciornei [Wed, 30 Oct 2019 23:18:28 +0000 (01:18 +0200)]
bus: fsl-mc: export device types present on the bus

Export all device types present on the fsl-mc bus in order to be able to
actually use the is_fsl_mc_bus_*() functions from drivers on the bus.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'sfc-Add-XDP-support'
David S. Miller [Thu, 31 Oct 2019 21:14:53 +0000 (14:14 -0700)]
Merge branch 'sfc-Add-XDP-support'

Charles McLachlan says:

====================
sfc: Add XDP support

Supply the XDP callbacks in netdevice ops that enable lower level processing
of XDP frames.

Changes in v4:
- Handle the failure to send some frames in efx_xdp_tx_buffers() properly.

Changes in v3:
- Fix a BUG_ON when trying to allocate piobufs to xdp queues.
- Add a missed trace_xdp_exception.

Changes in v2:
- Use of xdp_return_frame_rx_napi() in tx.c
- Addition of xdp_rxq_info_valid and xdp_rxq_info_failed to track when
  xdp_rxq_info failures occur.
- Renaming of rc to err and more use of unlikely().
- Cut some duplicated code and fix an array overrun.
- Actually increment n_rx_xdp_tx when packets are transmitted.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agosfc: add XDP counters to ethtool stats
Charles McLachlan [Thu, 31 Oct 2019 10:24:23 +0000 (10:24 +0000)]
sfc: add XDP counters to ethtool stats

Count XDP packet drops, error drops, transmissions and redirects and
expose these counters via the ethtool stats command.

Signed-off-by: Charles McLachlan <cmclachlan@solarflare.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agosfc: handle XDP_TX outcomes of XDP eBPF programs
Charles McLachlan [Thu, 31 Oct 2019 10:24:12 +0000 (10:24 +0000)]
sfc: handle XDP_TX outcomes of XDP eBPF programs

Provide an ndo_xdp_xmit function that uses the XDP tx queue for this
CPU to send the packet.

Signed-off-by: Charles McLachlan <cmclachlan@solarflare.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agosfc: allocate channels for XDP tx queues
Charles McLachlan [Thu, 31 Oct 2019 10:23:49 +0000 (10:23 +0000)]
sfc: allocate channels for XDP tx queues

Each CPU needs access to its own queue to allow uncontested
transmission of XDP_TX packets. This means we need to allocate (up
front) enough channels ("xdp transmit channels") to provide at least
one extra tx queue per CPU. These tx queues should not do TSO.

Signed-off-by: Charles McLachlan <cmclachlan@solarflare.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agosfc: Enable setting of xdp_prog
Charles McLachlan [Thu, 31 Oct 2019 10:23:37 +0000 (10:23 +0000)]
sfc: Enable setting of xdp_prog

Provide an ndo_bpf function to efx_netdev_ops that allows setting and
querying of xdp programs on an interface.

Also check that the MTU size isn't too big when setting a program or
when the MTU is explicitly set.

Signed-off-by: Charles McLachlan <cmclachlan@solarflare.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agosfc: perform XDP processing on received packets
Charles McLachlan [Thu, 31 Oct 2019 10:23:23 +0000 (10:23 +0000)]
sfc: perform XDP processing on received packets

Adds a field to hold an attached xdp_prog, but never populates it (see
following patch).  Also, XDP_TX support is deferred to a later patch
in the series.

Track failures of xdp_rxq_info_reg() via per-queue xdp_rxq_info_valid
flags and a per-nic xdp_rxq_info_failed flag. The per-queue flags are
needed to prevent attempts to xdp_rxq_info_unreg() structs that failed
to register.  Possibly the API could be changed in the future to avoid
the need for these flags.

Signed-off-by: Charles McLachlan <cmclachlan@solarflare.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agosfc: support encapsulation of xdp_frames in efx_tx_buffer
Charles McLachlan [Thu, 31 Oct 2019 10:23:10 +0000 (10:23 +0000)]
sfc: support encapsulation of xdp_frames in efx_tx_buffer

Add a field to efx_tx_buffer so that we can track xdp_frames. Add a
flag so that buffers that contain xdp_frames can be identified and
passed to xdp_return_frame.

Signed-off-by: Charles McLachlan <cmclachlan@solarflare.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: Fix 64-bit division in mlxsw_sp_sb_prs_init
Nathan Chancellor [Wed, 30 Oct 2019 16:01:52 +0000 (09:01 -0700)]
mlxsw: Fix 64-bit division in mlxsw_sp_sb_prs_init

When building for 32-bit ARM, there is a link time error because of a
64-bit division:

ld.lld: error: undefined symbol: __aeabi_uldivmod
>>> referenced by spectrum_buffers.c
>>>               net/ethernet/mellanox/mlxsw/spectrum_buffers.o:(mlxsw_sp_buffers_init) in archive drivers/built-in.a
>>> did you mean: __aeabi_uidivmod
>>> defined in: arch/arm/lib/lib.a(lib1funcs.o

Avoid this by using div_u64, which is designed to avoid this problem.

Fixes: bc9f6e94bcb5 ("mlxsw: spectrum_buffers: Calculate the size of the main pool")
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 's390-next'
David S. Miller [Thu, 31 Oct 2019 19:32:59 +0000 (12:32 -0700)]
Merge branch 's390-next'

Julian Wiedmann says:

====================
s390/qeth: updates 2019-10-31

please apply the following series of spooky qeth updates for net-next.

The first two patches add support for an enhanced TX doorbell, which
enables us to do more xmit_more-based bulking.
Note that this requires one patch for the s390/qdio base layer, which
has been graciously acked by Heiko to go through your tree.

The remaining patches are just the usual minor cleanups/improvements.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: don't cache MAC addresses for multicast IPs
Julian Wiedmann [Thu, 31 Oct 2019 12:42:21 +0000 (13:42 +0100)]
s390/qeth: don't cache MAC addresses for multicast IPs

Instead of storing the multicast-mapped MAC address in an IP address
object, just calculate the MAC address when actually building a cmd
for the IP address.

While at it, also clean up some rather verbose copying of IP addresses.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: use helpers for IP address hashing
Julian Wiedmann [Thu, 31 Oct 2019 12:42:20 +0000 (13:42 +0100)]
s390/qeth: use helpers for IP address hashing

Replace our custom implementations with the stack's version of IP address
hashing.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: don't set card state in qeth_qdio_clear_card()
Julian Wiedmann [Thu, 31 Oct 2019 12:42:19 +0000 (13:42 +0100)]
s390/qeth: don't set card state in qeth_qdio_clear_card()

Any change to the card state should only be driven by
qeth_l?_set_online() and qeth_l?_stop_card().

qeth_qdio_clear_card() currently also gets called from
(a) qeth_core_shutdown(), where we haven't walked through the whole
    teardown sequence. So changing the state to DOWN is not accurate.
(b) qeth_core_hardsetup_card(), which is only called while the card is
    still in DOWN state. No change in behaviour here.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: consolidate some duplicated HW cmd code
Julian Wiedmann [Thu, 31 Oct 2019 12:42:18 +0000 (13:42 +0100)]
s390/qeth: consolidate some duplicated HW cmd code

When setting a device online, both subdrivers have the same code to
program the HW trap and Isolation mode. Move that code into a single
place.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: keep IRQ disabled until NAPI is really done
Julian Wiedmann [Thu, 31 Oct 2019 12:42:17 +0000 (13:42 +0100)]
s390/qeth: keep IRQ disabled until NAPI is really done

When napi_complete_done() returns false, the NAPI instance is still
active and we can keep the IRQ disabled a little longer.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: use QDIO_BUFNR()
Julian Wiedmann [Thu, 31 Oct 2019 12:42:16 +0000 (13:42 +0100)]
s390/qeth: use QDIO_BUFNR()

qdio.h recently gained a new helper macro that handles wrap-around on a
QDIO queue, consistently use it across all of qeth.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: use IQD Multi-Write
Julian Wiedmann [Thu, 31 Oct 2019 12:42:15 +0000 (13:42 +0100)]
s390/qeth: use IQD Multi-Write

For IQD devices with Multi-Write support, we can defer the queue-flush
further and transmit multiple IO buffers with a single TX doorbell.
The same-target restriction still applies.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qdio: implement IQD Multi-Write
Julian Wiedmann [Thu, 31 Oct 2019 12:42:14 +0000 (13:42 +0100)]
s390/qdio: implement IQD Multi-Write

This allows IQD drivers to send out multiple SBALs with a single SIGA
instruction.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'DPAA-Ethernet-changes'
David S. Miller [Thu, 31 Oct 2019 19:13:34 +0000 (12:13 -0700)]
Merge branch 'DPAA-Ethernet-changes'

Madalin Bucur says:

====================
DPAA Ethernet changes

v2: remove excess braces

Here are some more changes for the DPAA 1.x area.
In summary, these changes use pages for the receive buffers and
for the scatter-gather table fed to the HW on the Tx path, perform
a bit of cleanup in some convoluted parts of the code, add some
minor fixes related to DMA (un)mapping sequencing for a not so
common scenario, add a device link that removes the interfaces
when the QMan portal in use by them is removed.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa_eth: register a device link for the qman portal used
Madalin Bucur [Thu, 31 Oct 2019 14:37:59 +0000 (16:37 +0200)]
dpaa_eth: register a device link for the qman portal used

Before this change, unbinding the QMan portals did not trigger a
corresponding unbinding of the dpaa_eth making use of it; the first
QMan portal related operation issued afterwards crashed the kernel.
The device link ensures the dpaa_eth dependency upon the qman portal
used is honoured at the QMan portal removal.

Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agosoc: fsl: qbman: allow registering a device link for the portal user
Madalin Bucur [Thu, 31 Oct 2019 14:37:58 +0000 (16:37 +0200)]
soc: fsl: qbman: allow registering a device link for the portal user

Introduce the API required to make sure that the devices that use
the QMan portal are unbound when the portal is unbound.

Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa_eth: extend delays in ndo_stop
Madalin Bucur [Thu, 31 Oct 2019 14:37:57 +0000 (16:37 +0200)]
dpaa_eth: extend delays in ndo_stop

Make sure all the frames that are in flight have time to be processed
before the interface is completely brought down. Add a missing delay
for the Rx path.

Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa_eth: remove netdev_err() for user errors
Madalin Bucur [Thu, 31 Oct 2019 14:37:56 +0000 (16:37 +0200)]
dpaa_eth: remove netdev_err() for user errors

User reports that an application making an (incorrect) call to
restart AN on a fixed link DPAA interface triggers an error in
the kernel log while the returned EINVAL should be enough.

Reported-by: Joakim Tjernlund <Joakim.Tjernlund@infinera.com>
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa_eth: add dropped frames to percpu ethtool stats
Madalin Bucur [Thu, 31 Oct 2019 14:37:55 +0000 (16:37 +0200)]
dpaa_eth: add dropped frames to percpu ethtool stats

Prior to this change, the frames dropped on receive or transmit
were not displayed in the ethtool statistics, leaving the dropped
frames unaccounted for.

Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa_eth: use a page to store the SGT
Madalin Bucur [Thu, 31 Oct 2019 14:37:54 +0000 (16:37 +0200)]
dpaa_eth: use a page to store the SGT

Use a page to store the scatter gather table on the transmit path.

Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa_eth: cleanup skb_to_contig_fd()
Madalin Bucur [Thu, 31 Oct 2019 14:37:53 +0000 (16:37 +0200)]
dpaa_eth: cleanup skb_to_contig_fd()

Remove cast, align variable name, simplify DMA map size computation.

Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa_eth: use fd information in dpaa_cleanup_tx_fd()
Madalin Bucur [Thu, 31 Oct 2019 14:37:52 +0000 (16:37 +0200)]
dpaa_eth: use fd information in dpaa_cleanup_tx_fd()

Instead of reading skb fields, use information from the DPAA frame
descriptor.

Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa_eth: simplify variables used in dpaa_cleanup_tx_fd()
Madalin Bucur [Thu, 31 Oct 2019 14:37:51 +0000 (16:37 +0200)]
dpaa_eth: simplify variables used in dpaa_cleanup_tx_fd()

Avoid casts and repeated conversions.

Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa_eth: avoid timestamp read on error paths
Madalin Bucur [Thu, 31 Oct 2019 14:37:50 +0000 (16:37 +0200)]
dpaa_eth: avoid timestamp read on error paths

The dpaa_cleanup_tx_fd() function is called by the frame transmit
confirmation callback but also on several error paths. This function
is reading the transmit timestamp value. Avoid reading an invalid
timestamp value on the error paths.

Fixes: 4664856e9ca2 ("dpaa_eth: add support for hardware timestamping")
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa_eth: perform DMA unmapping before read
Madalin Bucur [Thu, 31 Oct 2019 14:37:49 +0000 (16:37 +0200)]
dpaa_eth: perform DMA unmapping before read

DMA unmapping is required before accessing the HW provided timestamping
information.

Fixes: 4664856e9ca2 ("dpaa_eth: add support for hardware timestamping")
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa_eth: use page backed rx buffers
Madalin Bucur [Thu, 31 Oct 2019 14:37:48 +0000 (16:37 +0200)]
dpaa_eth: use page backed rx buffers

Change the buffers used for reception from netdev_frags to pages.

Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa_eth: use only one buffer pool per interface
Madalin Bucur [Thu, 31 Oct 2019 14:37:47 +0000 (16:37 +0200)]
dpaa_eth: use only one buffer pool per interface

Currently the DPAA Ethernet driver is using three buffer pools
for each interface, with three different sizes for the buffers
provided for the FMan reception path. This patch reduces the
number of buffer pools to one per interface. This change is in
preparation of another, that will be switching from netdev_frags
to page backed buffers for the receive path.

Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'net-hns3-add-some-optimizations-and-cleanups'
David S. Miller [Thu, 31 Oct 2019 19:03:29 +0000 (12:03 -0700)]
Merge branch 'net-hns3-add-some-optimizations-and-cleanups'

Huazhong Tan says:

====================
net: hns3: add some optimizations and cleanups

This series adds some code optimizations and cleanups for
the HNS3 ethernet driver.

[patch 1/9] dumps some debug information when reset fail.

[patch 2/9] dumps some struct netdev_queue information when
TX timeout.

[patch 3/9] cleanups some magic numbers.

[patch 4/9] cleanups some coding style issue.

[patch 5/9] fixes a compiler warning.

[patch 6/9] optimizes some local variable initialization.

[patch 7/9] modifies some comments.

[patch 8/9] cleanups some print format warnings.

[patch 9/9] cleanups byte order issue.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: cleanup byte order issues when printed
Guojia Liao [Thu, 31 Oct 2019 11:23:24 +0000 (19:23 +0800)]
net: hns3: cleanup byte order issues when printed

Though the hip08 and the IMP(Intelligent Management Processor)
have the same byte order right now, it is better to convert
__be or __le variable into the CPU's byte order before print.

Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: cleanup some print format warning
Guojia Liao [Thu, 31 Oct 2019 11:23:23 +0000 (19:23 +0800)]
net: hns3: cleanup some print format warning

Using '%d' for printing type unsigned int or '%u' for
type int would cause static tools to give false warnings,
so this patch cleanups this warning by using the suitable
format specifier of the type of variable.

BTW, modifies the type of some variables and macro to
synchronize with their usage.

Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: add or modify some comments
Guangbin Huang [Thu, 31 Oct 2019 11:23:22 +0000 (19:23 +0800)]
net: hns3: add or modify some comments

This patch makes the comment for macro HCLGE_MBX_GET_VF_FLR_STATUS
more correct, and adds comments in some place to make the code more
readable.

Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: optimize local variable initialization
Guangbin Huang [Thu, 31 Oct 2019 11:23:21 +0000 (19:23 +0800)]
net: hns3: optimize local variable initialization

The variable tx_ring is unnecessary to be initialized as it will be set
before used, and the variable rst_cnt is better to be initialized when
declaration for simplification.

Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: cleanup a format-truncation warning
Guojia Liao [Thu, 31 Oct 2019 11:23:20 +0000 (19:23 +0800)]
net: hns3: cleanup a format-truncation warning

In hns3_nic_init_irq(), when '*_int_idx' has more than 9 digits
and the length of netdev's name is IFNAMSIZ, the total length
of final name will be bigger the HNAE3_INT_NAME_LEN - 1, even
though '*_int_idx' will never have such large value, but the
compiler gives a format-truncation warning for this case.

So this patch just enlarges the length to avoid this warning.

Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: cleanup some coding style issues
Guangbin Huang [Thu, 31 Oct 2019 11:23:19 +0000 (19:23 +0800)]
net: hns3: cleanup some coding style issues

To unify code style and make code simpler, this patch modifies
some code, deletes unnecessary blank lines and {}, changes
location of code, and so on.

No functional change.

Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: cleanup some magic numbers
Guojia Liao [Thu, 31 Oct 2019 11:23:18 +0000 (19:23 +0800)]
net: hns3: cleanup some magic numbers

To make the code more readable, this patch replaces
some magic numbers with macro or sizeof operation.

Also uses macro lower_32_bits and upper_32_bits to
get bits 0-31 and 32-63 of a number, instead of
using type conversion and '>>' operation.

No functional change.

Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: add struct netdev_queue debug info for TX timeout
Yunsheng Lin [Thu, 31 Oct 2019 11:23:17 +0000 (19:23 +0800)]
net: hns3: add struct netdev_queue debug info for TX timeout

When there is a TX timeout, we can tell if the driver or stack
has stopped the queue by looking at state field, and when has
the last packet transmited by looking at trans_start field.

So this patch prints these two field in the
hns3_get_tx_timeo_queue_info().

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: dump some debug information when reset fail
Huazhong Tan [Thu, 31 Oct 2019 11:23:16 +0000 (19:23 +0800)]
net: hns3: dump some debug information when reset fail

When reset fails, there is some information that will help for
finding out why does reset fail. and removes an unused
core_rst_cnt field in struct hclge_rst_stats.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'bnxt_en-Add-OP-TEE-based-bnxt-f-w-manager'
David S. Miller [Thu, 31 Oct 2019 18:00:45 +0000 (11:00 -0700)]
Merge branch 'bnxt_en-Add-OP-TEE-based-bnxt-f-w-manager'

Sheetal Tigadoli says:

====================
bnxt_en: Add OP-TEE based bnxt f/w manager

This patch series adds support for TEE based BNXT firmware
management module and the driver changes to invoke OP-TEE
APIs to fastboot firmware and to collect crash dump.

Changes from v4:
 - update Kconfig to reflect dependency on TEE driver
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agobnxt_en: Add support to collect crash dump via ethtool
Vasundhara Volam [Thu, 31 Oct 2019 10:08:52 +0000 (15:38 +0530)]
bnxt_en: Add support to collect crash dump via ethtool

Driver supports 2 types of core dumps.

1. Live dump - Firmware dump when system is up and running.
2. Crash dump - Dump which is collected during firmware crash
                that can be retrieved after recovery.
Crash dump is currently supported only on specific 58800 chips
which can be retrieved using OP-TEE API only, as firmware cannot
access this region directly.

User needs to set the dump flag using following command before
initiating the dump collection:

    $ ethtool -W|--set-dump eth0 N

Where N is "0" for live dump and "1" for crash dump

Command to collect the dump after setting the flag:

    $ ethtool -w eth0 data Filename

v3: Modify set_dump to support even when CONFIG_TEE_BNXT_FW=n.
Also change log message to netdev_info().

Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Sheetal Tigadoli <sheetal.tigadoli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agobnxt_en: Add support to invoke OP-TEE API to reset firmware
Vasundhara Volam [Thu, 31 Oct 2019 10:08:51 +0000 (15:38 +0530)]
bnxt_en: Add support to invoke OP-TEE API to reset firmware

In error recovery process when firmware indicates that it is
completely down, initiate a firmware reset by calling OP-TEE API.

Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Sheetal Tigadoli <sheetal.tigadoli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agofirmware: broadcom: add OP-TEE based BNXT f/w manager
Vikas Gupta [Thu, 31 Oct 2019 10:08:50 +0000 (15:38 +0530)]
firmware: broadcom: add OP-TEE based BNXT f/w manager

This driver registers on TEE bus to interact with OP-TEE based
BNXT firmware management modules

Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Sheetal Tigadoli <sheetal.tigadoli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'mlxsw-Make-port-split-code-more-generic'
David S. Miller [Thu, 31 Oct 2019 17:54:47 +0000 (10:54 -0700)]
Merge branch 'mlxsw-Make-port-split-code-more-generic'

Ido Schimmel says:

====================
mlxsw: Make port split code more generic

Jiri says:

Currently, we assume some limitations and constant values which are not
applicable for Spectrum-3 which has 8 lanes ports (instead of previous 4
lanes).

This patch does 2 things:

1) Generalizes the code to not use constants so it can work for 4, 8 and
   possibly 16 lanes.

2) Enforces some assumptions we had in the code but did not check.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum: Generalize split count check
Jiri Pirko [Thu, 31 Oct 2019 09:42:21 +0000 (11:42 +0200)]
mlxsw: spectrum: Generalize split count check

Make the check generic for any possible value, not only 2 and 4.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum: Iterate over all ports in gap during unsplit create
Jiri Pirko [Thu, 31 Oct 2019 09:42:20 +0000 (11:42 +0200)]
mlxsw: spectrum: Iterate over all ports in gap during unsplit create

During recreation of original unsplit ports, just simply iterate over
the whole gap and recreate whatever originally existed.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum: Fix base port get for split count 4 and 8
Jiri Pirko [Thu, 31 Oct 2019 09:42:19 +0000 (11:42 +0200)]
mlxsw: spectrum: Fix base port get for split count 4 and 8

The current code considers only split by 2 or 4. Make the base port
getting generic and allow split by 8 to be handled correctly. Generalize
the used port checks as well.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum: Use port_module_max_width to compute base port index
Jiri Pirko [Thu, 31 Oct 2019 09:42:18 +0000 (11:42 +0200)]
mlxsw: spectrum: Use port_module_max_width to compute base port index

Instead of using constant value, use port_module_max_width which is
aligned with the cluster size.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum: Remember split base local port and use it in unsplit
Jiri Pirko [Thu, 31 Oct 2019 09:42:17 +0000 (11:42 +0200)]
mlxsw: spectrum: Remember split base local port and use it in unsplit

Don't compute the original base local port during unsplit, rather
remember it in mlxsw_sp_port structure during split port creation.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum: Introduce resource for getting offset of 4 lanes split port
Jiri Pirko [Thu, 31 Oct 2019 09:42:16 +0000 (11:42 +0200)]
mlxsw: spectrum: Introduce resource for getting offset of 4 lanes split port

In Spectrum-3 the modules have 8 lanes, so split by count 2 results in
two split ports each of 4 lanes. Add a resource that can be used to
obtain local port offset in that case.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum: Push getting offsets of split ports into a helper
Jiri Pirko [Thu, 31 Oct 2019 09:42:15 +0000 (11:42 +0200)]
mlxsw: spectrum: Push getting offsets of split ports into a helper

Get local port offsets of split port in a separate helper function and
use it in both split and unsplit function.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum: Add sanity checks into module info get
Jiri Pirko [Thu, 31 Oct 2019 09:42:14 +0000 (11:42 +0200)]
mlxsw: spectrum: Add sanity checks into module info get

Driver assumes certain values in the PMLP register. Add checks that
verify that PMLP register provides fitting values.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum: Pass mapping values in port mapping structure
Jiri Pirko [Thu, 31 Oct 2019 09:42:13 +0000 (11:42 +0200)]
mlxsw: spectrum: Pass mapping values in port mapping structure

Pass the port mapping structure down to create, module_map and other
function instead of individual values.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum: Use mapping of port being split for creating split ports
Jiri Pirko [Thu, 31 Oct 2019 09:42:12 +0000 (11:42 +0200)]
mlxsw: spectrum: Use mapping of port being split for creating split ports

Don't use constant max width value and instead of that, use the actual
width of the port. Also don't pass module value and use the value
stored in the same structure.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum: Replace port_to_module array with array of structs
Jiri Pirko [Thu, 31 Oct 2019 09:42:11 +0000 (11:42 +0200)]
mlxsw: spectrum: Replace port_to_module array with array of structs

Store the initial PMLP register configuration into array of structures
instead of just simple array of module numbers.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum: Distinguish between unsplittable and split port
Jiri Pirko [Thu, 31 Oct 2019 09:42:10 +0000 (11:42 +0200)]
mlxsw: spectrum: Distinguish between unsplittable and split port

Currently when user does split, he is not able to distinguish if the
port cannot be split because it is already split, or because it cannot
be split at all. Add another check for split flag to distinguish this.
Also add check forbidding split when maximal width is 1.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum: Move max_width check up before count check
Jiri Pirko [Thu, 31 Oct 2019 09:42:09 +0000 (11:42 +0200)]
mlxsw: spectrum: Move max_width check up before count check

The fact that the port cannot be split further should be checked before
checking the count, so move it.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum: Use PMTM register to get max module width
Jiri Pirko [Thu, 31 Oct 2019 09:42:08 +0000 (11:42 +0200)]
mlxsw: spectrum: Use PMTM register to get max module width

Currently the max module width is hard-coded according to ASIC type.
That is not entirely correct, as the max module width might differ
per-board. Use PMTM register to query FW for maximal width of a module.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: reg: Add Port Module Type Mapping Register
Jiri Pirko [Thu, 31 Oct 2019 09:42:07 +0000 (11:42 +0200)]
mlxsw: reg: Add Port Module Type Mapping Register

The PMTM allows query or configuration of module types.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: reg: Extend PMLP tx/rx lane value size to 4 bits
Jiri Pirko [Thu, 31 Oct 2019 09:42:06 +0000 (11:42 +0200)]
mlxsw: reg: Extend PMLP tx/rx lane value size to 4 bits

The tx/rx lane fields got extended to 4 bits, update the reg field
description accordingly.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agocxgb4/l2t: Simplify 't4_l2e_free()' and '_t4_l2e_free()'
Christophe JAILLET [Thu, 31 Oct 2019 05:53:45 +0000 (06:53 +0100)]
cxgb4/l2t: Simplify 't4_l2e_free()' and '_t4_l2e_free()'

Use '__skb_queue_purge()' instead of re-implementing it.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'Control-action-percpu-counters-allocation-by-netlink-flag'
David S. Miller [Thu, 31 Oct 2019 01:07:51 +0000 (18:07 -0700)]
Merge branch 'Control-action-percpu-counters-allocation-by-netlink-flag'

Vlad Buslov says:

====================
Control action percpu counters allocation by netlink flag

Currently, significant fraction of CPU time during TC filter allocation
is spent in percpu allocator. Moreover, percpu allocator is protected
with single global mutex which negates any potential to improve its
performance by means of recent developments in TC filter update API that
removed rtnl lock for some Qdiscs and classifiers. In order to
significantly improve filter update rate and reduce memory usage we
would like to allow users to skip percpu counters allocation for
specific action if they don't expect high traffic rate hitting the
action, which is a reasonable expectation for hardware-offloaded setup.
In that case any potential gains to software fast-path performance
gained by usage of percpu-allocated counters compared to regular integer
counters protected by spinlock are not important, but amount of
additional CPU and memory consumed by them is significant.

In order to allow configuring action counters allocation type at
runtime, implement following changes:

- Implement helper functions to update the action counters and use them
  in affected actions instead of updating counters directly. This steps
  abstracts actions implementation from counter types that are being
  used for particular action instance at runtime.

- Modify the new helpers to use percpu counters if they were allocated
  during action initialization and use regular counters otherwise.

- Extend action UAPI TCA_ACT space with TCA_ACT_FLAGS field. Add
  TCA_ACT_FLAGS_NO_PERCPU_STATS action flag and update
  hardware-offloaded actions to not allocate percpu counters when the
  flag is set.

With this changes users that prefer action update slow-path speed over
software fast-path speed can dynamically request actions to skip percpu
counters allocation without affecting other users.

Now, lets look at actual performance gains provided by this change.
Simple test is used to measure insertion rate - iproute2 TC is executed
in parallel by xargs in batch mode, its total execution time is measured
by shell builtin "time" command. The command runs 20 concurrent tc
instances, each with its own batch file with 100k rules:

$ time ls add* | xargs -n 1 -P 20 sudo tc -b

Two main rule profiles are tested. First is simple L2 flower classifier
with single gact drop action. The configuration is chosen as worst case
scenario because with single-action rules pressure on percpu allocator
is minimized. Example rule:

filter add dev ens1f0 protocol ip ingress prio 1 handle 1 flower skip_hw
    src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 action drop

Second profile is typical real-world scenario that uses flower
classifier with some L2-4 fields and two actions (tunnel_key+mirred).
Example rule:

filter add dev ens1f0_0 protocol ip ingress prio 1 handle 1 flower
    skip_hw src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 src_ip
    192.168.111.1 dst_ip 192.168.111.2 ip_proto udp dst_port 1 src_port
    1 action tunnel_key set id 1 src_ip 2.2.2.2 dst_ip 2.2.2.3 dst_port
    4789 action mirred egress redirect dev vxlan1

 Profile           |        percpu |     no_percpu | X improvement
                   | (k rules/sec) | (k rules/sec) |
-------------------+---------------+---------------+---------------
 Gact drop         |           203 |           259 |          1.28
 tunnel_key+mirred |            92 |           204 |          2.22

For simple drop action removing percpu allocation leads to ~25%
insertion rate improvement. Perf profiles highlights the bottlenecks.

Perf profile of run with percpu allocation (gact drop):

+ 89.11% 0.48% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 88.58% 0.04% tc [kernel.vmlinux] [k] do_syscall_64
+ 87.50% 0.04% tc libc-2.29.so [.] __libc_sendmsg
+ 86.96% 0.04% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 86.85% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 86.60% 0.05% tc [kernel.vmlinux] [k] sock_sendmsg
+ 86.55% 0.12% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 86.04% 0.13% tc [kernel.vmlinux] [k] netlink_unicast
+ 85.42% 0.03% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 84.68% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 84.56% 0.24% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 75.73% 0.65% tc [cls_flower] [k] fl_change
+ 71.30% 0.03% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 71.27% 0.13% tc [kernel.vmlinux] [k] tcf_action_init
+ 71.06% 0.01% tc [kernel.vmlinux] [k] tcf_action_init_1
+ 70.41% 0.04% tc [act_gact] [k] tcf_gact_init
+ 53.59% 1.21% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 52.34% 0.34% tc [kernel.vmlinux] [k] tcf_idr_create
- 51.23% 2.17% tc [kernel.vmlinux] [k] pcpu_alloc
  - 49.05% pcpu_alloc
    + 39.35% __mutex_lock.isra.0 4.99% memset_erms
    + 2.16% pcpu_alloc_area
  + 2.17% __libc_sendmsg
+ 45.89% 44.33% tc [kernel.vmlinux] [k] osq_lock
+ 9.94% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
+ 7.76% 0.00% tc [kernel.vmlinux] [k] tcf_idr_insert
+ 6.50% 0.03% tc [kernel.vmlinux] [k] tfilter_notify
+ 6.24% 6.11% tc [kernel.vmlinux] [k] mutex_spin_on_owner
+ 5.73% 5.32% tc [kernel.vmlinux] [k] memset_erms
+ 5.31% 0.18% tc [kernel.vmlinux] [k] tcf_fill_node

Here bottleneck is clearly in pcpu_alloc() function that takes more than
half CPU time, which is mostly wasted busy-waiting for internal percpu
allocator global lock.

With percpu allocation removed (gact drop):

+ 87.50% 0.51% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 86.94% 0.07% tc [kernel.vmlinux] [k] do_syscall_64
+ 85.75% 0.04% tc libc-2.29.so [.] __libc_sendmsg
+ 85.00% 0.07% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 84.84% 0.07% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 84.59% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg
+ 84.58% 0.14% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 83.95% 0.12% tc [kernel.vmlinux] [k] netlink_unicast
+ 83.34% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 82.39% 0.12% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 82.16% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 75.13% 0.84% tc [cls_flower] [k] fl_change
+ 69.92% 0.05% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 69.87% 0.11% tc [kernel.vmlinux] [k] tcf_action_init
+ 69.61% 0.02% tc [kernel.vmlinux] [k] tcf_action_init_1
- 68.80% 0.10% tc [act_gact] [k] tcf_gact_init
  - 68.70% tcf_gact_init
    + 36.08% tcf_idr_check_alloc
    + 31.88% tcf_idr_insert
+ 63.72% 0.58% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 58.80% 56.68% tc [kernel.vmlinux] [k] osq_lock
+ 36.08% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
+ 31.88% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert

The gact actions (like all other actions types) are inserted in single
idr instance protected by global (per namespace) lock that becomes new
bottleneck with such simple rule profile and prevents achieving 2x+
performance increase that can be expected by looking at profiling data
for insertion action with percpu counter.

Perf profile of run with percpu allocation (tunnel_key+mirred):

+ 91.95% 0.21% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 91.74% 0.06% tc [kernel.vmlinux] [k] do_syscall_64
+ 90.74% 0.01% tc libc-2.29.so [.] __libc_sendmsg
+ 90.52% 0.01% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 90.50% 0.04% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 90.41% 0.02% tc [kernel.vmlinux] [k] sock_sendmsg
+ 90.38% 0.04% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 90.10% 0.06% tc [kernel.vmlinux] [k] netlink_unicast
+ 89.76% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 89.28% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 89.15% 0.03% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 83.41% 0.33% tc [cls_flower] [k] fl_change
+ 81.17% 0.04% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 81.13% 0.06% tc [kernel.vmlinux] [k] tcf_action_init
+ 81.04% 0.04% tc [kernel.vmlinux] [k] tcf_action_init_1
- 73.59% 2.16% tc [kernel.vmlinux] [k] pcpu_alloc
  - 71.42% pcpu_alloc
    + 61.41% __mutex_lock.isra.0 5.02% memset_erms
    + 2.93% pcpu_alloc_area
  + 2.16% __libc_sendmsg
+ 63.58% 0.17% tc [kernel.vmlinux] [k] tcf_idr_create
+ 63.40% 0.60% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 57.85% 56.38% tc [kernel.vmlinux] [k] osq_lock
+ 46.27% 0.13% tc [act_tunnel_key] [k] tunnel_key_init
+ 34.26% 0.02% tc [act_mirred] [k] tcf_mirred_init
+ 10.99% 0.00% tc [kernel.vmlinux] [k] dst_cache_init
+ 5.32% 5.11% tc [kernel.vmlinux] [k] memset_erms

With two times more actions pressure on percpu allocator doubles, so now
it takes ~74% of CPU execution time.

With percpu allocation removed (tunnel_key+mirred):

+ 86.02% 0.50% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 85.51% 0.12% tc [kernel.vmlinux] [k] do_syscall_64
+ 84.40% 0.03% tc libc-2.29.so [.] __libc_sendmsg
+ 83.84% 0.03% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 83.72% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 83.56% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg
+ 83.50% 0.08% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 83.02% 0.17% tc [kernel.vmlinux] [k] netlink_unicast
+ 82.48% 0.00% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 81.89% 0.11% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 81.71% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 73.99% 0.63% tc [cls_flower] [k] fl_change
+ 69.72% 0.00% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 69.72% 0.09% tc [kernel.vmlinux] [k] tcf_action_init
+ 69.53% 0.05% tc [kernel.vmlinux] [k] tcf_action_init_1
+ 53.08% 0.91% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 45.52% 43.99% tc [kernel.vmlinux] [k] osq_lock
- 36.02% 0.21% tc [act_tunnel_key] [k] tunnel_key_init
  - 35.81% tunnel_key_init
    + 15.95% tcf_idr_check_alloc
    + 13.91% tcf_idr_insert
    - 4.70% dst_cache_init
      + 4.68% pcpu_alloc
+ 33.22% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
+ 32.34% 0.05% tc [act_mirred] [k] tcf_mirred_init
+ 28.24% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert
+ 7.79% 0.05% tc [kernel.vmlinux] [k] idr_alloc_u32
+ 7.67% 7.35% tc [kernel.vmlinux] [k] idr_get_free
+ 6.46% 6.22% tc [kernel.vmlinux] [k] mutex_spin_on_owner
+ 5.11% 0.05% tc [kernel.vmlinux] [k] tfilter_notify

With percpu allocation removed insertion rate is increased by ~120%.
Such rule profile scales much better than simple single action because
both types of actions were competing for single lock in percpu
allocator, but not for action idr lock, which is per-action. Note that
percpu allocator is still used by dst_cache in tunnel_key actions and
consumes 4.68% CPU time. Dst_cache seems like good opportunity for
further insertion rate optimization but is not addressed by this change.

Another improvement provided by this change is significantly reduced
memory usage. The test is implemented by sampling "used memory" value
from "vmstat -s" command output. Following table includes memory usage
measurements for same two configurations that were used for measuring
insertion rate:

 Profile           | Mem per rule | Mem per rule no_percpu | Less memory used
                   |         (KB) |                   (KB) |             (KB)
-------------------+--------------+------------------------+------------------
 Gact drop         |         3.91 |                   2.51 |              1.4
 tunnel_key+mirred |         6.73 |                   3.91 |              2.8

Results indicate that memory usage of percpu allocator per action is
~1.4 KB. Note that any measurements of percpu allocator memory usage is
inherently tied to particular setup since memory usage is linear to
number of cores in system. It is to be expected that on current top of
the line servers percpu allocator memory usage will be 2-5x more than on
24 CPUs setup that was used for testing.

Setup details: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB memory

Patches applied on top of net-next branch:

commit 2203cbf2c8b58a1e3bef98c47531d431d11639a0 (net-next) Author:
Russell King <rmk+kernel@armlinux.org.uk> Date: Tue Oct 15 11:38:39 2019
+0100

net: sfp: move fwnode parsing into sfp-bus layer

Changes V1 -> V2:

- Include memory measurements.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agotc-testing: implement tests for new fast_init action flag
Vlad Buslov [Wed, 30 Oct 2019 14:09:07 +0000 (16:09 +0200)]
tc-testing: implement tests for new fast_init action flag

Add basic tests to verify action creation with new fast_init flag for all
actions that support the flag.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: update action implementations to support flags
Vlad Buslov [Wed, 30 Oct 2019 14:09:06 +0000 (16:09 +0200)]
net: sched: update action implementations to support flags

Extend struct tc_action with new "tcfa_flags" field. Set the field in
tcf_idr_create() function and provide new helper
tcf_idr_create_from_flags() that derives 'cpustats' boolean from flags
value. Update individual hardware-offloaded actions init() to pass their
"flags" argument to new helper in order to skip percpu stats allocation
when user requested it through flags.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: extend TCA_ACT space with TCA_ACT_FLAGS
Vlad Buslov [Wed, 30 Oct 2019 14:09:05 +0000 (16:09 +0200)]
net: sched: extend TCA_ACT space with TCA_ACT_FLAGS

Extend TCA_ACT space with nla_bitfield32 flags. Add
TCA_ACT_FLAGS_NO_PERCPU_STATS as the only allowed flag. Parse the flags in
tcf_action_init_1() and pass resulting value as additional argument to
a_o->init().

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: modify stats helper functions to support regular stats
Vlad Buslov [Wed, 30 Oct 2019 14:09:04 +0000 (16:09 +0200)]
net: sched: modify stats helper functions to support regular stats

Modify stats update helper functions introduced in previous patches in this
series to fallback to regular tc_action->tcfa_{b|q}stats if cpu stats are
not allocated for the action argument. If regular non-percpu allocated
counters are in use, then obtain action tcfa_lock while modifying them.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: don't expose action qstats to skb_tc_reinsert()
Vlad Buslov [Wed, 30 Oct 2019 14:09:03 +0000 (16:09 +0200)]
net: sched: don't expose action qstats to skb_tc_reinsert()

Previous commit introduced helper function for updating qstats and
refactored set of actions to use the helpers, instead of modifying qstats
directly. However, one of the affected action exposes its qstats to
skb_tc_reinsert(), which then modifies it.

Refactor skb_tc_reinsert() to return integer error code and don't increment
overlimit qstats in case of error, and use the returned error code in
tcf_mirred_act() to manually increment the overlimit counter with new
helper function.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: extract qstats update code into functions
Vlad Buslov [Wed, 30 Oct 2019 14:09:02 +0000 (16:09 +0200)]
net: sched: extract qstats update code into functions

Extract common code that increments cpu_qstats counters into standalone act
API functions. Change hardware offloaded actions that use percpu counter
allocation to use the new functions instead of accessing cpu_qstats
directly.

This commit doesn't change functionality.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: extract bstats update code into function
Vlad Buslov [Wed, 30 Oct 2019 14:09:01 +0000 (16:09 +0200)]
net: sched: extract bstats update code into function

Extract common code that increments cpu_bstats counter into standalone act
API function. Change hardware offloaded actions that use percpu counter
allocation to use the new function instead of incrementing cpu_bstats
directly.

This commit doesn't change functionality.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: extract common action counters update code into function
Vlad Buslov [Wed, 30 Oct 2019 14:09:00 +0000 (16:09 +0200)]
net: sched: extract common action counters update code into function

Currently, all implementations of tc_action_ops->stats_update() callback
have almost exactly the same implementation of counters update
code (besides gact which also updates drop counter). In order to simplify
support for using both percpu-allocated and regular action counters
depending on run-time flag in following patches, extract action counters
update code into standalone function in act API.

This commit doesn't change functionality.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: qrtr: Simplify 'qrtr_tun_release()'
Christophe JAILLET [Wed, 30 Oct 2019 06:36:40 +0000 (07:36 +0100)]
net: qrtr: Simplify 'qrtr_tun_release()'

Use 'skb_queue_purge()' instead of re-implementing it.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next...
David S. Miller [Thu, 31 Oct 2019 00:51:25 +0000 (17:51 -0700)]
Merge branch '1GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
1GbE Intel Wired LAN Driver Updates 2019-10-29

This series contains updates to e1000e, igb, ixgbe and i40e drivers.

Sasha adds support for Intel client platforms Comet Lake and Tiger Lake
to the e1000e driver.  Also adds a fix for a compiler warning that was
recently introduced, when CONFIG_PM_SLEEP is not defined, so wrap the
code that requires this kernel configuration to be defined.

Alex fixes a potential race condition between network configuration and
power management for e1000e, which is similar to a past issue in the igb
driver.  Also provided a bit of code cleanup since the driver no longer
checks for __E1000_DOWN.

Josh Hunt adds UDP segmentation offload support for igb, ixgbe and i40e.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agowimax: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops
zhong jiang [Wed, 30 Oct 2019 02:55:34 +0000 (10:55 +0800)]
wimax: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops

It is more clear to use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs file
operation rather than DEFINE_SIMPLE_ATTRIBUTE.

It is detected with the help of coccinelle.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: dsa: add ethtool pause configuration support
Heiner Kallweit [Tue, 29 Oct 2019 21:32:48 +0000 (22:32 +0100)]
net: dsa: add ethtool pause configuration support

This patch adds glue logic to make pause settings per port
configurable vie ethtool.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agovxlan: drop "vxlan" parameter in vxlan_fdb_alloc()
Guillaume Nault [Tue, 29 Oct 2019 20:57:10 +0000 (21:57 +0100)]
vxlan: drop "vxlan" parameter in vxlan_fdb_alloc()

This parameter has never been used.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phy: marvell: add downshift support for 88E1145
Heiner Kallweit [Tue, 29 Oct 2019 19:25:26 +0000 (20:25 +0100)]
net: phy: marvell: add downshift support for 88E1145

Add downshift support for 88E1145, it uses the same downshift
configuration registers as 88E1111.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'ICMP-flow-improvements'
David S. Miller [Thu, 31 Oct 2019 00:21:35 +0000 (17:21 -0700)]
Merge branch 'ICMP-flow-improvements'

Matteo Croce says:

====================
ICMP flow improvements

This series improves the flow inspector handling of ICMP packets:
The first two patches just add some comments in the code which would have saved
me a few minutes of time, and refactor a piece of code.
The third one adds to the flow inspector the capability to extract the
Identifier field, if present, so echo requests and replies are classified
as part of the same flow.
The fourth patch uses the function introduced earlier to the bonding driver,
so echo replies can be balanced across bonding slaves.

v1 -> v2:
 - remove unused struct members
 - add an helper to check for the Id field
 - use a local flow_dissector_key in the bonding to avoid
   changing behaviour of the flow dissector
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agobonding: balance ICMP echoes in layer3+4 mode
Matteo Croce [Tue, 29 Oct 2019 13:50:53 +0000 (14:50 +0100)]
bonding: balance ICMP echoes in layer3+4 mode

The bonding uses the L4 ports to balance flows between slaves. As the ICMP
protocol has no ports, those packets are sent all to the same device:

    # tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
    # tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64

But some ICMP packets have an Identifier field which is
used to match packets within sessions, let's use this value in the hash
function to balance these packets between bond slaves:

    # ping -qc1 192.168.0.2
    0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
    0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64

Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,
so we can balance pings encapsulated in a tunnel when using mode encap3+4:

    # ping -q 192.168.1.2 -c1
    0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
    0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
    # ping -q 192.168.1.2 -c1
    1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoflow_dissector: extract more ICMP information
Matteo Croce [Tue, 29 Oct 2019 13:50:52 +0000 (14:50 +0100)]
flow_dissector: extract more ICMP information

The ICMP flow dissector currently parses only the Type and Code fields.
Some ICMP packets (echo, timestamp) have a 16 bit Identifier field which
is used to correlate packets.
Add such field in flow_dissector_key_icmp and replace skb_flow_get_be16()
with a more complex function which populate this field.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoflow_dissector: skip the ICMP dissector for non ICMP packets
Matteo Croce [Tue, 29 Oct 2019 13:50:51 +0000 (14:50 +0100)]
flow_dissector: skip the ICMP dissector for non ICMP packets

FLOW_DISSECTOR_KEY_ICMP is checked for every packet, not only ICMP ones.
Even if the test overhead is probably negligible, move the
ICMP dissector code under the big 'switch(ip_proto)' so it gets called
only for ICMP packets.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoflow_dissector: add meaningful comments
Matteo Croce [Tue, 29 Oct 2019 13:50:50 +0000 (14:50 +0100)]
flow_dissector: add meaningful comments

Documents two piece of code which can't be understood at a glance.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agotc-testing: fixed two failing pedit tests
Roman Mashak [Wed, 30 Oct 2019 19:08:43 +0000 (15:08 -0400)]
tc-testing: fixed two failing pedit tests

Two pedit tests were failing due to incorrect operation
value in matchPattern, should be 'add' not 'val', so fix it.

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agotipc: add smart nagle feature
Jon Maloy [Wed, 30 Oct 2019 13:00:41 +0000 (14:00 +0100)]
tipc: add smart nagle feature

We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.

- The nagle property as such is determined by manipulating a new
  'maxnagle' field in struct tipc_sock. If certain conditions are met,
  'maxnagle' will define max size of the messages which can be bundled.
  If it is set to zero no messages are ever bundled, implying that the
  nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
  than 4 messages have been sent out without receiving any data message
  from the peer.
- A socket leaves nagle mode whenever it receives a data message from
  the peer.

In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.

The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.

- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
  no outstanding message where the 'ack_required' bit was set. As a
  consequence, the first message added after we enter nagle mode
  is always sent directly with this bit set.

This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.

Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'mlxsw-Update-firmware-version'
David S. Miller [Wed, 30 Oct 2019 19:07:05 +0000 (12:07 -0700)]
Merge branch 'mlxsw-Update-firmware-version'

Ido Schimmel says:

====================
mlxsw: Update firmware version

This patch set updates the firmware version for Spectrum-1 and enforces
a firmware version for Spectrum-2.

The version adds support for querying port module type. It will be used
by a followup patch set from Jiri to make port split code more generic.

Patch #1 increases the size of an existing register in order to be
compatible with the new firmware version. In the future the firmware
will assign default values to fields not specified by the driver.

Patch #2 temporarily increases the PCI reset timeout for SN3800 systems.
Note that in normal cases the driver will need to wait no longer than 5
seconds for the device to become ready following reset command.

Patch #3 bumps the firmware version for Spectrum-1.

Patch #4 enforces a minimum firmware version for Spectrum-2.

v2:
* Added patch #2
====================

Signed-off-by: David S. Miller <davem@davemloft.net>