git.monstr.eu Git - linux-2.6-microblaze.git/log

r8169: improve a check in rtl_init_one

The check for pci_is_pcie() is redundant here because all
chip versions >=18 are PCIe only anyway. In addition use
dma_set_mask_and_coherent() instead of separate calls to
pci_set_dma_mask() and pci_set_consistent_dma_mask().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: improve rtl8169_irq_mask_and_ack

Code can be slightly simplified by acking even events we're not
interested in. In addition add a comment making clear that the
read has no functional purpose and is just a PCI commit.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: use default watchdog timeout

The networking core has a default watchdog timeout of 5s. I see no
need to define an own timeout of 6s which is basically the same.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpaa2-eth: Make Rx flow hash key configurable

Until now, the Rx flow hash key was a 5-tuple (IP src, IP dst,
IP nextproto, L4 src port, L4 dst port) fixed value that we
configured at probe.

Add support for configuring this hash key at runtime.
We support all standard header fields configurable through ethtool,
but cannot differentiate between flow types, so the same hash key
is applied regardless of protocol.

We also don't support the discard option.

Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qca_spi: Introduce write register verification

The SPI protocol for the QCA7000 doesn't have any fault detection.
In order to increase the drivers reliability in noisy environments,
we could implement a write verification inspired by the enc28j60.
This should avoid situations were the driver wrongly assumes the
receive interrupt is enabled and miss all incoming packets.

This function is disabled per default and can be controlled via module
parameter wr_verify.

Signed-off-by: Michael Heimpold <michael.heimpold@i2se.com>
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tls: Fixed uninitialised vars warning

In tls_sw_sendmsg() and tls_sw_sendpage(), it is possible that the
uninitialised variable 'ret' gets passed to sk_stream_error(). So
initialise local variable 'ret' to '0. The warnings were detected by
'smatch' tool.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/tls: Fixed race condition in async encryption

On processors with multi-engine crypto accelerators, it is possible that
multiple records get encrypted in parallel and their encryption
completion is notified to different cpus in multicore processor. This
leads to the situation where tls_encrypt_done() starts executing in
parallel on different cores. In current implementation, encrypted
records are queued to tx_ready_list in tls_encrypt_done(). This requires
addition to linked list 'tx_ready_list' to be protected. As
tls_decrypt_done() could be executing in irq content, it is not possible
to protect linked list addition operation using a lock.

To fix the problem, we remove linked list addition operation from the
irq context. We do tx_ready_list addition/removal operation from
application context only and get rid of possible multiple access to
the linked list. Before starting encryption on the record, we add it to
the tail of tx_ready_list. To prevent tls_tx_records() from transmitting
it, we mark the record with a new flag 'tx_ready' in 'struct tls_rec'.
When record encryption gets completed, tls_encrypt_done() has to only
update the 'tx_ready' flag to true & linked list add operation is not
required.

The changed logic brings some other side benefits. Since the records
are always submitted in tls sequence number order for encryption, the
tx_ready_list always remains sorted and addition of new records to it
does not have to traverse the linked list.

Lastly, we renamed tx_ready_list in 'struct tls_sw_context_tx' to
'tx_list'. This is because now, the some of the records at the tail are
not ready to transmit.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'few-NTF_ROUTER-related-updates'

Roopa Prabhu says:

====================
few NTF_ROUTER related updates

This series allows setting of NTF_ROUTER by an external
entity (eg BGP E-VPN control plane). Also fixes missing
netlink notification on neigh NTF_ROUTER flag changes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

neighbour: send netlink notification if NTF_ROUTER changes

send netlink notification if neigh_update results in NTF_ROUTER
change and if NEIGH_UPDATE_F_ISROUTER is on. Also move the
NTF_ROUTER change function into a helper.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

neighbour: allow admin to set NTF_ROUTER

This patch allows admin setting of NTF_ROUTER flag
on a neighbour entry. This enables external control
plane (like bgp evpn) to manage neigh entries with
NTF_ROUTER flag.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-sched-Add-hardware-specific-counters-to-TC-actions'

Eelco Chaudron says:

====================
net/sched: Add hardware specific counters to TC actions

Add hardware specific counters to TC actions which will be exported
through the netlink API. This makes troubleshooting TC flower offload
easier, as it possible to differentiate the packets being offloaded.

v2 - Rebased on latest net-next
====================

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: Add hardware specific counters to TC actions

Add additional counters that will store the bytes/packets processed by
hardware. These will be exported through the netlink interface for
displaying by the iproute2 tc tool

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/core: Add new basic hardware counter

Add a new hardware specific basic counter, TCA_STATS_BASIC_HW. This can
be used to count packets/bytes processed by hardware offload.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mvpp2-Add-txq-to-CPU-mapping'

Maxime Chevallier says:

====================
net: mvpp2: Add txq to CPU mapping

This short series adds XPS support to the mvpp2 driver, by mapping
txqs and CPUs. This comes with a patch using round-robin scheduling
for the HW to pick the next txq to transmit from, instead of the default
fixed-priority scheduling.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: use round-robin scheduling for TX queues on the same CPU

This commit allows each TXQ to be picked in a round-robin fashion by
the PPv2 transmit scheduling mechanism. This is opposed to the default
behaviour that prioritizes the highest numbered queues.

Suggested-by: Yan Markman <ymarkman@marvell.com>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: support XPS by mapping TX queues to CPUs

Since the PPv2 controller has multiple TX queues, we can spread traffic
by assining TX queues to CPUs, allowing to use XPS to balance egress
traffic between CPUs.

Suggested-by : Yan Markman <ymarkman@marvell.com>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: Make MLXSW_SP1_FWREV_MINOR a hard requirement

Up until now, mlxsw tolerated firmware versions that weren't exactly
matching the required version, if the branch number matched. That
allowed the users to test various firmware versions as long as they were
on the right branch.

On the other hand, it made it impossible for mlxsw to put a hard lower
bound on a version that fixes all problems known to date. If a user had
a somewhat older FW version installed, mlxsw would start up just fine,
possibly performing non-optimally as it would use features that trigger
problematic behavior.

Therefore tweak the check to accept any FW version that is:

- on the same branch as the preferred version, and
- the same as or newer than the preferred version.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'hv_netvsc-Support-LRO-RSC-in-the-vSwitch'

Haiyang Zhang says:

====================
hv_netvsc: Support LRO/RSC in the vSwitch

The patch adds support for LRO/RSC in the vSwitch feature. It reduces
the per packet processing overhead by coalescing multiple TCP segments
when possible. The feature is enabled by default on VMs running on
Windows Server 2019 and later.

The patch set also adds ethtool command handler and documents.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

hv_netvsc: Update document for LRO/RSC support

Update document for LRO/RSC support, and the command line info to
change the setting.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

hv_netvsc: Add handler for LRO setting change

This patch adds the handler for LRO setting change, so that a user
can use ethtool command to enable / disable LRO feature.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

hv_netvsc: Add support for LRO/RSC in the vSwitch

LRO/RSC in the vSwitch is a feature available in Windows Server 2019
hosts and later. It reduces the per packet processing overhead by
coalescing multiple TCP segments when possible. This patch adds netvsc
driver support for this feature.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-dsa-b53-SGMII-modes-fixes'

Florian Fainelli says:

====================
net: dsa: b53: SGMII modes fixes

Here are two additional fixes that are required in order for SGMII to
work correctly. This was discovered with using a copper SFP which would
make us use SGMII mode, we would actually leave the HW configured in its
default mode: Fiber.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: b53: Also include SGMII for mac_config and mac_link_state

In both 802.3z and SGMII modes we need to configure the MAC accordingly
to flip between Fiber and SGMII modes, and we need to read the MAC
status from the SGMII in-band control word.

Fixes: 0e01491de646 ("net: dsa: b53: Add SerDes support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: b53: Fix B53_SERDES_DIGITAL_CONTROL offset

Maths went wrong, to get 0x20, we need to do 0x1e + (x) * 2, not 0x18,
fix that offset so we access the correct registers. This would make us
not access the correct SerDes Digital control words, status would be
fine and so we would not be correctly flipping between Fiber and SGMII
modes resulting in incorrect status words being pulled into the SerDes
digital status register.

Fixes: 0e01491de646 ("net: dsa: b53: Add SerDes support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: b53: Don't assign autonegotiation enabled

PHYLINK takes care of filing the right information into
state->an_enabled, get rid of the read from the SerDes's BMCR register.

Fixes: 0e01491de646 ("net: dsa: b53: Add SerDes support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

decnet: Remove unnecessary check for dev->name

Clang warns that the address of a pointer will always evaluated as true
in a boolean context.

net/decnet/dn_dev.c:1366:10: warning: address of array 'dev->name' will
always evaluate to 'true' [-Wpointer-bool-conversion]
dev->name ? dev->name : "???",
~~~~~^~~~ ~
1 warning generated.

Link: https://github.com/ClangBuiltLinux/linux/issues/116
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/net: add ipv6 tests to ip_defrag selftest

This patch adds ipv6 defragmentation tests to ip_defrag selftest,
to complement existing ipv4 tests.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net

Currently, ip[6]frag_high_thresh sysctl values in new namespaces are
hard-limited to those of the root/init ns.

There are at least two use cases when it would be desirable to
set the high_thresh values higher in a child namespace vs the global hard
limit:

- a security/ddos protection policy may lower the thresholds in the
root/init ns but allow for a special exception in a child namespace
- testing: a test running in a namespace may want to set these
thresholds higher in its namespace than what is in the root/init ns

The new behavior:

# ip netns add testns
# ip netns exec testns bash

# sysctl -w net.ipv4.ipfrag_high_thresh=9000000
net.ipv4.ipfrag_high_thresh = 9000000

# sysctl net.ipv4.ipfrag_high_thresh
net.ipv4.ipfrag_high_thresh = 9000000

# sysctl -w net.ipv6.ip6frag_high_thresh=9000000
net.ipv6.ip6frag_high_thresh = 9000000

# sysctl net.ipv6.ip6frag_high_thresh
net.ipv6.ip6frag_high_thresh = 9000000

The old behavior:

# ip netns add testns
# ip netns exec testns bash

# sysctl -w net.ipv4.ipfrag_high_thresh=9000000
net.ipv4.ipfrag_high_thresh = 9000000

# sysctl net.ipv4.ipfrag_high_thresh
net.ipv4.ipfrag_high_thresh = 4194304

# sysctl -w net.ipv6.ip6frag_high_thresh=9000000
net.ipv6.ip6frag_high_thresh = 9000000

# sysctl net.ipv6.ip6frag_high_thresh
net.ipv6.ip6frag_high_thresh = 4194304

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: discard IP frag queue on more errors

This is similar to how ipv4 now behaves:
commit 0ff89efb5246 ("ip: fail fast on IP defrag errors").

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/ipv4: avoid compile error in fib_info_nh_uses_dev

net/ipv4/fib_frontend.c: In function 'fib_info_nh_uses_dev':
net/ipv4/fib_frontend.c:322:6: error: unused variable 'ret' [-Werror=unused-variable]
cc1: all warnings being treated as errors

Fixes: 78f2756c5fc0 ("net/ipv4: Move device validation to helper")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tcp-switch-to-Early-Departure-Time-model'

Eric Dumazet says:

====================
tcp: switch to Early Departure Time model

In the early days, pacing has been implemented in sch_fq (FQ)
in a generic way :

- SO_MAX_PACING_RATE could be used by any sockets.

- TCP would vary effective pacing rate based on CWND*MSS/SRTT

- FQ would ensure delays between packets based on current
  sk->sk_pacing_rate, but with some quantum based artifacts.
  (inflating RPC tail latencies)

- BBR then tweaked the pacing rate in its various phases
  (PROBE, DRAIN, ...)

This worked reasonably well, but had the side effect that TCP RTT
samples would be inflated by the sojourn time of the packets in FQ.

Also note that when FQ is not used and TCP wants pacing, the
internal pacing fallback has very different behavior, since TCP
emits packets at the time they should be sent (with unreasonable
assumptions about scheduling costs)

Van Jacobson gave a talk at Netdev 0x12 in Montreal, about letting
TCP (or applications for UDP messages) decide of the Earliest
Departure Time, instead of letting packet schedulers derive it
from pacing rate.

https://www.netdevconf.org/0x12/session.html?evolving-from-afap-teaching-nics-about-time
https://www.files.netdevconf.org/d/46def75c2ef345809bbe/files/?p=/Evolving%20from%20AFAP%20%E2%80%93%20Teaching%20NICs%20about%20time.pdf

Recent additions in linux provided SO_TXTIME and a new ETF qdisc
supporting the new skb->tstamp role

This patch series converts TCP and FQ to the same model.

This might in the future allow us to relax tight TSQ limits
(if FQ is present in the output path), and thus lower
number of callbacks to tcp_write_xmit(), thanks to batching.

This will be followed by FQ change allowing SO_TXTIME support
so that QUIC servers can let the pacing being done in FQ (or
offloaded if network device permits)

For example, a TCP flow rated at 24Mbps now shows a more meaningful RTT

Before :

ESTAB  0  211408 10.246.7.151:41558   10.246.7.152:33723
cubic wscale:8,8 rto:203 rtt:2.195/0.084 mss:1448 rcvmss:536
  advmss:1448 cwnd:20 ssthresh:20 bytes_acked:36897937
  segs_out:25488 segs_in:12454 data_segs_out:25486
  send 105.5Mbps lastsnd:1 lastrcv:12851 lastack:1
  pacing_rate 24.0Mbps/24.0Mbps delivery_rate 22.9Mbps
  busy:12851ms unacked:4 rcv_space:29200 notsent:205616 minrtt:0.026

After :

ESTAB  0  192584 10.246.7.151:61612   10.246.7.152:34375
cubic wscale:8,8 rto:201 rtt:0.165/0.129 mss:1448 rcvmss:536
  advmss:1448 cwnd:20 ssthresh:20 bytes_acked:170755401
  segs_out:117931 segs_in:57651 data_segs_out:117929
  send 1404.1Mbps lastsnd:1 lastrcv:56915 lastack:1
  pacing_rate 24.0Mbps/24.0Mbps delivery_rate 24.2Mbps
  busy:56915ms unacked:4 rcv_space:29200 notsent:186792 minrtt:0.054

A nice side effect of this patch series is a reduction of max/p99
latencies of RPC workloads, since the FQ quantum no longer adds
artifact.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net_sched: sch_fq: remove dead code dealing with retransmits

With the earliest departure time model, we no longer plan
special casing TCP retransmits. We therefore remove dead
code (since most compilers understood skb_is_retransmit()
was false)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: switch tcp_internal_pacing() to tcp_wstamp_ns

Now TCP keeps track of tcp_wstamp_ns, recording the earliest
departure time of next packet, we can remove duplicate code
from tcp_internal_pacing()

This removes one ktime_get_tai_ns() call, and a divide.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: switch tcp and sch_fq to new earliest departure time model

TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq
no longer has to do it.

Thanks to this model, TCP can get more accurate RTT samples,
since pacing no longer inflates them.

This has the nice effect of removing some delays caused by FQ
quantum mechanism, causing inflated max/P99 latencies.

Also we might relax TCP Small Queue tight limits in the future,
since this new model allow TCP to build bigger batches, since
sch_fq (or a device with earliest departure time offload) ensure
these packets will be delivered on time.

Note that other protocols are not converted (they will probably
never be) so sch_fq has still support for SO_MAX_PACING_RATE

Tested:

Test showing FQ pacing quantum artifact for low-rate flows,
adding unexpected throttles for RPC flows, inflating max and P99 latencies.

The parameters chosen here are to show what happens typically when
a TCP flow has a reduced pacing rate (this can be caused by a reduced
cwin after few losses, or/and rtt above few ms)

MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
Before :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
19,82.78,5279,3825,482.02

After :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
20,49.94,128,63,3.18

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: switch internal pacing timer to CLOCK_TAI

Next patch will use tcp_wstamp_ns to feed internal
TCP pacing timer, so switch to CLOCK_TAI to share same base.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: provide earliest departure time in skb->tstamp

Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns,
from usec units to nsec units.

Do not clear skb->tstamp before entering IP stacks in TX,
so that qdisc or devices can implement pacing based on the
earliest departure time instead of socket sk->sk_pacing_rate

Packets are fed with tcp_wstamp_ns, and following patch
will update tcp_wstamp_ns when both TCP and sch_fq switch to
the earliest departure time mechanism.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: add tcp_wstamp_ns socket field

TCP will soon provide earliest departure time on TX skbs.
It needs to track this in a new variable.

tcp_mstamp_refresh() needs to update this variable, and
became too big to stay an inline.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net_sched: sch_fq: switch to CLOCK_TAI

TCP will soon provide per skb->tstamp with earliest departure time,
so that sch_fq does not have to determine departure time by looking
at socket sk_pacing_rate.

We chose in linux-4.19 CLOCK_TAI as the clock base for transports,
qdiscs, and NIC offloads.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: introduce tcp_skb_timestamp_us() helper

There are few places where TCP reads skb->skb_mstamp expecting
a value in usec unit.

skb->tstamp (aka skb->skb_mstamp) will soon store CLOCK_TAI nsec value.

Add tcp_skb_timestamp_us() to provide proper conversion when needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: switch tcp_clock_ns() to CLOCK_TAI base

TCP pacing is either implemented in sch_fq or internally.
We have the goal of being able to offload pacing on the NICS.

TCP will soon provide per skb skb->tstamp as early departure time.

Like ETF in commit 25db26a91364 ("net/sched: Introduce the ETF Qdisc")
we chose CLOCK_T as the clock base, so that TCP and pacers can share
a common clock, to get better RTT samples (without pacing artificially
inflating these samples).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'hns3-next'

Salil Mehta says:

====================
Bug fixes, snall modifications & cleanup for HNS3 driver

This patch presents some bug fixes, small modifications and cleanups
to the HNS3 VF and PF driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Remove redundant hclge_get_port_type()

This patch removes hclge_get_port_type which is redundant.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Fix speed/duplex information loss problem when executing ethtool ethx cmd of VF

Our VF has not implemented the ops for get_port_type. So when we executing
ethtool ethx cmd of VF, hns3_get_link_ksettings will return directly. And
we can not query anything.

To support get_link_ksettings for VF, this patch replaces get_port_type
with get_media_type. If the media type is HNAE3_MEDIA_TYPE_NONE,
hns3_get_link_ksettings will return link information of VF.

Fixes: 12f46bc1d447 ("net: hns3: Refine hns3_get_link_ksettings()")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Add get_media_type ops support for VF

This patch adds the ops of get_media_type support for VF.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Remove print messages for error packet

There are already multiple types packets statistics for error packets,
it's unnecessary to print them, which may affect the rx performance if
print too many.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Add unlikely for dma_mapping_error check

For dma_mapping_error is unlikely happened, this patch adds unlikely for
dma_mapping_error check.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Add nic state check before calling netif_tx_wake_queue

When nic down, it firstly calls netif_tx_stop_all_queues(), then calls
napi_disable(). But napi_disable() will wait current napi_poll finish,
it may call netif_tx_wake_queue(). This patch fixes it by add nic state
checking.

Fixes: 424eb834a9be ("net: hns3: Unified HNS3 {VF|PF} Ethernet Driver for hip08 SoC")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Add handle for default case

There are a few "switch-case" codes missed handle for default case. For
some abnormal case, it should return error code instead of return 0.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Unify the prefix of vf functions

The prefix of most functions for vf are hclgevf. This patch renames the
function with inconsistent prefix.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Fix tqp array traversal condition for vf

There are two tqp_num variables "hdev->tqp_num" and "kinfo->tqp_num"
used in VF. "hdev->tqp_num" is the total tqp number allocated to the
VF, and "kinfo->tqp_num" indicates the tqp number being used by the
VF. Usually the two variables are equal. But for the case hdev->tqp_num
larger than rss_size_max, and num_tc is 1, "kinfo->tqp_num" will be
less than "hdev->tqp_num".

In original codes, "hdev->tqp_num" is always used to traverse the
tqp array of kinfo. It may cause null pointer error when "hdev->tqp_num"
is larger than "kinfo->tqp_num"

Fixes: e2cb1dec9779 ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Adjust prefix of tx/rx statistic names

Some prefix of tx/rx statistic names are redundant, this patch modifies
these names.

The new prefix looks like below:
rxq#1_ -> rxq1_
txq#1_ -> txq1_
tx_dropped -> dropped
tx_wake -> wake
tx_busy -> busy
rx_dropped -> dropped

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Unify the type convert for desc.data

For desc.data is already point to the address of struct member "data[6]",
it's unnecessary to use '&' to get its address. This patch unifies all
the type convert for dest.data, using "req = (struct name *)dest.data".

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Fix ets validate issue

There is a defect in hclge_ets_validate(). If each member of tc_tsa is
not IEEE_8021QAZ_TSA_ETS, the variable total_ets_bw won't be updated.
In this case, the check for value of total_ets_bw will fail. This patch
fixes it by checking total_ets_bw only after it has been updated.

Fixes: cacde272dd00 ("net: hns3: Add hclge_dcb module for the support of DCB feature")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: ravb: Add support for r8a7744 SoC

Document RZ/G1N (R8A7744) SoC bindings.

Signed-off-by: Biju Das <biju.das@bp.renesas.com>
Reviewed-by: Fabrizio Castro <fabrizio.castro@bp.renesas.com>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ravb: Disable Pause Advertisement

The previous commit to ravb had the side effect of making the PHY
advertise Pause and Asym Pause, which previously did not happen. By
default, phydev->supported has both forms of pause enabled, but
phydev->advertising does not. The new phy_remove_link_mode() copies
phydev->supported to phydev->advertising after removing the requested
link mode. These Pause configuration bits appears it stops the PHY
from completing Auto-Neg and the link remains down. Be explicit and
remove the Pause and Asym Pause modes, so restoring the old behavior.

Fixes: 41124fa64d4b ("net: ethernet: Add helper to remove a supported link mode")
Reported-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-if_arp-use-define-instead-of-hard-coded-value'

Håkon Bugge says:

====================
net: if_arp: use define instead of hard-coded value

Struct arpreq contains the name of the device. All other places in the
kernel, the define IFNAMSIZ is used to designate its size. But in
if_arp.h, a literal constant is used.

As it could be good reasons to use constants instead of the defines in
include files under uapi, it seems to be OK to use the define here,
without opening a can of worms in user-land.

This because if_arp.h includes netdevice.h, which also uses
IFNAMSIZ. For the distros I have checked, this also holds true for the
use-land side.

The series also fixes some incorrect indents.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: if_arp: use define instead of hard-coded value

uapi/linux/if_arp.h includes linux/netdevice.h, which uses
IFNAMSIZ. Hence, use it instead of hard-coded value.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: if_arp: Fix incorrect indents

Fixing incorrect indents and align comments.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/tls: Add support for async encryption of records for performance

In current implementation, tls records are encrypted & transmitted
serially. Till the time the previously submitted user data is encrypted,
the implementation waits and on finish starts transmitting the record.
This approach of encrypt-one record at a time is inefficient when
asynchronous crypto accelerators are used. For each record, there are
overheads of interrupts, driver softIRQ scheduling etc. Also the crypto
accelerator sits idle most of time while an encrypted record's pages are
handed over to tcp stack for transmission.

This patch enables encryption of multiple records in parallel when an
async capable crypto accelerator is present in system. This is achieved
by allowing the user space application to send more data using sendmsg()
even while previously issued data is being processed by crypto
accelerator. This requires returning the control back to user space
application after submitting encryption request to accelerator. This
also means that zero-copy mode of encryption cannot be used with async
accelerator as we must be done with user space application buffer before
returning from sendmsg().

There can be multiple records in flight to/from the accelerator. Each of
the record is represented by 'struct tls_rec'. This is used to store the
memory pages for the record.

After the records are encrypted, they are added in a linked list called
tx_ready_list which contains encrypted tls records sorted as per tls
sequence number. The records from tx_ready_list are transmitted using a
newly introduced function called tls_tx_records(). The tx_ready_list is
polled for any record ready to be transmitted in sendmsg(), sendpage()
after initiating encryption of new tls records. This achieves parallel
encryption and transmission of records when async accelerator is
present.

There could be situation when crypto accelerator completes encryption
later than polling of tx_ready_list by sendmsg()/sendpage(). Therefore
we need a deferred work context to be able to transmit records from
tx_ready_list. The deferred work context gets scheduled if applications
are not sending much data through the socket. If the applications issue
sendmsg()/sendpage() in quick succession, then the scheduling of
tx_work_handler gets cancelled as the tx_ready_list would be polled from
application's context itself. This saves scheduling overhead of deferred
work.

The patch also brings some side benefit. We are able to get rid of the
concept of CLOSED record. This is because the records once closed are
either encrypted and then placed into tx_ready_list or if encryption
fails, the socket error is set. This simplifies the kernel tls
sendpath. However since tls_device.c is still using macros, accessory
functions for CLOSED records have been retained.

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: freescale: fix return type of ndo_start_xmit function

The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, so make sure the implementation in
this driver has returns 'netdev_tx_t' value, and change the function
return type to netdev_tx_t.

Found by coccinelle.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: micrel: fix return type of ndo_start_xmit function

The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, so make sure the implementation in
this driver has returns 'netdev_tx_t' value, and change the function
return type to netdev_tx_t.

Found by coccinelle.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: mdio-bcm-unimac: Allow configuring MDIO clock divider

Allow the configuration of the MDIO clock divider when the Device Tree
contains 'clock-frequency' property (similar to I2C and SPI buses).
Because the hardware may have lost its state during suspend/resume,
re-apply the MDIO clock divider upon resumption.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan78xx: Avoid unnecessary self assignment

Clang warns when a variable is assigned to itself.

drivers/net/usb/lan78xx.c:940:11: warning: explicitly assigning value of
variable of type 'u32' (aka 'unsigned int') to itself [-Wself-assign]
offset = offset;
~~~~~~ ^ ~~~~~~
1 warning generated.

Reorder the if statement to acheive the same result and avoid a self
assignment warning.

Link: https://github.com/ClangBuiltLinux/linux/issues/129
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: fddi: skfp: Remove unused function

Clang warns when a variable is assigned to itself.

drivers/net/fddi/skfp/pcmplc.c:1257:6: warning: explicitly assigning
value of variable of type 'int' to itself [-Wself-assign]
        phy = phy ; on_off = on_off ;
        ~~~ ^ ~~~
drivers/net/fddi/skfp/pcmplc.c:1257:21: warning: explicitly assigning
value of variable of type 'int' to itself [-Wself-assign]
        phy = phy ; on_off = on_off ;
                    ~~~~~~ ^ ~~~~~~
2 warnings generated.

Turns out this entire function doesn't actually do anything since
SK_UNUSED is just casting the pointer to void. Remove it to silence
this Clang warning.

Link: https://github.com/ClangBuiltLinux/linux/issues/128
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bna: Remove unnecessary self assignment

Clang warns when a variable is assigned to itself.

drivers/net/ethernet/brocade/bna/bna_enet.c:1800:9: warning: explicitly
assigning value of variable of type 'int' to itself [-Wself-assign]
        for (i = i; i < (bna->ioceth.attr.num_ucmac * 2); i++)
             ~ ^ ~
drivers/net/ethernet/brocade/bna/bna_enet.c:1835:9: warning: explicitly
assigning value of variable of type 'int' to itself [-Wself-assign]
        for (i = i; i < (bna->ioceth.attr.num_mcmac * 2); i++)
             ~ ^ ~
2 warnings generated.

Link: https://github.com/ClangBuiltLinux/linux/issues/110
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: neterion: vxge: Remove unnecessary parentheses

Clang warns when multiple pairs of parentheses are used for a single
conditional statement.

drivers/net/ethernet/neterion/vxge/vxge-traffic.c:2265:31: warning:
equality comparison with extraneous parentheses [-Wparentheses-equality]
        if ((hldev->config.intr_mode ==
VXGE_HW_INTR_MODE_MSIX_ONE_SHOT))
             ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/neterion/vxge/vxge-traffic.c:2265:31: note: remove
extraneous parentheses around the comparison to silence this warning
        if ((hldev->config.intr_mode ==
VXGE_HW_INTR_MODE_MSIX_ONE_SHOT))
            ~                        ^                                 ~
drivers/net/ethernet/neterion/vxge/vxge-traffic.c:2265:31: note: use '='
to turn this equality comparison into an assignment
        if ((hldev->config.intr_mode ==
VXGE_HW_INTR_MODE_MSIX_ONE_SHOT))
                                     ^~
                                     =
1 warning generated.

Link: https://github.com/ClangBuiltLinux/linux/issues/124
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: don't reschedule state machine when PHY is halted

When being in state PHY_HALTED we don't have to reschedule the
state machine, phy_start() will start it again.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

DRIVERS: net: macsec: Fix multiple coding style issues

This patch fixes a couple of issues highlighted by checkpatch.pl:

* Missing a blank line after declarations
* Alignment should match open parenthesis

Signed-off-by: Romain Aviolat <r.aviolat@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bnx2x-enhancements'

Shahed Shaikh says:

====================
bnx2x: enhancements

This series adds below changes -
- support for VF spoof-check configuration through .ndo_set_vf_spoofchk.
- workaround for MFW bug regarding unexpected bandwidth notifcation
in single function mode.
- supply VF link status as part of get VF config handling.
====================

Signed-off-by: Shahed Shaikh <shahed.shaikh@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>

bnx2x: Provide VF link status in ndo_get_vf_config

Provide current link status of VF in ndo_get_vf_config
handler.

Signed-off-by: Shahed Shaikh <Shahed.Shaikh@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: Ignore bandwidth attention in single function mode

This is a workaround for FW bug -
MFW generates bandwidth attention in single function mode, which
is only expected to be generated in multi function mode.
This undesired attention in SF mode results in incorrect HW
configuration and resulting into Tx timeout.

Signed-off-by: Shahed Shaikh <Shahed.Shaikh@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: Add VF spoof-checking configuration

Add support for `ndo_set_vf_spoofchk' to allow PF control over
its VF spoof-checking configuration.

Signed-off-by: Shahed Shaikh <shahed.shaikh@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mISDN: remove redundant null pointer check before kfree_skb

kfree_skb has taken the null pointer into account. hence it is safe
to remove the redundant null pointer check before kfree_skb.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vhost_net: add a missing error return

We accidentally left out this error return so it leads to some use after
free bugs later on.

Fixes: 0a0be13b8fe2 ("vhost_net: batch submitting XDP buffers to underlayer sockets")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'kfree_skb-NULL'

zhong jiang says:

====================
net: remove redundant null pointer check before kfree_skb

The issue is detected with the help of Coccinelle.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: remove redundant null pointer check before kfree_skb

kfree_skb has taken the null pointer into account. hence it is safe
to remove the redundant null pointer check before kfree_skb.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: cxgb3_main: remove redundant null pointer check before kfree_skb

kfree_skb has taken the null pointer into account. hence it is safe
to remove the redundant null pointer check before kfree_skb.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: nci: remove redundant null pointer check before kfree_skb

kfree_skb has taken the null pointer into account. hence it is safe
to remove the redundant null pointer check before kfree_skb.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: remove redundant null pointer check before kfree_skb

kfree_skb has taken the null pointer into account. hence it is safe
to remove the redundant null pointer check before kfree_skb.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: cxgb3: remove redundant null pointer check before kfree_skb

kfree_skb has taken the null pointer into account. hence it is safe
to remove the redundant null pointer check before kfree_skb.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: tap: remove redundant null pointer check before kfree_skb

kfree_skb has taken the null pointer into account. hence it is safe
to remove the redundant null pointer check before kfree_skb.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: neterion: remove redundant continue

The continue will not truely skip any code. hence it is safe to
remove it.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: amd: remove redundant continue

The continue will not truely skip any code. hence it is safe to
remove it.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net_sched: change tcf_del_walker() to take idrinfo->lock

Action API was changed to work with actions and action_idr in concurrency
safe manner, however tcf_del_walker() still uses actions without taking a
reference or idrinfo->lock first, and deletes them directly, disregarding
possible concurrent delete.

Change tcf_del_walker() to take idrinfo->lock while iterating over actions
and use new tcf_idr_release_unsafe() to release them while holding the
lock.

And the blocking function fl_hw_destroy_tmplt() could be called when we
put a filter chain, so defer it to a work queue.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
[xiyou.wangcong@gmail.com: heavily modify the code and changelog]
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-wean-netfilter-from-fib_nh'

David Ahern says:

====================
net: wean netfilter from fib_nh

Two netfilter modules reference fib_nh. In both cases the code is
only checking if a nexthop in a fib_info uses a specific device.
Both instances essentially duplicate code from __fib_validate_source,
so move that code into a helper and flip the netfilter modules to
use it.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

netfilter: nft_fib: Convert nft_fib4_eval to new dev helper

Convert nft_fib4_eval to the new device checking helper and
remove the duplicate code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netfilter: rpfilter: Convert rpfilter_lookup_reverse to new dev helper

Convert rpfilter_lookup_reverse to the new device checking helper
and remove the duplicate code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/ipv4: Move device validation to helper

Move the device matching check in __fib_validate_source to a helper and
export it for use by netfilter modules. Code move only; no functional
change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-next: mscc: remove unused ocelot_dev_gmii.h

The header ocelot_dev_gmii.h is unused since the inclusion of the driver.
It is unused, lets just remove it.

Signed-off-by: Corentin Labbe <clabbe@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-Further-MC-awareness-configuration'

Ido Schimmel says:

====================
mlxsw: Further MC-awareness configuration

Petr says:

Due to an issue in Spectrum chips, when unicast traffic shares the same
queue as BUM traffic, and there is congestion, the BUM traffic is
admitted to the queue anyway, thus pushing out all UC traffic. In order
to give unicast traffic precedence over BUM traffic, multicast-aware
mode is now configured on all ports. Under MC-aware mode, egress TCs
8..15 are used for BUM traffic, which has its own dedicated pool.

This patch set improves the way that the MC pool and the higher-order
TCs are integrated into the system.

In patch #1, shaper at the higher TCs is configured to the same value
that it has by default. It's better to have the corresponding artifact
in the code explicitly.

The 8 following patches gradually extend the devlink handling in mlxsw
to support the extra TCs and the new MC pool.

Patch #2 changes the way that pools are indexed in mlxsw. Instead of
using (FW index, direction) tuple to identify the pool and the
associated cache, mlxsw now uses devlink index. This change is necessary
because the new pool 15 is not contiguously adjacent to the
currently-used pools 0..3, and because it's only relevant on egress.
Using devlink index relaxes the requirement for symmetry and adjacency
imposed by using FW indexing.

In patch #3, the assumption that number of ingress TCs matches that of
egress TCs is relaxed to allow exposition of egress TCs 8..15.

In patches #4, #5 and #6, support for infinite quotas is introduced.
Infinite quotas are reported as taking all the memory in the system, but
actually use a mechanism where the infinity is configured explicitly.

In patches #7 and #8, support for configuring static pool sizes in
introduced. Statically-sized pools have been supported for a while now,
but during initialization, all pools have dynamic size. The patches
allow there to be a mix of by-default static and dynamic pools.

In patches #9 and #10, pool 15 resp. per-priority MC quotas are
explicitly configured to be in sync with the current recommendation for
handling BUM traffic in Spectrum chips.

In the following 3 patches, an mlxsw-specific selftest is added to test
the MC-awareness configuration.

First in patches #11 and #12, lib.sh is extended with functions to
collect ethtool stats, and to manage port MTU.

Then in patch #13 the selftest itself is added.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: Add a test for UC behavior under MC flood

A so-called "MC-aware" mode has recently been enabled in mlxsw. In
MC-aware mode, BUM traffic is handled in a special way so that when a
switch is flooded with BUM, UC performance isn't unduly impacted.
Without enablement of this mode, a stream of BUM traffic can cause
sustained UC throughput drop in excess of 99 %.

Add a test for this behavior. Compare how much UC throughput degrades as
a stream of broadcast frames floods the switch. A minimal degradation is
tolerated to cover for glitches in traffic injection performance.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: lib: Add mtu_set(), mtu_restore()

Some selftests need to tweak MTU of an interface, and naturally should
at teardown restore the MTU back to the original value. Add two
functions to facilitate this MTU handling: mtu_set() to change MTU
value, and mtu_reset() to change it back to what it was before.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: lib: Add ethtool_stats_get()

Add a new service function to obtain ethtool counters.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_buffers: Tweak SBMM configuration

The SBMM register configures shared buffer allocation and settings for
MC packets according to switch priority. The recommended values are no
reserved buffer and alpha of 1/4, which corresponds to buf_max of 6.
Update mlxsw_sp_sb_mms accordingly.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_buffers: Configure MC pool

Pool 15 (indexed as 8) is dedicated to MC traffic. Its configuration has
been kept at default, because the table-based configuration wasn't
expressive enough to allow the explicit configuration.

Now that the configuration of pool 15 can be described, do so. The MC
pool should have infinite size, infinite per-TC quota, and per-port
limit of 90K.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_buffers: Allow configuration of static pools

Some pools configured through the sb_pm entries may have by default
static size. The MC pool is now not explicitly configured, however it
gets configured as static implicitly by 0-initializing sb->prs, and a
follow-up patch adds an explicit configuration to the same effect.

To support this, pass max_buff taken from sb_pm and sb_cm entries
through cell conversion before handing it to mlxsw_sp_sb_pm_write(), if
the pool that the sb_pm entry configures is statically-sized.

To keep current behavior, update mlxsw_sp_sb_cms_egress[] to denote
buffer sizes in bytes (assuming Spectrum 1 cell sizes, which the
original code assumed as well) instead of cells. Note that a follow-up
patch changes this to infinite size.

Also tweak a comment at SBMM configuration to remain true now that
statically-sized pools exist.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_buffers: Pass SBPM min_size in cells

The SBPM register configures the shared buffer allocation and
configuration per port and pool. The min_buff value is the buffer size
dedicated to this single function, and is configured in cells.
Currently, all sb_pm entries have 0 for min_buff, and therefore the
actual unit is immaterial. However, in a follow-up patch we want to add
entries with non-zero minimum.

Therefore pass the min_buff from the sb_pm table through the cell
conversion before handing it over to mlxsw_sp_sb_pm_write().

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_buffers: Allow an infinite maximum for per-TC pool limit

The SBCM register configures the shared buffer configuration according
to port and TC. So far all pools have had a dynamic size, where the
infinite size is easy to express by using max_buff of 0xff. However the
MC pool should be configured with static size, and the infinite size
thus needs to be set using the field SBCM.infi_max.

Therefore add the field infi_max to the SBCM register and to
mlxsw_reg_sbcm_pack(). Extend mlxsw_sp_sb_cm_write() to handle infinite
sizes as well. Report infinite pool limits as if the limit actually were
the total shared buffer size.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_buffers: Allow pools of infinite size

The MC pool should have an infinite size (i.e. no quota).

To that end, add infi_size to the SBPR register and extend
mlxsw_reg_sbpr_pack(). Also add MLXSW_SP_SB_INFI to denote
buffers that should have an infinite size.

Change mlxsw_sp_sb_pr_write() to take as parameter byte size,
instead of cell size, and add the special handling of infinite
buffers. Report pools with infinite size as if they actually
take the full shared buffer size.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_buffers: Keep shared buffer size in mlxsw_sp_sb

Entities of infinite size will be reported as if they had the maximum
size allowed by the chip. To that end, keep track of maximum shared
buffer size in mlxsw_sp->sb.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>