git.monstr.eu Git - linux-2.6-microblaze.git/log

Merge branch 'mvneta-access-skb_shared_info-only-on-last-frag'

Lorenzo Bianconi says:

====================
mvneta: access skb_shared_info only on last frag

Build skb_shared_info on mvneta_rx_swbm stack and sync it to xdp_buff
skb_shared_info area only on the last fragment.
Avoid avoid unnecessary xdp_buff initialization in mvneta_rx_swbm routine.
This a preliminary series to complete xdp multi-buff in mvneta driver.
====================

Link: https://lore.kernel.org/r/cover.1605889258.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mvneta: alloc skb_shared_info on the mvneta_rx_swbm stack

Build skb_shared_info on mvneta_rx_swbm stack and sync it to xdp_buff
skb_shared_info area only on the last fragment. Leftover cache miss in
mvneta_swbm_rx_frame will be addressed introducing mb bit in
xdp_buff/xdp_frame struct

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mvneta: move skb_shared_info in mvneta_xdp_put_buff caller

Pass skb_shared_info pointer from mvneta_xdp_put_buff caller. This is a
preliminary patch to reduce accesses to skb_shared_info area and reduce
cache misses.
Remove napi parameter in mvneta_xdp_put_buff signature since it is always
run in NAPI context

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mvneta: avoid unnecessary xdp_buff initialization

Explicitly initialize mandatory fields defining xdp_buff struct
in mvneta_rx_swbm routine

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mvpp2: divide fifo for dts-active ports only

Tx/Rx FIFO is a HW resource limited by total size, but shared
by all ports of same CP110 and impacting port-performance.
Do not divide the FIFO for ports which are not enabled in DTS,
so active ports could have more FIFO.
No change in FIFO allocation if all 3 ports on the communication
processor enabled in DTS.

The active port mapping should be done in probe before FIFO-init.

Signed-off-by: Stefan Chulski <stefanc@marvell.com>
Reviewed-by: Russell King <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/1606154073-28267-1-git-send-email-stefanc@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: page_pool: Add page_pool_put_page_bulk() to page_pool.rst

Introduce page_pool_put_page_bulk() entry into the API section of
page_pool.rst

Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/a6a5141b4d7b7b71fa7778b16b48f80095dd3233.1606146163.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: warn if gso_type isn't set for a GSO SKB

In bug report [0] a warning in r8169 driver was reported that was
caused by an invalid GSO SKB (gso_type was 0). See [1] for a discussion
about this issue. Still the origin of the invalid GSO SKB isn't clear.

It shouldn't be a network drivers task to check for invalid GSO SKB's.
Also, even if issue [0] can be fixed, we can't be sure that a
similar issue doesn't pop up again at another place.
Therefore let gso_features_check() check for such invalid GSO SKB's.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=209423
[1] https://www.spinics.net/lists/netdev/msg690794.html

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/97c78d21-7f0b-d843-df17-3589f224d2cf@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns3: fix spelling mistake "memroy" -> "memory"

There are spelling mistakes in two dev_err messages. Fix them.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20201123103452.197708-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mlxsw-add-support-for-blackhole-nexthops'

Ido Schimmel says:

====================
mlxsw: Add support for blackhole nexthops

This patch set adds support for blackhole nexthops in mlxsw. These
nexthops are exactly the same as other nexthops, but instead of
forwarding packets to an egress router interface (RIF), they are
programmed to silently drop them.

Patches #1-#4 are preparations.

Patch #5 adds support for blackhole nexthops and removes the check that
prevented them from being programmed.

Patch #6 adds a selftests over mlxsw which tests that blackhole nexthops
can be programmed and are marked as offloaded.

Patch #7 extends the existing nexthop forwarding test to also test
blackhole functionality.

Patches #8-#10 add support for a new packet trap ('blackhole_nexthop')
which should be triggered whenever packets are dropped by a blackhole
nexthop. Obviously, by default, the trap action is set to 'drop' so that
dropped packets will not be reported.
====================

Link: https://lore.kernel.org/r/20201123071230.676469-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: Add blackhole_nexthop trap test

Test that packets hitting a blackhole nexthop are trapped to the CPU
when the trap is enabled. Test that packets are not reported when the
trap is disabled.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_trap: Add blackhole_nexthop trap

Register with devlink the blackhole_nexthop trap so that mlxsw will be
able to report packets dropped due to a blackhole nexthop.

The internal trap identifier is "DISCARD_ROUTER3", which traps packets
dropped in the adjacency table.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: Add blackhole_nexthop trap

Add a packet trap to report packets that were dropped due to a
blackhole nexthop.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Add blackhole nexthops tests

Test that IPv4 and IPv6 ping fail when the route is using a blackhole
nexthop or a group with a blackhole nexthop. Test that ping passes when
the route starts using a valid nexthop.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: Add blackhole nexthop configuration tests

Test the mlxsw allows blackhole nexthops to be installed and that the
nexthops are marked as offloaded.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_router: Add support for blackhole nexthops

Add support for blackhole nexthops by programming them to the adjacency
table with a discard action.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_router: Resolve RIF from nexthop struct instead of neighbour

The two are the same, but for blackhole nexthops we will not have an
associated neighbour struct, so resolve the RIF from the nexthop struct
itself instead.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_router: Use loopback RIF for unresolved nexthops

Now that the driver creates a loopback RIF during its initialization, it
can be used to program the adjacency entries for unresolved nexthops
instead of other RIFs. The loopback RIF is guaranteed to exist for the
entire life time of the driver, unlike other RIFs that come and go.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_router: Use different trap identifier for unresolved nexthops

Unresolved nexthops are currently written to the adjacency table with a
discard action. Packets hitting such entries are trapped to the CPU via
the 'DISCARD_ROUTER3' trap which can be enabled or disabled on demand,
but is always enabled in order to ensure the kernel can resolve the
unresolved neighbours.

This trap will be needed for blackhole nexthops support. Therefore, move
unresolved nexthops to explicitly program the adjacency entries with a
trap action and a different trap identifier, 'RTR_EGRESS0'.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_router: Create loopback RIF during initialization

Up until now RIFs (router interfaces) were created on demand (e.g.,
when an IP address was added to a netdev). However, sometimes the device
needs to be provided with a RIF when one might not be available.

For example, adjacency entries that drop packets need to be programmed
with an egress RIF despite the RIF not being used to forward packets.

Create such a RIF during initialization so that it could be used later
on to support blackhole nexthops.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'rxrpc-next-20201123' of git://git./linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Prelude to gssapi support

Here are some patches that do some reorganisation of the security class
handling in rxrpc to allow implementation of the RxGK security class that
will allow AF_RXRPC to use GSSAPI-negotiated tokens and better crypto.  The
RxGK security class is not included in this patchset.

It does the following things:

(1) Add a keyrings patch to provide the original key description, as
     provided to add_key(), to the payload preparser so that it can
     interpret the content on that basis.  Unfortunately, the rxrpc_s key
     type wasn't written to interpret its payload as anything other than a
     string of bytes comprising a key, but for RxGK, more information is
     required as multiple Kerberos enctypes are supported.

(2) Remove the rxk5 security class key parsing.  The rxk5 class never got
     rolled out in OpenAFS and got replaced with rxgk.

(3) Support the creation of rxrpc keys with multiple tokens of different
     types.  If some types are not supported, the ENOPKG error is
     suppressed if at least one other token's type is supported.

(4) Punt the handling of server keys (rxrpc_s type) to the appropriate
     security class.

(5) Organise the security bits in the rxrpc_connection struct into a
     union to make it easier to override for other classes.

(6) Move some bits from core code into rxkad that won't be appropriate to
     rxgk.

* tag 'rxrpc-next-20201123' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  rxrpc: Ask the security class how much space to allow in a packet
  rxrpc: rxkad: Don't use pskb_pull() to advance through the response packet
  rxrpc: Organise connection security to use a union
  rxrpc: Don't reserve security header in Tx DATA skbuff
  rxrpc: Merge prime_packet_security into init_connection_security
  rxrpc: Fix example key name in a comment
  rxrpc: Ignore unknown tokens in key payload unless no known tokens
  rxrpc: Make the parsing of xdr payloads more coherent
  rxrpc: Allow security classes to give more info on server keys
  rxrpc: Don't leak the service-side session key to userspace
  rxrpc: Hand server key parsing off to the security class
  rxrpc: Split the server key type (rxrpc_s) into its own file
  rxrpc: Don't retain the server key in the connection
  rxrpc: Support keys with multiple authentication tokens
  rxrpc: List the held token types in the key description in /proc/keys
  rxrpc: Remove the rxk5 security class as it's now defunct
  keys: Provide the original description to the key preparser
====================

Link: https://lore.kernel.org/r/160616220405.830164.2239716599743995145.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: Use hrtimer for TX coalescing

This driver uses a normal timer for TX coalescing, which means that the
with the default tx-usecs of 1000 microseconds the cleanups actually
happen 10 ms or more later with HZ=100. This leads to very low
througput with TCP when bridged to a slow link such as a 4G modem. Fix
this by using an hrtimer instead.

On my ARM platform with HZ=100 and the default TX coalescing settings
(tx-frames 25 tx-usecs 1000), with "tc qdisc add dev eth0 root netem
delay 60ms 40ms rate 50Mbit" run on the server, netperf's TCP_STREAM
improves from ~5.5 Mbps to ~100 Mbps.

Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Link: https://lore.kernel.org/r/20201120150208.6838-1-vincent.whitchurch@axis.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: Fix some typo

s/tranport/transport/

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20201122180704.1366636-1-christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: don't include ethtool.h from netdevice.h

linux/netdevice.h is included in very many places, touching any
of its dependecies causes large incremental builds.

Drop the linux/ethtool.h include, linux/netdevice.h just needs
a forward declaration of struct ethtool_ops.

Fix all the places which made use of this implicit include.

Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Shannon Nelson <snelson@pensando.io>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/r/20201120225052.1427503-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: pch_gbe: Use 'dma_free_coherent()' to undo 'dma_alloc_coherent()'

Memory allocation are done with 'dma_alloc_coherent()'. Be consistent
and use 'dma_free_coherent()' to free the corresponding memory.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20201121090330.1332543-1-christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: pch_gbe: Use dma_set_mask_and_coherent to simplify code

'pci_set_dma_mask()' + 'pci_set_consistent_dma_mask()' can be replaced by
an equivalent 'dma_set_mask_and_coherent()' which is much less verbose.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20201121090302.1332491-1-christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-dsa-hellcreek-minor-cleanups'

Kurt Kanzenbach says:

====================
net: dsa: hellcreek: Minor cleanups

fix two minor issues in the hellcreek driver.
====================

Link: https://lore.kernel.org/r/20201121114455.22422-1-kurt@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: hellcreek: Don't print error message on defer

When DSA is not loaded when the driver is probed an error message is
printed. But, that's not really an error, just a defer. Use dev_err_probe()
instead.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: tag_hellcreek: Cleanup includes

Remove unused and add needed includes. No functional change.

Suggested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ptp-introduce-common-defines-for-ptp-message-types'

Christian Eggers says:

====================
net: ptp: introduce common defines for PTP message types

This series introduces commen defines for PTP event messages. Driver
internal defines are removed and some uses of magic numbers are replaced
by the new defines.
====================

Link: https://lore.kernel.org/r/20201120084106.10046-1-ceggers@arri.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp: ptp_ines: use new PTP_MSGTYPE_* define(s)

Remove driver internal defines for this. Masking msgtype with 0xf is
already done within ptp_get_msgtype().

Signed-off-by: Christian Eggers <ceggers@arri.de>
Cc: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpaa2-eth: use new PTP_MSGTYPE_* define(s)

Remove usage of magic numbers.

Signed-off-by: Christian Eggers <ceggers@arri.de>
Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
Cc: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Cc: Yangbo Lu <yangbo.lu@nxp.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ptp: introduce common defines for PTP message types

Using PTP wide defines will obsolete different driver internal defines
and uses of magic numbers.

Signed-off-by: Christian Eggers <ceggers@arri.de>
Cc: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

compat: always include linux/compat.h from net/compat.h

We're about to do reshuffling in networking headers and
eliminate some implicit includes. This results in:

In file included from ../net/ipv4/netfilter/arp_tables.c:26:
include/net/compat.h:60:40: error: unknown type name ‘compat_uptr_t’; did you mean ‘compat_ptr_ioctl’?
    struct sockaddr __user **save_addr, compat_uptr_t *ptr,
                                        ^~~~~~~~~~~~~
                                        compat_ptr_ioctl
include/net/compat.h:61:4: error: unknown type name ‘compat_size_t’; did you mean ‘compat_sigset_t’?
    compat_size_t *len);
    ^~~~~~~~~~~~~
    compat_sigset_t

Currently net/compat.h depends on linux/compat.h being included
first. After the upcoming changes this would break the 32bit build.

Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20201121214844.1488283-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rxrpc: Ask the security class how much space to allow in a packet

Ask the security class how much header and trailer space to allow for when
allocating a packet, given how much data is remaining.

This will allow the rxgk security class to stick both a trailer in as well
as a header as appropriate in the future.

Signed-off-by: David Howells <dhowells@redhat.com>

net/tun: Call type change netdev notifiers

Call netdev notifiers before and after changing the device type.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Link: https://lore.kernel.org/r/20201118063919.29485-1-ms@dev.tdt.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rxrpc: rxkad: Don't use pskb_pull() to advance through the response packet

In the rxkad security class, don't use pskb_pull() to advance through the
contents of the response packet. There's no point, especially as the next
and last access to the skbuff still has to allow for the wire header in the
offset (which we didn't advance over).

Better to just add the displacement to the next offset.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Organise connection security to use a union

Organise the security information in the rxrpc_connection struct to use a
union to allow for different data for different security classes.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Don't reserve security header in Tx DATA skbuff

Insert the security header into the skbuff representing a DATA packet to be
transmitted rather than using skb_reserve() when the packet is allocated.
This makes it easier to apply crypto that spans the security header and the
data, particularly in the upcoming RxGK class where we have a common
encrypt-and-checksum function that is used in a number of circumstances.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Merge prime_packet_security into init_connection_security

Merge the ->prime_packet_security() into the ->init_connection_security()
hook as they're always called together.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Fix example key name in a comment

Fix an example of an rxrpc key name in a comment.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Ignore unknown tokens in key payload unless no known tokens

When parsing a payload for an rxrpc-type key, ignore any tokens that are
not of a known type and don't give an error for them - unless there are no
tokens of a known type.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Make the parsing of xdr payloads more coherent

Make the parsing of xdr-encoded payloads, as passed to add_key, more
coherent. Shuttling back and forth between various variables was a bit
hard to follow.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Allow security classes to give more info on server keys

Allow a security class to give more information on an rxrpc_s-type key when
it is viewed in /proc/keys. This will allow the upcoming RxGK security
class to show the enctype name here.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Don't leak the service-side session key to userspace

Don't let someone reading a service-side rxrpc-type key get access to the
session key that was exchanged with the client. The server application
will, at some point, need to be able to read the information in the ticket,
but this probably shouldn't include the key material.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Hand server key parsing off to the security class

Hand responsibility for parsing a server key off to the security class. We
can determine which class from the description. This is necessary as rxgk
server keys have different lookup requirements and different content
requirements (dependent on crypto type) to those of rxkad server keys.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Split the server key type (rxrpc_s) into its own file

Split the server private key type (rxrpc_s) out into its own file rather
than mingling it with the authentication/client key type (rxrpc) since they
don't really bear any relation.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Don't retain the server key in the connection

Don't retain a pointer to the server key in the connection, but rather get
it on demand when the server has to deal with a response packet.

This is necessary to implement RxGK (GSSAPI-mediated transport class),
where we can't know which key we'll need until we've challenged the client
and got back the response.

This also means that we don't need to do a key search in the accept path in
softirq mode.

Also, whilst we're at it, allow the security class to ask for a kvno and
encoding-type variant of a server key as RxGK needs different keys for
different encoding types. Keys of this type have an extra bit in the
description:

"<service-id>:<security-index>:<kvno>:<enctype>"

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Support keys with multiple authentication tokens

rxrpc-type keys can have multiple tokens attached for different security
classes. Currently, rxrpc always picks the first one, whether or not the
security class it indicates is supported.

Add preliminary support for choosing which security class will be used
(this will need to be directed from a higher layer) and go through the
tokens to find one that's supported.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: List the held token types in the key description in /proc/keys

When viewing an rxrpc-type key through /proc/keys, display a list of held
token types.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Remove the rxk5 security class as it's now defunct

Remove the rxrpc rxk5 security class as it's now defunct and nothing uses
it anymore.

Signed-off-by: David Howells <dhowells@redhat.com>

keys: Provide the original description to the key preparser

Provide the proposed description (add key) or the original description
(update/instantiate key) when preparsing a key so that the key type can
validate it against the data.

This is important for rxrpc server keys as we need to check that they have
the right amount of key material present - and it's better to do that when
the key is loaded rather than deep in trying to process a response packet.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
cc: keyrings@vger.kernel.org

octeontx2-af: Add support for RSS hashing based on Transport protocol field

Add support to choose RSS flow key algorithm with IPv4 transport protocol
field included in hashing input data. This will be enabled by default.
There-by enabling 3/5 tuple hash

Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: George Cherian <george.cherian@marvell.com>
Link: https://lore.kernel.org/r/20201120093906.2873616-1-george.cherian@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'linux-can-next-for-5.11-20201120' of git://git./linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2020-11-20

The first patch is by Yegor Yefremov and he improves the j1939 documentaton by
adding tables for the CAN identifier and its fields.

Then there are 8 patches by Oliver Hartkopp targeting the CAN driver
infrastructure and drivers. These add support for optional DLC element to the
Classical CAN frame structure. See patch ea7800565a12 ("can: add optional DLC
element to Classical CAN frame structure") for details. Oliver's last patch
adds len8_dlc support to several drivers. Stefan Mätje provides a patch to add
len8_dlc support to the esd_usb2 driver.

The next patch is by Oliver Hartkopp, too and adds support for modification of
Classical CAN DLCs to CAN GW sockets.

The next 3 patches target the nxp,flexcan DT bindings. One patch by my adds the
missing uint32 reference to the clock-frequency property. Joakim Zhang's
patches fix the fsl,clk-source property and add the IMX_SC_R_CAN() macro to the
imx firmware header file, which will be used in the flexcan driver later.

Another patch by Joakim Zhang prepares the flexcan driver for SCU based
stop-mode, by giving the existing, GPR based stop-mode, a _GPR postfix.

The next 5 patches are by me, target the flexcan driver, and clean up the
.ndo_open and .ndo_stop callbacks. These patches try to fix a sporadically
hanging flexcan_close() during simultanious ifdown, sending of CAN messages and
probably open CAN bus. I was never able to reproduce, but these seem to fix the
problem at the reporting user. As these changes are rather big, I'd like to
mainline them via net-next/master.

The next patches are by Jimmy Assarsson and Christer Beskow, they add support
for new USB devices to the existing kvaser_usb driver.

The last patch is by Kaixu Xia and simplifies the return in the
mcp251xfd_chip_softreset() function in the mcp251xfd driver.

* tag 'linux-can-next-for-5.11-20201120' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: (25 commits)
  can: mcp251xfd: remove useless code in mcp251xfd_chip_softreset
  can: kvaser_usb: Add new Kvaser hydra devices
  can: kvaser_usb: kvaser_usb_hydra: Add support for new device variant
  can: kvaser_usb: Add new Kvaser Leaf v2 devices
  can: kvaser_usb: Add USB_{LEAF,HYDRA}_PRODUCT_ID_END defines
  can: flexcan: flexcan_close(): change order if commands to properly shut down the controller
  can: flexcan: flexcan_open(): completely initialize controller before requesting IRQ
  can: flexcan: flexcan_rx_offload_setup(): factor out mailbox and rx-offload setup into separate function
  can: flexcan: move enabling/disabling of interrupts from flexcan_chip_{start,stop}() to callers
  can: flexcan: factor out enabling and disabling of interrupts into separate function
  can: flexcan: rename macro FLEXCAN_QUIRK_SETUP_STOP_MODE -> FLEXCAN_QUIRK_SETUP_STOP_MODE_GPR
  dt-bindings: firmware: add IMX_SC_R_CAN(x) macro for CAN
  dt-bindings: can: fsl,flexcan: fix fsl,clk-source property
  dt-bindings: can: fsl,flexcan: add uint32 reference to clock-frequency property
  can: gw: support modification of Classical CAN DLCs
  can: drivers: add len8_dlc support for esd_usb2 CAN adapter
  can: drivers: add len8_dlc support for various CAN adapters
  can: drivers: introduce helpers to access Classical CAN DLC values
  can: update documentation for DLC usage in Classical CAN
  can: rename CAN FD related can_len2dlc and can_dlc2len helpers
  ...
====================

Link: https://lore.kernel.org/r/20201120133318.3428231-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: switch to net core statistics counters handling

Use netdev->tstats instead of a member of net_bridge for storing
a pointer to the per-cpu counters. This allows us to use core
functionality for statistics handling.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/9bad2be2-fd84-7c6e-912f-cee433787018@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-hns3-misc-updates-for-next'

Huazhong Tan says:

====================
net: hns3: misc updates for -next

This series includes some misc updates for the HNS3 ethernet driver.

#1 adds support for 1280 queues
#2 adds mapping for BAR45 which is needed by RoCE client.
#3 extend the interrupt resources.
#4&#5 add support to query firmware's calculated shaping parameters.
====================

Link: https://lore.kernel.org/r/1605863783-36995-1-git-send-email-tanhuazhong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns3: adds debugfs to dump more info of shaping parameters

Adds debugfs to dump new shaping parameters: rate and flag.

Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns3: add support to utilize the firmware calculated shaping parameters

Since the calculation of the driver is fixed, if the number of
queue or clock changed, the calculated result may be inaccurate.

So for compatible and maintainable, add a new flag to tell the
firmware to calculate the shaping parameters with the specified
rate.

Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns3: add support for pf querying new interrupt resources

For HNAE3_DEVICE_VERSION_V3, a maximum of 1281 interrupt
resources are supported. To utilize these new resources,
extend the corresponding field or variable to 16bit type,
and remove the restriction of NIC client that only use a
maximum of 65 interrupt vectors. In addition, the I/O address
of the extended interrupt resources are different, so an extra
handler is needed.

Currently, the total number of interrupts is the sum of RoCE's
number and RoCE's offset (RoCE is in front of NIC), since
the number of both NIC and RoCE are same. For readability,
rewrite the corresponding field of the command, rename the
RoCE's offset field as the number of NIC interrupts, then
the total number of interrupts is sum of the number of RoCE
and NIC, and replace vport->back with hdev in
hclge_init_roce_base_info() for simplifying the code.

Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns3: add support for mapping device memory

For device who has device memory accessed through the PCI BAR4,
IO descriptor push of NIC and direct WQE(Work Queue Element) of
RoCE will use this device memory, so add support for mapping
this device memory, and add this info to the RoCE client whose
new feature needs.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns3: add support for 1280 queues

For DEVICE_VERSION_V1/2, there are total 1024 queues and
queue sets. For DEVICE_VERSION_V3, it increases to 1280,
and can be assigned to one pf， so remove the limitation
of 1024.

To keep compatible with DEVICE_VERSION_V1/2 and old driver
version, the queue number is split into two part:
tqp_num(range 0~1023) and ext_tqp_num(range 1024~1279).

Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ibmvnic-performance-improvements-and-other-updates'

Thomas Falcon says:

====================
ibmvnic: Performance improvements and other updates

The first three patches utilize a hypervisor call allowing multiple
TX and RX buffer replenishment descriptors to be sent in one operation,
which significantly reduces hypervisor call overhead. The xmit_more
and Byte Queue Limit API's are leveraged to provide this support
for TX descriptors.

The subsequent two patches remove superfluous code and members in
TX completion handling function and TX buffer structure, respectively,
and remove unused routines.

Finally, four patches which ensure that device queue memory is
cache-line aligned, resolving slowdowns observed in PCI traces,
as well as optimize the driver's NAPI polling function and
to RX buffer replenishment are provided by Dwip Banerjee.

This series provides significant performance improvements, allowing
the driver to fully utilize 100Gb NIC's.
====================

Link: https://lore.kernel.org/r/1605748345-32062-1-git-send-email-tlfalcon@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Do not replenish RX buffers after every polling loop

Reduce the amount of time spent replenishing RX buffers by only doing
so once available buffers has fallen under a certain threshold, in this
case half of the total number of buffers, or if the polling loop exits
before the packets processed is less than its budget. Non-exhaustion of
NAPI budget implies lower incoming packet pressure, allowing the leeway
to refill the buffers in preparation for any impending burst.

Signed-off-by: Dwip N. Banerjee <dnbanerg@us.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Use netdev_alloc_skb instead of alloc_skb to replenish RX buffers

Take advantage of the additional optimizations in netdev_alloc_skb when
allocating socket buffers to be used for packet reception.

Signed-off-by: Dwip N. Banerjee <dnbanerg@us.ibm.com>
Acked-by: Lijun Pan <ljp@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Correctly re-enable interrupts in NAPI polling routine

If the current NAPI polling loop exits without completing it's
budget, only re-enable interrupts if there are no entries remaining
in the queue and napi_complete_done is successful. If there are entries
remaining on the queue that were missed, restart the polling loop.

Signed-off-by: Dwip N. Banerjee <dnbanerg@us.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Ensure that device queue memory is cache-line aligned

PCI bus slowdowns were observed on IBM VNIC devices as a result
of partial cache line writes and non-cache aligned full cache line writes.
Ensure that packet data buffers are cache-line aligned to avoid these
slowdowns.

Signed-off-by: Dwip N. Banerjee <dnbanerg@us.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Remove send_subcrq function

It is not longer used, so remove it.

Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Acked-by: Lijun Pan <ljp@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Clean up TX code and TX buffer data structure

Remove unused and superfluous code and members in
existing TX implementation and data structures.

Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Introduce xmit_more support using batched subCRQ hcalls

Include support for the xmit_more feature utilizing the
H_SEND_SUB_CRQ_INDIRECT hypervisor call which allows the sending
of multiple subordinate Command Response Queue descriptors in one
hypervisor call via a DMA-mapped buffer. This update reduces hypervisor
calls and thus hypervisor call overhead per TX descriptor.

Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Introduce batched RX buffer descriptor transmission

Utilize the H_SEND_SUB_CRQ_INDIRECT hypervisor call to send
multiple RX buffer descriptors to the device in one hypervisor
call operation. This change will reduce the number of hypervisor
calls and thus hypervisor call overhead needed to transmit
RX buffer descriptors to the device.

Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Introduce indirect subordinate Command Response Queue buffer

This patch introduces the infrastructure to send batched subordinate
Command Response Queue descriptors, which are used by the ibmvnic
driver to send TX frame and RX buffer descriptors.

Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Acked-by: Lijun Pan <ljp@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ipa-add-a-driver-shutdown-callback'

Alex Elder says:

====================
net: ipa: add a driver shutdown callback

The final patch in this series adds a driver shutdown callback for
the IPA driver.  The patches leading up to that address some issues
encountered while ensuring that callback worked as expected:
  - The first just reports a little more information when channels
    or event rings are in unexpected states
  - The second patch recognizes a condition where an as-yet-unused
    channel does not require a reset during teardown
  - The third patch explicitly ignores a certain error condition,
    because it can't be avoided, and is harmless if it occurs
  - The fourth properly handles requests to retry a channel HALT
    request
  - The fifth makes a second attempt to stop modem activity during
    shutdown if it's busy

The shutdown callback is implemented by calling the existing remove
callback function (reporting if that returns an error).
====================

Link: https://lore.kernel.org/r/20201119224929.23819-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: add driver shutdown callback

A system shutdown can happen at essentially any time, and it's
possible that the IPA driver is busy when a shutdown is underway.
IPA hardware accesses IMEM and SMEM memory regions using an IOMMU,
and at some point during shutdown, needed I/O mappings could become
invalid. This could be disastrous for any "in flight" IPA activity.

Avoid this by defining a new driver shutdown callback that stops all
IPA activity and cleanly shuts down the driver. It merely calls the
driver's existing remove callback, reporting the error if it returns
one.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: retry modem stop if busy

The IPA driver remove callback, ipa_remove(), calls ipa_modem_stop()
if the setup stage of initialization is complete.  If a concurrent
call to ipa_modem_start() or ipa_modem_stop() has begin but not
completed, ipa_modem_stop() can return an error (-EBUSY).

The next patch adds a driver shutdown callback, which will simply
call ipa_remove().  We really want our shutdown callback to clean
things up.  So add a single retry to the ipa_modem_stop() call in
ipa_remove() after a short (millisecond) delay.  This offers no
guarantee the shutdown will complete successfully, but we'll at
least try a little harder before giving up.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: support retries on generic GSI commands

When stopping an AP RX channel, there can be a transient period
while the channel enters STOP_IN_PROC state before reaching the
final STOPPED state.  In that case we make another attempt to stop
the channel.

Similarly, when stopping a modem channel (using a GSI generic
command issued from the AP), it's possible that multiple attempts
will be required before the channel reaches STOPPED state.

Add a field to the GSI structure to record an errno representing the
result code provided when a generic command completes.  If the
result learned in gsi_isr_gp_int1() is RETRY, record -EAGAIN in the
result code, otherwise record 0 for success, or -EIO for any other
result.

If we time out nf gsi_generic_command() waiting for the command to
complete, return -ETIMEDOUT (as before).  Otherwise return the
result stashed by gsi_isr_gp_int1().

Add a loop in gsi_modem_channel_halt() to reissue the HALT command
if the result code indicates -EAGAIN.  Limit this to 10 retries
(after the initial attempt).

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: ignore CHANNEL_NOT_RUNNING errors

IPA v4.2 has a hardware quirk that requires the AP to allocate GSI
channels for the modem to use. It is recommended that these modem
channels get stopped (with a HALT generic command) by the AP when
its IPA driver gets removed.

The AP has no way of knowing the current state of a modem channel.
So when the IPA driver issues a HALT command it's possible the
channel is not running, and in that case we get an error indication.
This error simply means we didn't need to stop the channel, so we
can ignore it.

This patch adds an explanation for this situation, and arranges for
this condition to *not* report an error message.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: don't reset an ALLOCATED channel

If the rmnet_ipa0 network device has not been opened at the time
we remove or shut down the IPA driver, its underlying TX and RX
GSI channels will not have been started, and they will still be
in ALLOCATED state.

The RESET command on a channel is meant to return a channel to
ALLOCATED state after it's been stopped. But if it was never
started, its state will still be ALLOCATED, the RESET command
is not required.

Quietly skip doing the reset without printing an error message if a
channel is already in ALLOCATED state when we request it be reset.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: print channel/event ring number on error

When a GSI command is used to change the state of a channel or event
ring we check the state before and after the command to ensure it is
as expected. If not, we print an error message, but it does not
include the channel or event ring id, and it easily can. Add the
channel or event ring id to these error messages.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ipa-platform-specific-clock-and-interconnect-rates'

Alex Elder says:

====================
net: ipa: platform-specific clock and interconnect rates

This series changes the way the IPA core clock rate and the
bandwidth parameters for interconnects are specified.  Previously
these were specified with hard-wired constants, with the same values
used for the SDM845 and SC7180 platforms.  Now these parameters are
recorded in platform-specific configuration data.

For the SC7180 this means we use an all-new core clock rate and
interconnect parameters.

Additionally, while developing this I learned that the average
bandwidth setting for two of the interconnects is ignored (on both
platforms).  Zero is now used explicitly as that unused bandwidth
value.  This means the SDM845 bandwidth settings are also changed
by this series.
====================

Link: https://lore.kernel.org/r/20201119224041.16066-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: use config data for clocking

Stop assuming a fixed IPA core clock rate and interconnect
bandwidths. Use the configuration data defined for these
things instead. Get rid of the previously-used constants.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: populate clock and interconnect data

Populate the core clock rate and interconnect average and peak
bandwidth data for SDM845 and SC7180 in their configuration data
files. At this point we still don't *use* this data.

Note that SC7180 actually defines a new core clock rate (100 MHz
instead of 75 MHz) and new interconnect bandwidth values. They
will be activated in the next commit, which uses the configured
values rather than the fixed constants.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: define clock and interconnect data

Define a new type of configuration data, used to initialize the
IPA core clock and interconnects. This is the first of three
patches, and defines the data types and interface but doesn't
yet use them.

Switch the return value if there is no matching configuration data
to ENODEV instead of ENOTSUPP (to avoid using the nonstandard errno).

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mdio_bus: suppress err message for reset gpio EPROBE_DEFER

The mdio_bus may have dependencies from GPIO controller and so got
deferred. Now it will print error message every time -EPROBE_DEFER is
returned which from:
__mdiobus_register()
|-devm_gpiod_get_optional()
without actually identifying error code.

"mdio_bus 4a101000.mdio: mii_bus 4a101000.mdio couldn't get reset GPIO"

Hence, suppress error message for devm_gpiod_get_optional() returning
-EPROBE_DEFER case by using dev_err_probe().

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Link: https://lore.kernel.org/r/20201119203446.20857-1-grygorii.strashko@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: use dev_err_probe in rtl_get_ether_clk

Tiny improvement, let dev_err_probe() deal with EPROBE_DEFER.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/b0c4ebcf-2047-e933-b890-8a20e4bdb19f@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: reduce number of workaround doorbell rings

Some chip versions have a hw bug resulting in lost door bell rings.
To work around this the doorbell is also rung whenever we still have
tx descriptors in flight after having cleaned up tx descriptors.
These PCI(e) writes come at a cost, therefore let's reduce the number
of extra doorbell rings.
If skb is NULL then this means:
- last cleaned-up descriptor belongs to a skb with at least one fragment
  and last fragment isn't marked as sent yet
- hw is in progress sending the skb, therefore no extra doorbell ring
  is needed for this skb
- once last fragment is marked as transmitted hw will trigger
  a tx done interrupt and we come here again (with skb != NULL)
  and ring the doorbell if needed
Therefore skip the workaround doorbell ring if skb is NULL.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/0a15a83c-aecf-ab51-8071-b29d9dcd529a@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-more-miscellaneous-mptcp-fixes'

Mat Martineau says:

====================
mptcp: More miscellaneous MPTCP fixes

Here's another batch of fixup and enhancement patches that we have
collected in the MPTCP tree.

Patch 1 removes an unnecessary flag and related code.

Patch 2 fixes a bug encountered when closing fallback sockets.

Patches 3 and 4 choose a better transmit subflow, with a self test.

Patch 5 adjusts tracking of unaccepted subflows

Patches 6-8 improve handling of long ADD_ADDR options, with a test.

Patch 9 more reliably tracks the MPTCP-level window shared with peers.

Patch 10 sends MPTCP-level acknowledgements more aggressively, so the
peer can send more data without extra delay.
====================

Link: https://lore.kernel.org/r/20201119194603.103158-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: refine MPTCP-level ack scheduling

Send timely MPTCP-level ack is somewhat difficult when
the insertion into the msk receive level is performed
by the worker.

It needs TCP-level dup-ack to notify the MPTCP-level
ack_seq increase, as both the TCP-level ack seq and the
rcv window are unchanged.

We can actually avoid processing incoming data with the
worker, and let the subflow or recevmsg() send ack as needed.

When recvmsg() moves the skbs inside the msk receive queue,
the msk space is still unchanged, so tcp_cleanup_rbuf() could
end-up skipping TCP-level ack generation. Anyway, when
__mptcp_move_skbs() is invoked, a known amount of bytes is
going to be consumed soon: we update rcv wnd computation taking
them in account.

Additionally we need to explicitly trigger tcp_cleanup_rbuf()
when recvmsg() consumes a significant amount of the receive buffer.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: track window announced to peer

OoO handling attempts to detect when packet is out-of-window by testing
current ack sequence and remaining space vs. sequence number.

This doesn't work reliably. Store the highest allowed sequence number
that we've announced and use it to detect oow packets.

Do this when mptcp options get written to the packet (wire format).
For this to work we need to move the write_options call until after
stack selected a new tcp window.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: add ADD_ADDR IPv6 test cases

This patch added IPv6 support for do_transfer, and the test cases for
ADD_ADDR IPv6.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: send out dedicated ADD_ADDR packet

When ADD_ADDR suboption includes an IPv6 address, the size is 28 octets.
It will not fit when other MPTCP suboptions are included in this packet,
e.g. DSS. So here we send out an ADD_ADDR dedicated packet to carry only
ADD_ADDR suboption, no other MPTCP suboptions.

In mptcp_pm_announce_addr, we check whether this is an IPv6 ADD_ADDR.
If it is, we set the flag MPTCP_ADD_ADDR_IPV6 to true. Then we call
mptcp_pm_nl_add_addr_send_ack to sent out a new pure ACK packet.

In mptcp_established_options_add_addr, we check whether this is a pure
ACK packet for ADD_ADDR. If it is, we drop all other MPTCP suboptions
in this packet, only put ADD_ADDR suboption in it.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: change add_addr_signal type

This patch changed the 'add_addr_signal' type from bool to char, so that
we could encode the addr type there.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: keep unaccepted MPC subflow into join list

This will simplify all operation dealing with subflows
before accept time (e.g. data fin processing, add_addr).

The join list is already flushed by mptcp_stream_accept()
before returning the newly created msk to the user space.

This also fixes an potential bug present into the old code:
conn_list was manipulated without helding the msk lock
in mptcp_stream_accept().

Tested-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: add link failure test case

Add a test case where a link fails with multiple subflows.
The expectation is that MPTCP will transmit any data that
could not be delivered via the failed link on another subflow.

Co-developed-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: skip to next candidate if subflow has unacked data

In case a subflow path is blocked, MPTCP-level retransmit may not take
place anymore because such subflow is likely to have unacked data left
in its write queue.

Ignore subflows that have experienced loss and test next candidate.

Fixes: 3b1d6210a95773691 ("mptcp: implement and use MPTCP-level retransmission")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: fix state tracking for fallback socket

We need to cope with some more state transition for
fallback sockets, or could still end-up moving to TCP_CLOSE
too early and avoid spooling some pending data

Fixes: e16163b6e2b7 ("mptcp: refactor shutdown and close")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: drop WORKER_RUNNING status bit

Only mptcp_close() can actually cancel the workqueue,
no need to add and use this flag.

Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mlxsw-add-support-for-nexthop-objects'

Ido Schimmel says:

====================
mlxsw: Add support for nexthop objects

This patch set adds support for nexthop objects in mlxsw. Nexthop
objects are treated as another front-end for programming nexthops, in
addition to the existing IPv4 and IPv6 front-ends.

Patch #1 registers a listener to the nexthop notification chain and
parses the nexthop information into the existing mlxsw data structures
that are already used by the IPv4 and IPv6 front-ends. Blackhole
nexthops are currently rejected. Support will be added in a follow-up
patch set.

Patch #2 extends mlxsw to resolve its internal nexthop objects from the
nexthop identifier encoded in the FIB info of the notified routes.

Patch #3 finally removes the limitation of rejecting routes that use
nexthop objects.

Patch #4 adds a selftest.

Patches #5-#8 add generic forwarding selftests that can be used with
veth pairs or physical loopbacks.
====================

Link: https://lore.kernel.org/r/20201119130848.407918-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Add multipath tunneling nexthop test

Add a nexthop objects version of gre_multipath.sh. Unlike the original
test, it also tests IPv6 overlay which is not possible with the legacy
nexthop implementation. See commit 9a2ad3623868 ("selftests: forwarding:
gre_multipath: Drop IPv6 tests") for more info.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Add device-only nexthop test

In a similar fashion to router_multipath.sh and its nexthop objects
version router_mpath_nh.sh, create a nexthop objects version of
router.sh.

It reuses the same topology, but uses device-only nexthop objects
instead of legacy ones.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Test IPv4 routes with IPv6 link-local nexthops

In addition to IPv4 multipath tests with IPv4 nexthops, also test IPv4
multipath with nexthops that use IPv6 link-local addresses.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Do not configure nexthop objects twice

routing_nh_obj() is used to configure the nexthop objects employed by
the test, but it is called twice resulting in "RTNETLINK answers: File
exists" messages.

Remove the first call, so that the function is only called after
setup_wait(), when all the interfaces are up and ready.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>