git.monstr.eu Git - linux-2.6-microblaze.git/log

libbpf: Bump libpf current version to v0.0.8

New development cycles starts, bump to v0.0.8.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158220518424.127661.8278643006567775528.stgit@xdp-tutorial

libbpf: Relax check whether BTF is mandatory

If BPF program is using BTF-defined maps, BTF is required only for
libbpf itself to process map definitions. If after that BTF fails to
be loaded into kernel (e.g., if it doesn't support BTF at all), this
shouldn't prevent valid BPF program from loading. Existing
retry-without-BTF logic for creating maps will succeed to create such
maps without any problems. So, presence of .maps section shouldn't make
BTF required for kernel. Update the check accordingly.

Validated by ensuring simple BPF program with BTF-defined maps is still
loaded on old kernel without BTF support and map is correctly parsed and
created.

Fixes: abd29c931459 ("libbpf: allow specifying map definitions using BTF")
Reported-by: Julia Kartseva <hex@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200220062635.1497872-1-andriin@fb.com

selftests/bpf: Fix build of sockmap_ktls.c

The selftests fails to build with:
tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c: In function ‘test_sockmap_ktls_disconnect_after_delete’:
tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c:72:37: error: ‘TCP_ULP’ undeclared (first use in this function)
72 | err = setsockopt(cli, IPPROTO_TCP, TCP_ULP, "tls", strlen("tls"));
| ^~~~~~~

Similar to commit that fixes build of sockmap_basic.c on systems with old
/usr/include fix the build of sockmap_ktls.c

Fixes: d1ba1204f2ee ("selftests/bpf: Test unhashing kTLS socket after removing from map")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200219205514.3353788-1-ast@kernel.org

selftests/bpf: Change llvm flag -mcpu=probe to -mcpu=v3

The latest llvm supports cpu version v3, which is cpu version v1
plus some additional 64bit jmp insns and 32bit jmp insn support.

In selftests/bpf Makefile, the llvm flag -mcpu=probe did runtime
probe into the host system. Depending on compilation environments,
it is possible that runtime probe may fail, e.g., due to
memlock issue. This will cause generated code with cpu version v1.
This may cause confusion as the same compiler and the same C code
generates different byte codes in different environment.

Let us change the llvm flag -mcpu=probe to -mcpu=v3 so the
generated code will be the same regardless of the compilation
environment.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200219004236.2291125-1-yhs@fb.com

Merge branch 'bpf_read_branch_records'

Daniel Xu says:

====================
Branch records are a CPU feature that can be configured to record
certain branches that are taken during code execution. This data is
particularly interesting for profile guided optimizations. perf has had
branch record support for a while but the data collection can be a bit
coarse grained.

We (Facebook) have seen in experiments that associating metadata with
branch records can improve results (after postprocessing). We generally
use bpf_probe_read_*() to get metadata out of userspace. That's why bpf
support for branch records is useful.

Aside from this particular use case, having branch data available to bpf
progs can be useful to get stack traces out of userspace applications
that omit frame pointers.

Changes in v8:
- Use globals instead of perf buffer
- Call test_perf_branches__detach() before destroying skeleton
- Fix typo in docs

Changes in v7:
- Const-ify and static-ify local var
- Documentation formatting

Changes in v6:
- Move #ifdef a little to avoid unused variable warnings on !x86
- Test negative condition in selftest (-EINVAL on improperly configured
perf event)
- Skip positive condition selftest on setups that don't support branch
records

Changes in v5:
- Rename bpf_perf_prog_read_branches() -> bpf_read_branch_records()
- Rename BPF_F_GET_BR_SIZE -> BPF_F_GET_BRANCH_RECORDS_SIZE
- Squash tools/ bpf.h sync into selftest commit

Changes in v4:
- Add BPF_F_GET_BR_SIZE flag
- Return -ENOENT on unsupported architectures
- Only accept initialized memory in helper
- Check buffer size is multiple of sizeof(struct perf_branch_entry)
- Use bpf skeleton in selftest
- Add commit messages
- Spelling and formatting

Changes in v3:
- Document filling unused buffer with zero
- Formatting fixes
- Rebase

Changes in v2:
- Change to a bpf helper instead of context access
- Avoid mentioning Intel specific things
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add bpf_read_branch_records() selftest

Add a selftest to test:

* default bpf_read_branch_records() behavior
* BPF_F_GET_BRANCH_RECORDS_SIZE flag behavior
* error path on non branch record perf events
* using helper to write to stack
* using helper to write to global

On host with hardware counter support:

    # ./test_progs -t perf_branches
    #27/1 perf_branches_hw:OK
    #27/2 perf_branches_no_hw:OK
    #27 perf_branches:OK
    Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED

On host without hardware counter support (VM):

    # ./test_progs -t perf_branches
    #27/1 perf_branches_hw:OK
    #27/2 perf_branches_no_hw:OK
    #27 perf_branches:OK
    Summary: 1/2 PASSED, 1 SKIPPED, 0 FAILED

Also sync tools/include/uapi/linux/bpf.h.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200218030432.4600-3-dxu@dxuuu.xyz

bpf: Add bpf_read_branch_records() helper

Branch records are a CPU feature that can be configured to record
certain branches that are taken during code execution. This data is
particularly interesting for profile guided optimizations. perf has had
branch record support for a while but the data collection can be a bit
coarse grained.

We (Facebook) have seen in experiments that associating metadata with
branch records can improve results (after postprocessing). We generally
use bpf_probe_read_*() to get metadata out of userspace. That's why bpf
support for branch records is useful.

Aside from this particular use case, having branch data available to bpf
progs can be useful to get stack traces out of userspace applications
that omit frame pointers.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200218030432.4600-2-dxu@dxuuu.xyz

Merge branch 'bpf-skmsg-simplify-restore'

Jakub Sitnicki says:

====================
This series has been split out from "Extend SOCKMAP to store listening
sockets" [0]. I think it stands on its own, and makes the latter series
smaller, which will make the review easier, hopefully.

The essence is that we don't need to do a complicated dance in
sk_psock_restore_proto, if we agree that the contract with tcp_update_ulp
is to restore callbacks even when the socket doesn't use ULP. This is what
tcp_update_ulp currently does, and we just make use of it.

Series is accompanied by a test for a particularly tricky case of restoring
callbacks when we have both sockmap and tls callbacks configured in
sk->sk_prot.

[0] https://lore.kernel.org/bpf/20200127131057.150941-1-jakub@cloudflare.com/
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: Test unhashing kTLS socket after removing from map

When a TCP socket gets inserted into a sockmap, its sk_prot callbacks get
replaced with tcp_bpf callbacks built from regular tcp callbacks. If TLS
gets enabled on the same socket, sk_prot callbacks get replaced once again,
this time with kTLS callbacks built from tcp_bpf callbacks.

Now, we allow removing a socket from a sockmap that has kTLS enabled. After
removal, socket remains with kTLS configured. This is where things things
get tricky.

Since the socket has a set of sk_prot callbacks that are a mix of kTLS and
tcp_bpf callbacks, we need to restore just the tcp_bpf callbacks to the
original ones. At the moment, it comes down to the the unhash operation.

We had a regression recently because tcp_bpf callbacks were not cleared in
this particular scenario of removing a kTLS socket from a sockmap. It got
fixed in commit 4da6a196f93b ("bpf: Sockmap/tls, during free we may call
tcp_bpf_unhash() in loop").

Add a test that triggers the regression so that we don't reintroduce it in
the future.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200217121530.754315-4-jakub@cloudflare.com

bpf, sk_msg: Don't clear saved sock proto on restore

There is no need to clear psock->sk_proto when restoring socket protocol
callbacks in sk->sk_prot. The psock is about to get detached from the sock
and eventually destroyed. At worst we will restore the protocol callbacks
and the write callback twice.

This makes reasoning about psock state easier. Once psock is initialized,
we can count on psock->sk_proto always being set.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200217121530.754315-3-jakub@cloudflare.com

bpf, sk_msg: Let ULP restore sk_proto and write_space callback

We don't need a fallback for when the socket is not using ULP.
tcp_update_ulp handles this case exactly the same as we do in
sk_psock_restore_proto. Get rid of the duplicated code.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200217121530.754315-2-jakub@cloudflare.com

bpf: Allow bpf_perf_event_read_value in all BPF programs

bpf_perf_event_read_value() is NMI safe. Enable it for all BPF programs.
This can be used in fentry/fexit to profile BPF program and individual
kernel function with hardware counters.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200214234146.2910011-1-songliubraving@fb.com

net: ena: remove set but not used variable 'hash_key'

drivers/net/ethernet/amazon/ena/ena_com.c: In function ena_com_hash_key_allocate:
drivers/net/ethernet/amazon/ena/ena_com.c:1070:50:
warning: variable hash_key set but not used [-Wunused-but-set-variable]

commit 6a4f7dc82d1e ("net: ena: rss: do not allocate key when not supported")
introduced this, but not used, so remove it.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netlink: Replace zero-length array with flexible-array member

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
int stuff;
struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: switchdev: Replace zero-length array with flexible-array member

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
int stuff;
struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf, sockmap: Replace zero-length array with flexible-array member

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
int stuff;
struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

NFC: digital: Replace zero-length array with flexible-array member

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
int stuff;
struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: usb: cdc-phonet: Replace zero-length array with flexible-array member

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
int stuff;
struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: allow bcm84881 to be a module

Now that the phylib module loading issue has been resolved, we can
allow this PHY driver to be built as a module.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-smc-next'

Ursula Braun says:

====================
net/smc: patches 2020-02-17

here are patches for SMC making termination tasks more perfect.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: reduce port_event scheduling

IB event handlers schedule the port event worker for further
processing of port state changes. This patch reduces the number of
schedules to avoid duplicate processing of the same port change.

Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: simplify normal link termination

smc_lgr_terminate() and smc_lgr_terminate_sched() both result in soft
link termination, smc_lgr_terminate_sched() is scheduling a worker for
this task. Take out complexity by always using the termination worker
and getting rid of smc_lgr_terminate() completely.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: remove unused parameter of smc_lgr_terminate()

The soft parameter of smc_lgr_terminate() is not used and obsolete.
Remove it.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: do not delete lgr from list twice

When 2 callers call smc_lgr_terminate() at the same time
for the same lgr, one gets the lgr_lock and deletes the lgr from the
list and releases the lock. Then the second caller gets the lock and
tries to delete it again.
In smc_lgr_terminate() add a check if the link group lgr is already
deleted from the link group list and prevent to try to delete it a
second time.
And add a check if the lgr is marked as freeing, which means that a
termination is already pending.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: use termination worker under send_lock

smc_tx_rdma_write() is called under the send_lock and should not call
smc_lgr_terminate() directly. Call smc_lgr_terminate_sched() instead
which schedules a worker.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: improve smc_lgr_cleanup()

smc_lgr_cleanup() is called during termination processing, there is no
need to send a DELETE_LINK at that time. A DELETE_LINK should have been
sent before the termination is initiated, if needed.
And remove the extra call to wake_up(&lnk->wr_reg_wait) because
smc_llc_link_inactive() already calls the related helper function
smc_wr_wakeup_reg_wait().

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-Reduce-dependency-between-bridge-and-router-code'

Ido Schimmel says:

====================
mlxsw: Reduce dependency between bridge and router code

This patch set reduces the dependency between the bridge and the router
code in preparation for RTNL removal from the route insertion path in
mlxsw.

The motivation and solution are explained in detail in patch #3. The
main idea is that we need to stop special-casing the VXLAN devices with
regards to the reference counting of the FIDs. Otherwise, we can bump
into the situation described in patch #3, where the routing code calls
into the bridge code which calls back into the routing code. After
adding a mutex to protect router data structures to remove RTNL
dependency, this can result in an AA deadlock.

Patches #1 and #2 are preparations. They convert the FIDs to use
'refcount_t' for reference counting in order to catch over/under flows
and add extack to the bridge creation function.

Patches #3-#5 reduce the dependency between the bridge and the router
code. First, by having the VXLAN device take a reference on the FID in
patch #3 and then by removing unnecessary code following the change in
patch #3.

Patches #6-#10 adjust existing selftests and add new ones to exercise
the new code paths.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: vxlan: Add test for error path

Test that when two VXLAN tunnels with conflicting configurations (i.e.,
different TTL) are enslaved to the same VLAN-aware bridge, then the
enslavement of a port to the bridge is denied.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: vxlan: Adjust test to recent changes

After recent changes, the VXLAN tunnel will be offloaded regardless if
any local ports are member in the FID or not. Adjust the test to make
sure the tunnel is offloaded in this case.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: extack: Test creation of multiple VLAN-aware bridges

The driver supports a single VLAN-aware bridge. Test that the
enslavement of a port to the second VLAN-aware bridge fails with an
extack.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: extack: Test bridge creation with VXLAN

Test that creation of a bridge (both VLAN-aware and VLAN-unaware) fails
with an extack when a VXLAN device with an unsupported configuration is
already enslaved to it.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: Remove deprecated test

The addition of a VLAN on a bridge slave prompts the driver to have the
local port in question join the FID corresponding to this VLAN.

Before recent changes, the operation of joining the FID would also mean
that the driver would enable VXLAN tunneling if a VXLAN device was also
member in the VLAN. In case the configuration of the VXLAN tunnel was
not supported, an extack error would be returned.

Since the operation of joining the FID no longer means that VXLAN
tunneling is potentially enabled, the test is no longer relevant. Remove
it.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Reduce dependency between bridge and router code

Commit f40be47a3e40 ("mlxsw: spectrum_router: Do not force specific
configuration order") added a call from the routing code to the bridge
code in order to handle the case where VNI should be set on a FID
following the joining of the router port to the FID.

This is no longer required, as previous patches made VXLAN devices
explicitly take a reference on the FID and set VNI on it.

Therefore, remove the unnecessary call and simply have the RIF take a
reference on the FID without checking if VNI should also be set on it.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_switchdev: Remove VXLAN checks during FID membership

As explained in previous patch, VXLAN devices now take a reference on
the FID and not only local ports. Therefore, there is no need for local
ports to check if they need to set a VNI on the FID when they join the
FID.

Remove these unnecessary checks.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_switchdev: Have VXLAN device take reference on FID

Up until now only local ports and the router port (which is also a local
port) took a reference on the corresponding FID (Filtering Identifier)
when joining a bridge. For example:

        192.0.2.1/24
            br0
             |
      +------+------+
      |             |
     swp1        vxlan0

In this case the reference count of the FID will be '2'. Since the VXLAN
device does not take a reference on the FID, whenever a local port joins
the bridge it needs to check if a VXLAN device is already enslaved. If
the VXLAN device should be mapped to the FID in question, then the VXLAN
device's VNI is set on the FID.

Beside the fact that this scheme special-cases the VXLAN device, it also
creates an unnecessary dependency between the routing and bridge code:

1. [R] IP address is added on 'br0', which prompts the creation of a RIF
   and a backing FID
2. [B] VNI is enabled on backing FID
3. [R] Host route corresponding to VXLAN device's source address is
   promoted to perform NVE decapsulation

[R] - Routing code
[B] - Bridge code

This back and forth dependency will become problematic when a lock is
added in the routing code instead of relying on RTNL, as it will result
in an AA deadlock.

Instead, have the VXLAN device take a reference on the FID just like all
the other netdev members of the bridge. In order to correctly handle the
case where VXLAN devices are already enslaved to the bridge when it is
offloaded, walk the bridge's slaves and replay the configuration.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_switchdev: Propagate extack to bridge creation function

Propagate extack to bridge creation function so that error messages
could be passed to user space via netlink instead of printing them to
kernel log.

A subsequent patch will pass the new extack argument to more functions.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_fid: Use 'refcount_t' for FID reference counting

'refcount_t' is very useful for catching over/under flows. Convert the
FID (Filtering Identifier) objects to use it instead of 'unsigned int'
for their reference count.

A subsequent patch in the series will change the way VXLAN devices hold
/ release the FID reference, which is why the conversion is made now.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: teach ndo_dflt_bridge_getlink() more brport flags

This enables ndo_dflt_bridge_getlink() to report a bridge port's
offload settings for multicast and broadcast flooding.

CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sfc-couple-more-ARFS-tidy-ups'

Edward Cree says:

====================
couple more ARFS tidy-ups

Tie up some loose ends from the recent ARFS work.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: move some ARFS code out of headers

efx_filter_rfs_expire() is a work-function, so it being inline makes no
sense. It's only ever used in efx_channels.c, so move it there.
While we're at it, clean out some related unused cruft.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: only schedule asynchronous filter work if needed

Prevent excessive CPU time spent running a workitem with nothing to do.

We avoid any races by keeping the same check in efx_filter_rfs_expire().

Suggested-by: Martin Habets <mhabets@solarflare.com>
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: vlan: suppress "failed to kill vid" warnings

When a real dev unregisters, vlan_device_event() also unregisters all
of its vlan interfaces. For each VID this ends up in __vlan_vid_del(),
which attempts to remove the VID from the real dev's VLAN filter.

But the unregistering real dev might no longer be able to issue the
required IOs, and return an error. Subsequently we raise a noisy warning
msg that is not appropriate for this situation: the real dev is being
torn down anyway, there shouldn't be any worry about cleanly releasing
all of its HW-internal resources.

So to avoid scaring innocent users, suppress this warning when the
failed deletion happens on an unregistering device.
While at it also convert the raw pr_warn() to a more fitting
netdev_warn().

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: Get rid of custom STMMAC_DEVICE() macro

Since PCI core provides a generic PCI_DEVICE_DATA() macro,
replace STMMAC_DEVICE() with former one.

No functional change intended.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Remove-rtnl-lock-dependency-from-flow_action-infra'

Vlad Buslov says:

====================
Remove rtnl lock dependency from flow_action infra

Currently, TC flow_action infrastructure code obtain rtnl lock before
accessing action state in tc_setup_flow_action() function and releases
it afterwards. This behavior is not supposed to impact TC filter
insertion rate because filling flow_action representation is only a
small part of creating new filter and expensive operations (hardware
offload callbacks, classifiers, cls API code that creates chains and
classifiers instances) already support unlocked execution. However,
typical vswitch implementation might need to also dump TC filters
concurrently, for example to age out unused flows or update flow
counters. TC dump is fully serialized and holds rtnl lock during its
whole execution in kernel space. As such, it can significantly impact
concurrent tasks that try to intermittently obtain rtnl lock when
filling intermediate representation for new filter offload (performance
evaluation at the end of this mail).

Refactor flow_action cls API infrastructure and its dependencies to not
rely on rtnl lock for synchronization. Patch set overview:

- Refactor tc_setup_flow_action() to obtain action tcf_lock when
  accessing action state. Fix its dependencies to not obtain tcf_lock
  themselves and assume that caller already holds it (needs to be done
  in same patch to prevent deadlock) and not to call sleeping functions
  (needs to be done in same patch to prevent "sleeping while atomic"
  dmesg warnings).

- Refactor action helper functions to require tcf_lock instead of rtnl.
  Internally, all of the actions already use tcf_lock for
  synchronization to accommodate unlocked classifier API, so this change
  relies on already existing functionality.

- Remove rtnl lock and "rtnl_held" argument from tc_setup_flow_action()
  function.

To test the change, multiple concurrent TC instances are invoked with
following command:

time ls add* | xargs -n 1 -P 100 sudo tc -b

Ten batch files with following typical rules (100k each) are used:

filter add dev ens1f0_0 protocol ip ingress prio 1 handle 1 flower
src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 src_ip 192.168.111.1
dst_ip 192.168.111.2 ip_proto udp dst_port 1 src_port 1 action
tunnel_key set id 1 src_ip 2.2.2.2 dst_ip 2.2.2.3 dst_port 4789
no_percpu action mirred egress redirect dev vxlan1 no_percpu

TC dump of same device is called in infinite loop from five concurrent
instances:

while true do tc -s filter show dev $NIC ingress >/dev/null done

Results obtained on current net-next commit 9f68e3655aae ("Merge tag
'drm-next-2020-01-30' of git://anongit.freedesktop.org/drm/drm"):

               | net-next | this change
---------------+----------+-------------
TC add        | 6.3s     | 6.3s
TC add + dump | 29.3s    | 6.8s

Test results confirm significant impact of concurrent TC dump. The
impact is almost fully mitigated by proposed change (differences can be
attributed to contention for chain and tp locks between add and dump TC
instances).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: don't take rtnl lock during flow_action setup

Refactor tc_setup_flow_action() function not to use rtnl lock and remove
'rtnl_held' argument that is no longer needed.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: refactor ct action helpers to require tcf_lock

In order to remove rtnl lock dependency from flow_action representation
translator, change rtnl_dereference() to rcu_dereference_protected() in ct
action helpers that provide external access to zone and action values. This
is safe to do because the functions are not called from anywhere else
outside flow_action infrastructure which was modified to obtain tcf_lock
when accessing action data in one of previous patches in the series.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: refactor police action helpers to require tcf_lock

In order to remove rtnl lock dependency from flow_action representation
translator, change rcu_dereference_bh_rtnl() to rcu_dereference_protected()
in police action helpers that provide external access to rate and burst
values. This is safe to do because the functions are not called from
anywhere else outside flow_action infrastructure which was modified to
obtain tcf_lock when accessing action data in one of previous patches in
the series.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: lock action when translating it to flow_action infra

In order to remove dependency on rtnl lock, take action's tcfa_lock when
constructing its representation as flow_action_entry structure.

Refactor tcf_sample_get_group() to assume that caller holds tcf_lock and
don't take it manually. This callback is only called from flow_action infra
representation translator which now calls it with tcf_lock held, so this
refactoring is necessary to prevent deadlock.

Allocate memory with GFP_ATOMIC flag for ip_tunnel_info copy because
tcf_tunnel_info_copy() is only called from flow_action representation infra
code with tcf_lock spinlock taken.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mvneta-xdp-ethtool-stats'

Lorenzo Bianconi says:

====================
add xdp ethtool stats to mvneta driver

Rework mvneta stats accounting in order to introduce xdp ethtool
statistics in the mvneta driver.
Introduce xdp_redirect, xdp_pass, xdp_drop and xdp_tx counters to
ethtool statistics.
Fix skb_alloc_error and refill_error ethtool accounting
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvneta: get rid of xdp_ret in mvneta_swbm_rx_frame

Get rid of xdp_ret in mvneta_swbm_rx_frame routine since now
we can rely on xdp_stats to flush in case of xdp_redirect

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvneta: introduce xdp counters to ethtool

Add xdp_redirect, xdp_pass, xdp_drop and xdp_tx counters
to ethtool statistics

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvneta: rely on struct mvneta_stats in mvneta_update_stats routine

Introduce mvneta_stats structure in mvneta_update_stats routine signature
in order to collect all the rx stats and update them at the end at the
napi loop. mvneta_stats will be reused adding xdp statistics support to
ethtool.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvneta: rely on open-coding updating stats for non-xdp and tx path

In oreder to avoid unnecessary instructions rely on open-coding updating
per-cpu stats in mvneta_tx/mvneta_xdp_submit_frame and mvneta_rx_hwbm
routines. This patch will be used to add xdp support to ethtool for the
mvneta driver

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvneta: move refill_err and skb_alloc_err in per-cpu stats

mvneta_ethtool_update_stats routine is currently reporting
skb_alloc_error and refill_error only for the first rx queue.
Fix the issue moving skb_alloc_err and refill_err in
mvneta_pcpu_stats structure.
Moreover this patch will be used to introduce xdp statistics
to ethtool for the mvneta driver

Fixes: 17a96da62716 ("net: mvneta: discriminate error cause for missed packet")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mv88e6xxx-Add-SERDES-PCS-registers-to-ethtool-dump'

Andrew Lunn says:

====================
mv88e6xxx: Add SERDES/PCS registers to ethtool -d

ethtool -d will dump the registers of an interface. For mv88e6xxx
switch ports, this dump covers the port specific registers. Extend
this with the SERDES/PCS registers, if a port has a SERDES.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Add 6390 family PCS registers to ethtool -d

The mv88e6390 has upto 8 sets of PCS registers, depending on how ports
9 and 10 are configured. The can be spread over 8 ports. If a port has
a PCS register set, return it along with the port registers. The
register space is sparse, so hard code a list of registers which will
be returned. It can later be extended, if needed, by append to the end
of the list.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Add 6352 family PCS registers to ethtool -d

The mv88e6352 has one PCS which can be used for 1000BaseX or
SGMII. Add the registers to the dump for the port which the PCS is
associated to.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Allow PCS registers to be retrieved via ethtool

ethtool provides a generic mechanism for a driver to return the
registers of an ethernet device. DSA uses this to give the port
registers associated with an interfaces. Extend this to allow PCS
registers to also be returned, if the port has a PCS associated to it.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '100GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
100GbE Intel Wired LAN Driver Updates 2020-02-15

This series contains updates to ice driver only.

Brett adds support for "Queue in Queue" (QinQ) support, by supporting
S-tag & C-tag VLAN traffic by disabling pruning when there are no 0x8100
VLAN interfaces currently on top of the PF.  Also refactored the port
VLAN configuration to re-use the common code for enabling and disabling
a port VLAN in single function.  Added a helper function to determine if
the VF link is up.  Fixed how the port VLAN configures the priority bits
for a VF interface.  Fixed the port VLAN to only see its own broadcast
and multicast traffic.  Added support to enable and disable all receive
queues, by refactoring adding a new function to do the necessary steps
to enable/disable a queue with the necessary read flush.  Fixed how we
set the mapping mode for transmit and receive queues.  Added support for
VF queues to handle LAN overflow events.  Fixed and refactored how
receive queues get disabled for VFs, which was being handled one queue
at at time, so improve it to handle when the VF is requesting more than
one queue to be disabled.  Fixed how the virtchnl_queue_select bitmap is
validated.

Finally a patch not authored by Brett, Bruce cleans up "fallthrough"
comments which are unnecessary.  Also replaces the "fallthough" comments
with the GCC reserved word fallthrough, along with other GCC compiler
fixes.  Add missing function header comment regarding a function
argument that was missing.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sonic-next'

Finn Thain says:

====================
Improvements for SONIC ethernet drivers

Now that the necessary sonic driver fixes have been merged, and the merge
window has closed again, I'm sending the remainder of my sonic driver
patch queue.

A couple of these patches will have to be applied in sequence to avoid
'git am' rejects. The others are independent and could have been submitted
individually. Please let me know if I should do that.

The complete sonic driver patch queue was tested on National Semiconductor
hardware (macsonic), qemu-system-m68k (macsonic) and qemu-system-mips64el
(jazzsonic).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/macsonic: Remove interrupt handler wrapper

On m68k, local irqs remain enabled while interrupt handlers execute.
Therefore the macsonic driver has had to disable interrupts to avoid
re-entering sonic_interrupt().

As of commit 865ad2f2201d ("net/sonic: Add mutual exclusion for accessing
shared state"), sonic_interrupt() became re-entrant, and its wrapper
became redundant.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sonic: Start packet transmission immediately

Give the transmit command as soon as the transmit descriptor is ready.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sonic: Remove explicit memory barriers

The explicit memory barriers are redundant now that proper locking and
MMIO accessors have been employed.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sonic: Remove redundant netif_start_queue() call

The transmit queue must be running already otherwise sonic_send_packet()
would not have been called. If the queue was stopped by the interrupt
handler, the interrupt handler will restart it again.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sonic: Remove redundant next_tx variable

The eol_tx variable is the one that matters to the tx algorithm because
packets are always placed at the end of the list. The next_tx variable
just confuses things so remove it.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sonic: Refactor duplicated code

No functional change.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sonic: Remove obsolete comment

The comment is meaningless since mark_bh() was removed a long time ago.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sh_eth-get-rid-of-the-dedicated-regiseter-mapping-for-RZ-A1-R7S72100'

Sergei Shtylyov says:

====================
sh_eth: get rid of the dedicated regiseter mapping for RZ/A1 (R7S72100)

Here's a set of 5 patches against DaveM's 'net-next.git' repo.

I changed my mind about the RZ/A1 SoC needing its own register
map -- now that we don't depend on the register map array in order
to determine whether a given register exists any more, we can add
a new flag to determine if the GECMR exists (this register is
present only on true GEther chips, not RZ/A1). We also need to
add the sh_eth_cpu_data::* flag checks where they were missing
so far: in the ethtool API for the register dump.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sh_eth: use Gigabit register map for R7S72100

The register maps for the Gigabit controllers and the Ether one used on
RZ/A1  (AKA R7S72100) are identical except for GECMR which is only present
on the true GEther controllers.  We no longer use the register map arrays
to determine if a given register exists,  and have added the GECMR flag to
the 'struct sh_eth_cpu_data' in the previous patch, so we're ready to drop
the R7S72100 specific register map -- this saves 216 bytes of object code
(ARM gcc 4.8.5).

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sh_eth: add sh_eth_cpu_data::gecmr flag

Not all Ether controllers having the Gigabit register layout have GECMR --
RZ/A1 (AKA R7S72100) actually has the same layout but no Gigabit speed
support and hence no GECMR. In the past, the new register map table was
added for this SoC, now I think we should have used the existing Gigabit
table with the differences (such as GECMR) covered by the mere flags in
the 'struct sh_eth_cpu_data'. Add such flag for GECMR -- and then we can
get rid of the R7S72100 specific layout in the next patch...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sh_eth: check sh_eth_cpu_data::no_xdfar when dumping registers

When adding the sh_eth_cpu_data::no_xdfar flag I forgot to add the flag
check to __sh_eth_get_regs(), causing the non-existing RDFAR/TDFAR to be
considered for dumping on the R-Car gen1/2 SoCs (the register offset check
has the final say here)...

Fixes: 4c1d45850d5 ("sh_eth: add sh_eth_cpu_data::cexcr flag")
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sh_eth: check sh_eth_cpu_data::cexcr when dumping registers

When adding the sh_eth_cpu_data::cexcr flag I forgot to add the flag
check to __sh_eth_get_regs(), causing the non-existing RX packet counter
registers to be considered for dumping on the R7S72100 SoC (the register
offset sanity check has the final say here)...

Fixes: 4c1d45850d5 ("sh_eth: add sh_eth_cpu_data::cexcr flag")
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sh_eth: check sh_eth_cpu_data::no_tx_cntrs when dumping registers

When adding the sh_eth_cpu_data::no_tx_cntrs flag I forgot to add the
flag check to __sh_eth_get_regs(), causing the non-existing TX counter
registers to be considered for dumping on the R7S72100 SoC (the register
offset sanity check has the final say here)...

Fixes: ce9134dff6d9 ("sh_eth: add sh_eth_cpu_data::no_tx_cntrs flag")
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Pause-updates-for-phylib-and-phylink'

Russell King says:

====================
Pause updates for phylib and phylink

Currently, phylib resolves the speed and duplex settings, which MAC
drivers use directly. phylib also extracts the "Pause" and "AsymPause"
bits from the link partner's advertisement, and stores them in struct
phy_device's pause and asym_pause members with no further processing.
It is left up to each MAC driver to implement decoding for this
information.

phylink converted drivers are able to take advantage of code therein
which resolves the pause advertisements for the MAC driver, but this
does nothing for unconverted drivers. It also does not allow us to
make use of hardware-resolved pause states offered by several PHYs.

This series aims to address this by:

1. Providing a generic implementation, linkmode_resolve_pause(), that
   takes the ethtool linkmode arrays for the link partner and local
   advertisements, decoding the result to whether pause frames should
   be allowed to be transmitted or received and acted upon.  I call
   this the pause enablement state.

2. Providing a phylib implementation, phy_get_pause(), which allows
   MAC drivers to request the pause enablement state from phylib.

3. Providing a generic linkmode_set_pause() for setting the pause
   advertisement according to the ethtool tx/rx flags - note that this
   design has some shortcomings, see the comments in the kerneldoc for
   this function.

4. Remove the ability in phylink to set the pause states for fixed
   links, which brings them into line with how we deal with the speed
   and duplex parameters; we can reintroduce this later if anyone
   requires it.  This could be a userspace-visible change.

5. Split application of manual pause enablement state from phylink's
   resolution of the same to allow use of phylib's new phy_get_pause()
   interface by phylink, and make that switch.

6. Resolve the fixed-link pause enablement state using the generic
   linkmode_resolve_pause() helper introduced earlier. This, in
   connection with the previous commits, allows us to kill the
   MLO_PAUSE_SYM and MLO_PAUSE_ASYM flags.

7. make phylink's ethtool pause setting implementation update the
   pause advertisement in the same way that phylib does, with the
   same caveats that are present there (as mentioned above w.r.t
   linkmode_set_pause()).

8. create a more accurate initial configuration for MACs, used when
   phy_start() is called or a SFP is detected. In particular, this
   ensures that the pause bits seen by MAC drivers in state->pause
   are accurate for SGMII.

9. finally, update the kerneldoc descriptions for mac_config() for
   the above changes.

This series has been build-tested against net-next; the boot tested
patches are in my "phy" branch against v5.5 plus the queued phylink
changes that were merged for 5.6.

The next series will introduce the ability for phylib drivers to
provide hardware resolved pause enablement state.  These patches can
be found in my "phy" branch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phylink: clarify flow control settings in documentation

Clarify the expected flow control settings operation in the phylink
documentation for each negotiation mode.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phylink: improve initial mac configuration

Improve the initial MAC configuration so we get a configuration which
more represents the final operating mode, in particular with respect
to the flow control settings.

We do this by:
1) more fully initialising our phy state, so we can use this as the
   initial state for PHY based connections.
2) reading the fixed link state.
3) ensuring that in-band mode has sane pause settings for SGMII vs
   802.3z negotiation modes.

In all three cases, we ensure that state->link is false, just in case
any MAC drivers have other ideas by mis-using this member, and we also
take account of manual pause mode configuration at this point.

This avoids MLO_PAUSE_AN being seen in mac_config() when operating in
PHY, fixed mode or inband SGMII mode, thereby giving cleaner semantics
to the pause flags.  As a result of this, the pause flags now indicate
in a mode-independent way what is required from a mac_config()
implementation.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phylink: allow ethtool -A to change flow control advertisement

When ethtool -A is used to change the pause modes, the pause
advertisement is not being changed, but the documentation in
uapi/linux/ethtool.h says we should be. Add that capability to
phylink.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phylink: resolve fixed link flow control

Resolve the fixed link flow control using the recently introduced
linkmode_resolve_pause() helper, which we use in
phylink_get_fixed_state() only when operating in full duplex mode.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phylink: use phylib resolved flow control modes

Use the new phy_get_pause() helper to get the resolved pause modes for
a PHY rather than resolving the pause modes ourselves. We temporarily
retain our pause mode resolution for causes where there is no PHY
attached, e.g. for fixed-link modes.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phylink: ensure manual flow control is selected appropriately

Split the application of manually controlled flow control modes from
phylink_resolve_flow(), so that we can use alternative providers of
flow control resolution.

We also want to clear the MLO_PAUSE_AN flag when autoneg is disabled,
since flow control can't be negotiated in this circumstance.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phylink: remove pause mode ethtool setting for fixed links

Remove the ability for ethtool -A to change the pause settings for
fixed links; if this is really required, we can reinstate it later.

Andrew Lunn agrees: "So I think it is safe to not implement ethtool
-A, at least until somebody has a real use case for it."

Lets avoid making things too complex for use cases that aren't being
used.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: add linkmode helper for setting flow control advertisement

Add a linkmode helper to set the flow control advertisement in an
ethtool linkmode mask according to the tx/rx capabilities. This
implementation is moved from phylib, and documented with an
analysis of its shortcomings.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: add helpers to resolve negotiated flow control

Add a couple of helpers to resolve negotiated flow control. Two helpers
are provided:

- linkmode_resolve_pause() which takes the link partner and local
  advertisements, and decodes whether we should enable TX or RX pause
  at the MAC. This is useful outside of phylib, e.g. in phylink.
- phy_get_pause(), which returns the TX/RX enablement status for the
  current negotiation results of the PHY.

This allows us to centralise the flow control resolution, rather than
spreading it around.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: linkmode: make linkmode_test_bit() take const pointer

linkmode_test_bit() does not modify the address; test_bit() is also
declared const volatile for the same reason. There's no need for
linkmode_test_bit() to be any different, and allows implementation of
helpers that take a const linkmode pointer.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'r8169-series-with-further-smaller-improvements'

Heiner Kallweit says:

====================
r8169: series with further smaller improvements

Nothing too exciting. This series includes further smaller
improvements.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: improve statistics of missed rx packets

Register RxMissed exists on few early chip versions only, however all
chip versions have the number of missed RX packets in the hardware
counters. Therefore remove using RxMissed and get the number of missed
RX packets from the hardware stats.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: improve rtl_jumbo_config

Merge enabling and disabling jumbo packets to one function to make
the code a little simpler.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: improve rtl8169_get_mac_version

Currently code snippet (RTL_R32(tp, TxConfig) >> 20) & 0xfcf is used
in few places to extract the chip XID. Change the code to do the XID
extraction only once.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: add helper rtl_pci_commit

In few places we do a PCI commit by reading an arbitrary chip register.
It's not always obvious that the read is meant to be a PCI commit,
therefore add a helper for it.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: simplify setting netdev features

Setting dev->features a few lines later allows to simplify the code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: remove setting PCI_CACHE_LINE_SIZE in rtl_hw_start_8169

This is done for all RTL8169 chip versions in rtl8169_init_phy already.
Therefore we can remove it here.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: remove unneeded check from rtl_link_chg_patch

rtl_link_chg_patch() can be called from rtl_open() to rtl8169_close()
only. And in rtl8169_close() phy_stop() ensures that this function
isn't called afterwards. So we don't need this check.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: add TTL decrement action

New action to decrement TTL instead of setting it to a fixed value.
This action will decrement the TTL and, in case of expired TTL, drop it
or execute an action passed via a nested attribute.
The default TTL expired action is to drop the packet.

Supports both IPv4 and IPv6 via the ttl and hop_limit fields, respectively.

Tested with a corresponding change in the userspace:

    # ovs-dpctl dump-flows
    in_port(2),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, actions:dec_ttl{ttl<=1 action:(drop)},1
    in_port(1),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, actions:dec_ttl{ttl<=1 action:(drop)},2
    in_port(1),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, actions:2
    in_port(2),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, actions:1

    # ping -c1 192.168.0.2 -t 42
    IP (tos 0x0, ttl 41, id 61647, offset 0, flags [DF], proto ICMP (1), length 84)
        192.168.0.1 > 192.168.0.2: ICMP echo request, id 386, seq 1, length 64
    # ping -c1 192.168.0.2 -t 120
    IP (tos 0x0, ttl 119, id 62070, offset 0, flags [DF], proto ICMP (1), length 84)
        192.168.0.1 > 192.168.0.2: ICMP echo request, id 388, seq 1, length 64
    # ping -c1 192.168.0.2 -t 1
    #

Co-developed-by: Bindiya Kurle <bindiyakurle@gmail.com>
Signed-off-by: Bindiya Kurle <bindiyakurle@gmail.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: bcm_sf2: Also configure Port 5 for 2Gb/sec on 7278

Either port 5 or port 8 can be used on a 7278 device, make sure that
port 5 also gets configured properly for 2Gb/sec in that case.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp-zerocopy: Return sk_err (if set) along with tcp receive zerocopy.

This patchset is intended to reduce the number of extra system calls
imposed by TCP receive zerocopy. For ping-pong RPC style workloads,
this patchset has demonstrated a system call reduction of about 30%
when coupled with userspace changes.

For applications using epoll, returning sk_err along with the result
of tcp receive zerocopy could remove the need to call
recvmsg()=-EAGAIN after a spurious wakeup.

Consider a multi-threaded application using epoll. A thread may awaken
with EPOLLIN but another thread may already be reading. The
spuriously-awoken thread does not necessarily know that another thread
'won'; rather, it may be possible that it was woken up due to the
presence of an error if there is no data. A zerocopy read receiving 0
bytes thus would need to be followed up by recvmsg to be sure.

Instead, we return sk_err directly with zerocopy, so the application
can avoid this extra system call.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp-zerocopy: Return inq along with tcp receive zerocopy.

This patchset is intended to reduce the number of extra system calls
imposed by TCP receive zerocopy. For ping-pong RPC style workloads,
this patchset has demonstrated a system call reduction of about 30%
when coupled with userspace changes.

For applications using edge-triggered epoll, returning inq along with
the result of tcp receive zerocopy could remove the need to call
recvmsg()=-EAGAIN after a successful zerocopy. Generally speaking,
since normally we would need to perform a recvmsg() call for every
successful small RPC read via TCP receive zerocopy, returning inq can
reduce the number of system calls performed by approximately half.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Enhance-virtio-vsock-connection-semantics'

Sebastien Boeuf says:

====================
Enhance virtio-vsock connection semantics

This series improves the semantics behind the way virtio-vsock server
accepts connections coming from the client. Whenever the server
receives a connection request from the client, if it is bound to the
socket but not yet listening, it will answer with a RST packet. The
point is to ensure each request from the client is quickly processed
so that the client can decide about the strategy of retrying or not.

The series includes along with the improvement patch a new test to
ensure the behavior is consistent across all hypervisors drivers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tools: testing: vsock: Test when server is bound but not listening

Whenever the server side of vsock is binding to the socket, but not
listening yet, we expect the behavior from the client to be identical to
what happens when the server is not even started.

This new test runs the server side so that it binds to the socket
without ever listening to it. The client side will try to connect and
should receive an ECONNRESET error.

This new test provides a way to validate the previously introduced patch
for making sure the server side will always answer with a RST packet in
case the client requested a new connection.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: virtio_vsock: Enhance connection semantics

Whenever the vsock backend on the host sends a packet through the RX
queue, it expects an answer on the TX queue. Unfortunately, there is one
case where the host side will hang waiting for the answer and might
effectively never recover if no timeout mechanism was implemented.

This issue happens when the guest side starts binding to the socket,
which insert a new bound socket into the list of already bound sockets.
At this time, we expect the guest to also start listening, which will
trigger the sk_state to move from TCP_CLOSE to TCP_LISTEN. The problem
occurs if the host side queued a RX packet and triggered an interrupt
right between the end of the binding process and the beginning of the
listening process. In this specific case, the function processing the
packet virtio_transport_recv_pkt() will find a bound socket, which means
it will hit the switch statement checking for the sk_state, but the
state won't be changed into TCP_LISTEN yet, which leads the code to pick
the default statement. This default statement will only free the buffer,
while it should also respond to the host side, by sending a packet on
its TX queue.

In order to simply fix this unfortunate chain of events, it is important
that in case the default statement is entered, and because at this stage
we know the host side is waiting for an answer, we must send back a
packet containing the operation VIRTIO_VSOCK_OP_RST.

One could say that a proper timeout mechanism on the host side will be
enough to avoid the backend to hang. But the point of this patch is to
ensure the normal use case will be provided with proper responsiveness
when it comes to establishing the connection.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mac80211-next-for-net-next-2020-02-14' of git://git./linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
A few big new things:
* 802.11 frame encapsulation offload support
* more HE (802.11ax) support, including some for 6 GHz band
* powersave in hwsim, for better testing

Of course as usual there are various cleanups and small fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>