linux-2.6-microblaze.git
4 years agonetfilter: Generalize ingress hook
Lukas Wunner [Wed, 11 Mar 2020 11:59:02 +0000 (12:59 +0100)]
netfilter: Generalize ingress hook

Prepare for addition of a netfilter egress hook by generalizing the
ingress hook introduced by commit e687ad60af09 ("netfilter: add
netfilter ingress hook after handle_ing() under unique static key").

In particular, rename and refactor the ingress hook's static inlines
such that they can be reused for an egress hook.

No functional change intended.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: Rename ingress hook include file
Lukas Wunner [Wed, 11 Mar 2020 11:59:01 +0000 (12:59 +0100)]
netfilter: Rename ingress hook include file

Prepare for addition of a netfilter egress hook by renaming
<linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.

The egress hook also necessitates a refactoring of the include file,
but that is done in a separate commit to ease reviewing.

No functional change intended.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: conntrack: re-visit sysctls in unprivileged namespaces
Florian Westphal [Wed, 11 Mar 2020 19:52:01 +0000 (20:52 +0100)]
netfilter: conntrack: re-visit sysctls in unprivileged namespaces

since commit b884fa46177659 ("netfilter: conntrack: unify sysctl handling")
conntrack no longer exposes most of its sysctls (e.g. tcp timeouts
settings) to network namespaces that are not owned by the initial user
namespace.

This patch exposes all sysctls even if the namespace is unpriviliged.

compared to a 4.19 kernel, the newly visible and writeable sysctls are:
  net.netfilter.nf_conntrack_acct
  net.netfilter.nf_conntrack_timestamp
  .. to allow to enable accouting and timestamp extensions.

  net.netfilter.nf_conntrack_events
  .. to turn off conntrack event notifications.

  net.netfilter.nf_conntrack_checksum
  .. to disable checksum validation.

  net.netfilter.nf_conntrack_log_invalid
  .. to enable logging of packets deemed invalid by conntrack.

newly visible sysctls that are only exported as read-only:

  net.netfilter.nf_conntrack_count
  .. current number of conntrack entries living in this netns.

  net.netfilter.nf_conntrack_max
  .. global upperlimit (maximum size of the table).

  net.netfilter.nf_conntrack_buckets
  .. size of the conntrack table (hash buckets).

  net.netfilter.nf_conntrack_expect_max
  .. maximum number of permitted expectations in this netns.

  net.netfilter.nf_conntrack_helper
  .. conntrack helper auto assignment.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nft_lookup: update element stateful expression
Pablo Neira Ayuso [Wed, 11 Mar 2020 14:30:16 +0000 (15:30 +0100)]
netfilter: nft_lookup: update element stateful expression

If the set element comes with an stateful expression, update it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_tables: add nft_set_elem_update_expr() helper function
Pablo Neira Ayuso [Wed, 11 Mar 2020 14:30:15 +0000 (15:30 +0100)]
netfilter: nf_tables: add nft_set_elem_update_expr() helper function

This helper function runs the eval path of the stateful expression
of an existing set element.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_tables: add elements with stateful expressions
Pablo Neira Ayuso [Wed, 11 Mar 2020 14:30:14 +0000 (15:30 +0100)]
netfilter: nf_tables: add elements with stateful expressions

Update nft_add_set_elem() to handle the NFTA_SET_ELEM_EXPR netlink
attribute. This patch allows users to to add elements with stateful
expressions.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_tables: statify nft_expr_init()
Pablo Neira Ayuso [Wed, 11 Mar 2020 14:30:13 +0000 (15:30 +0100)]
netfilter: nf_tables: statify nft_expr_init()

Not exposed anymore to modules, statify this function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_tables: add nft_set_elem_expr_alloc()
Pablo Neira Ayuso [Wed, 11 Mar 2020 14:30:12 +0000 (15:30 +0100)]
netfilter: nf_tables: add nft_set_elem_expr_alloc()

Add helper function to create stateful expression.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonft_set_pipapo: Prepare for single ranged field usage
Stefano Brivio [Sat, 7 Mar 2020 16:52:37 +0000 (17:52 +0100)]
nft_set_pipapo: Prepare for single ranged field usage

A few adjustments in nft_pipapo_init() are needed to allow usage of
this set back-end for a single, ranged field.

Provide a convenient NFT_PIPAPO_MIN_FIELDS definition that currently
makes sure that the rbtree back-end is selected instead, for sets
with a single field.

This finally allows a fair comparison with rbtree sets, by defining
NFT_PIPAPO_MIN_FIELDS as 0 and skipping rbtree back-end initialisation:

 ---------------.--------------------------.-------------------------.
 AMD Epyc 7402  |      baselines, Mpps     |   Mpps, % over rbtree   |
  1 thread      |__________________________|_________________________|
  3.35GHz       |        |        |        |            |            |
  768KiB L1D$   | netdev |  hash  | rbtree |            |   pipapo   |
 ---------------|  hook  |   no   | single |   pipapo   |single field|
 type   entries |  drop  | ranges | field  |single field|    AVX2    |
 ---------------|--------|--------|--------|------------|------------|
 net,port       |        |        |        |            |            |
          1000  |   19.0 |   10.4 |    3.8 | 6.0   +58% | 9.6  +153% |
 ---------------|--------|--------|--------|------------|------------|
 port,net       |        |        |        |            |            |
           100  |   18.8 |   10.3 |    5.8 | 9.1   +57% |11.6  +100% |
 ---------------|--------|--------|--------|------------|------------|
 net6,port      |        |        |        |            |            |
          1000  |   16.4 |    7.6 |    1.8 | 2.8   +55% | 6.5  +261% |
 ---------------|--------|--------|--------|------------|------------|
 port,proto     |        |        |        |     [1]    |    [1]     |
         30000  |   19.6 |   11.6 |    3.9 | 0.9   -77% | 2.7   -31% |
 ---------------|--------|--------|--------|------------|------------|
 port,proto     |        |        |        |            |            |
         10000  |   19.6 |   11.6 |    4.4 | 2.1   -52% | 5.6   +27% |
 ---------------|--------|--------|--------|------------|------------|
 port,proto     |        |        |        |            |            |
 4 threads 10000|   77.9 |   45.1 |   17.4 | 8.3   -52% |22.4   +29% |
 ---------------|--------|--------|--------|------------|------------|
 net6,port,mac  |        |        |        |            |            |
            10  |   16.5 |    5.4 |    4.3 | 4.5    +5% | 8.2   +91% |
 ---------------|--------|--------|--------|------------|------------|
 net6,port,mac, |        |        |        |            |            |
 proto    1000  |   16.5 |    5.7 |    1.9 | 2.8   +47% | 6.6  +247% |
 ---------------|--------|--------|--------|------------|------------|
 net,mac        |        |        |        |            |            |
          1000  |   19.0 |    8.4 |    3.9 | 6.0   +54% | 9.9  +154% |
 ---------------'--------'--------'--------'------------'------------'
 [1] Causes switch of lookup table buckets for 'port' to 4-bit groups

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonft_set_pipapo: Introduce AVX2-based lookup implementation
Stefano Brivio [Sat, 7 Mar 2020 16:52:36 +0000 (17:52 +0100)]
nft_set_pipapo: Introduce AVX2-based lookup implementation

If the AVX2 set is available, we can exploit the repetitive
characteristic of this algorithm to provide a fast, vectorised
version by using 256-bit wide AVX2 operations for bucket loads and
bitwise intersections.

In most cases, this implementation consistently outperforms rbtree
set instances despite the fact they are configured to use a given,
single, ranged data type out of the ones used for performance
measurements by the nft_concat_range.sh kselftest.

That script, injecting packets directly on the ingoing device path
with pktgen, reports, averaged over five runs on a single AMD Epyc
7402 thread (3.35GHz, 768 KiB L1D$, 12 MiB L2$), the figures below.
CONFIG_RETPOLINE was not set here.

Note that this is not a fair comparison over hash and rbtree set
types: non-ranged entries (used to have a reference for hash types)
would be matched faster than this, and matching on a single field
only (which is the case for rbtree) is also significantly faster.

However, it's not possible at the moment to choose this set type
for non-ranged entries, and the current implementation also needs
a few minor adjustments in order to match on less than two fields.

 ---------------.-----------------------------------.------------.
 AMD Epyc 7402  |          baselines, Mpps          | this patch |
  1 thread      |___________________________________|____________|
  3.35GHz       |        |        |        |        |            |
  768KiB L1D$   | netdev |  hash  | rbtree |        |            |
 ---------------|  hook  |   no   | single |        |   pipapo   |
 type   entries |  drop  | ranges | field  | pipapo |    AVX2    |
 ---------------|--------|--------|--------|--------|------------|
 net,port       |        |        |        |        |            |
          1000  |   19.0 |   10.4 |    3.8 |    4.0 | 7.5   +87% |
 ---------------|--------|--------|--------|--------|------------|
 port,net       |        |        |        |        |            |
           100  |   18.8 |   10.3 |    5.8 |    6.3 | 8.1   +29% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port      |        |        |        |        |            |
          1000  |   16.4 |    7.6 |    1.8 |    2.1 | 4.8  +128% |
 ---------------|--------|--------|--------|--------|------------|
 port,proto     |        |        |        |        |            |
         30000  |   19.6 |   11.6 |    3.9 |    0.5 | 2.6  +420% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port,mac  |        |        |        |        |            |
            10  |   16.5 |    5.4 |    4.3 |    3.4 | 4.7   +38% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port,mac, |        |        |        |        |            |
 proto    1000  |   16.5 |    5.7 |    1.9 |    1.4 | 3.6   +26% |
 ---------------|--------|--------|--------|--------|------------|
 net,mac        |        |        |        |        |            |
          1000  |   19.0 |    8.4 |    3.9 |    2.5 | 6.4  +156% |
 ---------------'--------'--------'--------'--------'------------'

A similar strategy could be easily reused to implement specialised
versions for other SIMD sets, and I plan to post at least a NEON
version at a later time.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonft_set_pipapo: Prepare for vectorised implementation: helpers
Stefano Brivio [Sat, 7 Mar 2020 16:52:35 +0000 (17:52 +0100)]
nft_set_pipapo: Prepare for vectorised implementation: helpers

Move most macros and helpers to a header file, so that they can be
conveniently used by related implementations.

No functional changes are intended here.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonft_set_pipapo: Prepare for vectorised implementation: alignment
Stefano Brivio [Sat, 7 Mar 2020 16:52:34 +0000 (17:52 +0100)]
nft_set_pipapo: Prepare for vectorised implementation: alignment

SIMD vector extension sets require stricter alignment than native
instruction sets to operate efficiently (AVX, NEON) or for some
instructions to work at all (AltiVec).

Provide facilities to define arbitrary alignment for lookup tables
and scratch maps. By defining byte alignment with NFT_PIPAPO_ALIGN,
lt_aligned and scratch_aligned pointers become available.

Additional headroom is allocated, and pointers to the possibly
unaligned, originally allocated areas are kept so that they can
be freed.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonft_set_pipapo: Add support for 8-bit lookup groups and dynamic switch
Stefano Brivio [Sat, 7 Mar 2020 16:52:33 +0000 (17:52 +0100)]
nft_set_pipapo: Add support for 8-bit lookup groups and dynamic switch

While grouping matching bits in groups of four saves memory compared
to the more natural choice of 8-bit words (lookup table size is one
eighth), it comes at a performance cost, as the number of lookup
comparisons is doubled, and those also needs bitshifts and masking.

Introduce support for 8-bit lookup groups, together with a mapping
mechanism to dynamically switch, based on defined per-table size
thresholds and hysteresis, between 8-bit and 4-bit groups, as tables
grow and shrink. Empty sets start with 8-bit groups, and per-field
tables are converted to 4-bit groups if they get too big.

An alternative approach would have been to swap per-set lookup
operation functions as needed, but this doesn't allow for different
group sizes in the same set, which looks desirable if some fields
need significantly more matching data compared to others due to
heavier impact of ranges (e.g. a big number of subnets with
relatively simple port specifications).

Allowing different group sizes for the same lookup functions implies
the need for further conditional clauses, whose cost, however,
appears to be negligible in tests.

The matching rate figures below were obtained for x86_64 running
the nft_concat_range.sh "performance" cases, averaged over five
runs, on a single thread of an AMD Epyc 7402 CPU, and for aarch64
on a single thread of a BCM2711 (Raspberry Pi 4 Model B 4GB),
clocked at a stable 2147MHz frequency:

---------------.-----------------------------------.------------.
AMD Epyc 7402  |          baselines, Mpps          | this patch |
 1 thread      |___________________________________|____________|
 3.35GHz       |        |        |        |        |            |
 768KiB L1D$   | netdev |  hash  | rbtree |        |            |
---------------|  hook  |   no   | single | pipapo |   pipapo   |
type   entries |  drop  | ranges | field  | 4 bits | bit switch |
---------------|--------|--------|--------|--------|------------|
net,port       |        |        |        |        |            |
         1000  |   19.0 |   10.4 |    3.8 |    2.8 | 4.0   +43% |
---------------|--------|--------|--------|--------|------------|
port,net       |        |        |        |        |            |
          100  |   18.8 |   10.3 |    5.8 |    5.5 | 6.3   +14% |
---------------|--------|--------|--------|--------|------------|
net6,port      |        |        |        |        |            |
         1000  |   16.4 |    7.6 |    1.8 |    1.3 | 2.1   +61% |
---------------|--------|--------|--------|--------|------------|
port,proto     |        |        |        |        |     [1]    |
        30000  |   19.6 |   11.6 |    3.9 |    0.3 | 0.5   +66% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac  |        |        |        |        |            |
           10  |   16.5 |    5.4 |    4.3 |    2.6 | 3.4   +31% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac, |        |        |        |        |            |
proto    1000  |   16.5 |    5.7 |    1.9 |    1.0 | 1.4   +40% |
---------------|--------|--------|--------|--------|------------|
net,mac        |        |        |        |        |            |
         1000  |   19.0 |    8.4 |    3.9 |    1.7 | 2.5   +47% |
---------------'--------'--------'--------'--------'------------'
[1] Causes switch of lookup table buckets for 'port', not 'proto',
    to 4-bit groups

 ---------------.-----------------------------------.------------.
 BCM2711        |          baselines, Mpps          | this patch |
  1 thread      |___________________________________|____________|
  2147MHz       |        |        |        |        |            |
  32KiB L1D$    | netdev |  hash  | rbtree |        |            |
 ---------------|  hook  |   no   | single | pipapo |   pipapo   |
 type   entries |  drop  | ranges | field  | 4 bits | bit switch |
 ---------------|--------|--------|--------|--------|------------|
 net,port       |        |        |        |        |            |
          1000  |   1.63 |   1.37 |   0.87 |   0.61 | 0.70  +17% |
 ---------------|--------|--------|--------|--------|------------|
 port,net       |        |        |        |        |            |
           100  |   1.64 |   1.36 |   1.02 |   0.78 | 0.81   +4% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port      |        |        |        |        |            |
          1000  |   1.56 |   1.27 |   0.65 |   0.34 | 0.50  +47% |
 ---------------|--------|--------|--------|--------|------------|
 port,proto [2] |        |        |        |        |            |
         10000  |   1.68 |   1.43 |   0.84 |   0.30 | 0.40  +13% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port,mac  |        |        |        |        |            |
            10  |   1.56 |   1.14 |   1.02 |   0.62 | 0.66   +6% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port,mac, |        |        |        |        |            |
 proto    1000  |   1.56 |   1.12 |   0.64 |   0.27 | 0.40  +48% |
 ---------------|--------|--------|--------|--------|------------|
 net,mac        |        |        |        |        |            |
          1000  |   1.63 |   1.26 |   0.87 |   0.41 | 0.53  +29% |
 ---------------'--------'--------'--------'--------'------------'
[2] Using 10000 entries instead of 30000 as it would take way too
    long for the test script to generate all of them

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonft_set_pipapo: Generalise group size for buckets
Stefano Brivio [Sat, 7 Mar 2020 16:52:32 +0000 (17:52 +0100)]
nft_set_pipapo: Generalise group size for buckets

Get rid of all hardcoded assumptions that buckets in lookup tables
correspond to four-bit groups, and replace them with appropriate
calculations based on a variable group size, now stored in struct
field.

The group size could now be in principle any divisor of eight. Note,
though, that lookup and get functions need an implementation
intimately depending on the group size, and the only supported size
there, currently, is four bits, which is also the initial and only
used size at the moment.

While at it, drop 'groups' from struct nft_pipapo: it was never used.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: flowtable: add tunnel encap/decap action offload support
wenxu [Mon, 24 Feb 2020 05:22:55 +0000 (13:22 +0800)]
netfilter: flowtable: add tunnel encap/decap action offload support

This patch add tunnel encap decap action offload in the flowtable
offload.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: flowtable: add tunnel match offload support
wenxu [Mon, 24 Feb 2020 04:22:54 +0000 (05:22 +0100)]
netfilter: flowtable: add tunnel match offload support

This patch support both ipv4 and ipv6 tunnel_id, tunnel_src and
tunnel_dst match for flowtable offload

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: flowtable: add indr block setup support
wenxu [Mon, 24 Feb 2020 05:22:53 +0000 (13:22 +0800)]
netfilter: flowtable: add indr block setup support

Add etfilter flowtable support indr-block setup. It makes flowtable offload
vlan and tunnel device.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: flowtable: add nf_flow_table_block_offload_init()
wenxu [Mon, 24 Feb 2020 05:22:52 +0000 (13:22 +0800)]
netfilter: flowtable: add nf_flow_table_block_offload_init()

Add nf_flow_table_block_offload_init prepare for the indr block
offload patch

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: xt_IDLETIMER: clean up some indenting
Dan Carpenter [Tue, 25 Feb 2020 06:42:22 +0000 (09:42 +0300)]
netfilter: xt_IDLETIMER: clean up some indenting

These lines were indented wrong so Smatch complained.
net/netfilter/xt_IDLETIMER.c:81 idletimer_tg_show() warn: inconsistent indenting

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: bitwise: use more descriptive variable-names.
Jeremy Sowden [Mon, 24 Feb 2020 12:49:30 +0000 (12:49 +0000)]
netfilter: bitwise: use more descriptive variable-names.

Name the mask and xor data variables, "mask" and "xor," instead of "d1"
and "d2."

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: Replace zero-length array with flexible-array member
Gustavo A. R. Silva [Thu, 20 Feb 2020 13:59:14 +0000 (07:59 -0600)]
netfilter: Replace zero-length array with flexible-array member

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

Lastly, fix checkpatch.pl warning
WARNING: __aligned(size) is preferred over __attribute__((aligned(size)))
in net/bridge/netfilter/ebtables.c

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nft_set_pipapo: make the symbol 'nft_pipapo_get' static
Chen Wandun [Mon, 10 Feb 2020 08:51:09 +0000 (16:51 +0800)]
netfilter: nft_set_pipapo: make the symbol 'nft_pipapo_get' static

Fix the following sparse warning:

net/netfilter/nft_set_pipapo.c:739:6: warning: symbol 'nft_pipapo_get' was not declared. Should it be static?

Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Acked-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: cleanup unused macro
Li RongQing [Thu, 20 Feb 2020 07:20:18 +0000 (15:20 +0800)]
netfilter: cleanup unused macro

TEMPLATE_NULLS_VAL is not used after commit 0838aa7fcfcd
("netfilter: fix netns dependencies with conntrack templates")

PFX is not used after commit 8bee4bad03c5b ("netfilter: xt
extensions: use pr_<level>")

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_tables: make all set structs const
Florian Westphal [Tue, 18 Feb 2020 10:59:27 +0000 (11:59 +0100)]
netfilter: nf_tables: make all set structs const

They do not need to be writeable anymore.

v2: remove left-over __read_mostly annotation in set_pipapo.c (Stefano)

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_tables: make sets built-in
Florian Westphal [Tue, 18 Feb 2020 10:59:26 +0000 (11:59 +0100)]
netfilter: nf_tables: make sets built-in

Placing nftables set support in an extra module is pointless:

1. nf_tables needs dynamic registeration interface for sake of one module
2. nft heavily relies on sets, e.g. even simple rule like
   "nft ... tcp dport { 80, 443 }" will not work with _SETS=n.

IOW, either nftables isn't used or both nf_tables and nf_tables_set
modules are needed anyway.

With extra module:
 307K net/netfilter/nf_tables.ko
  79K net/netfilter/nf_tables_set.ko

   text  data  bss     dec filename
 146416  3072  545  150033 nf_tables.ko
  35496  1817    0   37313 nf_tables_set.ko

This patch:
 373K net/netfilter/nf_tables.ko

 178563  4049  545  183157 nf_tables.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nft_tunnel: add support for geneve opts
Xin Long [Mon, 10 Feb 2020 05:41:22 +0000 (13:41 +0800)]
netfilter: nft_tunnel: add support for geneve opts

Like vxlan and erspan opts, geneve opts should also be supported in
nft_tunnel. The difference is geneve RFC (draft-ietf-nvo3-geneve-14)
allows a geneve packet to carry multiple geneve opts. So with this
patch, nftables/libnftnl would do:

  # nft add table ip filter
  # nft add chain ip filter input { type filter hook input priority 0 \; }
  # nft add tunnel filter geneve_02 { type geneve\; id 2\; \
    ip saddr 192.168.1.1\; ip daddr 192.168.1.2\; \
    sport 9000\; dport 9001\; dscp 1234\; ttl 64\; flags 1\; \
    opts \"1:1:34567890,2:2:12121212,3:3:1212121234567890\"\; }
  # nft list tunnels table filter
    table ip filter {
     tunnel geneve_02 {
     id 2
     ip saddr 192.168.1.1
     ip daddr 192.168.1.2
     sport 9000
     dport 9001
     tos 18
     ttl 64
     flags 1
     geneve opts 1:1:34567890,2:2:12121212,3:3:1212121234567890
     }
    }

v1->v2:
  - no changes, just post it separately.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: xtables: Add snapshot of hardidletimer target
Manoj Basapathi [Thu, 6 Feb 2020 11:07:29 +0000 (16:37 +0530)]
netfilter: xtables: Add snapshot of hardidletimer target

This is a snapshot of hardidletimer netfilter target.

This patch implements a hardidletimer Xtables target that can be
used to identify when interfaces have been idle for a certain period
of time.

Timers are identified by labels and are created when a rule is set
with a new label. The rules also take a timeout value (in seconds) as
an option. If more than one rule uses the same timer label, the timer
will be restarted whenever any of the rules get a hit.

One entry for each timer is created in sysfs. This attribute contains
the timer remaining for the timer to expire. The attributes are
located under the xt_idletimer class:

/sys/class/xt_idletimer/timers/<label>

When the timer expires, the target module sends a sysfs notification
to the userspace, which can then decide what to do (eg. disconnect to
save power)

Compared to IDLETIMER, HARDIDLETIMER can send notifications when
CPU is in suspend too, to notify the timer expiry.

v1->v2: Moved all functionality into IDLETIMER module to avoid
code duplication per comment from Florian.

Signed-off-by: Manoj Basapathi <manojbm@codeaurora.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: flowtable: Use nf_flow_offload_tuple for stats as well
Paul Blakey [Thu, 30 Jan 2020 16:15:18 +0000 (18:15 +0200)]
netfilter: flowtable: Use nf_flow_offload_tuple for stats as well

This patch doesn't change any functionality.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agocdc_ncm: Fix the build warning
Alexander Bersenev [Sat, 14 Mar 2020 05:33:24 +0000 (10:33 +0500)]
cdc_ncm: Fix the build warning

The ndp32->wLength is two bytes long, so replace cpu_to_le32 with cpu_to_le16.

Fixes: 0fa81b304a79 ("cdc_ncm: Implement the 32-bit version of NCM Transfer Block")
Signed-off-by: Alexander Bersenev <bay@hackerdom.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'mptcp-simplify-mptcp_accept'
David S. Miller [Sun, 15 Mar 2020 07:19:03 +0000 (00:19 -0700)]
Merge branch 'mptcp-simplify-mptcp_accept'

Paolo Abeni says:

====================
mptcp: simplify mptcp_accept()

Currently we allocate the MPTCP master socket at accept time.

The above makes mptcp_accept() quite complex, and requires checks is several
places for NULL MPTCP master socket.

These series simplify the MPTCP accept implementation, moving the master socket
allocation at syn-ack time, so that we drop unneeded checks with the follow-up
patch.

v1 -> v2:
- rebased on top of 2398e3991bda7caa6b112a6f650fbab92f732b91
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: drop unneeded checks
Paolo Abeni [Fri, 13 Mar 2020 15:52:42 +0000 (16:52 +0100)]
mptcp: drop unneeded checks

After the previous patch subflow->conn is always != NULL and
is never changed. We can drop a bunch of now unneeded checks.

v1 -> v2:
 - rebased on top of commit 2398e3991bda ("mptcp: always
   include dack if possible.")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: create msk early
Paolo Abeni [Fri, 13 Mar 2020 15:52:41 +0000 (16:52 +0100)]
mptcp: create msk early

This change moves the mptcp socket allocation from mptcp_accept() to
subflow_syn_recv_sock(), so that subflow->conn is now always set
for the non fallback scenario.

It allows cleaning up a bit mptcp_accept() reducing the additional
locking and will allow fourther cleanup in the next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: stmmac: platform: convert to devm_platform_ioremap_resource
Dejin Zheng [Fri, 13 Mar 2020 14:42:57 +0000 (22:42 +0800)]
net: stmmac: platform: convert to devm_platform_ioremap_resource

Use devm_platform_ioremap_resource() to simplify code, which
contains platform_get_resource and devm_ioremap_resource.

Signed-off-by: Dejin Zheng <zhengdejin5@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: mscc: ocelot: adjust maxlen on NPI port, not CPU
Vladimir Oltean [Fri, 13 Mar 2020 13:46:51 +0000 (15:46 +0200)]
net: mscc: ocelot: adjust maxlen on NPI port, not CPU

Being a non-physical port, the CPU port does not have an ocelot_port
structure, so the ocelot_port_writel call inside the
ocelot_port_set_maxlen() function would access data behind a NULL
pointer.

This is a patch for net-next only, the net tree boots fine, the bug was
introduced during the net -> net-next merge.

Fixes: 1d3435793123 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
Fixes: a8015ded89ad ("net: mscc: ocelot: properly account for VLAN header length when setting MRU")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agotipc: add NULL pointer check to prevent kernel oops
Hoang Le [Fri, 13 Mar 2020 03:18:03 +0000 (10:18 +0700)]
tipc: add NULL pointer check to prevent kernel oops

Calling:
tipc_node_link_down()->
   - tipc_node_write_unlock()->tipc_mon_peer_down()
   - tipc_mon_peer_down()
  just after disabling bearer could be caused kernel oops.

Fix this by adding a sanity check to make sure valid memory
access.

Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agotipc: simplify trivial boolean return
Hoang Le [Fri, 13 Mar 2020 03:18:02 +0000 (10:18 +0700)]
tipc: simplify trivial boolean return

Checking and returning 'true' boolean is useless as it will be
returning at end of function

Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'ethtool-consolidate-irq-coalescing-part-5'
David S. Miller [Sun, 15 Mar 2020 04:13:55 +0000 (21:13 -0700)]
Merge branch 'ethtool-consolidate-irq-coalescing-part-5'

Jakub Kicinski says:

====================
ethtool: consolidate irq coalescing - part 5

Convert more drivers following the groundwork laid in a recent
patch set [1] and continued in [2], [3], [4]. The aim of the effort
is to consolidate irq coalescing parameter validation in the core.

This set converts further 15 drivers in drivers/net/ethernet.
One more conversion sets to come.

[1] https://lore.kernel.org/netdev/20200305051542.991898-1-kuba@kernel.org/
[2] https://lore.kernel.org/netdev/20200306010602.1620354-1-kuba@kernel.org/
[3] https://lore.kernel.org/netdev/20200310021512.1861626-1-kuba@kernel.org/
[4] https://lore.kernel.org/netdev/20200311223302.2171564-1-kuba@kernel.org/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: via: reject unsupported coalescing params
Jakub Kicinski [Fri, 13 Mar 2020 04:08:03 +0000 (21:08 -0700)]
net: via: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sxgbe: reject unsupported coalescing params
Jakub Kicinski [Fri, 13 Mar 2020 04:08:02 +0000 (21:08 -0700)]
net: sxgbe: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: r8169: reject unsupported coalescing params
Jakub Kicinski [Fri, 13 Mar 2020 04:08:01 +0000 (21:08 -0700)]
net: r8169: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: qlnic: let core reject the unsupported coalescing parameters
Jakub Kicinski [Fri, 13 Mar 2020 04:08:00 +0000 (21:08 -0700)]
net: qlnic: let core reject the unsupported coalescing parameters

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver already correctly rejected almost all
unsupported parameters (missing sample_rate_interval).

As a side effect of these changes the error code for
unsupported params changes from EINVAL to EOPNOTSUPP.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: qede: reject unsupported coalescing params
Jakub Kicinski [Fri, 13 Mar 2020 04:07:59 +0000 (21:07 -0700)]
net: qede: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: netxen: let core reject the unsupported coalescing parameters
Jakub Kicinski [Fri, 13 Mar 2020 04:07:58 +0000 (21:07 -0700)]
net: netxen: let core reject the unsupported coalescing parameters

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

As a side effect of these changes the error code for
unsupported params changes from EINVAL to EOPNOTSUPP.

The driver was missing a check for rate_sample_interval.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: nixge: let core reject the unsupported coalescing parameters
Jakub Kicinski [Fri, 13 Mar 2020 04:07:57 +0000 (21:07 -0700)]
net: nixge: let core reject the unsupported coalescing parameters

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver correctly rejects all unsupported
parameters, no functional changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: myri10ge: reject unsupported coalescing params
Jakub Kicinski [Fri, 13 Mar 2020 04:07:56 +0000 (21:07 -0700)]
net: myri10ge: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sky2: reject unsupported coalescing params
Jakub Kicinski [Fri, 13 Mar 2020 04:07:55 +0000 (21:07 -0700)]
net: sky2: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: skge: reject unsupported coalescing params
Jakub Kicinski [Fri, 13 Mar 2020 04:07:54 +0000 (21:07 -0700)]
net: skge: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: octeontx2-pf: let core reject the unsupported coalescing parameters
Jakub Kicinski [Fri, 13 Mar 2020 04:07:53 +0000 (21:07 -0700)]
net: octeontx2-pf: let core reject the unsupported coalescing parameters

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver correctly rejects all unsupported
parameters, no functional changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: mvpp2: reject unsupported coalescing params
Jakub Kicinski [Fri, 13 Mar 2020 04:07:52 +0000 (21:07 -0700)]
net: mvpp2: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: mvneta: reject unsupported coalescing params
Jakub Kicinski [Fri, 13 Mar 2020 04:07:51 +0000 (21:07 -0700)]
net: mvneta: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: mv643xx_eth: reject unsupported coalescing params
Jakub Kicinski [Fri, 13 Mar 2020 04:07:50 +0000 (21:07 -0700)]
net: mv643xx_eth: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: jme: reject unsupported coalescing params
Jakub Kicinski [Fri, 13 Mar 2020 04:07:49 +0000 (21:07 -0700)]
net: jme: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'net-phy-split-the-mscc-driver'
David S. Miller [Sun, 15 Mar 2020 04:06:45 +0000 (21:06 -0700)]
Merge branch 'net-phy-split-the-mscc-driver'

Antoine Tenart says:

====================
net: phy: split the mscc driver

This is a proposal to split the MSCC PHY driver, as its code base grew a
lot lately (it's already 3800+ lines). It also supports features
requiring a lot of code (MACsec), which would gain in being split from
the driver core, for readability and maintenance. This is also done as
other features should be coming later, which will also need lots of code
addition.

This series shouldn't change the way the driver works.

I checked, and there were no patch pending on this driver. This change
was done on top of all the modifications done on this driver in net-next.

Since v2:
  - Defined inline functions as static inline.
  - Fixed a locking issue reported by Kbuild.

Since v1:
  - Moved more definitions into the mscc_macsec.h header.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phy: mscc: fix header defines and descriptions
Antoine Tenart [Fri, 13 Mar 2020 09:48:02 +0000 (10:48 +0100)]
net: phy: mscc: fix header defines and descriptions

Cosmetic commit fixing the MSCC PHY header defines and descriptions,
which were referring the to MSCC Ocelot MAC driver (see
drivers/net/ethernet/mscc/).

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phy: mscc: split the driver into separate files
Antoine Tenart [Fri, 13 Mar 2020 09:48:01 +0000 (10:48 +0100)]
net: phy: mscc: split the driver into separate files

This patch splits the MSCC driver into separate files, per
functionality, to improve readability and maintenance as the codebase
grew a lot. The MACsec code is moved to a dedicated mscc_macsec.c file,
the mscc.c file is renamed to mscc_main.c to keep the driver binary to
be named mscc and common definition are put into a new mscc.h header.

Most of the code was just moved around, except for a few exceptions:
- Header inclusions were reworked to only keep what's needed.
- Three helpers were created in the MACsec code, to avoid #ifdef's in
  the main C file: vsc8584_macsec_init, vsc8584_handle_macsec_interrupt
  and vsc8584_config_macsec_intr.

The patch should not introduce any functional modification.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phy: move the mscc driver to its own directory
Antoine Tenart [Fri, 13 Mar 2020 09:48:00 +0000 (10:48 +0100)]
net: phy: move the mscc driver to its own directory

The MSCC PHY driver is growing, with lots of space consuming features
(firmware support, full initialization, MACsec...). It's becoming hard
to read and navigate in its source code. This patch moves the MSCC
driver to its own directory, without modifying anything, as a
preparation for splitting up its features into dedicated files.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'RED-Introduce-an-ECN-tail-dropping-mode'
David S. Miller [Sun, 15 Mar 2020 04:03:47 +0000 (21:03 -0700)]
Merge branch 'RED-Introduce-an-ECN-tail-dropping-mode'

Petr Machata says:

====================
RED: Introduce an ECN tail-dropping mode

When the RED qdisc is currently configured to enable ECN, the RED algorithm
is used to decide whether a certain SKB should be marked. If that SKB is
not ECN-capable, it is early-dropped.

It is also possible to keep all traffic in the queue, and just mark the
ECN-capable subset of it, as appropriate under the RED algorithm. Some
switches support this mode, and some installations make use of it.
There is currently no way to put the RED qdiscs to this mode.

Therefore this patchset adds a new RED flag, TC_RED_TAILDROP. When the
qdisc is configured with this flag, non-ECT traffic is enqueued (and
tail-dropped when the queue size is exhausted) instead of being
early-dropped.

Unfortunately, adding a new RED flag is not as simple as it sounds. RED
flags are passed in tc_red_qopt.flags. However RED neglects to validate the
flag field, and just copies it over wholesale to its internal structure,
and later dumps it back.

A broken userspace can therefore configure a RED qdisc with arbitrary
unsupported flags, and later expect to see the flags on qdisc dump. The
current ABI thus allows storage of 5 bits of custom data along with the
qdisc instance.

GRED, SFQ and CHOKE qdiscs are in the same situation. (GRED validates VQ
flags, but not the flags for the main queue.) E.g. if SFQ ever needs to
support TC_RED_ADAPTATIVE, it needs another way of doing it, and at the
same time it needs to retain the possibility to store 6 bits of
uninterpreted data.

For RED, this problem is resolved in patch #2, which adds a new attribute,
and a way to separate flags from userbits that can be reused by other
qdiscs. The flag itself and related behavioral changes are added in patch

To test the new feature, patch #1 first introduces a TDC testsuite that
covers the existing RED flags. Patch #5 later extends it with taildrop
coverage. Patch #6 contains a forwarding selftest for the offloaded
datapath.

To test the SW datapath, I took the mlxsw selftest and adapted it in mostly
obvious ways. The test is stable enough to verify that RED, ECN and ECN
taildrop actually work. However, I have no confidence in its portability to
other people's machines or mildly different configurations. I therefore do
not find it suitable for upstreaming.

GRED and CHOKE can use the same method as RED if they ever need to support
extra flags. SFQ uses the length of TCA_OPTIONS to dispatch on binary
control structure version, and would therefore need a different approach.

v2:
- Patch #1
    - Require nsPlugin in each RED test
    - Match end-of-line to catch cases of more flags reported than
      requested
- Patch #2:
    - Replaced with another patch.
- Patch #3:
    - Fix red_use_taildrop() condition in red_enqueue switch for
      probabilistic case.
- Patch #5:
    - Require nsPlugin in each RED test
    - Match end-of-line to catch cases of more flags reported than
      requested
    - Add a test for creation of non-ECN taildrop, which should fail
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoselftests: mlxsw: RED: Test RED ECN nodrop offload
Petr Machata [Thu, 12 Mar 2020 23:11:00 +0000 (01:11 +0200)]
selftests: mlxsw: RED: Test RED ECN nodrop offload

Extend RED testsuite to cover the new nodrop mode of RED-ECN. This test is
really similar to ECN test, diverging only in the last step, where UDP
traffic should go to backlog instead of being dropped. Thus extract a
common helper, ecn_test_common(), make do_ecn_test() into a relatively
simple wrapper, and add another one, do_ecn_nodrop_test().

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoselftests: qdiscs: RED: Add nodrop tests
Petr Machata [Thu, 12 Mar 2020 23:10:59 +0000 (01:10 +0200)]
selftests: qdiscs: RED: Add nodrop tests

Add tests for the new "nodrop" flag.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum_qdisc: Offload RED ECN nodrop mode
Petr Machata [Thu, 12 Mar 2020 23:10:58 +0000 (01:10 +0200)]
mlxsw: spectrum_qdisc: Offload RED ECN nodrop mode

RED ECN nodrop mode means that non-ECT traffic should not be early-dropped,
but enqueued normally instead. In Spectrum systems, this is achieved by
disabling CWTPM.ew (enable WRED) for a given traffic class.

So far CWTPM.ew was unconditionally enabled. Instead disable it when the
RED qdisc is in nodrop mode.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: RED: Introduce an ECN nodrop mode
Petr Machata [Thu, 12 Mar 2020 23:10:57 +0000 (01:10 +0200)]
net: sched: RED: Introduce an ECN nodrop mode

When the RED Qdisc is currently configured to enable ECN, the RED algorithm
is used to decide whether a certain SKB should be marked. If that SKB is
not ECN-capable, it is early-dropped.

It is also possible to keep all traffic in the queue, and just mark the
ECN-capable subset of it, as appropriate under the RED algorithm. Some
switches support this mode, and some installations make use of it.

To that end, add a new RED flag, TC_RED_NODROP. When the Qdisc is
configured with this flag, non-ECT traffic is enqueued instead of being
early-dropped.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: Allow extending set of supported RED flags
Petr Machata [Thu, 12 Mar 2020 23:10:56 +0000 (01:10 +0200)]
net: sched: Allow extending set of supported RED flags

The qdiscs RED, GRED, SFQ and CHOKE use different subsets of the same pool
of global RED flags. These are passed in tc_red_qopt.flags. However none of
these qdiscs validate the flag field, and just copy it over wholesale to
internal structures, and later dump it back. (An exception is GRED, which
does validate for VQs -- however not for the main setup.)

A broken userspace can therefore configure a qdisc with arbitrary
unsupported flags, and later expect to see the flags on qdisc dump. The
current ABI therefore allows storage of several bits of custom data to
qdisc instances of the types mentioned above. How many bits, depends on
which flags are meaningful for the qdisc in question. E.g. SFQ recognizes
flags ECN and HARDDROP, and the rest is not interpreted.

If SFQ ever needs to support ADAPTATIVE, it needs another way of doing it,
and at the same time it needs to retain the possibility to store 6 bits of
uninterpreted data. Likewise RED, which adds a new flag later in this
patchset.

To that end, this patch adds a new function, red_get_flags(), to split the
passed flags of RED-like qdiscs to flags and user bits, and
red_validate_flags() to validate the resulting configuration. It further
adds a new attribute, TCA_RED_FLAGS, to pass arbitrary flags.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoselftests: qdiscs: Add TDC test for RED
Petr Machata [Thu, 12 Mar 2020 23:10:55 +0000 (01:10 +0200)]
selftests: qdiscs: Add TDC test for RED

Add a handful of tests for creating RED with different flags.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agosfc: support configuring vf spoofchk on EF10 VFs
Edward Cree [Thu, 12 Mar 2020 19:21:39 +0000 (19:21 +0000)]
sfc: support configuring vf spoofchk on EF10 VFs

Corresponds to the MAC_SPOOFING_TX privilege in the hardware.
Some firmware versions on some cards don't support the feature, so check
 the TX_MAC_SECURITY capability and fail EOPNOTSUPP if trying to enable
 spoofchk on a NIC that doesn't support it.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'net-phy-XLGMII-define-and-usage-in-PHYLINK'
David S. Miller [Sun, 15 Mar 2020 03:55:12 +0000 (20:55 -0700)]
Merge branch 'net-phy-XLGMII-define-and-usage-in-PHYLINK'

Jose Abreu says:

====================
net: phy: XLGMII define and usage in PHYLINK

Adds XLGMII defines and usage in PHYLINK.

Patch 1/2, adds the define for it, whilst 2/2 adds the usage of it in
PHYLINK.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phylink: Add XLGMII support
Jose Abreu [Thu, 12 Mar 2020 17:10:10 +0000 (18:10 +0100)]
net: phylink: Add XLGMII support

Add XLGMII interface and the list of XLGMII speeds to PHYLINK.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phy: Add XLGMII interface define
Jose Abreu [Thu, 12 Mar 2020 17:10:09 +0000 (18:10 +0100)]
net: phy: Add XLGMII interface define

Add a define for XLGMII interface.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: ena: ethtool: clean up minor indentation issue
Colin Ian King [Thu, 12 Mar 2020 14:05:22 +0000 (14:05 +0000)]
net: ena: ethtool: clean up minor indentation issue

There is a statement that is indented incorrectly, remove a space.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: dsa: sja1105: move MAC configuration to .phylink_mac_link_up
Vladimir Oltean [Thu, 12 Mar 2020 12:19:51 +0000 (12:19 +0000)]
net: dsa: sja1105: move MAC configuration to .phylink_mac_link_up

The switches supported so far by the driver only have non-SerDes ports,
so they should be configured in the PHYLINK callback that provides the
resolved PHY link parameters.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agocxgb4: update T5/T6 adapter register ranges
Shahjada Abul Husain [Thu, 12 Mar 2020 11:42:40 +0000 (17:12 +0530)]
cxgb4: update T5/T6 adapter register ranges

Add more T5/T6 registers to be collected in register dump:

1. MPS register range 0x9810 to 0x9864 and 0xd000 to 0xd004.
2. NCSI register range 0x1a114 to 0x1a130 and 0x1a138 to 0x1a1c4.

Signed-off-by: Shahjada Abul Husain <shahjada@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge tag 'mlx5-updates-2020-03-13' of git://git.kernel.org/pub/scm/linux/kernel...
David S. Miller [Sat, 14 Mar 2020 04:04:03 +0000 (21:04 -0700)]
Merge tag 'mlx5-updates-2020-03-13' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2020-03-13

Misc update to mlx5 core and E-Switch driver:

1) Blue-Field, Update VF vports config when num of VFs changed

From Bodon, Various misc cleanups and refactoring
for vport enabling/disabling routines to allow them to be called
dynamically and not only on E-Switch load.

This will allow ECPF (ConnectX BlueField Smartnic) support for dynamic
num vf changes and dynamic vport creation and configuration as introduced
in "Update VF vports config when num of VFs changed" patch.

2) From Parav and Mark, trivial clean-ups.

3) Software steering support for flow table id as destination
and a clean-up patch to remove unnecessary function stubs, from Alex.
====================

Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
David S. Miller [Sat, 14 Mar 2020 03:52:03 +0000 (20:52 -0700)]
Merge git://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2020-03-13

The following pull-request contains BPF updates for your *net-next* tree.

We've added 86 non-merge commits during the last 12 day(s) which contain
a total of 107 files changed, 5771 insertions(+), 1700 deletions(-).

The main changes are:

1) Add modify_return attach type which allows to attach to a function via
   BPF trampoline and is run after the fentry and before the fexit programs
   and can pass a return code to the original caller, from KP Singh.

2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher
   objects to be visible in /proc/kallsyms so they can be annotated in
   stack traces, from Jiri Olsa.

3) Extend BPF sockmap to allow for UDP next to existing TCP support in order
   in order to enable this for BPF based socket dispatch, from Lorenz Bauer.

4) Introduce a new bpftool 'prog profile' command which attaches to existing
   BPF programs via fentry and fexit hooks and reads out hardware counters
   during that period, from Song Liu. Example usage:

   bpftool prog profile id 337 duration 3 cycles instructions llc_misses

        4228 run_cnt
     3403698 cycles                                              (84.08%)
     3525294 instructions   #  1.04 insn per cycle               (84.05%)
          13 llc_misses     #  3.69 LLC misses per million isns  (83.50%)

5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition
   of a new bpf_link abstraction to keep in particular BPF tracing programs
   attached even when the applicaion owning them exits, from Andrii Nakryiko.

6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering
   and which returns the PID as seen by the init namespace, from Carlos Neira.

7) Refactor of RISC-V JIT code to move out common pieces and addition of a
   new RV32G BPF JIT compiler, from Luke Nelson.

8) Add gso_size context member to __sk_buff in order to be able to know whether
   a given skb is GSO or not, from Willem de Bruijn.

9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output
   implementation but can be called from tracepoint programs, from Eelco Chaudron.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet/mlx5: DR, Remove unneeded functions deceleration
Alex Vesker [Sun, 8 Mar 2020 11:21:41 +0000 (13:21 +0200)]
net/mlx5: DR, Remove unneeded functions deceleration

Remove dummy functions declaration, the dummy functions are not needed
since fs_dr is the only one to call mlx5dr and both fs_dr and dr files
depend on the same config flag (MLX5_SW_STEERING).

Fixes: 70605ea545e8 ("net/mlx5: DR, Expose APIs for direct rule managing")
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: DR, Add support for flow table id destination action
Alex Vesker [Wed, 26 Feb 2020 09:39:45 +0000 (11:39 +0200)]
net/mlx5: DR, Add support for flow table id destination action

This action allows to go to a flow table based on the table id.
Goto flow table id is required for supporting user space SW.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: Avoid deriving mlx5_core_dev second time
Parav Pandit [Wed, 18 Dec 2019 05:16:11 +0000 (23:16 -0600)]
net/mlx5: Avoid deriving mlx5_core_dev second time

All callers needs to work on mlx5_core_dev and it is already derived
before calling mlx5_devlink_eswitch_check().
Hence, accept mlx5_core_dev in mlx5_devlink_eswitch_check().

Given that it works on mlx5_core_dev change helper function name to
drop devlink prefix.

Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: E-switch, Annotate esw state_lock mutex destroy
Parav Pandit [Wed, 18 Dec 2019 04:51:24 +0000 (22:51 -0600)]
net/mlx5: E-switch, Annotate esw state_lock mutex destroy

Invoke mutex_destroy() to catch any esw state_lock errors.

Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: E-switch, Annotate termtbl_mutex mutex destroy
Parav Pandit [Sat, 14 Dec 2019 09:24:25 +0000 (03:24 -0600)]
net/mlx5: E-switch, Annotate termtbl_mutex mutex destroy

Annotate mutex destroy to keep it symmetric to init sequence.
It should be destroyed after its users (representor netdevices) are
destroyed in below flow.

esw_offloads_disable()
  esw_offloads_unload_rep()

Hence, initialize the mutex before creating the representors which uses
it.

Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: Accept flow rules without match
Mark Bloch [Fri, 17 Jan 2020 18:30:32 +0000 (18:30 +0000)]
net/mlx5: Accept flow rules without match

Allow passing NULL spec when creating a flow rule. Such rules will act
as "catch all" flow rules.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: E-Switch, Refactor unload all reps per rep type
Bodong Wang [Tue, 12 Nov 2019 17:56:12 +0000 (11:56 -0600)]
net/mlx5: E-Switch, Refactor unload all reps per rep type

Following introduction of per vport configuration of vport and rep,
unload all reps per rep type is still needed as IB reps can be
unloaded individually. However, a few internal functions exist purely
for this purpose, merge them to a single function.

This patch doesn't change any existing functionality.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: E-Switch, Update VF vports config when num of VFs changed
Bodong Wang [Tue, 12 Nov 2019 17:30:10 +0000 (11:30 -0600)]
net/mlx5: E-Switch, Update VF vports config when num of VFs changed

Currently, ECPF eswitch manager does one-time only configuration for
VF vports when device switches to offloads mode. However, when num of
VFs changed from host side, driver doesn't update VF vports
configurations.

Hence, perform VFs vport configuration update whenever num_vfs change
event occurs.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: E-Switch, Introduce per vport configuration for eswitch modes
Bodong Wang [Mon, 11 Nov 2019 22:40:35 +0000 (16:40 -0600)]
net/mlx5: E-Switch, Introduce per vport configuration for eswitch modes

Both legacy and offload modes require vport setup, only offload mode
requires rep setup. Before this patch, vport and rep operations are
separated applied to all relevant vports in different stages.

Change to use per vport configuration, so that vport and rep operations
are modularized per vport.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: E-switch, Make vport setup/cleanup sequence symmetric
Bodong Wang [Wed, 16 Oct 2019 16:19:34 +0000 (11:19 -0500)]
net/mlx5: E-switch, Make vport setup/cleanup sequence symmetric

Vport enable and disable sequence is incorrect. It should be:
  enable()
  esw_vport_setup_acl,
  esw_vport_setup,
  esw_vport_enable_qos.

  disable()
  esw_vport_disable_qos,
  esw_vport_cleanup,
  esw_vport_cleanup_acl.

Instead of having two setup functions for port and acl, merge
acl setup to port setup function.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: E-Switch, Prepare for vport enable/disable refactor
Bodong Wang [Fri, 30 Aug 2019 20:41:09 +0000 (15:41 -0500)]
net/mlx5: E-Switch, Prepare for vport enable/disable refactor

Rename esw_apply_vport_config() to esw_vport_setup(), and add new
helper function esw_vport_cleanup() to make them symmetric.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: E-Switch, Remove redundant warning when QoS enable failed
Bodong Wang [Thu, 17 Oct 2019 19:55:52 +0000 (14:55 -0500)]
net/mlx5: E-Switch, Remove redundant warning when QoS enable failed

esw_vport_enable_qos can return error in cases below:
1. QoS is already enabled. Warnning is useless in this case.
2. Create scheduling element cmd failed. There is already a warning.

Remove the redundant warnning if esw_vport_enable_qos returns err.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: E-Switch, Hold mutex when querying drop counter in legacy mode
Bodong Wang [Fri, 13 Sep 2019 21:24:19 +0000 (16:24 -0500)]
net/mlx5: E-Switch, Hold mutex when querying drop counter in legacy mode

Consider scenario below, CPU 1 is at risk to query already destroyed
drop counters. Need to apply the same state mutex when disabling vport.

+-------------------------------+-------------------------------------+
| CPU 0                         | CPU 1                               |
+-------------------------------+-------------------------------------+
| mlx5_device_disable_sriov     | mlx5e_get_vf_stats                  |
| mlx5_eswitch_disable          | mlx5_eswitch_get_vport_stats        |
| esw_disable_vport             | mlx5_eswitch_query_vport_drop_stats |
| mlx5_fc_destroy(drop_counter) | mlx5_fc_query(drop_counter)         |
+-------------------------------+-------------------------------------+

Fixes: b8a0dbe3a90b ("net/mlx5e: E-switch, Add steering drop counters")
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: E-Switch, Remove redundant check of eswitch manager cap
Bodong Wang [Thu, 2 Jan 2020 21:30:52 +0000 (15:30 -0600)]
net/mlx5: E-Switch, Remove redundant check of eswitch manager cap

esw_vport_create_legacy_acl_tables bails out immediately for eswitch
manager, hence remove all the check of esw manager cap after.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agoMerge branch 'bpf-core-fixes'
Daniel Borkmann [Fri, 13 Mar 2020 22:30:53 +0000 (23:30 +0100)]
Merge branch 'bpf-core-fixes'

Andrii Nakryiko says:

====================
This patch set fixes bug in CO-RE relocation candidate finding logic, which
currently allows matching against forward declarations, functions, and other
named types, even though it makes no sense to even attempt. As part of
verifying the fix, add test using vmlinux.h with preserve_access_index
attribute and utilizing struct pt_regs heavily to trace nanosleep syscall
using 5 different types of tracing BPF programs.

This test also demonstrated problems using struct pt_regs in syscall
tracepoints and required a new set of macro, which were added in patch #3
into bpf_tracing.h.

Patch #1 fixes annoying issue with selftest failure messages being out of
sync.

v1->v2:
  - drop unused handle__probed() function (Martin).
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
4 years agoselftests/bpf: Add vmlinux.h selftest exercising tracing of syscalls
Andrii Nakryiko [Fri, 13 Mar 2020 17:23:36 +0000 (10:23 -0700)]
selftests/bpf: Add vmlinux.h selftest exercising tracing of syscalls

Add vmlinux.h generation to selftest/bpf's Makefile. Use it from newly added
test_vmlinux to trace nanosleep syscall using 5 different types of programs:
  - tracepoint;
  - raw tracepoint;
  - raw tracepoint w/ direct memory reads (tp_btf);
  - kprobe;
  - fentry.

These programs are realistic variants of real-life tracing programs,
excercising vmlinux.h's usage with tracing applications.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200313172336.1879637-5-andriin@fb.com
4 years agolibbpf: Provide CO-RE variants of PT_REGS macros
Andrii Nakryiko [Fri, 13 Mar 2020 17:23:35 +0000 (10:23 -0700)]
libbpf: Provide CO-RE variants of PT_REGS macros

Syscall raw tracepoints have struct pt_regs pointer as tracepoint's first
argument. After that, reading any of pt_regs fields requires bpf_probe_read(),
even for tp_btf programs. Due to that, PT_REGS_PARMx macros are not usable as
is. This patch adds CO-RE variants of those macros that use BPF_CORE_READ() to
read necessary fields. This provides relocatable architecture-agnostic pt_regs
field accesses.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200313172336.1879637-4-andriin@fb.com
4 years agolibbpf: Ignore incompatible types with matching name during CO-RE relocation
Andrii Nakryiko [Fri, 13 Mar 2020 17:23:34 +0000 (10:23 -0700)]
libbpf: Ignore incompatible types with matching name during CO-RE relocation

When finding target type candidates, ignore forward declarations, functions,
and other named types of incompatible kind. Not doing this can cause false
errors.  See [0] for one such case (due to struct pt_regs forward
declaration).

  [0] https://github.com/iovisor/bcc/pull/2806#issuecomment-598543645

Fixes: ddc7c3042614 ("libbpf: implement BPF CO-RE offset relocation algorithm")
Reported-by: Wenbo Zhang <ethercflow@gmail.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200313172336.1879637-3-andriin@fb.com
4 years agoselftests/bpf: Ensure consistent test failure output
Andrii Nakryiko [Fri, 13 Mar 2020 17:23:33 +0000 (10:23 -0700)]
selftests/bpf: Ensure consistent test failure output

printf() doesn't seem to honor using overwritten stdout/stderr (as part of
stdio hijacking), so ensure all "standard" invocations of printf() do
fprintf(stdout, ...) instead.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200313172336.1879637-2-andriin@fb.com
4 years agoselftests/bpf: Fix spurious failures in accept due to EAGAIN
Jakub Sitnicki [Fri, 13 Mar 2020 16:10:49 +0000 (17:10 +0100)]
selftests/bpf: Fix spurious failures in accept due to EAGAIN

Andrii Nakryiko reports that sockmap_listen test suite is frequently
failing due to accept() calls erroring out with EAGAIN:

  ./test_progs:connect_accept_thread:733: accept: Resource temporarily unavailable
  connect_accept_thread:FAIL:733

This is because we are using a non-blocking listening TCP socket to
accept() connections without polling on the socket.

While at first switching to blocking mode seems like the right thing to do,
this could lead to test process blocking indefinitely in face of a network
issue, like loopback interface being down, as Andrii pointed out.

Hence, stick to non-blocking mode for TCP listening sockets but with
polling for incoming connection for a limited time before giving up.

Apply this approach to all socket I/O calls in the test suite that we
expect to block indefinitely, that is accept() for TCP and recv() for UDP.

Fixes: 44d28be2b8d4 ("selftests/bpf: Tests for sockmap/sockhash holding listening sockets")
Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200313161049.677700-1-jakub@cloudflare.com
4 years agotools/bpf: Move linux/types.h for selftests and bpftool
Tobias Klauser [Fri, 13 Mar 2020 11:31:05 +0000 (12:31 +0100)]
tools/bpf: Move linux/types.h for selftests and bpftool

Commit fe4eb069edb7 ("bpftool: Use linux/types.h from source tree for
profiler build") added a build dependency on tools/testing/selftests/bpf
to tools/bpf/bpftool. This is suboptimal with respect to a possible
stand-alone build of bpftool.

Fix this by moving tools/testing/selftests/bpf/include/uapi/linux/types.h
to tools/include/uapi/linux/types.h.

This requires an adjustment in the include search path order for the
tests in tools/testing/selftests/bpf so that tools/include/linux/types.h
is selected when building host binaries and
tools/include/uapi/linux/types.h is selected when building bpf binaries.

Verified by compiling bpftool and the bpf selftests on x86_64 with this
change.

Fixes: fe4eb069edb7 ("bpftool: Use linux/types.h from source tree for profiler build")
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200313113105.6918-1-tklauser@distanz.ch
4 years agobpf: Add missing annotations for __bpf_prog_enter() and __bpf_prog_exit()
Jules Irenge [Wed, 11 Mar 2020 01:09:01 +0000 (01:09 +0000)]
bpf: Add missing annotations for __bpf_prog_enter() and __bpf_prog_exit()

Sparse reports a warning at __bpf_prog_enter() and __bpf_prog_exit()

warning: context imbalance in __bpf_prog_enter() - wrong count at exit
warning: context imbalance in __bpf_prog_exit() - unexpected unlock

The root cause is the missing annotation at __bpf_prog_enter()
and __bpf_prog_exit()

Add the missing __acquires(RCU) annotation
Add the missing __releases(RCU) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200311010908.42366-2-jbi.octave@gmail.com
4 years agobpf_helpers_doc.py: Fix warning when compiling bpftool
Carlos Neira [Fri, 13 Mar 2020 15:46:50 +0000 (12:46 -0300)]
bpf_helpers_doc.py: Fix warning when compiling bpftool

When compiling bpftool the following warning is found: "declaration of
'struct bpf_pidns_info' will not be visible outside of this function."
This patch adds struct bpf_pidns_info to type_fwds array to fix this.

Fixes: b4490c5c4e02 ("bpf: Added new helper bpf_get_ns_current_pid_tgid")
Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200313154650.13366-1-cneirabustos@gmail.com
4 years agoselftests/bpf: Fix usleep() implementation
Andrii Nakryiko [Fri, 13 Mar 2020 06:18:37 +0000 (23:18 -0700)]
selftests/bpf: Fix usleep() implementation

nanosleep syscall expects pointer to struct timespec, not nanoseconds
directly. Current implementation fulfills its purpose of invoking nanosleep
syscall, but doesn't really provide sleeping capabilities, which can cause
flakiness for tests relying on usleep() to wait for something.

Fixes: ec12a57b822c ("selftests/bpf: Guarantee that useep() calls nanosleep() syscall")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200313061837.3685572-1-andriin@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agoMerge branch 'generalize-bpf-ksym'
Alexei Starovoitov [Fri, 13 Mar 2020 02:23:12 +0000 (19:23 -0700)]
Merge branch 'generalize-bpf-ksym'

Jiri Olsa says:

====================
this patchset adds trampoline and dispatcher objects
to be visible in /proc/kallsyms.

  $ sudo cat /proc/kallsyms | tail -20
  ...
  ffffffffa050f000 t bpf_prog_5a2b06eab81b8f51    [bpf]
  ffffffffa0511000 t bpf_prog_6deef7357e7b4530    [bpf]
  ffffffffa0542000 t bpf_trampoline_13832 [bpf]
  ffffffffa0548000 t bpf_prog_96f1b5bf4e4cc6dc_mutex_lock [bpf]
  ffffffffa0572000 t bpf_prog_d1c63e29ad82c4ab_bpf_prog1  [bpf]
  ffffffffa0585000 t bpf_prog_e314084d332a5338__dissect   [bpf]
  ffffffffa0587000 t bpf_prog_59785a79eac7e5d2_mutex_unlock       [bpf]
  ffffffffa0589000 t bpf_prog_d0db6e0cac050163_mutex_lock [bpf]
  ffffffffa058d000 t bpf_prog_d8f047721e4d8321_bpf_prog2  [bpf]
  ffffffffa05df000 t bpf_trampoline_25637 [bpf]
  ffffffffa05e3000 t bpf_prog_d8f047721e4d8321_bpf_prog2  [bpf]
  ffffffffa05e5000 t bpf_prog_3b185187f1855c4c    [bpf]
  ffffffffa05e7000 t bpf_prog_d8f047721e4d8321_bpf_prog2  [bpf]
  ffffffffa05eb000 t bpf_prog_93cebb259dd5c4b2_do_sys_open        [bpf]
  ffffffffa0677000 t bpf_dispatcher_xdp   [bpf]

v5 changes:
  - keeping just 1 bpf_tree for all the objects and adding flag
    to recognize bpf_objects when searching for exception tables [Alexei]
  - no need for is_bpf_image_address call in kernel_text_address [Alexei]
  - removed the bpf_image tree, because it's no longer needed

v4 changes:
  - add trampoline and dispatcher to kallsyms once the it's allocated [Alexei]
  - omit the symbols sorting for kallsyms [Alexei]
  - small title change in one patch [Song]
  - some function renames:
     bpf_get_prog_name to bpf_prog_ksym_set_name
     bpf_get_prog_addr_region to bpf_prog_ksym_set_addr
  - added acks to changelogs
  - I checked and there'll be conflict on perftool side with
    upcoming changes from Adrian Hunter (text poke events),
    so I think it's better if Arnaldo takes the perf changes
    via perf tree and we will solve all conflicts there

v3 changes:
  - use container_of directly in bpf_get_ksym_start  [Daniel]
  - add more changelog explanations for ksym addresses [Daniel]

v2 changes:
  - omit extra condition in __bpf_ksym_add for sorting code (Andrii)
  - rename bpf_kallsyms_tree_ops to bpf_ksym_tree (Andrii)
  - expose only executable code in kallsyms (Andrii)
  - use full trampoline key as its kallsyms id (Andrii)
  - explained the BPF_TRAMP_REPLACE case (Andrii)
  - small format changes in bpf_trampoline_link_prog/bpf_trampoline_unlink_prog (Andrii)
  - propagate error value in bpf_dispatcher_update and update kallsym if it's successful (Andrii)
  - get rid of __always_inline for bpf_ksym_tree callbacks (Andrii)
  - added KSYMBOL notification for bpf_image add/removal
  - added perf tools changes to properly display trampoline/dispatcher
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agobpf: Remove bpf_image tree
Jiri Olsa [Thu, 12 Mar 2020 19:56:07 +0000 (20:56 +0100)]
bpf: Remove bpf_image tree

Now that we have all the objects (bpf_prog, bpf_trampoline,
bpf_dispatcher) linked in bpf_tree, there's no need to have
separate bpf_image tree for images.

Reverting the bpf_image tree together with struct bpf_image,
because it's no longer needed.

Also removing bpf_image_alloc function and adding the original
bpf_jit_alloc_exec_page interface instead.

The kernel_text_address function can now rely only on is_bpf_text_address,
because it checks the bpf_tree that contains all the objects.

Keeping bpf_image_ksym_add and bpf_image_ksym_del because they are
useful wrappers with perf's ksymbol interface calls.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200312195610.346362-13-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agobpf: Add dispatchers to kallsyms
Jiri Olsa [Thu, 12 Mar 2020 19:56:06 +0000 (20:56 +0100)]
bpf: Add dispatchers to kallsyms

Adding dispatchers to kallsyms. It's displayed as
  bpf_dispatcher_<NAME>

where NAME is the name of dispatcher.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200312195610.346362-12-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agobpf: Add trampolines to kallsyms
Jiri Olsa [Thu, 12 Mar 2020 19:56:05 +0000 (20:56 +0100)]
bpf: Add trampolines to kallsyms

Adding trampolines to kallsyms. It's displayed as
  bpf_trampoline_<ID> [bpf]

where ID is the BTF id of the trampoline function.

Adding bpf_image_ksym_add/del functions that setup
the start/end values and call KSYMBOL perf events
handlers.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200312195610.346362-11-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>