linux-2.6-microblaze.git
3 years agobpf: Add size arg to build_id_parse function
Jiri Olsa [Thu, 14 Jan 2021 13:40:43 +0000 (14:40 +0100)]
bpf: Add size arg to build_id_parse function

It's possible to have other build id types (other than default SHA1).
Currently there's also ld support for MD5 build id.

Adding size argument to build_id_parse function, that returns (if defined)
size of the parsed build id, so we can recognize the build id type.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210114134044.1418404-3-jolsa@kernel.org
3 years agobpf: Move stack_map_get_build_id into lib
Jiri Olsa [Thu, 14 Jan 2021 13:40:42 +0000 (14:40 +0100)]
bpf: Move stack_map_get_build_id into lib

Moving stack_map_get_build_id into lib with
declaration in linux/buildid.h header:

  int build_id_parse(struct vm_area_struct *vma, unsigned char *build_id);

This function returns build id for given struct vm_area_struct.
There is no functional change to stack_map_get_build_id function.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210114134044.1418404-2-jolsa@kernel.org
3 years agoMerge branch 'Atomics for eBPF'
Alexei Starovoitov [Fri, 15 Jan 2021 02:34:30 +0000 (18:34 -0800)]
Merge branch 'Atomics for eBPF'

Brendan Jackman says:

====================

There's still one unresolved review comment from John[3] which I
will resolve with a followup patch.

Differences from v6->v7 [1]:

* Fixed riscv build error detected by 0-day robot.

Differences from v5->v6 [1]:

* Carried Björn Töpel's ack for RISC-V code, plus a couple more acks from
  Yonhgong.

* Doc fixups.

* Trivial cleanups.

Differences from v4->v5 [1]:

* Fixed bogus type casts in interpreter that led to warnings from
  the 0day robot.

* Dropped feature-detection for Clang per Andrii's suggestion in [4].
  The selftests will now fail to build unless you have llvm-project
  commit 286daafd6512. The ENABLE_ATOMICS_TEST macro is still needed
  to support the no_alu32 tests.

* Carried some Acks from John and Yonghong.

* Dropped confusing usage of __atomic_exchange from prog_test in
  favour of __sync_lock_test_and_set.

* [Really] got rid of all the forest of instruction macros
  (BPF_ATOMIC_FETCH_ADD and friends); now there's just BPF_ATOMIC_OP
  to define all the instructions as we use them in the verifier
  tests. This makes the atomic ops less special in that API, and I
  don't think the resulting usage is actually any harder to read.

Differences from v3->v4 [1]:

* Added one Ack from Yonghong. He acked some other patches but those
  have now changed non-trivally so I didn't add those acks.

* Fixups to commit messages.

* Fixed disassembly and comments: first arg to atomic_fetch_* is a
  pointer.

* Improved prog_test efficiency. BPF progs are now all loaded in a
  single call, then the skeleton is re-used for each subtest.

* Dropped use of tools/build/feature in favour of a one-liner in the
  Makefile.

* Dropped the commit that created an emit_neg helper in the x86
  JIT. It's not used any more (it wasn't used in v3 either).

* Combined all the different filter.h macros (used to be
  BPF_ATOMIC_ADD, BPF_ATOMIC_FETCH_ADD, BPF_ATOMIC_AND, etc) into
  just BPF_ATOMIC32 and BPF_ATOMIC64.

* Removed some references to BPF_STX_XADD from tools/, samples/ and
  lib/ that I missed before.

Differences from v2->v3 [1]:

* More minor fixes and naming/comment changes

* Dropped atomic subtract: compilers can implement this by preceding
  an atomic add with a NEG instruction (which is what the x86 JIT did
  under the hood anyway).

* Dropped the use of -mcpu=v4 in the Clang BPF command-line; there is
  no longer an architecture version bump. Instead a feature test is
  added to Kbuild - it builds a source file to check if Clang
  supports BPF atomics.

* Fixed the prog_test so it no longer breaks
  test_progs-no_alu32. This requires some ifdef acrobatics to avoid
  complicating the prog_tests model where the same userspace code
  exercises both the normal and no_alu32 BPF test objects, using the
  same skeleton header.

Differences from v1->v2 [1]:

* Fixed mistakes in the netronome driver

* Addd sub, add, or, xor operations

* The above led to some refactors to keep things readable. (Maybe I
  should have just waited until I'd implemented these before starting
  the review...)

* Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which
  include the BPF_FETCH flag

* Added a bit of documentation. Suggestions welcome for more places
  to dump this info...

The prog_test that's added depends on Clang/LLVM features added by
Yonghong in commit 286daafd6512 (was
https://reviews.llvm.org/D72184).

This only includes a JIT implementation for x86_64 - I don't plan to
implement JIT support myself for other architectures.

Operations
==========

This patchset adds atomic operations to the eBPF instruction set. The
use-case that motivated this work was a trivial and efficient way to
generate globally-unique cookies in BPF progs, but I think it's
obvious that these features are pretty widely applicable.  The
instructions that are added here can be summarised with this list of
kernel operations:

* atomic[64]_[fetch_]add
* atomic[64]_[fetch_]and
* atomic[64]_[fetch_]or
* atomic[64]_xchg
* atomic[64]_cmpxchg

The following are left out of scope for this effort:

* 16 and 8 bit operations
* Explicit memory barriers

Encoding
========

I originally planned to add new values for bpf_insn.opcode. This was
rather unpleasant: the opcode space has holes in it but no entire
instruction classes[2]. Yonghong Song had a better idea: use the
immediate field of the existing STX XADD instruction to encode the
operation. This works nicely, without breaking existing programs,
because the immediate field is currently reserved-must-be-zero, and
extra-nicely because BPF_ADD happens to be zero.

Note that this of course makes immediate-source atomic operations
impossible. It's hard to imagine a measurable speedup from such
instructions, and if it existed it would certainly not benefit x86,
which has no support for them.

The BPF_OP opcode fields are re-used in the immediate, and an
additional flag BPF_FETCH is used to mark instructions that should
fetch a pre-modification value from memory.

So, BPF_XADD is now called BPF_ATOMIC (the old name is kept to avoid
breaking userspace builds), and where we previously had .imm = 0, we
now have .imm = BPF_ADD (which is 0).

Operands
========

Reg-source eBPF instructions only have two operands, while these
atomic operations have up to four. To avoid needing to encode
additional operands, then:

- One of the input registers is re-used as an output register
  (e.g. atomic_fetch_add both reads from and writes to the source
  register).

- Where necessary (i.e. for cmpxchg) , R0 is "hard-coded" as one of
  the operands.

This approach also allows the new eBPF instructions to map directly
to single x86 instructions.

[1] Previous iterations:
    v1: https://lore.kernel.org/bpf/20201123173202.1335708-1-jackmanb@google.com/
    v2: https://lore.kernel.org/bpf/20201127175738.1085417-1-jackmanb@google.com/
    v3: https://lore.kernel.org/bpf/X8kN7NA7bJC7aLQI@google.com/
    v4: https://lore.kernel.org/bpf/20201207160734.2345502-1-jackmanb@google.com/
    v5: https://lore.kernel.org/bpf/20201215121816.1048557-1-jackmanb@google.com/
    v6: https://lore.kernel.org/bpf/20210112154235.2192781-1-jackmanb@google.com/

[2] Visualisation of eBPF opcode space:
    https://gist.github.com/bjackman/00fdad2d5dfff601c1918bc29b16e778

[3] Comment from John about propagating bounds in verifier:
    https://lore.kernel.org/bpf/5fcf0fbcc8aa8_9ab320853@john-XPS-13-9370.notmuch/

[4] Mail from Andrii about not supporting old Clang in selftests:
    https://lore.kernel.org/bpf/CAEf4BzYBddPaEzRUs=jaWSo5kbf=LZdb7geAUVj85GxLQztuAQ@mail.gmail.com/
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agobpf: Document new atomic instructions
Brendan Jackman [Thu, 14 Jan 2021 18:17:51 +0000 (18:17 +0000)]
bpf: Document new atomic instructions

Document new atomic instructions.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-12-jackmanb@google.com
3 years agobpf: Add tests for new BPF atomic operations
Brendan Jackman [Thu, 14 Jan 2021 18:17:50 +0000 (18:17 +0000)]
bpf: Add tests for new BPF atomic operations

The prog_test that's added depends on Clang/LLVM features added by
Yonghong in commit 286daafd6512 (was https://reviews.llvm.org/D72184).

Note the use of a define called ENABLE_ATOMICS_TESTS: this is used
to:

 - Avoid breaking the build for people on old versions of Clang
 - Avoid needing separate lists of test objects for no_alu32, where
   atomics are not supported even if Clang has the feature.

The atomics_test.o BPF object is built unconditionally both for
test_progs and test_progs-no_alu32. For test_progs, if Clang supports
atomics, ENABLE_ATOMICS_TESTS is defined, so it includes the proper
test code. Otherwise, progs and global vars are defined anyway, as
stubs; this means that the skeleton user code still builds.

The atomics_test.o userspace object is built once and used for both
test_progs and test_progs-no_alu32. A variable called skip_tests is
defined in the BPF object's data section, which tells the userspace
object whether to skip the atomics test.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-11-jackmanb@google.com
3 years agobpf: Add bitwise atomic instructions
Brendan Jackman [Thu, 14 Jan 2021 18:17:49 +0000 (18:17 +0000)]
bpf: Add bitwise atomic instructions

This adds instructions for

atomic[64]_[fetch_]and
atomic[64]_[fetch_]or
atomic[64]_[fetch_]xor

All these operations are isomorphic enough to implement with the same
verifier, interpreter, and x86 JIT code, hence being a single commit.

The main interesting thing here is that x86 doesn't directly support
the fetch_ version these operations, so we need to generate a CMPXCHG
loop in the JIT. This requires the use of two temporary registers,
IIUC it's safe to use BPF_REG_AX and x86's AUX_REG for this purpose.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-10-jackmanb@google.com
3 years agobpf: Pull out a macro for interpreting atomic ALU operations
Brendan Jackman [Thu, 14 Jan 2021 18:17:48 +0000 (18:17 +0000)]
bpf: Pull out a macro for interpreting atomic ALU operations

Since the atomic operations that are added in subsequent commits are
all isomorphic with BPF_ADD, pull out a macro to avoid the
interpreter becoming dominated by lines of atomic-related code.

Note that this sacrificies interpreter performance (combining
STX_ATOMIC_W and STX_ATOMIC_DW into single switch case means that we
need an extra conditional branch to differentiate them) in favour of
compact and (relatively!) simple C code.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-9-jackmanb@google.com
3 years agobpf: Add instructions for atomic_[cmp]xchg
Brendan Jackman [Thu, 14 Jan 2021 18:17:47 +0000 (18:17 +0000)]
bpf: Add instructions for atomic_[cmp]xchg

This adds two atomic opcodes, both of which include the BPF_FETCH
flag. XCHG without the BPF_FETCH flag would naturally encode
atomic_set. This is not supported because it would be of limited
value to userspace (it doesn't imply any barriers). CMPXCHG without
BPF_FETCH woulud be an atomic compare-and-write. We don't have such
an operation in the kernel so it isn't provided to BPF either.

There are two significant design decisions made for the CMPXCHG
instruction:

 - To solve the issue that this operation fundamentally has 3
   operands, but we only have two register fields. Therefore the
   operand we compare against (the kernel's API calls it 'old') is
   hard-coded to be R0. x86 has similar design (and A64 doesn't
   have this problem).

   A potential alternative might be to encode the other operand's
   register number in the immediate field.

 - The kernel's atomic_cmpxchg returns the old value, while the C11
   userspace APIs return a boolean indicating the comparison
   result. Which should BPF do? A64 returns the old value. x86 returns
   the old value in the hard-coded register (and also sets a
   flag). That means return-old-value is easier to JIT, so that's
   what we use.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-8-jackmanb@google.com
3 years agobpf: Add BPF_FETCH field / create atomic_fetch_add instruction
Brendan Jackman [Thu, 14 Jan 2021 18:17:46 +0000 (18:17 +0000)]
bpf: Add BPF_FETCH field / create atomic_fetch_add instruction

The BPF_FETCH field can be set in bpf_insn.imm, for BPF_ATOMIC
instructions, in order to have the previous value of the
atomically-modified memory location loaded into the src register
after an atomic op is carried out.

Suggested-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-7-jackmanb@google.com
3 years agobpf: Move BPF_STX reserved field check into BPF_STX verifier code
Brendan Jackman [Thu, 14 Jan 2021 18:17:45 +0000 (18:17 +0000)]
bpf: Move BPF_STX reserved field check into BPF_STX verifier code

I can't find a reason why this code is in resolve_pseudo_ldimm64;
since I'll be modifying it in a subsequent commit, tidy it up.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-6-jackmanb@google.com
3 years agobpf: Rename BPF_XADD and prepare to encode other atomics in .imm
Brendan Jackman [Thu, 14 Jan 2021 18:17:44 +0000 (18:17 +0000)]
bpf: Rename BPF_XADD and prepare to encode other atomics in .imm

A subsequent patch will add additional atomic operations. These new
operations will use the same opcode field as the existing XADD, with
the immediate discriminating different operations.

In preparation, rename the instruction mode BPF_ATOMIC and start
calling the zero immediate BPF_ADD.

This is possible (doesn't break existing valid BPF progs) because the
immediate field is currently reserved MBZ and BPF_ADD is zero.

All uses are removed from the tree but the BPF_XADD definition is
kept around to avoid breaking builds for people including kernel
headers.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-5-jackmanb@google.com
3 years agobpf: x86: Factor out a lookup table for some ALU opcodes
Brendan Jackman [Thu, 14 Jan 2021 18:17:43 +0000 (18:17 +0000)]
bpf: x86: Factor out a lookup table for some ALU opcodes

A later commit will need to lookup a subset of these opcodes. To
avoid duplicating code, pull out a table.

The shift opcodes won't be needed by that later commit, but they're
already duplicated, so fold them into the table anyway.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-4-jackmanb@google.com
3 years agobpf: x86: Factor out emission of REX byte
Brendan Jackman [Thu, 14 Jan 2021 18:17:42 +0000 (18:17 +0000)]
bpf: x86: Factor out emission of REX byte

The JIT case for encoding atomic ops is about to get more
complicated. In order to make the review & resulting code easier,
let's factor out some shared helpers.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-3-jackmanb@google.com
3 years agobpf: x86: Factor out emission of ModR/M for *(reg + off)
Brendan Jackman [Thu, 14 Jan 2021 18:17:41 +0000 (18:17 +0000)]
bpf: x86: Factor out emission of ModR/M for *(reg + off)

The case for JITing atomics is about to get more complicated. Let's
factor out some common code to make the review and result more
readable.

NB the atomics code doesn't yet use the new helper - a subsequent
patch will add its use as a side-effect of other changes.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-2-jackmanb@google.com
3 years agotools/bpftool: Add -Wall when building BPF programs
Ian Rogers [Wed, 13 Jan 2021 22:36:09 +0000 (14:36 -0800)]
tools/bpftool: Add -Wall when building BPF programs

No additional warnings are generated by enabling this, but having it
enabled will help avoid regressions.

Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210113223609.3358812-2-irogers@google.com
3 years agobpf, libbpf: Avoid unused function warning on bpf_tail_call_static
Ian Rogers [Wed, 13 Jan 2021 22:36:08 +0000 (14:36 -0800)]
bpf, libbpf: Avoid unused function warning on bpf_tail_call_static

Add inline to __always_inline making it match the linux/compiler.h.
Adding this avoids an unused function warning on bpf_tail_call_static
when compining with -Wall.

Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210113223609.3358812-1-irogers@google.com
3 years agoMerge branch 'selftests/bpf: Some build fixes'
Andrii Nakryiko [Thu, 14 Jan 2021 03:05:40 +0000 (19:05 -0800)]
Merge branch 'selftests/bpf: Some build fixes'

Jean-Philippe Brucker says:

====================

A few fixes for cross-building the sefltests out of tree. This will
enable wider automated testing on various Arm hardware.

Changes since v1 [1]:
* Use wildcard in patch 5
* Move the MAKE_DIRS declaration in patch 1

[1] https://lore.kernel.org/bpf/20210112135959.649075-1-jean-philippe@linaro.org/
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
3 years agoselftests/bpf: Install btf_dump test cases
Jean-Philippe Brucker [Wed, 13 Jan 2021 16:33:20 +0000 (17:33 +0100)]
selftests/bpf: Install btf_dump test cases

The btf_dump test cannot access the original source files for comparison
when running the selftests out of tree, causing several failures:

awk: btf_dump_test_case_syntax.c: No such file or directory
...

Add those files to $(TEST_FILES) to have "make install" pick them up.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210113163319.1516382-6-jean-philippe@linaro.org
3 years agoselftests/bpf: Fix installation of urandom_read
Jean-Philippe Brucker [Wed, 13 Jan 2021 16:33:19 +0000 (17:33 +0100)]
selftests/bpf: Fix installation of urandom_read

For out-of-tree builds, $(TEST_CUSTOM_PROGS) require the $(OUTPUT)
prefix, otherwise the kselftest lib doesn't know how to install them:

rsync: [sender] link_stat "tools/testing/selftests/bpf/urandom_read" failed: No such file or directory (2)

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210113163319.1516382-5-jean-philippe@linaro.org
3 years agoselftests/bpf: Move generated test files to $(TEST_GEN_FILES)
Jean-Philippe Brucker [Wed, 13 Jan 2021 16:33:18 +0000 (17:33 +0100)]
selftests/bpf: Move generated test files to $(TEST_GEN_FILES)

During an out-of-tree build, attempting to install the $(TEST_FILES)
into the $(OUTPUT) directory fails, because the objects were already
generated into $(OUTPUT):

rsync: [sender] link_stat "tools/testing/selftests/bpf/test_lwt_ip_encap.o" failed: No such file or directory (2)
rsync: [sender] link_stat "tools/testing/selftests/bpf/test_tc_edt.o" failed: No such file or directory (2)

Use $(TEST_GEN_FILES) instead.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210113163319.1516382-4-jean-philippe@linaro.org
3 years agoselftests/bpf: Fix out-of-tree build
Jean-Philippe Brucker [Wed, 13 Jan 2021 16:33:17 +0000 (17:33 +0100)]
selftests/bpf: Fix out-of-tree build

When building out-of-tree, the .skel.h files are generated into the
$(OUTPUT) directory, rather than $(CURDIR). Add $(OUTPUT) to the include
paths.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210113163319.1516382-3-jean-philippe@linaro.org
3 years agoselftests/bpf: Enable cross-building
Jean-Philippe Brucker [Wed, 13 Jan 2021 16:33:16 +0000 (17:33 +0100)]
selftests/bpf: Enable cross-building

Build bpftool and resolve_btfids using the host toolchain when
cross-compiling, since they are executed during build to generate the
selftests. Add a host build directory in order to build both host and
target version of libbpf. Build host tools using $(HOSTCC) defined in
Makefile.include.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210113163319.1516382-2-jean-philippe@linaro.org
3 years agoMerge branch 'Support kernel module ksym variables'
Alexei Starovoitov [Tue, 12 Jan 2021 23:00:01 +0000 (15:00 -0800)]
Merge branch 'Support kernel module ksym variables'

Andrii Nakryiko says:

====================

Add support for using kernel module global variables (__ksym externs in BPF
program). BPF verifier will now support ldimm64 with src_reg=BPF_PSEUDO_BTF_ID
and non-zero insn[1].imm field, specifying module BTF's FD. In such case,
module BTF object, similarly to BPF maps referenced from ldimm64 with
src_reg=BPF_PSEUDO_MAP_FD, will be recorded in bpf_progran's auxiliary data
and refcnt will be increased for both BTF object itself and its kernel module.
This makes sure kernel module won't be unloaded from under active attached BPF
program. These refcounts will be dropped when BPF program is unloaded.

New selftest validates all this is working as intended. bpf_testmod.ko is
extended with per-CPU variable. Selftests expects the latest pahole changes
(soon to be released as v1.20) to generate per-CPU variable BTF info for
kernel module.

v2->v3:
  - added comments, addressed feedack (Yonghong, Hao);
v1->v2:
  - fixed few compiler warnings, posted as separate pre-patches;
rfc->v1:
  - use sys_membarrier(MEMBARRIER_CMD_GLOBAL) (Alexei).

Cc: Hao Luo <haoluo@google.com>
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agoselftests/bpf: Test kernel module ksym externs
Andrii Nakryiko [Tue, 12 Jan 2021 07:55:20 +0000 (23:55 -0800)]
selftests/bpf: Test kernel module ksym externs

Add per-CPU variable to bpf_testmod.ko and use those from new selftest to
validate it works end-to-end.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/bpf/20210112075520.4103414-8-andrii@kernel.org
3 years agolibbpf: Support kernel module ksym externs
Andrii Nakryiko [Tue, 12 Jan 2021 07:55:19 +0000 (23:55 -0800)]
libbpf: Support kernel module ksym externs

Add support for searching for ksym externs not just in vmlinux BTF, but across
all module BTFs, similarly to how it's done for CO-RE relocations. Kernels
that expose module BTFs through sysfs are assumed to support new ldimm64
instruction extension with BTF FD provided in insn[1].imm field, so no extra
feature detection is performed.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/bpf/20210112075520.4103414-7-andrii@kernel.org
3 years agobpf: Support BPF ksym variables in kernel modules
Andrii Nakryiko [Tue, 12 Jan 2021 07:55:18 +0000 (23:55 -0800)]
bpf: Support BPF ksym variables in kernel modules

Add support for directly accessing kernel module variables from BPF programs
using special ldimm64 instructions. This functionality builds upon vmlinux
ksym support, but extends ldimm64 with src_reg=BPF_PSEUDO_BTF_ID to allow
specifying kernel module BTF's FD in insn[1].imm field.

During BPF program load time, verifier will resolve FD to BTF object and will
take reference on BTF object itself and, for module BTFs, corresponding module
as well, to make sure it won't be unloaded from under running BPF program. The
mechanism used is similar to how bpf_prog keeps track of used bpf_maps.

One interesting change is also in how per-CPU variable is determined. The
logic is to find .data..percpu data section in provided BTF, but both vmlinux
and module each have their own .data..percpu entries in BTF. So for module's
case, the search for DATASEC record needs to look at only module's added BTF
types. This is implemented with custom search function.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/bpf/20210112075520.4103414-6-andrii@kernel.org
3 years agoselftests/bpf: Sync RCU before unloading bpf_testmod
Andrii Nakryiko [Tue, 12 Jan 2021 07:55:17 +0000 (23:55 -0800)]
selftests/bpf: Sync RCU before unloading bpf_testmod

If some of the subtests use module BTFs through ksyms, they will cause
bpf_prog to take a refcount on bpf_testmod module, which will prevent it from
successfully unloading. Module's refcnt is decremented when bpf_prog is freed,
which generally happens in RCU callback. So we need to trigger
syncronize_rcu() in the kernel, which can be achieved nicely with
membarrier(MEMBARRIER_CMD_SHARED) or membarrier(MEMBARRIER_CMD_GLOBAL) syscall.
So do that in kernel_sync_rcu() and make it available to other test inside the
test_progs. This synchronize_rcu() is called before attempting to unload
bpf_testmod.

Fixes: 9f7fa225894c ("selftests/bpf: Add bpf_testmod kernel module for testing")
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/bpf/20210112075520.4103414-5-andrii@kernel.org
3 years agobpf: Declare __bpf_free_used_maps() unconditionally
Andrii Nakryiko [Tue, 12 Jan 2021 07:55:16 +0000 (23:55 -0800)]
bpf: Declare __bpf_free_used_maps() unconditionally

__bpf_free_used_maps() is always defined in kernel/bpf/core.c, while
include/linux/bpf.h is guarding it behind CONFIG_BPF_SYSCALL. Move it out of
that guard region and fix compiler warning.

Fixes: a2ea07465c8d ("bpf: Fix missing prog untrack in release_maps")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210112075520.4103414-4-andrii@kernel.org
3 years agobpf: Avoid warning when re-casting __bpf_call_base into __bpf_call_base_args
Andrii Nakryiko [Tue, 12 Jan 2021 07:55:15 +0000 (23:55 -0800)]
bpf: Avoid warning when re-casting __bpf_call_base into __bpf_call_base_args

BPF interpreter uses extra input argument, so re-casts __bpf_call_base into
__bpf_call_base_args. Avoid compiler warning about incompatible function
prototypes by casting to void * first.

Fixes: 1ea47e01ad6e ("bpf: add support for bpf_call to interpreter")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210112075520.4103414-3-andrii@kernel.org
3 years agobpf: Add bpf_patch_call_args prototype to include/linux/bpf.h
Andrii Nakryiko [Tue, 12 Jan 2021 07:55:14 +0000 (23:55 -0800)]
bpf: Add bpf_patch_call_args prototype to include/linux/bpf.h

Add bpf_patch_call_args() prototype. This function is called from BPF verifier
and only if CONFIG_BPF_JIT_ALWAYS_ON is not defined. This fixes compiler
warning about missing prototype in some kernel configurations.

Fixes: 1ea47e01ad6e ("bpf: add support for bpf_call to interpreter")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210112075520.4103414-2-andrii@kernel.org
3 years agobpf: Extend bind v4/v6 selftests for mark/prio/bindtoifindex
Daniel Borkmann [Mon, 11 Jan 2021 23:09:40 +0000 (00:09 +0100)]
bpf: Extend bind v4/v6 selftests for mark/prio/bindtoifindex

Extend existing cgroup bind4/bind6 tests to add coverage for setting and
retrieving SO_MARK, SO_PRIORITY and SO_BINDTOIFINDEX at the bind hook.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/384fdc90e5fa83f8335a37aa90fa2f5f3661929c.1610406333.git.daniel@iogearbox.net
3 years agobpf: Allow to retrieve sol_socket opts from sock_addr progs
Daniel Borkmann [Mon, 11 Jan 2021 23:09:39 +0000 (00:09 +0100)]
bpf: Allow to retrieve sol_socket opts from sock_addr progs

The _bpf_setsockopt() is able to set some of the SOL_SOCKET level options,
however, _bpf_getsockopt() has little support to actually retrieve them.
This small patch adds few misc options such as SO_MARK, SO_PRIORITY and
SO_BINDTOIFINDEX. For the latter getter and setter are added. The mark and
priority in particular allow to retrieve the options from BPF cgroup hooks
to then implement custom behavior / settings on the syscall hooks compared
to other sockets that stick to the defaults, for example.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/cba44439b801e5ddc1170e5be787f4dc93a2d7f9.1610406333.git.daniel@iogearbox.net
3 years agobpf: Fix a verifier message for alloc size helper arg
Brendan Jackman [Tue, 12 Jan 2021 12:39:13 +0000 (12:39 +0000)]
bpf: Fix a verifier message for alloc size helper arg

The error message here is misleading, the argument will be rejected unless
it is a known constant.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210112123913.2016804-1-jackmanb@google.com
3 years agobpf: Clarify return value of probe str helpers
Brendan Jackman [Tue, 12 Jan 2021 12:34:22 +0000 (12:34 +0000)]
bpf: Clarify return value of probe str helpers

When the buffer is too small to contain the input string, these helpers
return the length of the buffer, not the length of the original string.
This tries to make the docs totally clear about that, since "the length
of the [copied ]string" could also refer to the length of the input.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: KP Singh <kpsingh@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210112123422.2011234-1-jackmanb@google.com
3 years agolibbpf: Clarify kernel type use with USER variants of CORE reading macros
Andrii Nakryiko [Fri, 8 Jan 2021 19:44:08 +0000 (11:44 -0800)]
libbpf: Clarify kernel type use with USER variants of CORE reading macros

Add comments clarifying that USER variants of CO-RE reading macro are still
only going to work with kernel types, defined in kernel or kernel module BTF.
This should help preventing invalid use of those macro to read user-defined
types (which doesn't work with CO-RE).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210108194408.3468860-1-andrii@kernel.org
3 years agoselftests/bpf: Remove duplicate include in test_lsm
Menglong Dong [Tue, 5 Jan 2021 15:20:47 +0000 (07:20 -0800)]
selftests/bpf: Remove duplicate include in test_lsm

'unistd.h' included in 'selftests/bpf/prog_tests/test_lsm.c' is
duplicated.

Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210105152047.6070-1-dong.menglong@zte.com.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agonet, xdp: Introduce xdp_prepare_buff utility routine
Lorenzo Bianconi [Tue, 22 Dec 2020 21:09:29 +0000 (22:09 +0100)]
net, xdp: Introduce xdp_prepare_buff utility routine

Introduce xdp_prepare_buff utility routine to initialize per-descriptor
xdp_buff fields (e.g. xdp_buff pointers). Rely on xdp_prepare_buff() in
all XDP capable drivers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/bpf/45f46f12295972a97da8ca01990b3e71501e9d89.1608670965.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agonet, xdp: Introduce xdp_init_buff utility routine
Lorenzo Bianconi [Tue, 22 Dec 2020 21:09:28 +0000 (22:09 +0100)]
net, xdp: Introduce xdp_init_buff utility routine

Introduce xdp_init_buff utility routine to initialize xdp_buff fields
const over NAPI iterations (e.g. frame_sz or rxq pointer). Rely on
xdp_init_buff in all XDP capable drivers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/bpf/7f8329b6da1434dc2b05a77f2e800b29628a8913.1608670965.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agobpf: Replace fput with sockfd_put in sock map
Zheng Yongjun [Tue, 29 Dec 2020 13:48:34 +0000 (21:48 +0800)]
bpf: Replace fput with sockfd_put in sock map

The function sockfd_lookup uses fget on the value that is stored in
the file field of the returned structure, so fput should ultimately
be applied to this value. This can be done directly, but it seems
better to use the specific macro sockfd_put, which does the same
thing.

The cleanup was done using the following semantic patch:
    (http://www.emn.fr/x-info/coccinelle/)

    // <smpl>
    @@
    expression s;
    @@

       s = sockfd_lookup(...)
       ...
    +  sockfd_put(s);
    ?- fput(s->file);
    // </smpl>

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201229134834.22962-1-zhengyongjun3@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agobpf: Remove unnecessary <argp.h> include from preload/iterators
Leah Neukirchen [Wed, 16 Dec 2020 10:03:06 +0000 (11:03 +0100)]
bpf: Remove unnecessary <argp.h> include from preload/iterators

This program does not use argp (which is a glibcism). Instead include <errno.h>
directly, which was pulled in by <argp.h>.

Signed-off-by: Leah Neukirchen <leah@vuxu.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201216100306.30942-1-leah@vuxu.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agoselftests/bpf: Add tests for user- and non-CO-RE BPF_CORE_READ() variants
Andrii Nakryiko [Fri, 18 Dec 2020 23:56:14 +0000 (15:56 -0800)]
selftests/bpf: Add tests for user- and non-CO-RE BPF_CORE_READ() variants

Add selftests validating that newly added variations of BPF_CORE_READ(), for
use with user-space addresses and for non-CO-RE reads, work as expected.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201218235614.2284956-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agolibbpf: Add non-CO-RE variants of BPF_CORE_READ() macro family
Andrii Nakryiko [Fri, 18 Dec 2020 23:56:13 +0000 (15:56 -0800)]
libbpf: Add non-CO-RE variants of BPF_CORE_READ() macro family

BPF_CORE_READ(), in addition to handling CO-RE relocations, also allows much
nicer way to read data structures with nested pointers. Instead of writing
a sequence of bpf_probe_read() calls to follow links, one can just write
BPF_CORE_READ(a, b, c, d) to effectively do a->b->c->d read. This is a welcome
ability when porting BCC code, which (in most cases) allows exactly the
intuitive a->b->c->d variant.

This patch adds non-CO-RE variants of BPF_CORE_READ() family of macros for
cases where CO-RE is not supported (e.g., old kernels). In such cases, the
property of shortening a sequence of bpf_probe_read()s to a simple
BPF_PROBE_READ(a, b, c, d) invocation is still desirable, especially when
porting BCC code to libbpf. Yet, no CO-RE relocation is going to be emitted.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201218235614.2284956-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agolibbpf: Add user-space variants of BPF_CORE_READ() family of macros
Andrii Nakryiko [Fri, 18 Dec 2020 23:56:12 +0000 (15:56 -0800)]
libbpf: Add user-space variants of BPF_CORE_READ() family of macros

Add BPF_CORE_READ_USER(), BPF_CORE_READ_USER_STR() and their _INTO()
variations to allow reading CO-RE-relocatable kernel data structures from the
user-space. One of such cases is reading input arguments of syscalls, while
reaping the benefits of CO-RE relocations w.r.t. handling 32/64 bit
conversions and handling missing/new fields in UAPI data structs.

Suggested-by: Gilad Reti <gilad.reti@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201218235614.2284956-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Fri, 8 Jan 2021 21:28:00 +0000 (13:28 -0800)]
Merge git://git./linux/kernel/git/netdev/net

Trivial conflict in CAN on file rename.

Conflicts:
drivers/net/can/m_can/tcan4x5x-core.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge tag 'net-5.11-rc3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Fri, 8 Jan 2021 20:12:30 +0000 (12:12 -0800)]
Merge tag 'net-5.11-rc3-2' of git://git./linux/kernel/git/netdev/net

Pull more networking fixes from Jakub Kicinski:
 "Slightly lighter pull request to get back into the Thursday cadence.

  Current release - always broken:

   - can: mcp251xfd: fix Tx/Rx ring buffer driver race conditions

   - dsa: hellcreek: fix led_classdev build errors

  Previous releases - regressions:

   - ipv6: fib: flush exceptions when purging route to avoid netdev
     reference leak

   - ip_tunnels: fix pmtu check in nopmtudisc mode

   - ip: always refragment ip defragmented packets to avoid MTU issues
     when forwarding through tunnels, correct "packet too big" message
     is prohibitively tricky to generate

   - s390/qeth: fix locking for discipline setup / removal and during
     recovery to prevent both deadlocks and races

   - mlx5: Use port_num 1 instead of 0 when delete a RoCE address

  Previous releases - always broken:

   - cdc_ncm: correct overhead calculation in delayed_ndp_size to
     prevent out of bound accesses with Huawei 909s-120 LTE module

   - fix stmmac dwmac-sun8i suspend/resume:
           - PHY being left powered off
           - MAC syscon configuration being reset
           - reference to the reset controller being improperly dropped

   - qrtr: fix null-ptr-deref in qrtr_ns_remove

   - can: tcan4x5x: fix bittiming const, use common bittiming from m_can
     driver

   - mlx5e: CT: Use per flow counter when CT flow accounting is enabled

   - mlx5e: Fix SWP offsets when vlan inserted by driver

  Misc:

   - bpf: Fix a task_iter bug caused by a bpf -> net merge conflict
     resolution

  And the usual many fixes to various error paths"

* tag 'net-5.11-rc3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
  net: dsa: lantiq_gswip: Exclude RMII from modes that report 1 GbE
  s390/qeth: fix L2 header access in qeth_l3_osa_features_check()
  s390/qeth: fix locking for discipline setup / removal
  s390/qeth: fix deadlock during recovery
  selftests: fib_nexthops: Fix wrong mausezahn invocation
  nexthop: Bounce NHA_GATEWAY in FDB nexthop groups
  nexthop: Unlink nexthop group entry in error path
  nexthop: Fix off-by-one error in error path
  octeontx2-af: fix memory leak of lmac and lmac->name
  chtls: Fix chtls resources release sequence
  chtls: Added a check to avoid NULL pointer dereference
  chtls: Replace skb_dequeue with skb_peek
  chtls: Avoid unnecessary freeing of oreq pointer
  chtls: Fix panic when route to peer not configured
  chtls: Remove invalid set_tcb call
  chtls: Fix hardware tid leak
  net: ip: always refragment ip defragmented packets
  net: fix pmtu check in nopmtudisc mode
  selftests: netfilter: add selftest for ipip pmtu discovery with enabled connection tracking
  docs: octeontx2: tune rst markup
  ...

3 years agoMerge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Linus Torvalds [Fri, 8 Jan 2021 20:05:11 +0000 (12:05 -0800)]
Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6

Pull crypto fixes from Herbert Xu:
 "This fixes a functional bug in arm/chacha-neon as well as a potential
  buffer overflow in ecdh"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: ecdh - avoid buffer overflow in ecdh_set_secret()
  crypto: arm/chacha-neon - add missing counter increment

3 years agopoll: fix performance regression due to out-of-line __put_user()
Linus Torvalds [Thu, 7 Jan 2021 17:43:54 +0000 (09:43 -0800)]
poll: fix performance regression due to out-of-line __put_user()

The kernel test robot reported a -5.8% performance regression on the
"poll2" test of will-it-scale, and bisected it to commit d55564cfc222
("x86: Make __put_user() generate an out-of-line call").

I didn't expect an out-of-line __put_user() to matter, because no normal
core code should use that non-checking legacy version of user access any
more.  But I had overlooked the very odd poll() usage, which does a
__put_user() to update the 'revents' values of the poll array.

Now, Al Viro correctly points out that instead of updating just the
'revents' field, it would be much simpler to just copy the _whole_
pollfd entry, and then we could just use "copy_to_user()" on the whole
array of entries, the same way we use "copy_from_user()" a few lines
earlier to get the original values.

But that is not what we've traditionally done, and I worry that threaded
applications might be concurrently modifying the other fields of the
pollfd array.  So while Al's suggestion is simpler - and perhaps worth
trying in the future - this instead keeps the "just update revents"
model.

To fix the performance regression, use the modern "unsafe_put_user()"
instead of __put_user(), with the proper "user_write_access_begin()"
guarding in place. This improves code generation enormously.

Link: https://lore.kernel.org/lkml/20210107134723.GA28532@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Laight <David.Laight@aculab.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoRevert "init/console: Use ttynull as a fallback when there is no console"
Petr Mladek [Fri, 8 Jan 2021 11:48:47 +0000 (12:48 +0100)]
Revert "init/console: Use ttynull as a fallback when there is no console"

This reverts commit 757055ae8dedf5333af17b3b5b4b70ba9bc9da4e.

The commit caused that ttynull was used as the default console
on several systems[1][2][3]. As a result, the console was
blank even when a better alternative existed.

It happened when there was no console configured
on the command line and ttynull_init() was the first initcall
calling register_console().

Or it happened when /dev/ did not exist when console_on_rootfs()
was called. It was not able to open /dev/console even though
a console driver was registered. It tried to add ttynull console
but it obviously did not help. But ttynull became the preferred
console and was used by /dev/console when it was available later.

The commit tried to fix a historical problem that have been there
for ages. The primary motivation was the commit 3cffa06aeef7ece30f6
("printk/console: Allow to disable console output by using console=""
 or console=null"). It provided a clean solution for a workaround
 that was widely used and worked only by chance.

This revert causes that the console="" or console=null command line
options will again work only by chance. These options will cause that
a particular console will be preferred and the default (tty) ones
will not get enabled. There will be no console registered at
all. As a result there won't be stdin, stdout, and stderr for
the init process. But it worked exactly this way even before.

The proper solution has to fulfill many conditions:

  + Register ttynull only when explicitly required or as
    the ultimate fallback.

  + ttynull should get associated with /dev/console but it must
    not become preferred console when used as a fallback.
    Especially, it must still be possible to replace it
    by a better console later.

Such a change requires clean up of the register_console() code.
Otherwise, it would be even harder to follow. Especially, the use
of has_preferred_console and CON_CONSDEV flag is tricky. The clean
up is risky. The ordering of consoles is not well defined. And
any changes tend to break existing user settings.

Do the revert at the least risky solution for now.

[1] https://lore.kernel.org/linux-kselftest/20201221144302.GR4077@smile.fi.intel.com/
[2] https://lore.kernel.org/lkml/d2a3b3c0-e548-7dd1-730f-59bc5c04e191@synopsys.com/
[3] https://patchwork.ozlabs.org/project/linux-um/patch/20210105120128.10854-1-thomas@m3y3r.de/

Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reported-by: Vineet Gupta <vgupta@synopsys.com>
Reported-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoMerge tag 'mlx5-fixes-2021-01-07' of git://git.kernel.org/pub/scm/linux/kernel/git...
Jakub Kicinski [Fri, 8 Jan 2021 03:13:29 +0000 (19:13 -0800)]
Merge tag 'mlx5-fixes-2021-01-07' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5 fixes 2021-01-07

* tag 'mlx5-fixes-2021-01-07' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5e: Fix memleak in mlx5e_create_l2_table_groups
  net/mlx5e: Fix two double free cases
  net/mlx5: Release devlink object if adev fails
  net/mlx5e: ethtool, Fix restriction of autoneg with 56G
  net/mlx5e: In skb build skip setting mark in switchdev mode
  net/mlx5: E-Switch, fix changing vf VLANID
  net/mlx5e: Fix SWP offsets when vlan inserted by driver
  net/mlx5e: CT: Use per flow counter when CT flow accounting is enabled
  net/mlx5: Use port_num 1 instead of 0 when delete a RoCE address
  net/mlx5e: Add missing capability check for uplink follow
  net/mlx5: Check if lag is supported before creating one
====================

Link: https://lore.kernel.org/r/20210107202845.470205-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: lantiq_gswip: Exclude RMII from modes that report 1 GbE
Aleksander Jan Bajkowski [Thu, 7 Jan 2021 19:58:18 +0000 (20:58 +0100)]
net: dsa: lantiq_gswip: Exclude RMII from modes that report 1 GbE

Exclude RMII from modes that report 1 GbE support. Reduced MII supports
up to 100 MbE.

Fixes: 14fceff4771e ("net: dsa: Add Lantiq / Intel DSA driver for vrx200")
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210107195818.3878-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 's390-qeth-fixes-2021-01-07'
Jakub Kicinski [Fri, 8 Jan 2021 02:54:08 +0000 (18:54 -0800)]
Merge branch 's390-qeth-fixes-2021-01-07'

Julian Wiedmann says:

====================
s390/qeth: fixes 2021-01-07

This brings two locking fixes for the device control path.
Also one fix for a path where our .ndo_features_check() attempts to
access a non-existent L2 header.
====================

Link: https://lore.kernel.org/r/20210107172442.1737-1-jwi@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agos390/qeth: fix L2 header access in qeth_l3_osa_features_check()
Julian Wiedmann [Thu, 7 Jan 2021 17:24:42 +0000 (18:24 +0100)]
s390/qeth: fix L2 header access in qeth_l3_osa_features_check()

ip_finish_output_gso() may call .ndo_features_check() even before the
skb has a L2 header. This conflicts with qeth_get_ip_version()'s attempt
to inspect the L2 header via vlan_eth_hdr().

Switch to vlan_get_protocol(), as already used further down in the
common qeth_features_check() path.

Fixes: f13ade199391 ("s390/qeth: run non-offload L3 traffic over common xmit path")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agos390/qeth: fix locking for discipline setup / removal
Julian Wiedmann [Thu, 7 Jan 2021 17:24:41 +0000 (18:24 +0100)]
s390/qeth: fix locking for discipline setup / removal

Due to insufficient locking, qeth_core_set_online() and
qeth_dev_layer2_store() can run in parallel, both attempting to load &
setup the discipline (and stepping on each other toes along the way).
A similar race can also occur between qeth_core_remove_device() and
qeth_dev_layer2_store().

Access to .discipline is meant to be protected by the discipline_mutex,
so add/expand the locking in qeth_core_remove_device() and
qeth_core_set_online().
Adjust the locking in qeth_l*_remove_device() accordingly, as it's now
handled by the callers in a consistent manner.

Based on an initial patch by Ursula Braun.

Fixes: 9dc48ccc68b9 ("qeth: serialize sysfs-triggered device configurations")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agos390/qeth: fix deadlock during recovery
Julian Wiedmann [Thu, 7 Jan 2021 17:24:40 +0000 (18:24 +0100)]
s390/qeth: fix deadlock during recovery

When qeth_dev_layer2_store() - holding the discipline_mutex - waits
inside qeth_l*_remove_device() for a qeth_do_reset() thread to complete,
we can hit a deadlock if qeth_do_reset() concurrently calls
qeth_set_online() and thus tries to aquire the discipline_mutex.

Move the discipline_mutex locking outside of qeth_set_online() and
qeth_set_offline(), and turn the discipline into a parameter so that
callers understand the dependency.

To fix the deadlock, we can now relax the locking:
As already established, qeth_l*_remove_device() waits for
qeth_do_reset() to complete. So qeth_do_reset() itself is under no risk
of having card->discipline ripped out while it's running, and thus
doesn't need to take the discipline_mutex.

Fixes: 9dc48ccc68b9 ("qeth: serialize sysfs-triggered device configurations")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'nexthop-various-fixes'
Jakub Kicinski [Fri, 8 Jan 2021 02:47:21 +0000 (18:47 -0800)]
Merge branch 'nexthop-various-fixes'

Ido Schimmel says:

====================
nexthop: Various fixes

This series contains various fixes for the nexthop code. The bugs were
uncovered during the development of resilient nexthop groups.

Patches #1-#2 fix the error path of nexthop_create_group(). I was not
able to trigger these bugs with current code, but it is possible with
the upcoming resilient nexthop groups code which adds a user
controllable memory allocation further in the function.

Patch #3 fixes wrong validation of netlink attributes.

Patch #4 fixes wrong invocation of mausezahn in a selftest.
====================

Link: https://lore.kernel.org/r/20210107144824.1135691-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoselftests: fib_nexthops: Fix wrong mausezahn invocation
Ido Schimmel [Thu, 7 Jan 2021 14:48:24 +0000 (16:48 +0200)]
selftests: fib_nexthops: Fix wrong mausezahn invocation

For IPv6 traffic, mausezahn needs to be invoked with '-6'. Otherwise an
error is returned:

 # ip netns exec me mausezahn veth1 -B 2001:db8:101::2 -A 2001:db8:91::1 -c 0 -t tcp "dp=1-1023, flags=syn"
 Failed to set source IPv4 address. Please check if source is set to a valid IPv4 address.
  Invalid command line parameters!

Fixes: 7c741868ceab ("selftests: Add torture tests to nexthop tests")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Bounce NHA_GATEWAY in FDB nexthop groups
Petr Machata [Thu, 7 Jan 2021 14:48:23 +0000 (16:48 +0200)]
nexthop: Bounce NHA_GATEWAY in FDB nexthop groups

The function nh_check_attr_group() is called to validate nexthop groups.
The intention of that code seems to have been to bounce all attributes
above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all
these attributes except when NHA_FDB attribute is present--then it accepts
them.

NHA_FDB validation that takes place before, in rtm_to_nh_config(), already
bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further
back, NHA_GROUPS and NHA_MASTER are bounced unconditionally.

But that still leaves NHA_GATEWAY as an attribute that would be accepted in
FDB nexthop groups (with no meaning), so long as it keeps the address
family as unspecified:

 # ip nexthop add id 1 fdb via 127.0.0.1
 # ip nexthop add id 10 fdb via default group 1

The nexthop code is still relatively new and likely not used very broadly,
and the FDB bits are newer still. Even though there is a reproducer out
there, it relies on an improbable gateway arguments "via default", "via
all" or "via any". Given all this, I believe it is OK to reformulate the
condition to do the right thing and bounce NHA_GATEWAY.

Fixes: 38428d68719c ("nexthop: support for fdb ecmp nexthops")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Unlink nexthop group entry in error path
Ido Schimmel [Thu, 7 Jan 2021 14:48:22 +0000 (16:48 +0200)]
nexthop: Unlink nexthop group entry in error path

In case of error, remove the nexthop group entry from the list to which
it was previously added.

Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Fix off-by-one error in error path
Ido Schimmel [Thu, 7 Jan 2021 14:48:21 +0000 (16:48 +0200)]
nexthop: Fix off-by-one error in error path

A reference was not taken for the current nexthop entry, so do not try
to put it in the error path.

Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoocteontx2-af: fix memory leak of lmac and lmac->name
Colin Ian King [Thu, 7 Jan 2021 12:39:16 +0000 (12:39 +0000)]
octeontx2-af: fix memory leak of lmac and lmac->name

Currently the error return paths don't kfree lmac and lmac->name
leading to some memory leaks.  Fix this by adding two error return
paths that kfree these objects

Addresses-Coverity: ("Resource leak")
Fixes: 1463f382f58d ("octeontx2-af: Add support for CGX link management")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210107123916.189748-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'bug-fixes-for-chtls-driver'
Jakub Kicinski [Fri, 8 Jan 2021 01:06:05 +0000 (17:06 -0800)]
Merge branch 'bug-fixes-for-chtls-driver'

Ayush Sawal says:

====================
Bug fixes for chtls driver

patch 1: Fix hardware tid leak.
patch 2: Remove invalid set_tcb call.
patch 3: Fix panic when route to peer not configured.
patch 4: Avoid unnecessary freeing of oreq pointer.
patch 5: Replace skb_dequeue with skb_peek.
patch 6: Added a check to avoid NULL pointer dereference patch.
patch 7: Fix chtls resources release sequence.
====================

Link: https://lore.kernel.org/r/20210106042912.23512-1-ayush.sawal@chelsio.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agochtls: Fix chtls resources release sequence
Ayush Sawal [Wed, 6 Jan 2021 04:29:12 +0000 (09:59 +0530)]
chtls: Fix chtls resources release sequence

CPL_ABORT_RPL is sent after releasing the resources by calling
chtls_release_resources(sk); and chtls_conn_done(sk);
eventually causing kernel panic. Fixing it by calling release
in appropriate order.

Fixes: cc35c88ae4db ("crypto : chtls - CPL handler definition")
Signed-off-by: Vinay Kumar Yadav <vinay.yadav@chelsio.com>
Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agochtls: Added a check to avoid NULL pointer dereference
Ayush Sawal [Wed, 6 Jan 2021 04:29:11 +0000 (09:59 +0530)]
chtls: Added a check to avoid NULL pointer dereference

In case of server removal lookup_stid() may return NULL pointer, which
is used as listen_ctx. So added a check before accessing this pointer.

Fixes: cc35c88ae4db ("crypto : chtls - CPL handler definition")
Signed-off-by: Vinay Kumar Yadav <vinay.yadav@chelsio.com>
Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agochtls: Replace skb_dequeue with skb_peek
Ayush Sawal [Wed, 6 Jan 2021 04:29:10 +0000 (09:59 +0530)]
chtls: Replace skb_dequeue with skb_peek

The skb is unlinked twice, one in __skb_dequeue in function
chtls_reset_synq() and another in cleanup_syn_rcv_conn().
So in this patch using skb_peek() instead of __skb_dequeue(),
so that unlink will be handled only in cleanup_syn_rcv_conn().

Fixes: cc35c88ae4db ("crypto : chtls - CPL handler definition")
Signed-off-by: Vinay Kumar Yadav <vinay.yadav@chelsio.com>
Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agochtls: Avoid unnecessary freeing of oreq pointer
Ayush Sawal [Wed, 6 Jan 2021 04:29:09 +0000 (09:59 +0530)]
chtls: Avoid unnecessary freeing of oreq pointer

In chtls_pass_accept_request(), removing the chtls_reqsk_free()
call to avoid oreq freeing twice. Here oreq is the pointer to
struct request_sock.

Fixes: cc35c88ae4db ("crypto : chtls - CPL handler definition")
Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com>
Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agochtls: Fix panic when route to peer not configured
Ayush Sawal [Wed, 6 Jan 2021 04:29:08 +0000 (09:59 +0530)]
chtls: Fix panic when route to peer not configured

If route to peer is not configured, we might get non tls
devices from dst_neigh_lookup() which is invalid, adding a
check to avoid it.

Fixes: cc35c88ae4db ("crypto : chtls - CPL handler definition")
Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com>
Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agochtls: Remove invalid set_tcb call
Ayush Sawal [Wed, 6 Jan 2021 04:29:07 +0000 (09:59 +0530)]
chtls: Remove invalid set_tcb call

At the time of SYN_RECV, connection information is not
initialized at FW, updating tcb flag over uninitialized
connection causes adapter crash. We don't need to
update the flag during SYN_RECV state, so avoid this.

Fixes: cc35c88ae4db ("crypto : chtls - CPL handler definition")
Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com>
Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agochtls: Fix hardware tid leak
Ayush Sawal [Wed, 6 Jan 2021 04:29:06 +0000 (09:59 +0530)]
chtls: Fix hardware tid leak

send_abort_rpl() is not calculating cpl_abort_req_rss offset and
ends up sending wrong TID with abort_rpl WR causng tid leaks.
Replaced send_abort_rpl() with chtls_send_abort_rpl() as it is
redundant.

Fixes: cc35c88ae4db ("crypto : chtls - CPL handler definition")
Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com>
Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'generic-zcopy_-functions'
Jakub Kicinski [Fri, 8 Jan 2021 00:08:38 +0000 (16:08 -0800)]
Merge branch 'generic-zcopy_-functions'

Jonathan Lemon says:

====================
Generic zcopy_* functions

This is set of cleanup patches for zerocopy which are intended
to allow a introduction of a different zerocopy implementation.

The top level API will use the skb_zcopy_*() functions, while
the current TCP specific zerocopy ends up using msg_zerocopy_*()
calls.

There should be no functional changes from these patches.
====================

Link: https://lore.kernel.org/r/20210106221841.1880536-1-jonathan.lemon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoskbuff: Rename skb_zcopy_{get|put} to net_zcopy_{get|put}
Jonathan Lemon [Wed, 6 Jan 2021 22:18:41 +0000 (14:18 -0800)]
skbuff: Rename skb_zcopy_{get|put} to net_zcopy_{get|put}

Unlike the rest of the skb_zcopy_ functions, these routines
operate on a 'struct ubuf', not a skb.  Remove the 'skb_'
prefix from the naming to make things clearer.

Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agotap/tun: add skb_zcopy_init() helper for initialization.
Jonathan Lemon [Wed, 6 Jan 2021 22:18:40 +0000 (14:18 -0800)]
tap/tun: add skb_zcopy_init() helper for initialization.

Replace direct assignments with skb_zcopy_init() for zerocopy
cases where a new skb is initialized, without changing the
reference counts.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoskbuff: add flags to ubuf_info for ubuf setup
Jonathan Lemon [Wed, 6 Jan 2021 22:18:39 +0000 (14:18 -0800)]
skbuff: add flags to ubuf_info for ubuf setup

Currently, when an ubuf is attached to a new skb, the shared
flags word is initialized to a fixed value.  Instead of doing
this, set the default flags in the ubuf, and have new skbs
inherit from this default.

This is needed when setting up different zerocopy types.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: group skb_shinfo zerocopy related bits together.
Jonathan Lemon [Wed, 6 Jan 2021 22:18:38 +0000 (14:18 -0800)]
net: group skb_shinfo zerocopy related bits together.

In preparation for expanded zerocopy (TX and RX), move
the zerocopy related bits out of tx_flags into their own
flag word.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoskbuff: rename sock_zerocopy_* to msg_zerocopy_*
Jonathan Lemon [Wed, 6 Jan 2021 22:18:37 +0000 (14:18 -0800)]
skbuff: rename sock_zerocopy_* to msg_zerocopy_*

At Willem's suggestion, rename the sock_zerocopy_* functions
so that they match the MSG_ZEROCOPY flag, which makes it clear
they are specific to this zerocopy implementation.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoskbuff: Call skb_zcopy_clear() before unref'ing fragments
Jonathan Lemon [Wed, 6 Jan 2021 22:18:36 +0000 (14:18 -0800)]
skbuff: Call skb_zcopy_clear() before unref'ing fragments

RX zerocopy fragment pages which are not allocated from the
system page pool require special handling.  Give the callback
in skb_zcopy_clear() a chance to process them first.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoskbuff: Call sock_zerocopy_put_abort from skb_zcopy_put_abort
Jonathan Lemon [Wed, 6 Jan 2021 22:18:35 +0000 (14:18 -0800)]
skbuff: Call sock_zerocopy_put_abort from skb_zcopy_put_abort

The sock_zerocopy_put_abort function contains logic which is
specific to the current zerocopy implementation.  Add a wrapper
which checks the callback and dispatches apppropriately.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoskbuff: Add skb parameter to the ubuf zerocopy callback
Jonathan Lemon [Wed, 6 Jan 2021 22:18:34 +0000 (14:18 -0800)]
skbuff: Add skb parameter to the ubuf zerocopy callback

Add an optional skb parameter to the zerocopy callback parameter,
which is passed down from skb_zcopy_clear().  This gives access
to the original skb, which is needed for upcoming RX zero-copy
error handling.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoskbuff: replace sock_zerocopy_get with skb_zcopy_get
Jonathan Lemon [Wed, 6 Jan 2021 22:18:33 +0000 (14:18 -0800)]
skbuff: replace sock_zerocopy_get with skb_zcopy_get

Rename the get routines for consistency.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoskbuff: replace sock_zerocopy_put() with skb_zcopy_put()
Jonathan Lemon [Wed, 6 Jan 2021 22:18:32 +0000 (14:18 -0800)]
skbuff: replace sock_zerocopy_put() with skb_zcopy_put()

Replace sock_zerocopy_put with the generic skb_zcopy_put()
function.  Pass 'true' as the success argument, as this
is identical to no change.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoskbuff: Push status and refcounts into sock_zerocopy_callback
Jonathan Lemon [Wed, 6 Jan 2021 22:18:31 +0000 (14:18 -0800)]
skbuff: Push status and refcounts into sock_zerocopy_callback

Before this change, the caller of sock_zerocopy_callback would
need to save the zerocopy status, decrement and check the refcount,
and then call the callback function - the callback was only invoked
when the refcount reached zero.

Now, the caller just passes the status into the callback function,
which saves the status and handles its own refcounts.

This makes the behavior of the sock_zerocopy_callback identical
to the tpacket and vhost callbacks.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoskbuff: simplify sock_zerocopy_put
Jonathan Lemon [Wed, 6 Jan 2021 22:18:30 +0000 (14:18 -0800)]
skbuff: simplify sock_zerocopy_put

All 'struct ubuf_info' users should have a callback defined
as of commit 0a4a060bb204 ("sock: fix zerocopy_success regression
with msg_zerocopy").

Remove the dead code path to consume_skb(), which makes
assumptions about how the structure was allocated.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoskbuff: remove unused skb_zcopy_abort function
Jonathan Lemon [Wed, 6 Jan 2021 22:18:29 +0000 (14:18 -0800)]
skbuff: remove unused skb_zcopy_abort function

skb_zcopy_abort() has no in-tree consumers, remove it.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge tag 'gcc-plugins-v5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 8 Jan 2021 00:03:19 +0000 (16:03 -0800)]
Merge tag 'gcc-plugins-v5.11-rc3' of git://git./linux/kernel/git/kees/linux

Pull gcc-plugins fix from Kees Cook:
 "Bump c++ standard version for latest GCC versions (Valdis Kletnieks)"

* tag 'gcc-plugins-v5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  gcc-plugins: fix gcc 11 indigestion with plugins...

3 years agoMerge branch 'dwmac-meson8b-picosecond-precision-rx-delay-support'
Jakub Kicinski [Thu, 7 Jan 2021 23:58:35 +0000 (15:58 -0800)]
Merge branch 'dwmac-meson8b-picosecond-precision-rx-delay-support'

Martin Blumenstingl says:

====================
dwmac-meson8b: picosecond precision RX delay support

with the help of Jianxin Pan (many thanks!) the meaning of the "new"
PRG_ETH1[19:16] register bits on Amlogic Meson G12A, G12B and SM1 SoCs
are finally known. These SoCs allow fine-tuning the RGMII RX delay in
200ps steps (contrary to what I have thought in the past [0] these are
not some "calibration" values).

The vendor u-boot has code to automatically detect the best RX/TX delay
settings. For now we keep it simple and add a device-tree property with
200ps precision to select the "right" RX delay for each board.

While here, deprecate the "amlogic,rx-delay-ns" property as it's not
used on any upstream .dts (yet). The driver is backwards compatible.

I have tested this on an X96 Air 4GB board (not upstream yet). Testing
with iperf3 gives 938 Mbits/sec in both directions (RX and TX). The
following network settings were used in the .dts (2ns TX delay
generated by the PHY, 800ps RX delay generated by the MAC as the PHY
only supports 0ns or 2ns RX delays):
        &ext_mdio {
                external_phy: ethernet-phy@0 {
                        /* Realtek RTL8211F (0x001cc916) */
                        reg = <0>;
                        eee-broken-1000t;

                        reset-assert-us = <10000>;
                        reset-deassert-us = <30000>;
                        reset-gpios = <&gpio GPIOZ_15 (GPIO_ACTIVE_LOW |
                                                GPIO_OPEN_DRAIN)>;

                        interrupt-parent = <&gpio_intc>;
                        /* MAC_INTR on GPIOZ_14 */
                        interrupts = <26 IRQ_TYPE_LEVEL_LOW>;
                };
        };

        &ethmac {
                status = "okay";

                pinctrl-0 = <&eth_pins>, <&eth_rgmii_pins>;
                pinctrl-names = "default";

                phy-mode = "rgmii-txid";
                phy-handle = <&external_phy>;

                amlogic,rgmii-rx-delay-ps = <800>;
        };

To use the same settings from vendor u-boot (which in my case has broken
Ethernet) the following commands can be used:
  mw.l 0xff634540 0x1621
  mw.l 0xff634544 0x30000
  phyreg w 0x0 0x1040
  phyreg w 0x1f 0xd08
  phyreg w 0x11 0x9
  phyreg w 0x15 0x11
  phyreg w 0x1f 0x0
  phyreg w 0x0 0x9200

Also I have tested this on a X96 Max board without any .dts changes
to confirm that other boards with the same IP block still work fine
with these changes.

Changes since v3 at [3].
- added Florian's Reviewed-by to patch 1 (thank you!)
- rebased on top of net-next

Changes since v2 at [2]:
- use the generic property name "rx-internal-delay-ps" as suggested by
  Rob (thanks!). This affects patches #1 and #3. The biggest change is
  is in patch #1 which is why I didn't add Florian's and Andrew's
  Reviewed-by
- added Andrew's and Florian's Reviewed-by to patches 2, 3, 4, 5 (many
  thanks to both!). I decided to do this despite renaming the property
  to the generic name "rx-internal-delay-ps" as it only affects the
  patch description and one line of code
- updated patch description of patch #3 to explain why there's not a
  lot of validation when parsing the old device-tree property (in
  nanosecond precision)
- dropped RFC status

Changes since v1 at [1]:
- updated patch 1 by making it more clear when the RX delay is applied.
  Thanks to Andrew for the suggestion!
- added a fix to enabling the timing-adjustment clock only when really
  needed. Found by Andrew - thanks!
- added testing not about X96 Max
- v1 did not go to the netdev mailing list, v2 fixes this

[0] https://lore.kernel.org/netdev/CAFBinCATt4Hi9rigj52nMf3oygyFbnopZcsakGL=KyWnsjY3JA@mail.gmail.com/
[1] https://patchwork.kernel.org/project/linux-amlogic/list/?series=384279&state=%2A&archive=both
[2] https://patchwork.kernel.org/project/linux-amlogic/list/?series=384491&state=%2A&archive=both
[3] https://patchwork.kernel.org/project/linux-amlogic/list/?series=406005&state=%2A&archive=both
====================

Link: https://lore.kernel.org/r/20210106134251.45264-1-martin.blumenstingl@googlemail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: stmmac: dwmac-meson8b: add support for the RGMII RX delay on G12A
Martin Blumenstingl [Wed, 6 Jan 2021 13:42:51 +0000 (14:42 +0100)]
net: stmmac: dwmac-meson8b: add support for the RGMII RX delay on G12A

Amlogic Meson G12A (and newer: G12B, SM1) SoCs have a more advanced RX
delay logic. Instead of fine-tuning the delay in the nanoseconds range
it now allows tuning in 200 picosecond steps. This support comes with
new bits in the PRG_ETH1[19:16] register.

Add support for validating the RGMII RX delay as well as configuring the
register accordingly on these platforms.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: stmmac: dwmac-meson8b: move RGMII delays into a separate function
Martin Blumenstingl [Wed, 6 Jan 2021 13:42:50 +0000 (14:42 +0100)]
net: stmmac: dwmac-meson8b: move RGMII delays into a separate function

Newer SoCs starting with the Amlogic Meson G12A have more a precise
RGMII RX delay configuration register. This means more complexity in the
code. Extract the existing RGMII delay configuration code into a
separate function to make it easier to read/understand even when adding
more logic in the future.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: stmmac: dwmac-meson8b: use picoseconds for the RGMII RX delay
Martin Blumenstingl [Wed, 6 Jan 2021 13:42:49 +0000 (14:42 +0100)]
net: stmmac: dwmac-meson8b: use picoseconds for the RGMII RX delay

Amlogic Meson G12A, G12B and SM1 SoCs have a more advanced RGMII RX
delay register which allows picoseconds precision. Parse the new
"rx-internal-delay-ps" property or fall back to the value from the old
"amlogic,rx-delay-ns" property.

No upstream DTB uses the old "amlogic,rx-delay-ns" property (yet).
Only include minimalistic logic to fall back to the old property,
without any special validation (for example if the old and new
property are given at the same time).

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: stmmac: dwmac-meson8b: fix enabling the timing-adjustment clock
Martin Blumenstingl [Wed, 6 Jan 2021 13:42:48 +0000 (14:42 +0100)]
net: stmmac: dwmac-meson8b: fix enabling the timing-adjustment clock

The timing-adjustment clock only has to be enabled when a) there is a
2ns RX delay configured using device-tree and b) the phy-mode indicates
that the RX delay should be enabled.

Only enable the RX delay if both are true, instead of (by accident) also
enabling it when there's the 2ns RX delay configured but the phy-mode
incicates that the RX delay is not used.

Fixes: 9308c47640d515 ("net: stmmac: dwmac-meson8b: add support for the RX delay configuration")
Reported-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agodt-bindings: net: dwmac-meson: use picoseconds for the RGMII RX delay
Martin Blumenstingl [Wed, 6 Jan 2021 13:42:47 +0000 (14:42 +0100)]
dt-bindings: net: dwmac-meson: use picoseconds for the RGMII RX delay

Amlogic Meson G12A, G12B and SM1 SoCs have a more advanced RGMII RX
delay register which allows picoseconds precision. Deprecate the old
"amlogic,rx-delay-ns" in favour of the generic "rx-internal-delay-ps"
property.

For older SoCs the only known supported values were 0ns and 2ns. The new
SoCs have support for RGMII RX delays between 0ps and 3000ps in 200ps
steps.

Don't carry over the description for the "rx-internal-delay-ps" property
and inherit that from ethernet-controller.yaml instead.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'reduce-coupling-between-dsa-and-broadcom-systemport-driver'
Jakub Kicinski [Thu, 7 Jan 2021 23:42:09 +0000 (15:42 -0800)]
Merge branch 'reduce-coupling-between-dsa-and-broadcom-systemport-driver'

Vladimir Oltean says:

====================
Reduce coupling between DSA and Broadcom SYSTEMPORT driver

Upon a quick inspection, it seems that there is some code in the generic
DSA layer that is somehow specific to the Broadcom SYSTEMPORT driver.
The challenge there is that the hardware integration is very tight between
the switch and the DSA master interface. However this does not mean that
the drivers must also be as integrated as the hardware is. We can avoid
creating a DSA notifier just for the Broadcom SYSTEMPORT, and we can
move some Broadcom-specific queue mapping helpers outside of the common
include/net/dsa.h.
====================

Link: https://lore.kernel.org/r/20210107012403.1521114-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: remove the DSA specific notifiers
Vladimir Oltean [Thu, 7 Jan 2021 01:24:03 +0000 (03:24 +0200)]
net: dsa: remove the DSA specific notifiers

This effectively reverts commit 60724d4bae14 ("net: dsa: Add support for
DSA specific notifiers"). The reason is that since commit 2f1e8ea726e9
("net: dsa: link interfaces with the DSA master to get rid of lockdep
warnings"), it appears that there is a generic way to achieve the same
purpose. The only user thus far, the Broadcom SYSTEMPORT driver, was
converted to use the generic notifiers.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: systemport: use standard netdevice notifier to detect DSA presence
Vladimir Oltean [Thu, 7 Jan 2021 01:24:02 +0000 (03:24 +0200)]
net: systemport: use standard netdevice notifier to detect DSA presence

The SYSTEMPORT driver maps each port of the embedded Broadcom DSA switch
port to a certain queue of the master Ethernet controller. For that it
currently uses a dedicated notifier infrastructure which was added in
commit 60724d4bae14 ("net: dsa: Add support for DSA specific notifiers").

However, since commit 2f1e8ea726e9 ("net: dsa: link interfaces with the
DSA master to get rid of lockdep warnings"), DSA is actually an upper of
the Broadcom SYSTEMPORT as far as the netdevice adjacency lists are
concerned. So naturally, the plain NETDEV_CHANGEUPPER net device notifiers
are emitted. It looks like there is enough API exposed by DSA to the
outside world already to make the call_dsa_notifiers API redundant. So
let's convert its only user to plain netdev notifiers.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: export dsa_slave_dev_check
Vladimir Oltean [Thu, 7 Jan 2021 01:24:01 +0000 (03:24 +0200)]
net: dsa: export dsa_slave_dev_check

Using the NETDEV_CHANGEUPPER notifications, drivers can be aware when
they are enslaved to e.g. a bridge by calling netif_is_bridge_master().

Export this helper from DSA to get the equivalent functionality of
determining whether the upper interface of a CHANGEUPPER notifier is a
DSA switch interface or not.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: move the Broadcom tag information in a separate header file
Vladimir Oltean [Thu, 7 Jan 2021 01:24:00 +0000 (03:24 +0200)]
net: dsa: move the Broadcom tag information in a separate header file

It is a bit strange to see something as specific as Broadcom SYSTEMPORT
bits in the main DSA include file. Move these away into a separate
header, and have the tagger and the SYSTEMPORT driver include them.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'offload-software-learnt-bridge-addresses-to-dsa'
Jakub Kicinski [Thu, 7 Jan 2021 23:34:48 +0000 (15:34 -0800)]
Merge branch 'offload-software-learnt-bridge-addresses-to-dsa'

Vladimir Oltean says:

====================
Offload software learnt bridge addresses to DSA

This series tries to make DSA behave a bit more sanely when bridged with
"foreign" (non-DSA) interfaces and source address learning is not
supported on the hardware CPU port (which would make things work more
seamlessly without software intervention). When a station A connected to
a DSA switch port needs to talk to another station B connected to a
non-DSA port through the Linux bridge, DSA must explicitly add a route
for station B towards its CPU port.

Initial RFC was posted here:
https://patchwork.ozlabs.org/project/netdev/cover/20201108131953.2462644-1-olteanv@gmail.com/

v2 was posted here:
https://patchwork.kernel.org/project/netdevbpf/cover/20201213024018.772586-1-vladimir.oltean@nxp.com/

v3 was posted here:
https://patchwork.kernel.org/project/netdevbpf/cover/20201213140710.1198050-1-vladimir.oltean@nxp.com/

This is a resend of the previous v3 with some added Reviewed-by tags.
====================

Link: https://lore.kernel.org/r/20210106095136.224739-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: ocelot: request DSA to fix up lack of address learning on CPU port
Vladimir Oltean [Wed, 6 Jan 2021 09:51:36 +0000 (11:51 +0200)]
net: dsa: ocelot: request DSA to fix up lack of address learning on CPU port

Given the following setup:

ip link add br0 type bridge
ip link set eno0 master br0
ip link set swp0 master br0
ip link set swp1 master br0
ip link set swp2 master br0
ip link set swp3 master br0

Currently, packets received on a DSA slave interface (such as swp0)
which should be routed by the software bridge towards a non-switch port
(such as eno0) are also flooded towards the other switch ports (swp1,
swp2, swp3) because the destination is unknown to the hardware switch.

This patch addresses the issue by monitoring the addresses learnt by the
software bridge on eno0, and adding/deleting them as static FDB entries
on the CPU port accordingly.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: listen for SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE on foreign bridge neighbors
Vladimir Oltean [Wed, 6 Jan 2021 09:51:35 +0000 (11:51 +0200)]
net: dsa: listen for SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE on foreign bridge neighbors

Some DSA switches (and not only) cannot learn source MAC addresses from
packets injected from the CPU. They only perform hardware address
learning from inbound traffic.

This can be problematic when we have a bridge spanning some DSA switch
ports and some non-DSA ports (which we'll call "foreign interfaces" from
DSA's perspective).

There are 2 classes of problems created by the lack of learning on
CPU-injected traffic:
- excessive flooding, due to the fact that DSA treats those addresses as
  unknown
- the risk of stale routes, which can lead to temporary packet loss

To illustrate the second class, consider the following situation, which
is common in production equipment (wireless access points, where there
is a WLAN interface and an Ethernet switch, and these form a single
bridging domain).

 AP 1:
 +------------------------------------------------------------------------+
 |                                          br0                           |
 +------------------------------------------------------------------------+
 +------------+ +------------+ +------------+ +------------+ +------------+
 |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
 +------------+ +------------+ +------------+ +------------+ +------------+
       |                                                       ^        ^
       |                                                       |        |
       |                                                       |        |
       |                                                    Client A  Client B
       |
       |
       |
 +------------+ +------------+ +------------+ +------------+ +------------+
 |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
 +------------+ +------------+ +------------+ +------------+ +------------+
 +------------------------------------------------------------------------+
 |                                          br0                           |
 +------------------------------------------------------------------------+
 AP 2

- br0 of AP 1 will know that Clients A and B are reachable via wlan0
- the hardware fdb of a DSA switch driver today is not kept in sync with
  the software entries on other bridge ports, so it will not know that
  clients A and B are reachable via the CPU port UNLESS the hardware
  switch itself performs SA learning from traffic injected from the CPU.
  Nonetheless, a substantial number of switches don't.
- the hardware fdb of the DSA switch on AP 2 may autonomously learn that
  Client A and B are reachable through swp0. Therefore, the software br0
  of AP 2 also may or may not learn this. In the example we're
  illustrating, some Ethernet traffic has been going on, and br0 from AP
  2 has indeed learnt that it can reach Client B through swp0.

One of the wireless clients, say Client B, disconnects from AP 1 and
roams to AP 2. The topology now looks like this:

 AP 1:
 +------------------------------------------------------------------------+
 |                                          br0                           |
 +------------------------------------------------------------------------+
 +------------+ +------------+ +------------+ +------------+ +------------+
 |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
 +------------+ +------------+ +------------+ +------------+ +------------+
       |                                                            ^
       |                                                            |
       |                                                         Client A
       |
       |
       |                                                         Client B
       |                                                            |
       |                                                            v
 +------------+ +------------+ +------------+ +------------+ +------------+
 |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
 +------------+ +------------+ +------------+ +------------+ +------------+
 +------------------------------------------------------------------------+
 |                                          br0                           |
 +------------------------------------------------------------------------+
 AP 2

- br0 of AP 1 still knows that Client A is reachable via wlan0 (no change)
- br0 of AP 1 will (possibly) know that Client B has left wlan0. There
  are cases where it might never find out though. Either way, DSA today
  does not process that notification in any way.
- the hardware FDB of the DSA switch on AP 1 may learn autonomously that
  Client B can be reached via swp0, if it receives any packet with
  Client 1's source MAC address over Ethernet.
- the hardware FDB of the DSA switch on AP 2 still thinks that Client B
  can be reached via swp0. It does not know that it has roamed to wlan0,
  because it doesn't perform SA learning from the CPU port.

Now Client A contacts Client B.
AP 1 routes the packet fine towards swp0 and delivers it on the Ethernet
segment.
AP 2 sees a frame on swp0 and its fdb says that the destination is swp0.
Hairpinning is disabled => drop.

This problem comes from the fact that these switches have a 'blind spot'
for addresses coming from software bridging. The generic solution is not
to assume that hardware learning can be enabled somehow, but to listen
to more bridge learning events. It turns out that the bridge driver does
learn in software from all inbound frames, in __br_handle_local_finish.
A proper SWITCHDEV_FDB_ADD_TO_DEVICE notification is emitted for the
addresses serviced by the bridge on 'foreign' interfaces. The software
bridge also does the right thing on migration, by notifying that the old
entry is deleted, so that does not need to be special-cased in DSA. When
it is deleted, we just need to delete our static FDB entry towards the
CPU too, and wait.

The problem is that DSA currently only cares about SWITCHDEV_FDB_ADD_TO_DEVICE
events received on its own interfaces, such as static FDB entries.

Luckily we can change that, and DSA can listen to all switchdev FDB
add/del events in the system and figure out if those events were emitted
by a bridge that spans at least one of DSA's own ports. In case that is
true, DSA will also offload that address towards its own CPU port, in
the eventuality that there might be bridge clients attached to the DSA
switch who want to talk to the station connected to the foreign
interface.

In terms of implementation, we need to keep the fdb_info->added_by_user
check for the case where the switchdev event was targeted directly at a
DSA switch port. But we don't need to look at that flag for snooped
events. So the check is currently too late, we need to move it earlier.
This also simplifies the code a bit, since we avoid uselessly allocating
and freeing switchdev_work.

We could probably do some improvements in the future. For example,
multi-bridge support is rudimentary at the moment. If there are two
bridges spanning a DSA switch's ports, and both of them need to service
the same MAC address, then what will happen is that the migration of one
of those stations will trigger the deletion of the FDB entry from the
CPU port while it is still used by other bridge. That could be improved
with reference counting but is left for another time.

This behavior needs to be enabled at driver level by setting
ds->assisted_learning_on_cpu_port = true. This is because we don't want
to inflict a potential performance penalty (accesses through
MDIO/I2C/SPI are expensive) to hardware that really doesn't need it
because address learning on the CPU port works there.

Reported-by: DENG Qingfang <dqfext@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: exit early in dsa_slave_switchdev_event if we can't program the FDB
Vladimir Oltean [Wed, 6 Jan 2021 09:51:34 +0000 (11:51 +0200)]
net: dsa: exit early in dsa_slave_switchdev_event if we can't program the FDB

Right now, the following would happen for a switch driver that does not
implement .port_fdb_add or .port_fdb_del.

dsa_slave_switchdev_event returns NOTIFY_OK and schedules:
-> dsa_slave_switchdev_event_work
   -> dsa_port_fdb_add
      -> dsa_port_notify(DSA_NOTIFIER_FDB_ADD)
         -> dsa_switch_fdb_add
            -> if (!ds->ops->port_fdb_add) return -EOPNOTSUPP;
   -> an error is printed with dev_dbg, and
      dsa_fdb_offload_notify(switchdev_work) is not called.

We can avoid scheduling the worker for nothing and say NOTIFY_DONE.
Because we don't call dsa_fdb_offload_notify, the static FDB entry will
remain just in the software bridge.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: move switchdev event implementation under the same switch/case statement
Vladimir Oltean [Wed, 6 Jan 2021 09:51:33 +0000 (11:51 +0200)]
net: dsa: move switchdev event implementation under the same switch/case statement

We'll need to start listening to SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE
events even for interfaces where dsa_slave_dev_check returns false, so
we need that check inside the switch-case statement for SWITCHDEV_FDB_*.

This movement also avoids a useless allocation / free of switchdev_work
on the untreated "default event" case.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: don't use switchdev_notifier_fdb_info in dsa_switchdev_event_work
Vladimir Oltean [Wed, 6 Jan 2021 09:51:32 +0000 (11:51 +0200)]
net: dsa: don't use switchdev_notifier_fdb_info in dsa_switchdev_event_work

Currently DSA doesn't add FDB entries on the CPU port, because it only
does so through switchdev, which is associated with a net_device, and
there are none of those for the CPU port.

But actually FDB addresses on the CPU port have some use cases of their
own, if the switchdev operations are initiated from within the DSA
layer. There is just one problem with the existing code: it passes a
structure in dsa_switchdev_event_work which was retrieved directly from
switchdev, so it contains a net_device. We need to generalize the
contents to something that covers the CPU port as well: the "ds, port"
tuple is fine for that.

Note that the new procedure for notifying the successful FDB offload is
inspired from the rocker model.

Also, nothing was being done if added_by_user was false. Let's check for
that a lot earlier, and don't actually bother to schedule the worker
for nothing.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>