linux-2.6-microblaze.git
3 years agoMerge branch 'Add a snprintf eBPF helper'
Alexei Starovoitov [Mon, 19 Apr 2021 22:27:37 +0000 (15:27 -0700)]
Merge branch 'Add a snprintf eBPF helper'

Florent Revest says:

====================

We have a usecase where we want to audit symbol names (if available) in
callback registration hooks. (ex: fentry/nf_register_net_hook)

A few months back, I proposed a bpf_kallsyms_lookup series but it was
decided in the reviews that a more generic helper, bpf_snprintf, would
be more useful.

This series implements the helper according to the feedback received in
https://lore.kernel.org/bpf/20201126165748.1748417-1-revest@google.com/T/#u

- A new arg type guarantees the NULL-termination of string arguments and
  lets us pass format strings in only one arg
- A new helper is implemented using that guarantee. Because the format
  string is known at verification time, the format string validation is
  done by the verifier
- To implement a series of tests for bpf_snprintf, the logic for
  marshalling variadic args in a fixed-size array is reworked as per:
https://lore.kernel.org/bpf/20210310015455.1095207-1-revest@chromium.org/T/#u

---
Changes in v5:
- Fixed the bpf_printf_buf_used counter logic in try_get_fmt_tmp_buf
- Added a couple of extra incorrect specifiers tests
- Call test_snprintf_single__destroy unconditionally
- Fixed a C++-style comment

---
Changes in v4:
- Moved bpf_snprintf, bpf_printf_prepare and bpf_printf_cleanup to
  kernel/bpf/helpers.c so that they get built without CONFIG_BPF_EVENTS
- Added negative test cases (various invalid format strings)
- Renamed put_fmt_tmp_buf() as bpf_printf_cleanup()
- Fixed a mistake that caused temporary buffers to be unconditionally
  freed in bpf_printf_prepare
- Fixed a mistake that caused missing 0 character to be ignored
- Fixed a warning about integer to pointer conversion
- Misc cleanups

---
Changes in v3:
- Simplified temporary buffer acquisition with try_get_fmt_tmp_buf()
- Made zero-termination check more consistent
- Allowed NULL output_buffer
- Simplified the BPF_CAST_FMT_ARG macro
- Three new test cases: number padding, simple string with no arg and
  string length extraction only with a NULL output buffer
- Clarified helper's description for edge cases (eg: str_size == 0)
- Lots of cosmetic changes

---
Changes in v2:
- Extracted the format validation/argument sanitization in a generic way
  for all printf-like helpers.
- bpf_snprintf's str_size can now be 0
- bpf_snprintf is now exposed to all BPF program types
- We now preempt_disable when using a per-cpu temporary buffer
- Addressed a few cosmetic changes
====================

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agoselftests/bpf: Add a series of tests for bpf_snprintf
Florent Revest [Mon, 19 Apr 2021 15:52:43 +0000 (17:52 +0200)]
selftests/bpf: Add a series of tests for bpf_snprintf

The "positive" part tests all format specifiers when things go well.

The "negative" part makes sure that incorrect format strings fail at
load time.

Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-7-revest@chromium.org
3 years agolibbpf: Introduce a BPF_SNPRINTF helper macro
Florent Revest [Mon, 19 Apr 2021 15:52:42 +0000 (17:52 +0200)]
libbpf: Introduce a BPF_SNPRINTF helper macro

Similarly to BPF_SEQ_PRINTF, this macro turns variadic arguments into an
array of u64, making it more natural to call the bpf_snprintf helper.

Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-6-revest@chromium.org
3 years agolibbpf: Initialize the bpf_seq_printf parameters array field by field
Florent Revest [Mon, 19 Apr 2021 15:52:41 +0000 (17:52 +0200)]
libbpf: Initialize the bpf_seq_printf parameters array field by field

When initializing the __param array with a one liner, if all args are
const, the initial array value will be placed in the rodata section but
because libbpf does not support relocation in the rodata section, any
pointer in this array will stay NULL.

Fixes: c09add2fbc5a ("tools/libbpf: Add bpf_iter support")
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-5-revest@chromium.org
3 years agobpf: Add a bpf_snprintf helper
Florent Revest [Mon, 19 Apr 2021 15:52:40 +0000 (17:52 +0200)]
bpf: Add a bpf_snprintf helper

The implementation takes inspiration from the existing bpf_trace_printk
helper but there are a few differences:

To allow for a large number of format-specifiers, parameters are
provided in an array, like in bpf_seq_printf.

Because the output string takes two arguments and the array of
parameters also takes two arguments, the format string needs to fit in
one argument. Thankfully, ARG_PTR_TO_CONST_STR is guaranteed to point to
a zero-terminated read-only map so we don't need a format string length
arg.

Because the format-string is known at verification time, we also do
a first pass of format string validation in the verifier logic. This
makes debugging easier.

Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-4-revest@chromium.org
3 years agobpf: Add a ARG_PTR_TO_CONST_STR argument type
Florent Revest [Mon, 19 Apr 2021 15:52:39 +0000 (17:52 +0200)]
bpf: Add a ARG_PTR_TO_CONST_STR argument type

This type provides the guarantee that an argument is going to be a const
pointer to somewhere in a read-only map value. It also checks that this
pointer is followed by a zero character before the end of the map value.

Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-3-revest@chromium.org
3 years agobpf: Factorize bpf_trace_printk and bpf_seq_printf
Florent Revest [Mon, 19 Apr 2021 15:52:38 +0000 (17:52 +0200)]
bpf: Factorize bpf_trace_printk and bpf_seq_printf

Two helpers (trace_printk and seq_printf) have very similar
implementations of format string parsing and a third one is coming
(snprintf). To avoid code duplication and make the code easier to
maintain, this moves the operations associated with format string
parsing (validation and argument sanitization) into one generic
function.

The implementation of the two existing helpers already drifted quite a
bit so unifying them entailed a lot of changes:

- bpf_trace_printk always expected fmt[fmt_size] to be the terminating
  NULL character, this is no longer true, the first 0 is terminating.
- bpf_trace_printk now supports %% (which produces the percentage char).
- bpf_trace_printk now skips width formating fields.
- bpf_trace_printk now supports the X modifier (capital hexadecimal).
- bpf_trace_printk now supports %pK, %px, %pB, %pi4, %pI4, %pi6 and %pI6
- argument casting on 32 bit has been simplified into one macro and
  using an enum instead of obscure int increments.

- bpf_seq_printf now uses bpf_trace_copy_string instead of
  strncpy_from_kernel_nofault and handles the %pks %pus specifiers.
- bpf_seq_printf now prints longs correctly on 32 bit architectures.

- both were changed to use a global per-cpu tmp buffer instead of one
  stack buffer for trace_printk and 6 small buffers for seq_printf.
- to avoid per-cpu buffer usage conflict, these helpers disable
  preemption while the per-cpu buffer is in use.
- both helpers now support the %ps and %pS specifiers to print symbols.

The implementation is also moved from bpf_trace.c to helpers.c because
the upcoming bpf_snprintf helper will be made available to all BPF
programs and will need it.

Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-2-revest@chromium.org
3 years agoMerge branch 'bpf: tools: support build selftests/bpf with clang'
Alexei Starovoitov [Thu, 15 Apr 2021 23:50:22 +0000 (16:50 -0700)]
Merge branch 'bpf: tools: support build selftests/bpf with clang'

Yonghong Song says:

====================

To build kernel with clang, people typically use
  make -j60 LLVM=1 LLVM_IAS=1
LLVM_IAS=1 is not required for non-LTO build but
is required for LTO build. In my environment,
I am always having LLVM_IAS=1 regardless of
whether LTO is enabled or not.

After kernel is build with clang, the following command
can be used to build selftests with clang:
  make -j60 -C tools/testing/selftests/bpf LLVM=1 LLVM_IAS=1

I am using latest bpf-next kernel code base and
latest clang built from source from
  https://github.com/llvm/llvm-project.git
Using earlier version of llvm may have compilation errors, see
  tools/testing/selftests/bpf
due to continuous development in llvm bpf features and selftests
to use these features.

To run bpf selftest properly, you need have certain necessary
kernel configs like at:
  bpf-next:tools/testing/selftests/bpf/config
(not that this is not a complete .config file and some other configs
 might still be needed.)

Currently, using the above command, some compilations
still use gcc and there are also compilation errors and warnings.
This patch set intends to fix these issues.
Patch #1 and #2 fixed the issue so clang/clang++ is
used instead of gcc/g++. Patch #3 fixed a compilation
failure. Patch #4 and #5 fixed various compiler warnings.

Changelog:
  v2 -> v3:
    . more test environment description in cover letter. (Sedat)
    . use a different fix, but similar to other use in selftests/bpf
      Makefile, to exclude header files from CXX compilation command
      line. (Andrii)
    . fix codes instead of adding -Wno-format-security. (Andrii)
  v1 -> v2:
    . add -Wno-unused-command-line-argument and -Wno-format-security
      for clang only as (1). gcc does not exhibit those
      warnings, and (2). -Wno-unused-command-line-argument is
      only supported by clang. (Sedat)
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agobpftool: Fix a clang compilation warning
Yonghong Song [Tue, 13 Apr 2021 15:34:35 +0000 (08:34 -0700)]
bpftool: Fix a clang compilation warning

With clang compiler:
  make -j60 LLVM=1 LLVM_IAS=1  <=== compile kernel
  # build selftests/bpf or bpftool
  make -j60 -C tools/testing/selftests/bpf LLVM=1 LLVM_IAS=1
  make -j60 -C tools/bpf/bpftool LLVM=1 LLVM_IAS=1
the following compilation warning showed up,
  net.c:160:37: warning: comparison of integers of different signs: '__u32' (aka 'unsigned int') and 'int' [-Wsign-compare]
                for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
                                                  ^~~~~~~~~~~~~~~~~
  .../tools/include/uapi/linux/netlink.h:99:24: note: expanded from macro 'NLMSG_OK'
                           (nlh)->nlmsg_len <= (len))
                           ~~~~~~~~~~~~~~~~ ^   ~~~

In this particular case, "len" is defined as "int" and (nlh)->nlmsg_len is "unsigned int".
The macro NLMSG_OK is defined as below in uapi/linux/netlink.h.
  #define NLMSG_OK(nlh,len) ((len) >= (int)sizeof(struct nlmsghdr) && \
                             (nlh)->nlmsg_len >= sizeof(struct nlmsghdr) && \
                             (nlh)->nlmsg_len <= (len))

The clang compiler complains the comparision "(nlh)->nlmsg_len <= (len))",
but in bpftool/net.c, it is already ensured that "len > 0" must be true.
So theoretically the compiler could deduce that comparison of
"(nlh)->nlmsg_len" and "len" is okay, but this really depends on compiler
internals. Let us add an explicit type conversion (from "int" to "unsigned int")
for "len" in NLMSG_OK to silence this warning right now.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210413153435.3029635-1-yhs@fb.com
3 years agoselftests/bpf: Silence clang compilation warnings
Yonghong Song [Tue, 13 Apr 2021 15:34:29 +0000 (08:34 -0700)]
selftests/bpf: Silence clang compilation warnings

With clang compiler:
  make -j60 LLVM=1 LLVM_IAS=1  <=== compile kernel
  make -j60 -C tools/testing/selftests/bpf LLVM=1 LLVM_IAS=1
Some linker flags are not used/effective for some binaries and
we have warnings like:
  warning: -lelf: 'linker' input unused [-Wunused-command-line-argument]

We also have warnings like:
  .../selftests/bpf/prog_tests/ns_current_pid_tgid.c:74:57: note: treat the string as an argument to avoid this
        if (CHECK(waitpid(cpid, &wstatus, 0) == -1, "waitpid", strerror(errno)))
                                                               ^
                                                               "%s",
  .../selftests/bpf/test_progs.h:129:35: note: expanded from macro 'CHECK'
        _CHECK(condition, tag, duration, format)
                                         ^
  .../selftests/bpf/test_progs.h:108:21: note: expanded from macro '_CHECK'
                fprintf(stdout, ##format);                              \
                                  ^
The first warning can be silenced with clang option -Wno-unused-command-line-argument.
For the second warning, source codes are modified as suggested by the compiler
to silence the warning. Since gcc does not support the option
-Wno-unused-command-line-argument and the warning only happens with clang
compiler, the option -Wno-unused-command-line-argument is enabled only when
clang compiler is used.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210413153429.3029377-1-yhs@fb.com
3 years agoselftests/bpf: Fix test_cpp compilation failure with clang
Yonghong Song [Tue, 13 Apr 2021 15:34:24 +0000 (08:34 -0700)]
selftests/bpf: Fix test_cpp compilation failure with clang

With clang compiler:
  make -j60 LLVM=1 LLVM_IAS=1  <=== compile kernel
  make -j60 -C tools/testing/selftests/bpf LLVM=1 LLVM_IAS=1
the test_cpp build failed due to the failure:
  warning: treating 'c-header' input as 'c++-header' when in C++ mode, this behavior is deprecated [-Wdeprecated]
  clang-13: error: cannot specify -o when generating multiple output files

test_cpp compilation flag looks like:
  clang++ -g -Og -rdynamic -Wall -I<...> ... \
  -Dbpf_prog_load=bpf_prog_test_load -Dbpf_load_program=bpf_test_load_program \
  test_cpp.cpp <...>/test_core_extern.skel.h <...>/libbpf.a <...>/test_stub.o \
  -lcap -lelf -lz -lrt -lpthread -o <...>/test_cpp

The clang++ compiler complains the header file in the command line and
also failed the compilation due to this.
Let us remove the header file from the command line which is not intended
any way, and this fixed the compilation problem.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210413153424.3028986-1-yhs@fb.com
3 years agotools: Allow proper CC/CXX/... override with LLVM=1 in Makefile.include
Yonghong Song [Tue, 13 Apr 2021 15:34:19 +0000 (08:34 -0700)]
tools: Allow proper CC/CXX/... override with LLVM=1 in Makefile.include

selftests/bpf/Makefile includes tools/scripts/Makefile.include.
With the following command
  make -j60 LLVM=1 LLVM_IAS=1  <=== compile kernel
  make -j60 -C tools/testing/selftests/bpf LLVM=1 LLVM_IAS=1 V=1
some files are still compiled with gcc. This patch
fixed the case if CC/AR/LD/CXX/STRIP is allowed to be
overridden, it will be written to clang/llvm-ar/..., instead of
gcc binaries. The definition of CC_NO_CLANG is also relocated
to the place after the above CC is defined.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210413153419.3028165-1-yhs@fb.com
3 years agoselftests: Set CC to clang in lib.mk if LLVM is set
Yonghong Song [Tue, 13 Apr 2021 15:34:13 +0000 (08:34 -0700)]
selftests: Set CC to clang in lib.mk if LLVM is set

selftests/bpf/Makefile includes lib.mk. With the following command
  make -j60 LLVM=1 LLVM_IAS=1  <=== compile kernel
  make -j60 -C tools/testing/selftests/bpf LLVM=1 LLVM_IAS=1 V=1
some files are still compiled with gcc. This patch
fixed lib.mk issue which sets CC to gcc in all cases.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210413153413.3027426-1-yhs@fb.com
3 years agolibbpf: Remove unused field.
Alexei Starovoitov [Thu, 15 Apr 2021 14:18:17 +0000 (07:18 -0700)]
libbpf: Remove unused field.

relo->processed is set, but not used. Remove it.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210415141817.53136-1-alexei.starovoitov@gmail.com
3 years agotools/testing: Remove unused variable
zuoqilin [Wed, 14 Apr 2021 14:16:39 +0000 (22:16 +0800)]
tools/testing: Remove unused variable

Remove unused variable "ret2".

Signed-off-by: zuoqilin <zuoqilin@yulong.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210414141639.1446-1-zuoqilin1@163.com
3 years agoselftests/bpf: Fix the ASSERT_ERR_PTR macro
Florent Revest [Wed, 14 Apr 2021 15:56:32 +0000 (17:56 +0200)]
selftests/bpf: Fix the ASSERT_ERR_PTR macro

It is just missing a ';'. This macro is not used by any test yet.

Fixes: 22ba36351631 ("selftests/bpf: Move and extend ASSERT_xxx() testing macros")
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210414155632.737866-1-revest@chromium.org
3 years agoselftests/bpf: Add tests for target information in bpf_link info queries
Toke Høiland-Jørgensen [Tue, 13 Apr 2021 09:16:07 +0000 (11:16 +0200)]
selftests/bpf: Add tests for target information in bpf_link info queries

Extend the fexit_bpf2bpf test to check that the info for the bpf_link
returned by the kernel matches the expected values.

While we're updating the test, change existing uses of CHEC() to use the
much easier to read ASSERT_*() macros.

v2:
- Convert last CHECK() call and get rid of 'duration' var
- Split ASSERT_OK_PTR() checks to two separate if statements

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210413091607.58945-2-toke@redhat.com
3 years agobpf: Return target info when a tracing bpf_link is queried
Toke Høiland-Jørgensen [Tue, 13 Apr 2021 09:16:06 +0000 (11:16 +0200)]
bpf: Return target info when a tracing bpf_link is queried

There is currently no way to discover the target of a tracing program
attachment after the fact. Add this information to bpf_link_info and return
it when querying the bpf_link fd.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210413091607.58945-1-toke@redhat.com
3 years agobpf: Generate BTF_KIND_FLOAT when linking vmlinux
Ilya Leoshkevich [Tue, 13 Apr 2021 19:00:43 +0000 (21:00 +0200)]
bpf: Generate BTF_KIND_FLOAT when linking vmlinux

pahole v1.21 supports the --btf_gen_floats flag, which makes it
generate the information about the floating-point types [1].

Adjust link-vmlinux.sh to pass this flag to pahole in case it's
supported, which is determined using a simple version check.

[1] https://lore.kernel.org/dwarves/YHRiXNX1JUF2Az0A@kernel.org/

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210413190043.21918-1-iii@linux.ibm.com
3 years agolibbpf: Clarify flags in ringbuf helpers
Pedro Tammela [Mon, 12 Apr 2021 19:24:32 +0000 (16:24 -0300)]
libbpf: Clarify flags in ringbuf helpers

In 'bpf_ringbuf_reserve()' we require the flag to '0' at the moment.

For 'bpf_ringbuf_{discard,submit,output}' a flag of '0' might send a
notification to the process if needed.

Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210412192434.944343-1-pctammela@mojatatu.com
3 years agosock_map: Fix a potential use-after-free in sock_map_close()
Cong Wang [Thu, 8 Apr 2021 03:05:56 +0000 (20:05 -0700)]
sock_map: Fix a potential use-after-free in sock_map_close()

The last refcnt of the psock can be gone right after
sock_map_remove_links(), so sk_psock_stop() could trigger a UAF.
The reason why I placed sk_psock_stop() there is to avoid RCU read
critical section, and more importantly, some callee of
sock_map_remove_links() is supposed to be called with RCU read lock,
we can not simply get rid of RCU read lock here. Therefore, the only
choice we have is to grab an additional refcnt with sk_psock_get()
and put it back after sk_psock_stop().

Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Reported-by: syzbot+7b6548ae483d6f4c64ae@syzkaller.appspotmail.com
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210408030556.45134-1-xiyou.wangcong@gmail.com
3 years agoskmsg: Pass psock pointer to ->psock_update_sk_prot()
Cong Wang [Wed, 7 Apr 2021 03:21:11 +0000 (20:21 -0700)]
skmsg: Pass psock pointer to ->psock_update_sk_prot()

Using sk_psock() to retrieve psock pointer from sock requires
RCU read lock, but we already get psock pointer before calling
->psock_update_sk_prot() in both cases, so we can just pass it
without bothering sk_psock().

Fixes: 8a59f9d1e3d4 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
Reported-by: syzbot+320a3bc8d80f478c37e4@syzkaller.appspotmail.com
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+320a3bc8d80f478c37e4@syzkaller.appspotmail.com
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210407032111.33398-1-xiyou.wangcong@gmail.com
3 years agobpf: Sync bpf headers in tooling infrastucture
Daniel Borkmann [Mon, 12 Apr 2021 15:19:00 +0000 (17:19 +0200)]
bpf: Sync bpf headers in tooling infrastucture

Synchronize tools/include/uapi/linux/bpf.h which was missing changes
from various commits:

  - f3c45326ee71 ("bpf: Document PROG_TEST_RUN limitations")
  - e5e35e754c28 ("bpf: BPF-helper for MTU checking add length input")

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agobpf: Document PROG_TEST_RUN limitations
Joe Stringer [Sat, 10 Apr 2021 17:45:48 +0000 (10:45 -0700)]
bpf: Document PROG_TEST_RUN limitations

Per net/bpf/test_run.c, particular prog types have additional
restrictions around the parameters that can be provided, so document
these in the header.

I didn't bother documenting the limitation on duration for raw
tracepoints since that's an output parameter anyway.

Tested with ./tools/testing/selftests/bpf/test_doc_build.sh.

Suggested-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210410174549.816482-1-joe@cilium.io
3 years agoMerge branch 'bpf/selftests: page size fixes'
Andrii Nakryiko [Fri, 9 Apr 2021 06:47:06 +0000 (23:47 -0700)]
Merge branch 'bpf/selftests: page size fixes'

Yauheni Kaliuta says:

====================

A set of fixes for selftests to make them working on systems with PAGE_SIZE > 4K
+ cleanup (version) and ringbuf_multi extention.
---
v3->v4:
- zero initialize BPF programs' static variables;
- add bpf_map__inner_map to libbpf.map in alphabetical order;
- add bpf_map__set_inner_map_fd test to ringbuf_multi;

v2->v3:
 - reorder: move version removing patch first to keep main patches in
   one group;
 - rename "selftests/bpf: pass page size from userspace in sockopt_sk"
   as suggested;
 - convert sockopt_sk test to use ASSERT macros;
 - set page size from userspace
 - split patches to pairs userspace/bpf. It's easier to check that
   every conversion works as expected;

v1->v2:

- add missed 'selftests/bpf: test_progs/sockopt_sk: Convert to use BPF skeleton'
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
3 years agoselftests/bpf: ringbuf_multi: Test bpf_map__set_inner_map_fd
Yauheni Kaliuta [Thu, 8 Apr 2021 06:13:10 +0000 (09:13 +0300)]
selftests/bpf: ringbuf_multi: Test bpf_map__set_inner_map_fd

Test map__set_inner_map_fd() interaction with map-in-map
initialization. Use hashmap of maps just to make it different to
existing array of maps.

Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210408061310.95877-9-yauheni.kaliuta@redhat.com
3 years agoselftests/bpf: ringbuf_multi: Use runtime page size
Yauheni Kaliuta [Thu, 8 Apr 2021 06:13:09 +0000 (09:13 +0300)]
selftests/bpf: ringbuf_multi: Use runtime page size

Set bpf table sizes dynamically according to the runtime page size
value.

Do not switch to ASSERT macros, keep CHECK, for consistency with the
rest of the test. Can be a separate cleanup patch.

Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210408061310.95877-8-yauheni.kaliuta@redhat.com
3 years agolibbpf: Add bpf_map__inner_map API
Andrii Nakryiko [Thu, 8 Apr 2021 06:13:08 +0000 (09:13 +0300)]
libbpf: Add bpf_map__inner_map API

The API gives access to inner map for map in map types (array or
hash of map). It will be used to dynamically set max_entries in it.

Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210408061310.95877-7-yauheni.kaliuta@redhat.com
3 years agoselftests/bpf: ringbuf: Use runtime page size
Yauheni Kaliuta [Thu, 8 Apr 2021 06:13:07 +0000 (09:13 +0300)]
selftests/bpf: ringbuf: Use runtime page size

Replace hardcoded 4096 with runtime value in the userspace part of
the test and set bpf table sizes dynamically according to the value.

Do not switch to ASSERT macros, keep CHECK, for consistency with the
rest of the test. Can be a separate cleanup patch.

Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210408061310.95877-6-yauheni.kaliuta@redhat.com
3 years agoselftests/bpf: mmap: Use runtime page size
Yauheni Kaliuta [Thu, 8 Apr 2021 06:13:06 +0000 (09:13 +0300)]
selftests/bpf: mmap: Use runtime page size

Replace hardcoded 4096 with runtime value in the userspace part of
the test and set bpf table sizes dynamically according to the value.

Do not switch to ASSERT macros, keep CHECK, for consistency with the
rest of the test. Can be a separate cleanup patch.

Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210408061310.95877-5-yauheni.kaliuta@redhat.com
3 years agoselftests/bpf: Pass page size from userspace in map_ptr
Yauheni Kaliuta [Thu, 8 Apr 2021 06:13:05 +0000 (09:13 +0300)]
selftests/bpf: Pass page size from userspace in map_ptr

Use ASSERT to check result but keep CHECK where format was used to
report error.

Use bpf_map__set_max_entries() to set map size dynamically from
userspace according to page size.

Zero-initialize the variable in bpf prog, otherwise it will cause
problems on some versions of Clang.

Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210408061310.95877-4-yauheni.kaliuta@redhat.com
3 years agoselftests/bpf: Pass page size from userspace in sockopt_sk
Yauheni Kaliuta [Thu, 8 Apr 2021 06:13:04 +0000 (09:13 +0300)]
selftests/bpf: Pass page size from userspace in sockopt_sk

Since there is no convenient way for bpf program to get PAGE_SIZE
from inside of the kernel, pass the value from userspace.

Zero-initialize the variable in bpf prog, otherwise it will cause
problems on some versions of Clang.

Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210408061310.95877-3-yauheni.kaliuta@redhat.com
3 years agoselftests/bpf: test_progs/sockopt_sk: Convert to use BPF skeleton
Yauheni Kaliuta [Thu, 8 Apr 2021 06:13:03 +0000 (09:13 +0300)]
selftests/bpf: test_progs/sockopt_sk: Convert to use BPF skeleton

Switch the test to use BPF skeleton to save some boilerplate and
make it easy to access bpf program bss segment.

The latter will be used to pass PAGE_SIZE from userspace since there
is no convenient way for bpf program to get it from inside of the
kernel.

Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210408061310.95877-2-yauheni.kaliuta@redhat.com
3 years agoselftests/bpf: test_progs/sockopt_sk: Remove version
Yauheni Kaliuta [Thu, 8 Apr 2021 06:13:02 +0000 (09:13 +0300)]
selftests/bpf: test_progs/sockopt_sk: Remove version

As pointed by Andrii Nakryiko, _version is useless now, remove it.

Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210408061310.95877-1-yauheni.kaliuta@redhat.com
3 years agobpf, inode: Remove second initialization of the bpf_preload_lock
Muhammad Usama Anjum [Mon, 5 Apr 2021 19:49:04 +0000 (00:49 +0500)]
bpf, inode: Remove second initialization of the bpf_preload_lock

bpf_preload_lock is already defined with DEFINE_MUTEX(). There is no
need to initialize it again. Remove the extraneous initialization.

Signed-off-by: Muhammad Usama Anjum <musamaanjum@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210405194904.GA148013@LEGION
3 years agobpf, udp: Remove some pointless comments
Cong Wang [Sat, 3 Apr 2021 05:27:15 +0000 (22:27 -0700)]
bpf, udp: Remove some pointless comments

These comments in udp_bpf_update_proto() are copied from the
original TCP code and apparently do not apply to UDP. Just
remove them.

Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210403052715.13854-1-xiyou.wangcong@gmail.com
3 years agolibbpf: Fix KERNEL_VERSION macro
Hengqi Chen [Mon, 5 Apr 2021 04:01:19 +0000 (12:01 +0800)]
libbpf: Fix KERNEL_VERSION macro

Add missing ')' for KERNEL_VERSION macro.

Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210405040119.802188-1-hengqi.chen@gmail.com
3 years agobpf: selftests: Specify CONFIG_DYNAMIC_FTRACE in the testing config
Martin KaFai Lau [Sat, 3 Apr 2021 00:29:21 +0000 (17:29 -0700)]
bpf: selftests: Specify CONFIG_DYNAMIC_FTRACE in the testing config

The tracing test and the recent kfunc call test require
CONFIG_DYNAMIC_FTRACE.  This patch adds it to the config file.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210403002921.3419721-1-kafai@fb.com
3 years agolibbpf: Remove redundant semi-colon
Yang Yingliang [Fri, 2 Apr 2021 01:26:34 +0000 (09:26 +0800)]
libbpf: Remove redundant semi-colon

Remove redundant semi-colon in finalize_btf_ext().

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210402012634.1965453-1-yangyingliang@huawei.com
3 years agobpf: Remove repeated struct btf_type declaration
Wan Jiabing [Thu, 1 Apr 2021 07:20:37 +0000 (15:20 +0800)]
bpf: Remove repeated struct btf_type declaration

struct btf_type is declared twice. One is declared at 35th line. The below
one is not needed, hence remove the duplicate.

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210401072037.995849-1-wanjiabing@vivo.com
3 years agobpf, cgroup: Delete repeated struct bpf_prog declaration
Wan Jiabing [Thu, 1 Apr 2021 06:46:37 +0000 (14:46 +0800)]
bpf, cgroup: Delete repeated struct bpf_prog declaration

struct bpf_prog is declared twice. There is one declaration which is
independent on the macro at 18th line. So the below one is not needed
though. Remove the duplicate.

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210401064637.993327-1-wanjiabing@vivo.com
3 years agobpf: Remove unused parameter from ___bpf_prog_run
He Fengqing [Wed, 31 Mar 2021 07:51:35 +0000 (07:51 +0000)]
bpf: Remove unused parameter from ___bpf_prog_run

'stack' parameter is not used in ___bpf_prog_run() after f696b8f471ec
("bpf: split bpf core interpreter"), the base address have been set to
FP reg. So consequently remove it.

Signed-off-by: He Fengqing <hefengqing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210331075135.3850782-1-hefengqing@huawei.com
3 years agobpf, selftests: test_maps generating unrecognized data section
John Fastabend [Thu, 1 Apr 2021 22:25:56 +0000 (15:25 -0700)]
bpf, selftests: test_maps generating unrecognized data section

With a relatively recent clang master branch test_map skips a section,

 libbpf: elf: skipping unrecognized data section(5) .rodata.str1.1

the cause is some pointless strings from bpf_printks in the BPF program
loaded during testing. After just removing the prints to fix above error
Daniel points out the program is a bit pointless and could be simply the
empty program returning SK_PASS.

Here we do just that and return simply SK_PASS. This program is used with
test_maps selftests to test insert/remove of a program into the sockmap
and sockhash maps. Its not testing actual functionality of the TCP
sockmap programs, these are tested from test_sockmap. So we shouldn't
lose in test coverage and fix above warnings. This original test was
added before test_sockmap existed and has been copied around ever since,
clean it up now.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/161731595664.74613.1603087410166945302.stgit@john-XPS-13-9370
3 years agotcp: reorder tcp_congestion_ops for better cache locality
Eric Dumazet [Fri, 2 Apr 2021 18:10:37 +0000 (11:10 -0700)]
tcp: reorder tcp_congestion_ops for better cache locality

Group all the often used fields in the first cache line,
to reduce cache line misses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: reorganize fields in netns_mib
Eric Dumazet [Fri, 2 Apr 2021 18:07:46 +0000 (11:07 -0700)]
net: reorganize fields in netns_mib

Order fields to increase locality for most used protocols.

udplite and icmp are moved at the end.

Same for proc_net_devsnmp6 which is not used in fast path.

This potentially saves one cache line miss for typical TCP/UDP over IPv4/IPv6.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonfc: pn533: prevent potential memory corruption
Dan Carpenter [Fri, 2 Apr 2021 11:44:42 +0000 (14:44 +0300)]
nfc: pn533: prevent potential memory corruption

If the "type_a->nfcid_len" is too large then it would lead to memory
corruption in pn533_target_found_type_a() when we do:

memcpy(nfc_tgt->nfcid1, tgt_type_a->nfcid_data, nfc_tgt->nfcid1_len);

Fixes: c3b1e1e8a76f ("NFC: Export NFCID1 from pn533")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'dpaa2-rx-copybreak'
David S. Miller [Fri, 2 Apr 2021 21:25:47 +0000 (14:25 -0700)]
Merge branch 'dpaa2-rx-copybreak'

Ioana Ciornei says:

====================
dpaa2-eth: add rx copybreak support

DMA unmapping, allocating a new buffer and DMA mapping it back on the
refill path is really not that efficient. Proper buffer recycling (page
pool, flipping the page and using the other half) cannot be done for
DPAA2 since it's not a ring based controller but it rather deals with
multiple queues which all get their buffers from the same buffer pool on
Rx.

To circumvent these limitations, add support for Rx copybreak in
dpaa2-eth.

Below you can find a summary of the tests that were run to end up
with the default rx copybreak value of 512.
A bit about the setup - a LS2088A SoC, 8 x Cortex A72 @ 1.8GHz, IPfwd
zero loss test @ 20Gbit/s throughput.  I tested multiple frame sizes to
get an idea where is the break even point.

Here are 2 sets of results, (1) is the baseline and (2) is just
allocating a new skb for all frames sizes received (as if the copybreak
was even to the MTU). All numbers are in Mpps.

         64   128    256   512  640   768   896

(1)     3.23  3.23  3.24  3.21  3.1  2.76  2.71
(2)     3.95  3.88  3.79  3.62  3.3  3.02  2.65

It seems that even for 512 bytes frame sizes it's comfortably better when
allocating a new skb. After that, we see diminishing rewards or even worse.

Changes in v2:
 - properly marked dpaa2_eth_copybreak as static
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agodpaa2-eth: export the rx copybreak value as an ethtool tunable
Ioana Ciornei [Fri, 2 Apr 2021 09:55:32 +0000 (12:55 +0300)]
dpaa2-eth: export the rx copybreak value as an ethtool tunable

It's useful, especially for debugging purposes, to have the Rx copybreak
value changeable at runtime. Export it as an ethtool tunable.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agodpaa2-eth: add rx copybreak support
Ioana Ciornei [Fri, 2 Apr 2021 09:55:31 +0000 (12:55 +0300)]
dpaa2-eth: add rx copybreak support

DMA unmapping, allocating a new buffer and DMA mapping it back on the
refill path is really not that efficient. Proper buffer recycling (page
pool, flipping the page and using the other half) cannot be done for
DPAA2 since it's not a ring based controller but it rather deals with
multiple queues which all get their buffers from the same buffer pool on
Rx.

To circumvent these limitations, add support for Rx copybreak. For small
sized packets instead of creating a skb around the buffer in which the
frame was received, allocate a new sk buffer altogether, copy the
contents of the frame and release the initial page back into the buffer
pool.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agodpaa2-eth: rename dpaa2_eth_xdp_release_buf into dpaa2_eth_recycle_buf
Ioana Ciornei [Fri, 2 Apr 2021 09:55:30 +0000 (12:55 +0300)]
dpaa2-eth: rename dpaa2_eth_xdp_release_buf into dpaa2_eth_recycle_buf

Rename the dpaa2_eth_xdp_release_buf function into dpaa2_eth_recycle_buf
since in the next patches we'll be using the same recycle mechanism for
the normal stack path beside for XDP_DROP.

Also, rename the array which holds the buffers to be recycled so that it
does not have any reference to XDP.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'mptcp-misc'
David S. Miller [Fri, 2 Apr 2021 21:21:51 +0000 (14:21 -0700)]
Merge branch 'mptcp-misc'

Mat Martineau says:

====================
MPTCP: Miscellaneous changes

Here is a collection of patches from the MPTCP tree:

Patches 1 and 2 add some helpful MIB counters for connection
information.

Patch 3 cleans up some unnecessary checks.

Patch 4 is a new feature, support for the MP_TCPRST option. This option
is used when resetting one subflow within a MPTCP connection, and
provides a reason code that the recipient can use when deciding how to
adapt to the lost subflow.

Patches 5-7 update the existing MPTCP selftests to improve timeout
handling and to share better information when tests fail.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoselftests: mptcp: dump more info on mpjoin errors
Matthieu Baerts [Thu, 1 Apr 2021 23:19:47 +0000 (16:19 -0700)]
selftests: mptcp: dump more info on mpjoin errors

Very occasionally, MPTCP selftests fail. Yeah, I saw that at least once!

Here we provide more details in case of errors with mptcp_join.sh script
like it was done with mptcp_connect.sh, see
commit 767389c8dd55 ("selftests: mptcp: dump more info on errors")

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoselftests: mptcp: init nstat history
Matthieu Baerts [Thu, 1 Apr 2021 23:19:46 +0000 (16:19 -0700)]
selftests: mptcp: init nstat history

Not to be impacted by packets sent between sub-tests.

Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoselftests: mptcp: launch mptcp_connect with timeout
Matthieu Baerts [Thu, 1 Apr 2021 23:19:45 +0000 (16:19 -0700)]
selftests: mptcp: launch mptcp_connect with timeout

'mptcp_connect' already has a timeout for poll() but in some cases, it
is not enough.

With "timeout" tool, we will force the command to fail if it doesn't
finish on time. Thanks to that, the script will continue and display
details about the current state before marking the test as failed.
Displaying this state is very important to be able to understand the
issue. Best to have our CI reporting the issue than just "the test
hanged".

Note that in mptcp_connect.sh, we were using a long timeout to validate
the fact we cannot create a socket if a sysctl is set. We don't need
this timeout.

In diag.sh, we want to send signals to mptcp_connect instances that have
been started in the netns. But we cannot send this signal to 'timeout'
otherwise that will stop the timeout and messages telling us SIGUSR1 has
been received will be printed. Instead of trying to find the right PID
and storing them in an array, we can simply use the output of
'ip netns pids' which is all the PIDs we want to send signal to.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/160
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomptcp: add mptcp reset option support
Florian Westphal [Thu, 1 Apr 2021 23:19:44 +0000 (16:19 -0700)]
mptcp: add mptcp reset option support

The MPTCP reset option allows to carry a mptcp-specific error code that
provides more information on the nature of a connection reset.

Reset option data received gets stored in the subflow context so it can
be sent to userspace via the 'subflow closed' netlink event.

When a subflow is closed, the desired error code that should be sent to
the peer is also placed in the subflow context structure.

If a reset is sent before subflow establishment could complete, e.g. on
HMAC failure during an MP_JOIN operation, the mptcp skb extension is
used to store the reset information.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomptcp: remove unneeded check on first subflow
Paolo Abeni [Thu, 1 Apr 2021 23:19:43 +0000 (16:19 -0700)]
mptcp: remove unneeded check on first subflow

Currently we explicitly check for the first subflow being
NULL in a couple of places, even if we don't need any
special actions in such scenario.

Just drop the unneeded checks, to avoid confusion.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomptcp: add active MPC mibs
Paolo Abeni [Thu, 1 Apr 2021 23:19:42 +0000 (16:19 -0700)]
mptcp: add active MPC mibs

We are not currently tracking the active MPTCP connection
attempts. Let's add the related counters.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomptcp: add mib for token creation fallback
Paolo Abeni [Thu, 1 Apr 2021 23:19:41 +0000 (16:19 -0700)]
mptcp: add mib for token creation fallback

If the MPTCP protocol is unable to create a new token,
the socket fallback to plain TCP, let's keep track
of such events via a specific MIB.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'ionic-ptp'
David S. Miller [Fri, 2 Apr 2021 21:18:33 +0000 (14:18 -0700)]
Merge branch 'ionic-ptp'

Shannon Nelson says:

====================
ionic: add PTP and hw clock support

This patchset adds support for accessing the DSC hardware clock and
for offloading PTP timestamping.

Tx packet timestamping happens through a separate Tx queue set up with
expanded completion descriptors that can report the timestamp.

Rx timestamping can happen either on all queues, or on a separate
timestamping queue when specific filtering is requested.  Again, the
timestamps are reported with the expanded completion descriptors.

The timestamping offload ability is advertised but not enabled until an
OS service asks for it.  At that time the driver's queues are reconfigured
to use the different completion descriptors and the private processing
queues as needed.

Reading the raw clock value comes through a new pair of values in the
device info registers in BAR0.  These high and low values are interpreted
with help from new clock mask, mult, and shift values in the device
identity information.

First we add the ability to detect new queue features, then the handling
of the new descriptor sizes.  After adding the new interface structures,
we start adding the support code, saving the advertising to the stack
for last.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoionic: advertise support for hardware timestamps
Shannon Nelson [Thu, 1 Apr 2021 17:56:10 +0000 (10:56 -0700)]
ionic: advertise support for hardware timestamps

Let the network stack know we've got support for timestamping
the packets.

Signed-off-by: Allen Hubbe <allenbh@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoionic: ethtool ptp stats
Shannon Nelson [Thu, 1 Apr 2021 17:56:09 +0000 (10:56 -0700)]
ionic: ethtool ptp stats

Add the new hwstamp stats to our ethtool stats output.

Signed-off-by: Allen Hubbe <allenbh@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoionic: add ethtool support for PTP
Shannon Nelson [Thu, 1 Apr 2021 17:56:08 +0000 (10:56 -0700)]
ionic: add ethtool support for PTP

Add the get_ts_info() callback for ethtool support of
timestamping information.

Signed-off-by: Allen Hubbe <allenbh@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoionic: add and enable tx and rx timestamp handling
Shannon Nelson [Thu, 1 Apr 2021 17:56:07 +0000 (10:56 -0700)]
ionic: add and enable tx and rx timestamp handling

The Tx and Rx timestamped packets are handled through separate
queues.  Here we set them up, service them, and tear them down
along with the normal Tx and Rx queues.

Signed-off-by: Allen Hubbe <allenbh@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoionic: set up hw timestamp queues
Shannon Nelson [Thu, 1 Apr 2021 17:56:06 +0000 (10:56 -0700)]
ionic: set up hw timestamp queues

We do hardware timestamping through a separate Tx queue,
and optionally through a separate Rx queue.  These queues
are allocated, freed, and tracked separately from the basic
queue arrays.

Signed-off-by: Allen Hubbe <allenbh@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoionic: add rx filtering for hw timestamp steering
Shannon Nelson [Thu, 1 Apr 2021 17:56:05 +0000 (10:56 -0700)]
ionic: add rx filtering for hw timestamp steering

Add handling of the new Rx packet classification filter type.
This simple bit of classification allows for steering packets
to a separate Rx queue for processing.

Signed-off-by: Allen Hubbe <allenbh@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoionic: link in the new hw timestamp code
Shannon Nelson [Thu, 1 Apr 2021 17:56:04 +0000 (10:56 -0700)]
ionic: link in the new hw timestamp code

These are changes to compile and link the new code, but no
new feature support is available or advertised yet.

Signed-off-by: Allen Hubbe <allenbh@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoionic: add hw timestamp support files
Shannon Nelson [Thu, 1 Apr 2021 17:56:03 +0000 (10:56 -0700)]
ionic: add hw timestamp support files

This adds the file of code for supporting Tx and Rx hardware
timestamps and the raw clock interface, but does not yet link
it in for compiling or use.

Signed-off-by: Allen Hubbe <allenbh@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoionic: split adminq post and wait calls
Shannon Nelson [Thu, 1 Apr 2021 17:56:02 +0000 (10:56 -0700)]
ionic: split adminq post and wait calls

Split the wait part out of adminq_post_wait() into a separate
function so that a caller can have finer grain control over
the sequencing of operations and locking.

Signed-off-by: Allen Hubbe <allenbh@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoionic: add hw timestamp structs to interface
Shannon Nelson [Thu, 1 Apr 2021 17:56:01 +0000 (10:56 -0700)]
ionic: add hw timestamp structs to interface

The interface for hardware timestamping includes a new FW
request, device identity fields, Tx and Rx queue feature bits, a
new Rx filter type, the beginnings of Rx packet classifications,
and hardware timestamp registers.

If the IONIC_ETH_HW_TIMESTAMP bit is shown in the
ionic_lif_config features bit string, then we have support
for the hw clock registers.  If the IONIC_RXQ_F_HWSTAMP and
IONIC_TXQ_F_HWSTAMP features are shown in the ionic_q_identity
features, then the queues can support HW timestamps on packets.

Signed-off-by: Allen Hubbe <allenbh@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoionic: add handling of larger descriptors
Shannon Nelson [Thu, 1 Apr 2021 17:56:00 +0000 (10:56 -0700)]
ionic: add handling of larger descriptors

In preparating for hardware timestamping, we need to support
large Tx and Rx completion descriptors.  Here we add the new
queue feature ids and handling for the completion descriptor
sizes.

We only are adding support for the Rx 2x sized completion
descriptors in the general Rx queues for now as we will be
using it for PTP Rx support, and we don't have an immediate
use for the large descriptors in the general Tx queues yet;
it will be used in a special Tx queues added in one of the
next few patches.

Signed-off-by: Allen Hubbe <allenbh@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoionic: add new queue features to interface
Shannon Nelson [Thu, 1 Apr 2021 17:55:59 +0000 (10:55 -0700)]
ionic: add new queue features to interface

Add queue feature extensions to prepare for features that
can be queue specific, in addition to the general queue
features already defined.  While we're here, change the
existing feature ids from #defines to enum.

Signed-off-by: Allen Hubbe <allenbh@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
David S. Miller [Fri, 2 Apr 2021 18:03:07 +0000 (11:03 -0700)]
Merge git://git./linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-04-01

The following pull-request contains BPF updates for your *net-next* tree.

We've added 68 non-merge commits during the last 7 day(s) which contain
a total of 70 files changed, 2944 insertions(+), 1139 deletions(-).

The main changes are:

1) UDP support for sockmap, from Cong.

2) Verifier merge conflict resolution fix, from Daniel.

3) xsk selftests enhancements, from Maciej.

4) Unstable helpers aka kernel func calling, from Martin.

5) Batches ops for LPM map, from Pedro.

6) Fix race in bpf_get_local_storage, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: usb: ax88179_178a: initialize local variables before use
Phillip Potter [Thu, 1 Apr 2021 22:36:07 +0000 (23:36 +0100)]
net: usb: ax88179_178a: initialize local variables before use

Use memset to initialize local array in drivers/net/usb/ax88179_178a.c, and
also set a local u16 and u32 variable to 0. Fixes a KMSAN found uninit-value bug
reported by syzbot at:
https://syzkaller.appspot.com/bug?id=00371c73c72f72487c1d0bfe0cc9d00de339d5aa

Reported-by: syzbot+4993e4a0e237f1b53747@syzkaller.appspotmail.com
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: phy: broadcom: Add statistics for all Gigabit PHYs
Florian Fainelli [Thu, 1 Apr 2021 16:42:33 +0000 (09:42 -0700)]
net: phy: broadcom: Add statistics for all Gigabit PHYs

All Gigabit PHYs use the same register layout as far as fetching
statistics goes. Fast Ethernet PHYs do not all support statistics, and
the BCM54616S would require some switching between the coper and fiber
modes to fetch the appropriate statistics which is not supported yet.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: document a side effect of ip_local_reserved_ports
Otto Hollmann [Thu, 1 Apr 2021 15:57:05 +0000 (17:57 +0200)]
net: document a side effect of ip_local_reserved_ports

If there is overlapp between ip_local_port_range and ip_local_reserved_ports with a huge reserved block, it will affect probability of selecting ephemeral ports, see file net/ipv4/inet_hashtables.c:723

    int __inet_hash_connect(
    ...
            for (i = 0; i < remaining; i += 2, port += 2) {
                    if (unlikely(port >= high))
                            port -= remaining;
                    if (inet_is_local_reserved_port(net, port))
                            continue;

    E.g. if there is reserved block of 10000 ports, two ports right after this block will be 5000 more likely selected than others.
    If this was intended, we can/should add note into documentation as proposed in this commit, otherwise we should think about different solution. One option could be mapping table of continuous port ranges. Second option could be letting user to modify step (port+=2) in above loop, e.g. using new sysctl parameter.

Signed-off-by: Otto Hollmann <otto.hollmann@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agolan743x: remove redundant semi-colon
Yang Yingliang [Thu, 1 Apr 2021 14:20:15 +0000 (22:20 +0800)]
lan743x: remove redundant semi-colon

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: hns: Fix some typos
Lu Wei [Thu, 1 Apr 2021 09:27:01 +0000 (17:27 +0800)]
net: hns: Fix some typos

Fix some typos.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Lu Wei <luwei32@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: smc: Remove repeated struct declaration
Wan Jiabing [Thu, 1 Apr 2021 08:40:29 +0000 (16:40 +0800)]
net: smc: Remove repeated struct declaration

struct smc_clc_msg_local is declared twice. One is declared at
301st line. The blew one is not needed. Remove the duplicate.

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoinclude: net: Remove repeated struct declaration
Wan Jiabing [Thu, 1 Apr 2021 07:08:22 +0000 (15:08 +0800)]
include: net: Remove repeated struct declaration

struct ctl_table_header is declared twice. One is declared
at 46th line. The blew one is not needed. Remove the duplicate.

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: stmmac: remove unnecessary pci_enable_msi() call
Wong Vee Khee [Thu, 1 Apr 2021 06:06:28 +0000 (14:06 +0800)]
net: stmmac: remove unnecessary pci_enable_msi() call

The commit d2a029bde37b ("stmmac: pci: add MSI support for Intel Quark
X1000") introduced a pci_enable_msi() call in stmmac_pci.c.

With the commit 58da0cfa6cf1 ("net: stmmac: create dwmac-intel.c to
contain all Intel platform"), Intel Quark platform related codes
have been moved to the newly created driver.

Removing this unnecessary pci_enable_msi() call as there are no other
devices that uses stmmac-pci and need MSI to be enabled.

Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agostmmac: intel: use managed PCI function on probe and resume
Wong Vee Khee [Thu, 1 Apr 2021 06:02:50 +0000 (14:02 +0800)]
stmmac: intel: use managed PCI function on probe and resume

Update dwmac-intel to use managed function, i.e. pcim_enable_device().

This will allow devres framework to call resource free function for us.

Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: ipv6: Refactor in rt6_age_examine_exception
Xu Jia [Thu, 1 Apr 2021 03:22:23 +0000 (11:22 +0800)]
net: ipv6: Refactor in rt6_age_examine_exception

The logic in rt6_age_examine_exception is confusing. The commit is
to refactor the code.

Signed-off-by: Xu Jia <xujia39@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotipc: fix unique bearer names sanity check
Hoang Le [Thu, 1 Apr 2021 02:30:48 +0000 (09:30 +0700)]
tipc: fix unique bearer names sanity check

When enabling a bearer by name, we don't sanity check its name with
higher slot in bearer list. This may have the effect that the name
of an already enabled bearer bypasses the check.

To fix the above issue, we just perform an extra checking with all
existing bearers.

Fixes: cb30a63384bc9 ("tipc: refactor function tipc_enable_bearer()")
Cc: stable@vger.kernel.org
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
David S. Miller [Thu, 1 Apr 2021 22:41:08 +0000 (15:41 -0700)]
Merge branch '100GbE' of git://git./linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
100GbE Intel Wired LAN Driver Updates 2021-03-31

This series contains updates to ice driver only.

Benita adds support for XPS.

Ani moves netdev registration to the end of probe to prevent use before
the interface is ready and moves up an error check to possibly avoid
an unneeded call. He also consolidates the VSI state and flag fields to
a single field.

Dan changes the segment where package information is pulled.

Paul S ensures correct ITR values are set when increasing ring size.

Paul G rewords a link misconfiguration message as this could be
expected.

Bruce removes setting an unnecessary AQ flag and corrects a memory
allocation call. Also fixes checkpatch issues for 'COMPLEX_MACRO'.

Qi aligns PTYPE bitmap naming by adding 'ptype' prefix to the bitmaps
missing it.

Brett removes limiting Rx queue mapping to RSS size as there is not a
dependency on this. He also refactors RSS configuration by introducing
individual functions for LUT and key configuration and by passing a
structure containing pertinent information instead of individual
arguments.

Tony corrects a comment block to follow netdev style.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'sockmap: introduce BPF_SK_SKB_VERDICT and support UDP'
Alexei Starovoitov [Thu, 1 Apr 2021 17:56:15 +0000 (10:56 -0700)]
Merge branch 'sockmap: introduce BPF_SK_SKB_VERDICT and support UDP'

Cong Wang says:

====================

From: Cong Wang <cong.wang@bytedance.com>

We have thousands of services connected to a daemon on every host
via AF_UNIX dgram sockets, after they are moved into VM, we have to
add a proxy to forward these communications from VM to host, because
rewriting thousands of them is not practical. This proxy uses an
AF_UNIX socket connected to services and a UDP socket to connect to
the host. It is inefficient because data is copied between kernel
space and user space twice, and we can not use splice() which only
supports TCP. Therefore, we want to use sockmap to do the splicing
without going to user-space at all (after the initial setup).

Currently sockmap only fully supports TCP, UDP is partially supported
as it is only allowed to add into sockmap. This patchset, as the second
part of the original large patchset, extends sockmap with:
1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support.

On the high level, ->read_sock() is required for each protocol to support
sockmap redirection, and in order to do sock proto update, a new ops
->psock_update_sk_prot() is introduced, which is also required. And the
BPF ->recvmsg() is also needed to replace the original ->recvmsg() to
retrieve skmsg. To make life easier, we have to get rid of lock_sock()
in sk_psock_handle_skb(), otherwise we would have to implement
->sendmsg_locked() on top of ->sendmsg(), which is ugly.

Please see each patch for more details.

To see the big picture, the original patchset is available here:
https://github.com/congwang/linux/tree/sockmap
this patchset is also available:
https://github.com/congwang/linux/tree/sockmap2
---
v8: get rid of 'offset' in udp_read_sock()
    add checks for skb_verdict/stream_verdict conflict
    add two cleanup patches for sock_map_link()
    add a new test case

v7: use work_mutex to protect psock->work
    return err in udp_read_sock()
    add patch 6/13
    clean up test case

v6: get rid of sk_psock_zap_ingress()
    add rcu work patch

v5: use INDIRECT_CALL_2() for function pointers
    use ingress_lock to fix a race condition found by Jacub
    rename two helper functions

v4: get rid of lock_sock() in sk_psock_handle_skb()
    get rid of udp_sendmsg_locked()
    remove an empty line
    update cover letter

v3: export tcp/udp_update_proto()
    rename sk->sk_prot->psock_update_sk_prot()
    improve changelogs

v2: separate from the original large patchset
    rebase to the latest bpf-next
    split UDP test case
    move inet_csk_has_ulp() check to tcp_bpf.c
    clean up udp_read_sock()
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agoselftests/bpf: Add a test case for loading BPF_SK_SKB_VERDICT
Cong Wang [Wed, 31 Mar 2021 02:32:37 +0000 (19:32 -0700)]
selftests/bpf: Add a test case for loading BPF_SK_SKB_VERDICT

This adds a test case to ensure BPF_SK_SKB_VERDICT and
BPF_SK_STREAM_VERDICT will never be attached at the same time.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-17-xiyou.wangcong@gmail.com
3 years agoselftests/bpf: Add a test case for udp sockmap
Cong Wang [Wed, 31 Mar 2021 02:32:36 +0000 (19:32 -0700)]
selftests/bpf: Add a test case for udp sockmap

Add a test case to ensure redirection between two UDP sockets work.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-16-xiyou.wangcong@gmail.com
3 years agosock_map: Update sock type checks for UDP
Cong Wang [Wed, 31 Mar 2021 02:32:35 +0000 (19:32 -0700)]
sock_map: Update sock type checks for UDP

Now UDP supports sockmap and redirection, we can safely update
the sock type checks for it accordingly.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-15-xiyou.wangcong@gmail.com
3 years agoudp: Implement udp_bpf_recvmsg() for sockmap
Cong Wang [Wed, 31 Mar 2021 02:32:34 +0000 (19:32 -0700)]
udp: Implement udp_bpf_recvmsg() for sockmap

We have to implement udp_bpf_recvmsg() to replace the ->recvmsg()
to retrieve skmsg from ingress_msg.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-14-xiyou.wangcong@gmail.com
3 years agoskmsg: Extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()
Cong Wang [Wed, 31 Mar 2021 02:32:33 +0000 (19:32 -0700)]
skmsg: Extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()

Although these two functions are only used by TCP, they are not
specific to TCP at all, both operate on skmsg and ingress_msg,
so fit in net/core/skmsg.c very well.

And we will need them for non-TCP, so rename and move them to
skmsg.c and export them to modules.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-13-xiyou.wangcong@gmail.com
3 years agoudp: Implement ->read_sock() for sockmap
Cong Wang [Wed, 31 Mar 2021 02:32:32 +0000 (19:32 -0700)]
udp: Implement ->read_sock() for sockmap

This is similar to tcp_read_sock(), except we do not need
to worry about connections, we just need to retrieve skb
from UDP receive queue.

Note, the return value of ->read_sock() is unused in
sk_psock_verdict_data_ready(), and UDP still does not
support splice() due to lack of ->splice_read(), so users
can not reach udp_read_sock() directly.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-12-xiyou.wangcong@gmail.com
3 years agosock: Introduce sk->sk_prot->psock_update_sk_prot()
Cong Wang [Wed, 31 Mar 2021 02:32:31 +0000 (19:32 -0700)]
sock: Introduce sk->sk_prot->psock_update_sk_prot()

Currently sockmap calls into each protocol to update the struct
proto and replace it. This certainly won't work when the protocol
is implemented as a module, for example, AF_UNIX.

Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each
protocol can implement its own way to replace the struct proto.
This also helps get rid of symbol dependencies on CONFIG_INET.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-11-xiyou.wangcong@gmail.com
3 years agosock_map: Introduce BPF_SK_SKB_VERDICT
Cong Wang [Wed, 31 Mar 2021 02:32:30 +0000 (19:32 -0700)]
sock_map: Introduce BPF_SK_SKB_VERDICT

Reusing BPF_SK_SKB_STREAM_VERDICT is possible but its name is
confusing and more importantly we still want to distinguish them
from user-space. So we can just reuse the stream verdict code but
introduce a new type of eBPF program, skb_verdict. Users are not
allowed to attach stream_verdict and skb_verdict programs to the
same map.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-10-xiyou.wangcong@gmail.com
3 years agosock_map: Kill sock_map_link_no_progs()
Cong Wang [Wed, 31 Mar 2021 02:32:29 +0000 (19:32 -0700)]
sock_map: Kill sock_map_link_no_progs()

Now we can fold sock_map_link_no_progs() into sock_map_link()
and get rid of sock_map_link_no_progs().

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-9-xiyou.wangcong@gmail.com
3 years agosock_map: Simplify sock_map_link() a bit
Cong Wang [Wed, 31 Mar 2021 02:32:28 +0000 (19:32 -0700)]
sock_map: Simplify sock_map_link() a bit

sock_map_link() passes down map progs, but it is confusing
to see both map progs and psock progs. Make the map progs
more obvious by retrieving it directly with sock_map_progs()
inside sock_map_link(). Now it is aligned with
sock_map_link_no_progs() too.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-8-xiyou.wangcong@gmail.com
3 years agoskmsg: Use GFP_KERNEL in sk_psock_create_ingress_msg()
Cong Wang [Wed, 31 Mar 2021 02:32:27 +0000 (19:32 -0700)]
skmsg: Use GFP_KERNEL in sk_psock_create_ingress_msg()

This function is only called in process context.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-7-xiyou.wangcong@gmail.com
3 years agoskmsg: Use rcu work for destroying psock
Cong Wang [Wed, 31 Mar 2021 02:32:26 +0000 (19:32 -0700)]
skmsg: Use rcu work for destroying psock

The RCU callback sk_psock_destroy() only queues work psock->gc,
so we can just switch to rcu work to simplify the code.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-6-xiyou.wangcong@gmail.com
3 years agoskmsg: Avoid lock_sock() in sk_psock_backlog()
Cong Wang [Wed, 31 Mar 2021 02:32:25 +0000 (19:32 -0700)]
skmsg: Avoid lock_sock() in sk_psock_backlog()

We do not have to lock the sock to avoid losing sk_socket,
instead we can purge all the ingress queues when we close
the socket. Sending or receiving packets after orphaning
socket makes no sense.

We do purge these queues when psock refcnt reaches zero but
here we want to purge them explicitly in sock_map_close().
There are also some nasty race conditions on testing bit
SK_PSOCK_TX_ENABLED and queuing/canceling the psock work,
we can expand psock->ingress_lock a bit to protect them too.

As noticed by John, we still have to lock the psock->work,
because the same work item could be running concurrently on
different CPU's.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-5-xiyou.wangcong@gmail.com
3 years agonet: Introduce skb_send_sock() for sock_map
Cong Wang [Wed, 31 Mar 2021 02:32:24 +0000 (19:32 -0700)]
net: Introduce skb_send_sock() for sock_map

We only have skb_send_sock_locked() which requires callers
to use lock_sock(). Introduce a variant skb_send_sock()
which locks on its own, callers do not need to lock it
any more. This will save us from adding a ->sendmsg_locked
for each protocol.

To reuse the code, pass function pointers to __skb_send_sock()
and build skb_send_sock() and skb_send_sock_locked() on top.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-4-xiyou.wangcong@gmail.com
3 years agoskmsg: Introduce a spinlock to protect ingress_msg
Cong Wang [Wed, 31 Mar 2021 02:32:23 +0000 (19:32 -0700)]
skmsg: Introduce a spinlock to protect ingress_msg

Currently we rely on lock_sock to protect ingress_msg,
it is too big for this, we can actually just use a spinlock
to protect this list like protecting other skb queues.

__tcp_bpf_recvmsg() is still special because of peeking,
it still has to use lock_sock.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-3-xiyou.wangcong@gmail.com