git.monstr.eu Git - linux-2.6-microblaze.git/log

selftests/bpf: Add unit tests for global functions

test_global_func[12] - check 512 stack limit.
test_global_func[34] - check 8 frame call chain limit.
test_global_func5    - check that non-ctx pointer cannot be passed into
                       a function that expects context.
test_global_func6    - check that ctx pointer is unmodified.
test_global_func7    - check that global function returns scalar.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200110064124.1760511-7-ast@kernel.org

selftests/bpf: Modify a test to check global functions

Make two static functions in test_xdp_noinline.c global:
before: processed 2790 insns
after: processed 2598 insns

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200110064124.1760511-6-ast@kernel.org

selftests/bpf: Add a test for a large global function

test results:
pyperf50 with always_inlined the same function five times: processed 46378 insns
pyperf50 with global function: processed 6102 insns

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200110064124.1760511-5-ast@kernel.org

selftests/bpf: Add fexit-to-skb test for global funcs

Add simple fexit prog type to skb prog type test when subprogram is a global
function.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200110064124.1760511-4-ast@kernel.org

bpf: Introduce function-by-function verification

New llvm and old llvm with libbpf help produce BTF that distinguish global and
static functions. Unlike arguments of static function the arguments of global
functions cannot be removed or optimized away by llvm. The compiler has to use
exactly the arguments specified in a function prototype. The argument type
information allows the verifier validate each global function independently.
For now only supported argument types are pointer to context and scalars. In
the future pointers to structures, sizes, pointer to packet data can be
supported as well. Consider the following example:

static int f1(int ...)
{
  ...
}

int f3(int b);

int f2(int a)
{
  f1(a) + f3(a);
}

int f3(int b)
{
  ...
}

int main(...)
{
  f1(...) + f2(...) + f3(...);
}

The verifier will start its safety checks from the first global function f2().
It will recursively descend into f1() because it's static. Then it will check
that arguments match for the f3() invocation inside f2(). It will not descend
into f3(). It will finish f2() that has to be successfully verified for all
possible values of 'a'. Then it will proceed with f3(). That function also has
to be safe for all possible values of 'b'. Then it will start subprog 0 (which
is main() function). It will recursively descend into f1() and will skip full
check of f2() and f3(), since they are global. The order of processing global
functions doesn't affect safety, since all global functions must be proven safe
based on their arguments only.

Such function by function verification can drastically improve speed of the
verification and reduce complexity.

Note that the stack limit of 512 still applies to the call chain regardless whether
functions were static or global. The nested level of 8 also still applies. The
same recursion prevention checks are in place as well.

The type information and static/global kind is preserved after the verification
hence in the above example global function f2() and f3() can be replaced later
by equivalent functions with the same types that are loaded and verified later
without affecting safety of this main() program. Such replacement (re-linking)
of global functions is a subject of future patches.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200110064124.1760511-3-ast@kernel.org

libbpf: Sanitize global functions

In case the kernel doesn't support BTF_FUNC_GLOBAL sanitize BTF produced by the
compiler for global functions.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200110064124.1760511-2-ast@kernel.org

Merge branch 'selftest-makefile-cleanup'

Andrii Nakryiko says:

====================
Fix issues with bpf_helper_defs.h usage in selftests/bpf. As part of that, fix
the way clean up is performed for libbpf and selftests/bpf. Some for Makefile
output clean ups as well.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Further clean up Makefile output

Further clean up Makefile output:
- hide "entering directory" messages;
- silvence sub-Make command echoing;
- succinct MKDIR messages.

Also remove few test binaries that are not produced anymore from .gitignore.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200110051716.1591485-4-andriin@fb.com

selftests/bpf: Ensure bpf_helper_defs.h are taken from selftests dir

Reorder includes search path to ensure $(OUTPUT) and $(CURDIR) go before
libbpf's directory. Also fix bpf_helpers.h to include bpf_helper_defs.h in
such a way as to leverage includes search path. This allows selftests to not
use libbpf's local and potentially stale bpf_helper_defs.h. It's important
because selftests/bpf's Makefile only re-generates bpf_helper_defs.h in
seltests' output directory, not the one in libbpf's directory.

Also force regeneration of bpf_helper_defs.h when libbpf.a is updated to
reduce staleness.

Fixes: fa633a0f8919 ("libbpf: Fix build on read-only filesystems")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200110051716.1591485-3-andriin@fb.com

libbpf,selftests/bpf: Fix clean targets

Libbpf's clean target should clean out generated files in $(OUTPUT) directory
and not make assumption that $(OUTPUT) directory is current working directory.

Selftest's Makefile should delegate cleaning of libbpf-generated files to
libbpf's Makefile. This ensures more robust clean up.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200110051716.1591485-2-andriin@fb.com

libbpf: Make bpf_map order and indices stable

Currently, libbpf re-sorts bpf_map structs after all the maps are added and
initialized, which might change their relative order and invalidate any
bpf_map pointer or index taken before that. This is inconvenient and
error-prone. For instance, it can cause .kconfig map index to point to a wrong
map.

Furthermore, libbpf itself doesn't rely on any specific ordering of bpf_maps,
so it's just an unnecessary complication right now. This patch drops sorting
of maps and makes their relative positions fixed. If efficient index is ever
needed, it's better to have a separate array of pointers as a search index,
instead of reordering bpf_map struct in-place. This will be less error-prone
and will allow multiple independent orderings, if necessary (e.g., either by
section index or by name).

Fixes: 166750bc1dd2 ("libbpf: Support libbpf-provided extern variables")
Reported-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200110034247.1220142-1-andriin@fb.com

bpf: Document BPF_F_QUERY_EFFECTIVE flag

Document BPF_F_QUERY_EFFECTIVE flag, mostly to clarify how it affects
attach_flags what may not be obvious and what may lead to confision.

Specifically attach_flags is returned only for target_fd but if programs
are inherited from an ancestor cgroup then returned attach_flags for
current cgroup may be confusing. For example, two effective programs of
same attach_type can be returned but w/o BPF_F_ALLOW_MULTI in
attach_flags.

Simple repro:
  # bpftool c s /sys/fs/cgroup/path/to/task
  ID       AttachType      AttachFlags     Name
  # bpftool c s /sys/fs/cgroup/path/to/task effective
  ID       AttachType      AttachFlags     Name
  95043    ingress                         tw_ipt_ingress
  95048    ingress                         tw_ingress

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200108014006.938363-1-rdna@fb.com

Merge branch 'tcp-bpf-cc'

Martin Lau says:

====================
This series introduces BPF STRUCT_OPS.  It is an infra to allow
implementing some specific kernel's function pointers in BPF.
The first use case included in this series is to implement
TCP congestion control algorithm in BPF  (i.e. implement
struct tcp_congestion_ops in BPF).

There has been attempt to move the TCP CC to the user space
(e.g. CCP in TCP).   The common arguments are faster turn around,
get away from long-tail kernel versions in production...etc,
which are legit points.

BPF has been the continuous effort to join both kernel and
userspace upsides together (e.g. XDP to gain the performance
advantage without bypassing the kernel).  The recent BPF
advancements (in particular BTF-aware verifier, BPF trampoline,
BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
possible in BPF.

The idea is to allow implementing tcp_congestion_ops in bpf.
It allows a faster turnaround for testing algorithm in the
production while leveraging the existing (and continue growing) BPF
feature/framework instead of building one specifically for
userspace TCP CC.

Please see individual patch for details.

The bpftool support will be posted in follow-up patches.

v4:
- Expose tcp_ca_find() to tcp.h in patch 7.
  It is used to check the same bpf-tcp-cc
  does not exist to guarantee the register()
  will succeed.
- set_memory_ro() and then set_memory_x() only after all
  trampolines are written to the image in patch 6. (Daniel)
  spinlock is replaced by mutex because set_memory_*
  requires sleepable context.

v3:
- Fix kbuild error by considering CONFIG_BPF_SYSCALL (kbuild)
- Support anonymous bitfield in patch 4 (Andrii, Yonghong)
- Push boundary safety check to a specific arch's trampoline function
  (in patch 6) (Yonghong).
  Reuse the WANR_ON_ONCE check in arch_prepare_bpf_trampoline() in x86.
- Check module field is 0 in udata in patch 6 (Yonghong)
- Check zero holes in patch 6 (Andrii)
- s/_btf_vmlinux/btf/ in patch 5 and 7 (Andrii)
- s/check_xxx/is_xxx/ in patch 7 (Andrii)
- Use "struct_ops/" convention in patch 11 (Andrii)
- Use the skel instead of bpf_object in patch 11 (Andrii)
- libbpf: Decide BPF_PROG_TYPE_STRUCT_OPS at open phase by using
          find_sec_def()
- libbpf: Avoid a debug message at open phase (Andrii)
- libbpf: Add bpf_program__(is|set)_struct_ops() for consistency (Andrii)
- libbpf: Add "struct_ops" to section_defs (Andrii)
- libbpf: Some code shuffling in init_kern_struct_ops() (Andrii)
- libbpf: A few safety checks (Andrii)

v2:
- Dropped cubic for now.  They will be reposted
  once there are more clarity in "jiffies" on both
  bpf side (about the helper) and
  tcp_cubic side (some of jiffies usages are being replaced
  by tp->tcp_mstamp)
- Remove unnecssary check on bitfield support from btf_struct_access()
  (Yonghong)
- BTF_TYPE_EMIT macro (Yonghong, Andrii)
- value_name's length check to avoid an unlikely
  type match during truncation case (Yonghong)
- BUILD_BUG_ON to ensure no trampoline-image overrun
  in the future (Yonghong)
- Simplify get_next_key() (Yonghong)
- Added comment to explain how to check mandatory
  func ptr in net/ipv4/bpf_tcp_ca.c (Yonghong)
- Rename "__bpf_" to "bpf_struct_ops_" for value prefix (Andrii)
- Add comment to highlight the bpf_dctcp.c is not necessarily
  the same as tcp_dctcp.c. (Alexei, Eric)
- libbpf: Renmae "struct_ops" to ".struct_ops" for elf sec (Andrii)
- libbpf: Expose struct_ops as a bpf_map (Andrii)
- libbpf: Support multiple struct_ops in SEC(".struct_ops") (Andrii)
- libbpf: Add bpf_map__attach_struct_ops()  (Andrii)
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add bpf_dctcp example

This patch adds a bpf_dctcp example. It currently does not do
no-ECN fallback but the same could be done through the cgrp2-bpf.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200109003517.3856825-1-kafai@fb.com

bpf: libbpf: Add STRUCT_OPS support

This patch adds BPF STRUCT_OPS support to libbpf.

The only sec_name convention is SEC(".struct_ops") to identify the
struct_ops implemented in BPF,
e.g. To implement a tcp_congestion_ops:

SEC(".struct_ops")
struct tcp_congestion_ops dctcp = {
.init           = (void *)dctcp_init,  /* <-- a bpf_prog */
/* ... some more func prts ... */
.name           = "bpf_dctcp",
};

Each struct_ops is defined as a global variable under SEC(".struct_ops")
as above.  libbpf creates a map for each variable and the variable name
is the map's name.  Multiple struct_ops is supported under
SEC(".struct_ops").

In the bpf_object__open phase, libbpf will look for the SEC(".struct_ops")
section and find out what is the btf-type the struct_ops is
implementing.  Note that the btf-type here is referring to
a type in the bpf_prog.o's btf.  A "struct bpf_map" is added
by bpf_object__add_map() as other maps do.  It will then
collect (through SHT_REL) where are the bpf progs that the
func ptrs are referring to.  No btf_vmlinux is needed in
the open phase.

In the bpf_object__load phase, the map-fields, which depend
on the btf_vmlinux, are initialized (in bpf_map__init_kern_struct_ops()).
It will also set the prog->type, prog->attach_btf_id, and
prog->expected_attach_type.  Thus, the prog's properties do
not rely on its section name.
[ Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
  process is as simple as: member-name match + btf-kind match + size match.
  If these matching conditions fail, libbpf will reject.
  The current targeting support is "struct tcp_congestion_ops" which
  most of its members are function pointers.
  The member ordering of the bpf_prog's btf-type can be different from
  the btf_vmlinux's btf-type. ]

Then, all obj->maps are created as usual (in bpf_object__create_maps()).

Once the maps are created and prog's properties are all set,
the libbpf will proceed to load all the progs.

bpf_map__attach_struct_ops() is added to register a struct_ops
map to a kernel subsystem.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200109003514.3856730-1-kafai@fb.com

bpf: Synch uapi bpf.h to tools/

This patch sync uapi bpf.h to tools/

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200109003512.3856559-1-kafai@fb.com

bpf: Add BPF_FUNC_tcp_send_ack helper

Add a helper to send out a tcp-ack. It will be used in the later
bpf_dctcp implementation that requires to send out an ack
when the CE state changed.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109004551.3900448-1-kafai@fb.com

bpf: tcp: Support tcp_congestion_ops in bpf

This patch makes "struct tcp_congestion_ops" to be the first user
of BPF STRUCT_OPS.  It allows implementing a tcp_congestion_ops
in bpf.

The BPF implemented tcp_congestion_ops can be used like
regular kernel tcp-cc through sysctl and setsockopt.  e.g.
[root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
net.ipv4.tcp_congestion_control = bpf_cubic

There has been attempt to move the TCP CC to the user space
(e.g. CCP in TCP).   The common arguments are faster turn around,
get away from long-tail kernel versions in production...etc,
which are legit points.

BPF has been the continuous effort to join both kernel and
userspace upsides together (e.g. XDP to gain the performance
advantage without bypassing the kernel).  The recent BPF
advancements (in particular BTF-aware verifier, BPF trampoline,
BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
possible in BPF.  It allows a faster turnaround for testing algorithm
in the production while leveraging the existing (and continue growing)
BPF feature/framework instead of building one specifically for
userspace TCP CC.

This patch allows write access to a few fields in tcp-sock
(in bpf_tcp_ca_btf_struct_access()).

The optional "get_info" is unsupported now.  It can be added
later.  One possible way is to output the info with a btf-id
to describe the content.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003508.3856115-1-kafai@fb.com

bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS

The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
is a kernel struct with its func ptr implemented in bpf prog.
This new map is the interface to register/unregister/introspect
a bpf implemented kernel struct.

The kernel struct is actually embedded inside another new struct
(or called the "value" struct in the code).  For example,
"struct tcp_congestion_ops" is embbeded in:
struct bpf_struct_ops_tcp_congestion_ops {
refcount_t refcnt;
enum bpf_struct_ops_state state;
struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
}
The map value is "struct bpf_struct_ops_tcp_congestion_ops".
The "bpftool map dump" will then be able to show the
state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
is created automatically by a macro.  Having a separate "value" struct
will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
"void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
initialization works before registering the struct_ops to the kernel
subsystem).  The libbpf will take care of finding and populating the
"struct bpf_struct_ops_XYZ" from "struct XYZ".

Register a struct_ops to a kernel subsystem:
1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
   set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
   running kernel.
   Instead of reusing the attr->btf_value_type_id,
   btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
   used as the "user" btf which could store other useful sysadmin/debug
   info that may be introduced in the furture,
   e.g. creation-date/compiler-details/map-creator...etc.
3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
   in the running kernel btf.  Populate the value of this object.
   The function ptr should be populated with the prog fds.
4. Call BPF_MAP_UPDATE with the object created in (3) as
   the map value.  The key is always "0".

During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
the specific struct_ops to do some final checks in "st_ops->init_member()"
(e.g. ensure all mandatory func ptrs are implemented).
If everything looks good, it will register this kernel struct
to the kernel subsystem.  The map will not allow further update
from this point.

Unregister a struct_ops from the kernel subsystem:
BPF_MAP_DELETE with key "0".

Introspect a struct_ops:
BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
have the prog _id_ populated as the func ptr.

The map value state (enum bpf_struct_ops_state) will transit from:
INIT (map created) =>
INUSE (map updated, i.e. reg) =>
TOBEFREE (map value deleted, i.e. unreg)

The kernel subsystem needs to call bpf_struct_ops_get() and
bpf_struct_ops_put() to manage the "refcnt" in the
"struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
for the purose of tracking the subsystem usage.  Another approach
is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
the subsystem's usage by doing map->refcnt - map->usercnt to filter out
the map-fd/pinned-map usage.  However, that will also tie down the
future semantics of map->refcnt and map->usercnt.

The very first subsystem's refcnt (during reg()) holds one
count to map->refcnt.  When the very last subsystem's refcnt
is gone, it will also release the map->refcnt.  All bpf_prog will be
freed when the map->refcnt reaches 0 (i.e. during map_free()).

Here is how the bpftool map command will look like:
[root@arch-fb-vm1 bpf]# bpftool map show
6: struct_ops  name dctcp  flags 0x0
key 4B  value 256B  max_entries 1  memlock 4096B
btf_id 6
[root@arch-fb-vm1 bpf]# bpftool map dump id 6
[{
        "value": {
            "refcnt": {
                "refs": {
                    "counter": 1
                }
            },
            "state": 1,
            "data": {
                "list": {
                    "next": 0,
                    "prev": 0
                },
                "key": 0,
                "flags": 2,
                "init": 24,
                "release": 0,
                "ssthresh": 25,
                "cong_avoid": 30,
                "set_state": 27,
                "cwnd_event": 28,
                "in_ack_event": 26,
                "undo_cwnd": 29,
                "pkts_acked": 0,
                "min_tso_segs": 0,
                "sndbuf_expand": 0,
                "cong_control": 0,
                "get_info": 0,
                "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
                ],
                "owner": 0
            }
        }
    }
]

Misc Notes:
* bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
  It does an inplace update on "*value" instead returning a pointer
  to syscall.c.  Otherwise, it needs a separate copy of "zero" value
  for the BPF_STRUCT_OPS_STATE_INIT to avoid races.

* The bpf_struct_ops_map_delete_elem() is also called without
  preempt_disable() from map_delete_elem().  It is because
  the "->unreg()" may requires sleepable context, e.g.
  the "tcp_unregister_congestion_control()".

* "const" is added to some of the existing "struct btf_func_model *"
  function arg to avoid a compiler warning caused by this patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com

bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS

This patch allows the kernel's struct ops (i.e. func ptr) to be
implemented in BPF.  The first use case in this series is the
"struct tcp_congestion_ops" which will be introduced in a
latter patch.

This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
func ptr of a kernel struct.  The attr->attach_btf_id is the btf id
of a kernel struct.  The attr->expected_attach_type is the member
"index" of that kernel struct.  The first member of a struct starts
with member index 0.  That will avoid ambiguity when a kernel struct
has multiple func ptrs with the same func signature.

For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
to implement the "init" func ptr of the "struct tcp_congestion_ops".
The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
of the _running_ kernel.  The attr->expected_attach_type is 3.

The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
by arch_prepare_bpf_trampoline that will be done in the next
patch when introducing BPF_MAP_TYPE_STRUCT_OPS.

"struct bpf_struct_ops" is introduced as a common interface for the kernel
struct that supports BPF_PROG_TYPE_STRUCT_OPS prog.  The supporting kernel
struct will need to implement an instance of the "struct bpf_struct_ops".

The supporting kernel struct also needs to implement a bpf_verifier_ops.
During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
bpf_verifier_ops by searching the attr->attach_btf_id.

A new "btf_struct_access" is also added to the bpf_verifier_ops such
that the supporting kernel struct can optionally provide its own specific
check on accessing the func arg (e.g. provide limited write access).

After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
to initialize some values (e.g. the btf id of the supporting kernel
struct) and it can only be done once the btf_vmlinux is available.

The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
if the return type of the prog->aux->attach_func_proto is "void".

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003503.3855825-1-kafai@fb.com

bpf: Support bitfield read access in btf_struct_access

This patch allows bitfield access as a scalar.

It checks "off + size > t->size" to avoid accessing bitfield
end up accessing beyond the struct. This check is done
outside of the loop since it is applicable to all access.

It also takes this chance to break early on the "off < moff" case.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003501.3855427-1-kafai@fb.com

bpf: Add enum support to btf_ctx_access()

It allows bpf prog (e.g. tracing) to attach
to a kernel function that takes enum argument.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003459.3855366-1-kafai@fb.com

bpf: Avoid storing modifier to info->btf_id

info->btf_id expects the btf_id of a struct, so it should
store the final result after skipping modifiers (if any).

It also takes this chanace to add a missing newline in one of the
bpf_log() messages.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003456.3855176-1-kafai@fb.com

bpf: Save PTR_TO_BTF_ID register state when spilling to stack

This patch makes the verifier save the PTR_TO_BTF_ID register state when
spilling to the stack.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003454.3854870-1-kafai@fb.com

selftests/bpf: Restore original comm in test_overhead

test_overhead changes task comm in order to estimate BPF trampoline
overhead but never sets the comm back to the original one.
We have the tests (like core_reloc.c) that have 'test_progs'
as hard-coded expected comm, so let's try to preserve the
original comm.

Currently, everything works because the order of execution is:
first core_recloc, then test_overhead; but let's make it a bit
future-proof.

Other related changes: use 'test_overhead' as new comm instead of
'test' to make it easy to debug and drop '\n' at the end.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200108192132.189221-1-sdf@google.com

bpftool: Add misc section and probe for large INSN limit

Introduce a new probe section (misc) for probes not related to concrete
map types, program types, functions or kernel configuration. Introduce a
probe for large INSN limit as the first one in that section.

Example outputs:

  # bpftool feature probe
  [...]
  Scanning miscellaneous eBPF features...
  Large program size limit is available

  # bpftool feature probe macros
  [...]
  /*** eBPF misc features ***/
  #define HAVE_HAVE_LARGE_INSN_LIMIT

  # bpftool feature probe -j | jq '.["misc"]'
  {
    "have_large_insn_limit": true
  }

Signed-off-by: Michal Rostecki <mrostecki@opensuse.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Link: https://lore.kernel.org/bpf/20200108162428.25014-3-mrostecki@opensuse.org

libbpf: Add probe for large INSN limit

Introduce a new probe which checks whether kernel has large maximum
program size which was increased in the following commit:

c04c0d2b968a ("bpf: increase complexity limit and maximum program size")

Based on the similar check in Cilium[0], authored by Daniel Borkmann.

[0] https://github.com/cilium/cilium/commit/657d0f585afd26232cfa5d4e70b6f64d2ea91596

Co-authored-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Michal Rostecki <mrostecki@opensuse.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Link: https://lore.kernel.org/bpf/20200108162428.25014-2-mrostecki@opensuse.org

ptp: clockmatrix: Rework clockmatrix version information.

Simplify and fix the version information displayed by the driver.
The new info better relects what is needed to support the hardware.

Prev:
Version: 4.8.0, Pipeline 22169 0x4001, Rev 0, Bond 5, CSR 311, IRQ 2

New:
Version: 4.8.0, Id: 0x4001  Hw Rev: 5  OTP Config Select: 15

- Remove pipeline, CSR and IRQ because version x.y.z already incorporates
  this information.
- Remove bond number because it is not used.
- Remove rev number because register was not implemented, always 0
- Add HW Rev ID register to replace rev number
- Add OTP config select to show the user configuration chosen by
  the configurable GPIO pins on start-up

Signed-off-by: Vincent Cheng <vincent.cheng.xh@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

enetc: Fix inconsistent IS_ERR and PTR_ERR

The proper pointer to be passed as argument is hw
Detected using Coccinelle.

Fixes: 6517798dd343 ("enetc: Make MDIO accessors more generic and export to include/linux/fsl")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

enetc: Fix an off by one in enetc_setup_tc_txtime()

The priv->tx_ring[] has 16 elements but only priv->num_tx_rings are
set up, the rest are NULL. This ">" comparison should be ">=" to avoid
a potential crash.

Fixes: 0d08c9ec7d6e ("enetc: add support time specific departure base on the qos etf")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Documentation-stmmac-documentation-improvements'

Jose Abreu says:

====================
Documentation: stmmac documentation improvements

Converts stmmac documentation to RST format.

1) Adds missing entry of stmmac documentation to MAINTAINERS.

2) Converts stmmac documentation to RST format and adds some new info.

3) Adds the new RST file to the list of files.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Documentation: networking: Add stmmac to device drivers list

Add the stmmac RST file to the index.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Documentation: networking: Convert stmmac documentation to RST format

Convert the documentation of the driver to RST format and delete the old
txt and old information that no longer applies.

Also, add some new information.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

MAINTAINERS: Add stmmac Ethernet driver documentation entry

Add the missing entry for the file that documents the stmicro Ethernet
driver stmmac.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: cisco_hdlc: use __func__ in debug message

Use __func__ to print the function name instead of hard coded string.
BTW, replace printk(KERN_DEBUG, ...) with netdev_dbg.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-ch9200-code-cleanup'

Chen Zhou says:

====================
net: ch9200: code cleanup

patch 1 introduce __func__ in debug message.
patch 2 remove unnecessary return.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ch9200: remove unnecessary return

The return is not needed, remove it.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ch9200: use __func__ in debug message

Use __func__ to print the function name instead of hard coded string.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ionic-driver-updates'

Shannon Nelson says:

====================
ionic: driver updates

These are a few little updates for the ionic network driver.

v2: dropped IBM msi patch
added fix for a compiler warning
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ionic: clear compiler warning on hb use before set

Build checks have pointed out that 'hb' can theoretically
be used before set, so let's initialize it and get rid
of the compiler complaint.

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

ionic: restrict received packets to mtu size

Make sure the NIC drops packets that are larger than the
specified MTU.

The front end of the NIC will accept packets larger than MTU and
will copy all the data it can to fill up the driver's posted
buffers - if the buffers are not long enough the packet will
then get dropped. With the Rx SG buffers allocagted as full
pages, we are currently setting up more space than MTU size
available and end up receiving some packets that are larger
than MTU, up to the size of buffers posted. To be sure the
NIC doesn't waste our time with oversized packets we need to
lie a little in the SG descriptor about how long is the last
SG element.

At dealloc time, we know the allocation was a page, so the
deallocation doesn't care about what length we put in the
descriptor.

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

ionic: add Rx dropped packet counter

Add a counter for packets dropped by the driver, typically
for bad size or a receive error seen by the device.

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

ionic: drop use of subdevice tags

The subdevice concept is not being used in the driver, so
drop the references to it.

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '1GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
1GbE Intel Wired LAN Driver Updates 2020-01-06

This series contains updates to igc to add basic support for
timestamping.

Vinicius adds basic support for timestamping and enables ptp4l/phc2sys
to work with i225 devices.  Initially, adds the ability to read and
adjust the PHC clock.  Patches 2 & 3 enable and retrieve hardware
timestamps.  Patch 4 implements the ethtool ioctl that ptp4l uses to
check what timestamping methods are supported.  Lastly, added support to
do timestamping using the "Start of Packet" signal from the PHY, which
is now supported in i225 devices.

While i225 does support multiple PTP domains, with multiple timestamping
registers, we currently only support one PTP domain and use only one of
the timestamping registers for implementation purposes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Unique-mv88e6xxx-IRQ-names'

Andrew Lunn says:

====================
Unique mv88e6xxx IRQ names

There are a few boards which have multiple mv88e6xxx switches. With
such boards, it can be hard to determine which interrupts belong to
which switches. Make the interrupt names unique by including the
device name in the interrupt name. For the SERDES interrupt, also
include the port number. As a result of these patches ZII devel C
looks like:

50:          0  gpio-vf610  27 Level     mv88e6xxx-0.1:00
54:          0  mv88e6xxx-g1   3 Edge      mv88e6xxx-0.1:00-g1-atu-prob
56:          0  mv88e6xxx-g1   5 Edge      mv88e6xxx-0.1:00-g1-vtu-prob
58:          0  mv88e6xxx-g1   7 Edge      mv88e6xxx-0.1:00-g2
61:          0  mv88e6xxx-g2   1 Edge      !mdio-mux!mdio@1!switch@0!mdio:01
62:          0  mv88e6xxx-g2   2 Edge      !mdio-mux!mdio@1!switch@0!mdio:02
63:          0  mv88e6xxx-g2   3 Edge      !mdio-mux!mdio@1!switch@0!mdio:03
64:          0  mv88e6xxx-g2   4 Edge      !mdio-mux!mdio@1!switch@0!mdio:04
70:          0  mv88e6xxx-g2  10 Edge      mv88e6xxx-0.1:00-serdes-10
75:          0  mv88e6xxx-g2  15 Edge      mv88e6xxx-0.1:00-watchdog
76:          5  gpio-vf610  26 Level     mv88e6xxx-0.2:00
80:          0  mv88e6xxx-g1   3 Edge      mv88e6xxx-0.2:00-g1-atu-prob
82:          0  mv88e6xxx-g1   5 Edge      mv88e6xxx-0.2:00-g1-vtu-prob
84:          4  mv88e6xxx-g1   7 Edge      mv88e6xxx-0.2:00-g2
87:          2  mv88e6xxx-g2   1 Edge      !mdio-mux!mdio@2!switch@0!mdio:01
88:          0  mv88e6xxx-g2   2 Edge      !mdio-mux!mdio@2!switch@0!mdio:02
89:          0  mv88e6xxx-g2   3 Edge      !mdio-mux!mdio@2!switch@0!mdio:03
90:          0  mv88e6xxx-g2   4 Edge      !mdio-mux!mdio@2!switch@0!mdio:04
95:          3  mv88e6xxx-g2   9 Edge      mv88e6xxx-0.2:00-serdes-9
96:          0  mv88e6xxx-g2  10 Edge      mv88e6xxx-0.2:00-serdes-10
101:          0  mv88e6xxx-g2  15 Edge      mv88e6xxx-0.2:00-watchdog

Interrupt names like !mdio-mux!mdio@2!switch@0!mdio:01 are created by
phylib for the integrated PHYs. The mv88e6xxx driver does not
determine these names.
====================

Tested-by: Chris Healy <cphealy@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Unique ATU and VTU IRQ names

Dynamically generate a unique interrupt name for the VTU and ATU,
based on the device name.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Unique g2 IRQ name

Dynamically generate a unique g2 interrupt name, based on the
device name.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Unique watchdog IRQ name

Dynamically generate a unique watchdog interrupt name, based on the
device name.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Unique SERDES interrupt names

Dynamically generate a unique SERDES interrupt name, based on the
device name and the port the SERDES is for. For example:

95: 3 mv88e6xxx-g2 9 Edge mv88e6xxx-0.2:00-serdes-9
96: 0 mv88e6xxx-g2 10 Edge mv88e6xxx-0.2:00-serdes-10

The 0.2:00 indicates the switch and -9 indicates port 9.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Unique IRQ name

Dynamically generate a unique switch interrupt name, based on the
device name.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

igc: Use Start of Packet signal from PHY for timestamping

For better accuracy, i225 is able to do timestamping using the Start of
Packet signal from the PHY.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Add support for ethtool GET_TS_INFO command

This command allows igc to report what types of timestamping are
supported. ptp4l uses this to detect if the hardware supports
timestamping.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Add support for TX timestamping

This adds support for timestamping packets being transmitted.

Based on the code from i210. The basic differences is that i225 has 4
registers to store the transmit timestamps (i210 has one). Right now,
we only support retrieving from one register, support for using the
other registers will be added later.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Add support for RX timestamping

This adds support for timestamping received packets.

It is based on the i210, as many features of i225 work the same way.
The main difference from i210 is that i225 has support for choosing
the timer register to use when timestamping packets. Right now, we
only support using timer 0. The other difference is that i225 stores
two timestamps in the receive descriptor, right now, we only retrieve
one.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Merge branch 'ethtool-allow-nesting-of-begin-and-complete-callbacks'

Michal Kubecek says:

====================
ethtool: allow nesting of begin() and complete() callbacks

The ethtool ioctl interface used to guarantee that ethtool_ops callbacks
were always called in a block between calls to ->begin() and ->complete()
(if these are defined) and that this whole block was executed with RTNL
lock held:

rtnl_lock();
ops->begin();
/* other ethtool_ops calls */
ops->complete();
rtnl_unlock();

This prevented any nesting or crossing of the begin-complete blocks.
However, this is no longer guaranteed even for ioctl interface as at least
ethtool_phys_id() releases RTNL lock while waiting for a timer. With the
introduction of netlink ethtool interface, the begin-complete pairs are
naturally nested e.g. when a request triggers a netlink notification.

Fortunately, only minority of networking drivers implements begin() and
complete() callbacks and most of those that do, fall into three groups:

  - wrappers for pm_runtime_get_sync() and pm_runtime_put()
  - wrappers for clk_prepare_enable() and clk_disable_unprepare()
  - begin() checks netif_running() (fails if false), no complete()

First two have their own refcounting, third is safe w.r.t. nesting of the
blocks.

Only three in-tree networking drivers need an update to deal with nesting
of begin() and complete() calls: via-velocity and epic100 perform resume
and suspend on their own and wil6210 completely serializes the calls using
its own mutex (which would lead to a deadlock if a request request
triggered a netlink notification). The series addresses these problems.

changes between v1 and v2:
  - fix inverted condition in epic100 ethtool_begin() (thanks to Andrew
    Lunn)
====================

Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

epic100: allow nesting of ethtool_ops begin() and complete()

Unlike most networking drivers using begin() and complete() ethtool_ops
callbacks to resume a device which is down and suspend it again when done,
epic100 does not use standard refcounted infrastructure but sets device
sleep state directly.

With the introduction of netlink ethtool interface, we may have nested
begin-complete blocks so that inner complete() would put the device back to
sleep for the rest of the outer block.

To avoid rewriting an old and not very actively developed driver, just add
a nesting counter and only perform resume and suspend on the outermost
level.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>

via-velocity: allow nesting of ethtool_ops begin() and complete()

Unlike most networking drivers using begin() and complete() ethtool_ops
callbacks to resume a device which is down and suspend it again when done,
via-velocity does not use standard refcounted infrastructure but sets
device sleep state directly.

With the introduction of netlink ethtool interface, we may have nested
begin-complete blocks so that inner complete() would put the device back to
sleep for the rest of the outer block.

To avoid rewriting an old and not very actively developed driver, just add
a nesting counter and only perform resume and suspend on the outermost
level.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>

wil6210: get rid of begin() and complete() ethtool_ops

The wil6210 driver locks a mutex in begin() ethtool_ops callback and
unlocks it in complete() so that all ethtool requests are serialized. This
is not going to work correctly with netlink interface; e.g. when ioctl
triggers a netlink notification, netlink code would call begin() again
while the mutex taken by ioctl code is still held by the same task.

Let's get rid of the begin() and complete() callbacks and move the mutex
locking into the remaining ethtool_ops handlers except get_drvinfo which
only copies strings that are not changing so that there is no need for
serialization.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fcnal-test: Fix vrf argument in local tcp tests

The recent MD5 tests added duplicate configuration in the default VRF.
This change exposed a bug in existing tests designed to verify no
connection when client and server are not in the same domain. The
server should be running bound to the vrf device with the client run
in the default VRF (the -2 option is meant for validating connection
data). Fix the option for both tests.

While technically this is a bug in previous releases, the tests are
properly failing since the default VRF does not have any routing
configuration so there really is no need to backport to prior releases.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

gtp: simplify error handling code in 'gtp_encap_enable()'

'gtp_encap_disable_sock(sk)' handles the case where sk is NULL, so there
is no need to test it before calling the function.

This saves a few line of code.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-Disable-checks-in-hardware-pipeline'

Ido Schimmel says:

====================
mlxsw: Disable checks in hardware pipeline

Amit says:

The hardware pipeline contains some checks that, by default, are
configured to drop packets. Since the software data path does not drop
packets due to these reasons and since we are interested in offloading
the software data path to hardware, then these checks should be disabled
in the hardware pipeline as well.

This patch set changes mlxsw to disable four of these checks and adds
corresponding selftests. The tests pass both when the software data path
is exercised (using veth pair) and when the hardware data path is
exercised (using mlxsw ports in loopback).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: router: Add test case for destination IP link-local

Add test case to check that packets are not dropped when they need to be
routed and their destination is link-local, i.e., 169.254.0.0/16.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Disable DIP_LINK_LOCAL check in hardware pipeline

The check drops packets if they need to be routed and their destination
IP is link-local, i.e., belongs to 169.254.0.0/16 address range.

Disable the check since the kernel forwards such packets and does not
drop them.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: router: Add test case for source IP equals destination IP

Add test case to check that packets are not dropped when they need to be
routed and their source IP equals to their destination IP.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Disable SIP_DIP check in hardware pipeline

The check drops packets if they need to be routed and their source IP
equals to their destination IP.

Disable the check since the kernel forwards such packets and does not
drop them.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: router: Add test case for multicast destination MAC mismatch

Add test case to check that packets are not dropped when they need to be
routed and their multicast MAC mismatched to their multicast destination
IP.

i.e., destination IP is multicast and
* for IPV4: DMAC != {01-00-5E-0 (25 bits), DIP[22:0]}
* for IPV6: DMAC != {33-33-0 (16 bits), DIP[31:0]}

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Disable MC_DMAC check in hardware pipeline

The check drops packets if they need to be routed and their multicast
MAC mismatched to their multicast destination IP.

For IPV4:
DMAC is mismatched if it is different from {01-00-5E-0 (25 bits),
DIP[22:0]}

For IPV6:
DMAC is mismatched if it is different from {33-33-0 (16 bits),
DIP[31:0]}

Disable the check since the kernel forwards such packets and does not
drop them.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: router: Add test case for source IP in class E

Add test case to check that packets are not dropped when they need to be
routed and their source IP in class E, (i.e., 240.0.0.0 –
255.255.255.254).

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Disable SIP_CLASS_E check in hardware pipeline

The check drops packets if they need to be routed and their source IP is
from class E, i.e., belongs to 240.0.0.0/4 address range, but different
from 255.255.255.255.

Disable the check since the kernel forwards such packets and does not
drop them.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

igc: Add basic skeleton for PTP

This allows the creation of the /dev/ptpX device for i225, and reading
and writing the time.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Merge branch 'hns3-next'

Huazhong Tan says:

====================
net: hns3: misc updates for -net-next

This series includes some misc updates for the HNS3 ethernet driver.

[patch 1] adds trace events support.
[patch 2] re-organizes TQP's vector handling.
[patch 3] renames the name of TQP vector.
[patch 4] rewrites a log in the hclge_map_ring_to_vector().
[patch 5] modifies the name of misc IRQ vector.
[patch 6] handles the unexpected speed 0 return from HW.
[patch 7] replaces an unsuitable variable type.
[patch 8] modifies an unsuitable reset level for HW error.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: modify an unsuitable reset level for hardware error

According to hardware user manual, when hardware reports error
'roc_pkt_without_key_port', the driver should assert function
reset to do the recovery.

So this patch uses HNAE3_FUNC_RESET to replace HNAE3_GLOBAL_RESET.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: replace an unsuitable variable type in hclge_inform_reset_assert_to_vf()

In hclge_inform_reset_assert_to_vf(), variable reset_type(enum type)
will be copied into msg_data whose size is 2 bytes. Currently, hip08
is a little-endian machine, so the lower two bytes of reset_type will
be copied to msg_data. But when running on a big-endian machine,
msg_data will have a wrong value(the higher two bytes of reset_type).

So this patch modifies the type of reset_type to u16, and adds a
build check in case enum hnae3_reset_type has value larger than
U16_MAX.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: add protection when get SFP speed as 0

In some case, the MAC speed get from hardware maybe 0, it should
not be set to mac->speed.

Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: modify the IRQ name of misc vectors

The misc IRQ of all the devices have the same name, so it's
hard to find the right misc IRQ of the device.

This patch modifies the misc IRQ names as "hclge/hclgevf"-misc-
"pci name". And now the IRQ name is not related to net device
name anymore, so change the HNAE3_INT_NAME_LEN to 32 bytes, and
that is enough.

Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: modify an unsuitable log in hclge_map_ring_to_vector()

When the returned vector_id less than 0, the message should print
out the vector who is getting vector index fail.

So this patch replaces vector_id with vector, and re-format the
message.

Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: modify the IRQ name of TQP vector

When rename the net devices, the IRQ number can not be
fetched by the net device name, because the driver request
the IRQ resources only when the vector resource changed, and
the rename operation did not change the vector resources,
so the IRQ name keeps the previous net device name.
So this patch modifies the name of the TQP IRQ as
"pci driver name"-"pci name"-"TxRx"-"index".

Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: re-organize vector handle

To prevent loss user's IRQ affinity configuration when DOWN,
this patch moves out release/request operation of the vector
handle from net DOWN/UP, just do it when vector resource changes.

Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: add trace event support for HNS3 driver

This adds trace support for HNS3 driver. It also declares
some events which could be used to trace the events when a
TX/RX BD is processed, and other events which are related to
the processing of sk_buff, such as TSO, GRO.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Convert-Felix-DSA-switch-to-PHYLINK'

Vladimir Oltean says:

====================
Convert Felix DSA switch to PHYLINK

Unlike most other conversions, this one is not by far a trivial one, and should
be seen as "Layerscape PCS meets PHYLINK". Actually, the PCS doesn't
need a lot of hand-holding and most of our other devices 'just work'
(this one included) without any sort of operating system awareness, just
an initialization procedure done typically in the bootloader.
Our issues start when the PCS stops from "just working", and that is
where PHYLINK comes in handy.

The PCS is not specific to the Vitesse / Microsemi / Microchip switching core
at all. Variations of this SerDes/PCS design can also be found on DPAA1 and
DPAA2 hardware.

The main idea of the abstraction provided is that the PCS looks so much like a
PHY device, that we model it as an actual PHY device and run the generic PHY
functions on it, where appropriate.

The 4xSGMII, QSGMII and QSXGMII modes are fairly straightforward.

The SerDes protocol which the driver calls 2500Base-X mode (a misnomer) is more
interesting. There is a description of how it works and what can be done with
it in patch 9/9 (in a comment above vsc9959_pcs_init_2500basex).
In short, it is a fixed speed protocol with no auto-negotiation whatsoever.
From my research of the SGMII-2500 patent [1], it has nothing to do with
SGMII-2500. That one:
* does not define any change to the AN base page compared to plain 10/100/1000
  SGMII. This implies that the 2500 speed is not negotiable, but the other
  speeds are. In our case, when the SerDes is configured for this protocol it's
  configured for good, there's no going back to SGMII.
* runs at a higher base frequency than regular SGMII. So SGMII-2500 operating
  at 1000 Mbps wouldn't interoperate with plain SGMII at 1000 Mbps. Strange,
  but ok..
* Emulates lower link speeds than 2500 by duplicating the codewords twice, then
  thrice, then twice again etc (2.5/25/250 times on average). The Layerscape
  PCS doesn't do that (it is fixed at 2500 Mbaud).

But on the other hand it isn't completely compatible with Base-X either,
since it doesn't do 802.3z / clause 37 auto negotiation (flow control,
local/remote fault etc). It is compatible with 2500Base-X without
in-band AN, and that is exactly how we decided to expose it (this is
actually similar to what others do).

For SGMII and USXGMII, the driver is using the PHYLINK 'managed =
"in-band-status"' DTS binding to figure out whether in-band AN is
expected to be enabled in the PCS or not. It is expected that the
attached PHY follows suite, but there is a gap here: the PHY driver does
not react to this setting, so only one of "AN on" and "AN off" works on
any particular PHY, even though that PHY might support bypassing the
SGMII AN process, as is the case on the VSC8514 PHY present on the
LS1028A-RDB board. A separate series will be sent to propose a way to
deal with that.

I dropped the Ocelot PHYLINK conversion because:
* I don't have VSC7514 hardware anyway
* The hardware is so different in this regard that there's almost nothing to
  share anyway.

Changes in v5:

- Added the register write to DEV_CLOCK_CFG back in
  felix_phylink_mac_config in patch 9/9.

Changes in v4:

- This is mostly a resend of v3, with the only notable change that I've
  dropped the PHY core patches for in_band_autoneg and I'll propose them
  independently.

v1 series:
https://www.spinics.net/lists/netdev/msg613869.html

RFC v2 series:
https://www.spinics.net/lists/netdev/msg620128.html

v3 series:
https://www.spinics.net/lists/netdev/msg622060.html

v4 series:
https://www.spinics.net/lists/netdev/msg622606.html

[0]: https://www.spinics.net/lists/netdev/msg613869.html
[1]: https://patents.google.com/patent/US7356047B1/en
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: felix: Add PCS operations for PHYLINK

Layerscape SoCs traditionally expose the SerDes configuration/status for
Ethernet protocols (PCS for SGMII/USXGMII/10GBase-R etc etc) in a register
format that is compatible with clause 22 or clause 45 (depending on
SerDes protocol). Each MAC has its own internal MDIO bus on which there
is one or more of these PCS's, responding to commands at a configurable
PHY address. The per-port internal MDIO bus (which is just for PCSs) is
totally separate and has nothing to do with the dedicated external MDIO
controller (which is just for PHYs), but the register map for the MDIO
controller is the same.

The VSC9959 (Felix) switch instantiated in the LS1028A is integrated
in hardware with the ENETC PCS of its DSA master, and reuses its MDIO
controller driver, so Felix has been made to depend on it in Kconfig.

+------------------------------------------------------------------------+
|                   +--------+ GMII (typically disabled via RCW)         |
| ENETC PCI         |  ENETC |--------------------------+                |
| Root Complex      | port 3 |-----------------------+  |                |
| Integrated        +--------+                       |  |                |
| Endpoint                                           |  |                |
|                   +--------+ 2.5G GMII             |  |                |
|                   |  ENETC |--------------+        |  |                |
|                   | port 2 |-----------+  |        |  |                |
|                   +--------+           |  |        |  |                |
|                                     +--------+  +--------+             |
|                                     |  Felix |  |  Felix |             |
|                                     | port 4 |  | port 5 |             |
|                                     +--------+  +--------+             |
|                                                                        |
| +--------+  +--------+  +--------+  +--------+  +--------+  +--------+ |
| |  ENETC |  |  ENETC |  |  Felix |  |  Felix |  |  Felix |  |  Felix | |
| | port 0 |  | port 1 |  | port 0 |  | port 1 |  | port 2 |  | port 3 | |
+------------------------------------------------------------------------+
|    ||||  SerDes |          ||||        ||||        ||||        ||||    |
| +--------+block |       +--------------------------------------------+ |
| |  ENETC |      |       |       ENETC port 2 internal MDIO bus       | |
| | port 0 |      |       |  PCS         PCS          PCS        PCS   | |
| |   PCS  |      |       |   0           1            2          3    | |
+-----------------|------------------------------------------------------+
        v          v           v           v            v          v
     SGMII/      RGMII    QSGMII/QSXGMII/4xSGMII/4x1000Base-X/4x2500Base-X
    USXGMII/   (bypasses
  1000Base-X/   SerDes)
  2500Base-X

In the LS1028A SoC described above, the VSC9959 Felix switch is PF5 of
the ENETC root complex, and has 2 BARs:
- BAR 4: the switch's effective registers
- BAR 0: the MDIO controller register map lended from ENETC port 2
         (PF2), for accessing its associated PCS's.

This explanation is necessary because the patch does some renaming
"pci_bar" -> "switch_pci_bar" for clarity, which would otherwise appear
a bit obtuse.

The fact that the internal MDIO bus is "borrowed" is relevant because
the register map is found in PF5 (the switch) but it triggers an access
fault if PF2 (the ENETC DSA master) is not enabled. This is not treated
in any way (and I don't think it can be treated).

All of this is so SoC-specific, that it was contained as much as
possible in the platform-integration file felix_vsc9959.c.

We need to parse and pre-validate the device tree because of 2 reasons:
- The PHY mode (SerDes protocol) cannot change at runtime due to SoC
  design.
- There is a circular dependency in that we need to know what clause the
  PCS speaks in order to find it on the internal MDIO bus. But the
  clause of the PCS depends on what phy-mode it is configured for.

The goal of this patch is to make steps towards removing the bootloader
dependency for SGMII PCS pre-configuration, as well as to add support
for monitoring the in-band SGMII AN between the PCS and the system-side
link partner (PHY or other MAC).

In practice the bootloader dependency is not completely removed. U-Boot
pre-programs the PHY address at which each PCS can be found on the
internal MDIO bus (MDEV_PORT). This is needed because the PCS of each
port has the same out-of-reset PHY address of zero. The SerDes register
for changing MDEV_PORT is pretty deep in the SoC (outside the addresses
of the ENETC PCI BARs) and therefore inaccessible to us from here.

Felix VSC9959 and Ocelot VSC7514 are integrated very differently in
their respective SoCs, and for that reason Felix does not use the Ocelot
core library for PHYLINK. On one hand we don't want to impose the
fixed phy-mode limitation to Ocelot, and on the other hand Felix doesn't
need to force the MAC link speed the way Ocelot does, since the MAC is
connected to the PCS through a fixed GMII, and the PCS is the one who
does the rate adaptation at lower link speeds, which the MAC does not
even need to know about. In fact changing the GMII speed for Felix
irrecoverably breaks transmission through that port until a reset.

The pair with ENETC port 3 and Felix port 5 is optional and doesn't
support tagging. When we enable it, swp5 is a regular slave port, albeit
an internal one. The trouble is that it doesn't work, and that is
because the DSA PHYLIB adaptation layer doesn't treat fixed-link slave
ports. So that is yet another reason for wanting to convert Felix to the
native PHYLINK API.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: export ANA, DEV and QSYS registers to include/soc/mscc

Since the Felix DSA driver is implementing its own PHYLINK instance due
to SoC differences, it needs access to the few registers that are
common, mainly for flow control.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: make phy_mode a member of the common struct ocelot_port

The Ocelot switchdev driver and the Felix DSA one need it for different
reasons. Felix (or at least the VSC9959 instantiation in NXP LS1028A) is
integrated with the traditional NXP Layerscape PCS design which does not
support runtime configuration of SerDes protocol. So it needs to
pre-validate the phy-mode from the device tree and prevent PHYLINK from
attempting to change it. For this, it needs to cache it in a private
variable.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

enetc: Set MDIO_CFG_HOLD to the recommended value of 2

This increases the MDIO hold time to 5 enet_clk cycles from the previous
value of 0. This is actually the out-of-reset value, that the driver was
previously overwriting with 0. Zero worked for the external MDIO, but
breaks communication with the internal MDIO buses on which the PCS of
ENETC SI's and Felix switch are found.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

enetc: Make MDIO accessors more generic and export to include/linux/fsl

Within the LS1028A SoC, the register map for the ENETC MDIO controller
is instantiated a few times: for the central (external) MDIO controller,
for the internal bus of each standalone ENETC port, and for the internal
bus of the Felix switch.

Refactoring is needed to support multiple MDIO buses from multiple
drivers. The enetc_hw structure is made an opaque type and a smaller
enetc_mdio_priv is created.

'mdio_base' - MDIO registers base address - is being parameterized, to
be able to work with different MDIO register bases.

The ENETC MDIO bus operations are exported from the fsl-enetc-mdio
kernel object, the same that registers the central MDIO controller (the
dedicated PF). The ENETC main driver has been changed to select it, and
use its exported helpers to further register its private MDIO bus. The
DSA Felix driver will do the same.

Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Pass pcs_poll flag from driver to PHYLINK

The DSA drivers that implement .phylink_mac_link_state should normally
register an interrupt for the PCS, from which they should call
phylink_mac_change(). However not all switches implement this, and those
who don't should set this flag in dsa_switch in the .setup callback, so
that PHYLINK will poll for a few ms until the in-band AN link timer
expires and the PCS state settles.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phylink: add support for polling MAC PCS

Some MAC PCS blocks are unable to provide interrupts when their status
changes. As we already have support in phylink for polling status, use
this to provide a hook for MACs to enable polling mode.

The patch idea was picked up from Russell King's suggestion on the macb
phylink patch thread here [0] but the implementation was changed.
Instead of introducing a new phylink_start_poll() function, which would
make the implementation cumbersome for common PHYLINK implementations
for multiple types of devices, like DSA, just add a boolean property to
the phylink_config structure, which is just as backwards-compatible.

https://lkml.org/lkml/2019/12/16/603

Suggested-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phylink: make QSGMII a valid PHY mode for in-band AN

QSGMII is a SerDes protocol clocked at 5 Gbaud (4 times higher than
SGMII which is clocked at 1.25 Gbaud), with the same 8b/10b encoding and
some extra symbols for synchronization. Logically it offers 4 SGMII
interfaces multiplexed onto the same physical lanes. Each MAC PCS has
its own in-band AN process with the system side of the QSGMII PHY, which
is identical to the regular SGMII AN process.

So allow QSGMII as a valid in-band AN mode, since it is no different
from software perspective from regular SGMII.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mii: Add helpers for parsing SGMII auto-negotiation

Typically a MAC PCS auto-configures itself after it receives the
negotiated copper-side link settings from the PHY, but some MAC devices
are more special and need manual interpretation of the SGMII AN result.

In other cases, the PCS exposes the entire tx_config_reg base page as it
is transmitted on the wire during auto-negotiation, so it makes sense to
be able to decode the equivalent lp_advertised bit mask from the raw u16
(of course, "lp" considering the PCS to be the local PHY).

Therefore, add the bit definitions for the SGMII registers 4 and 5
(local device ability, link partner ability), as well as a link_mode
conversion helper that can be used to feed the AN results into
phy_resolve_aneg_linkmode.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-deferred-xmit'

Vladimir Oltean says:

====================
Improvements to the DSA deferred xmit

After the feedback received on v1:
https://www.spinics.net/lists/netdev/msg622617.html

I've decided to move the deferred xmit implementation completely within
the sja1105 driver.

The executive summary for this series is the same as it was for v1
(better for everybody):

- For those who don't use it, thanks to one less assignment in the
  hotpath (and now also thanks to less code in the DSA core)
- For those who do, by making its scheduling more amenable and moving it
  outside the generic workqueue (since it still deals with packet
  hotpath, after all)

There are some simplification (1/3) and cosmetic (3/3) patches in the
areas next to the code touched by the main patch (2/3).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: tag_sja1105: Slightly improve the Xmas tree in sja1105_xmit

This is a cosmetic patch that makes the dp, tx_vid, queue_mapping and
pcp local variable definitions a bit closer in length, so they don't
look like an eyesore as much.

The 'ds' variable is not used otherwise, except for ds->dp.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Make deferred_xmit private to sja1105

There are 3 things that are wrong with the DSA deferred xmit mechanism:

1. Its introduction has made the DSA hotpath ever so slightly more
   inefficient for everybody, since DSA_SKB_CB(skb)->deferred_xmit needs
   to be initialized to false for every transmitted frame, in order to
   figure out whether the driver requested deferral or not (a very rare
   occasion, rare even for the only driver that does use this mechanism:
   sja1105). That was necessary to avoid kfree_skb from freeing the skb.

2. Because L2 PTP is a link-local protocol like STP, it requires
   management routes and deferred xmit with this switch. But as opposed
   to STP, the deferred work mechanism needs to schedule the packet
   rather quickly for the TX timstamp to be collected in time and sent
   to user space. But there is no provision for controlling the
   scheduling priority of this deferred xmit workqueue. Too bad this is
   a rather specific requirement for a feature that nobody else uses
   (more below).

3. Perhaps most importantly, it makes the DSA core adhere a bit too
   much to the NXP company-wide policy "Innovate Where It Doesn't
   Matter". The sja1105 is probably the only DSA switch that requires
   some frames sent from the CPU to be routed to the slave port via an
   out-of-band configuration (register write) rather than in-band (DSA
   tag). And there are indeed very good reasons to not want to do that:
   if that out-of-band register is at the other end of a slow bus such
   as SPI, then you limit that Ethernet flow's throughput to effectively
   the throughput of the SPI bus. So hardware vendors should definitely
   not be encouraged to design this way. We do _not_ want more
   widespread use of this mechanism.

Luckily we have a solution for each of the 3 issues:

For 1, we can just remove that variable in the skb->cb and counteract
the effect of kfree_skb with skb_get, much to the same effect. The
advantage, of course, being that anybody who doesn't use deferred xmit
doesn't need to do any extra operation in the hotpath.

For 2, we can create a kernel thread for each port's deferred xmit work.
If the user switch ports are named swp0, swp1, swp2, the kernel threads
will be named swp0_xmit, swp1_xmit, swp2_xmit (there appears to be a 15
character length limit on kernel thread names). With this, the user can
change the scheduling priority with chrt $(pidof swp2_xmit).

For 3, we can actually move the entire implementation to the sja1105
driver.

So this patch deletes the generic implementation from the DSA core and
adds a new one, more adequate to the requirements of PTP TX
timestamping, in sja1105_main.c.

Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: sja1105: Always send through management routes in slot 0

I finally found out how the 4 management route slots are supposed to
be used, but.. it's not worth it.

The description from the comment I've just deleted in this commit is
still true: when more than 1 management slot is active at the same time,
the switch will match frames incoming [from the CPU port] on the lowest
numbered management slot that matches the frame's DMAC.

My issue was that one was not supposed to statically assign each port a
slot. Yes, there are 4 slots and also 4 non-CPU ports, but that is a
mere coincidence.

Instead, the switch can be used like this: every management frame gets a
slot at the right of the most recently assigned slot:

Send mgmt frame 1 through S0:    S0 x  x  x
Send mgmt frame 2 through S1:    S0 S1 x  x
Send mgmt frame 3 through S2:    S0 S1 S2 x
Send mgmt frame 4 through S3:    S0 S1 S2 S3

The difference compared to the old usage is that the transmission of
frames 1-4 doesn't need to wait until the completion of the management
route. It is safe to use a slot to the right of the most recently used
one, because by protocol nobody will program a slot to your left and
"steal" your route towards the correct egress port.

So there is a potential throughput benefit here.

But mgmt frame 5 has no more free slot to use, so it has to wait until
_all_ of S0, S1, S2, S3 are full, in order to use S0 again.

And that's actually exactly the problem: I was looking for something
that would bring more predictable transmission latency, but this is
exactly the opposite: 3 out of 4 frames would be transmitted quicker,
but the 4th would draw the short straw and have a worse worst-case
latency than before.

Useless.

Things are made even worse by PTP TX timestamping, which is something I
won't go deeply into here. Suffice to say that the fact there is a
driver-level lock on the SPI bus offsets any potential throughput gains
that parallelism might bring.

So there's no going back to the multi-slot scheme, remove the
"mgmt_slot" variable from sja1105_port and the dummy static assignment
made at probe time.

While passing by, also remove the assignment to casc_port altogether.
Don't pretend that we support cascaded setups.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Fix-10G-PHY-interface-types'

Russell King says:

====================
Fix 10G PHY interface types

Recent discussion has revealed that our current usage of the 10GKR
phy_interface_t is not correct. This is based on a misunderstanding
caused in part by the various specifications being difficult to
obtain. Now that a better understanding has been reached, we ought
to correct this.

This series introduce PHY_INTERFACE_MODE_10GBASER to replace the
existing usage of 10GKR mode, and document their differences in the
phylib documentation. Then switch PHY, SFP/phylink, the Marvell
PP2 network driver, and its associated comphy driver over to use
the correct interface mode. None of the existing platform usage
was actually using 10GBASE-KR.

In order to maintain compatibility with existing DT files, arrange
for the Marvell PP2 driver to rewrite the phy interface mode; this
allows other drivers to adopt correct behaviour w.r.t whether the
10G connection conforms to the backplane 10GBASE-KR protocol vs
normal 10GBASE-R protocol.

After applying these locally to net-next I've validated that the
only places which mention the old PHY_INTERFACE_MODE_10GKR
definition are:

Documentation/networking/phy.rst:``PHY_INTERFACE_MODE_10GKR``
drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c:        if (phy_mode == PHY_INTERFACE_MODE_10GKR)
drivers/net/phy/aquantia_main.c:                phydev->interface = PHY_INTERFACE_MODE_10GKR;
drivers/net/phy/aquantia_main.c:            phydev->interface != PHY_INTERFACE_MODE_10GKR &&
include/linux/phy.h:    PHY_INTERFACE_MODE_10GKR,
include/linux/phy.h:    case PHY_INTERFACE_MODE_10GKR:

which is as expected.  The only users of "10gbase-kr" in DT are:

arch/arm64/boot/dts/marvell/armada-7040-db.dts: phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts:     phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/armada-8040-db.dts: phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/armada-8040-db.dts: phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/armada-8040-mcbin-singleshot.dts:   phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/armada-8040-mcbin-singleshot.dts:   phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts:      phy-mode = "10gbase-kr";arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts:      phy-mode = "10gbase-kr";arch/arm64/boot/dts/marvell/cn9130-db.dts:      phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/cn9131-db.dts:      phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/cn9132-db.dts:      phy-mode = "10gbase-kr";

which all use the mvpp2 driver, and these will be updated in a
separate patch to be submitted in the following kernel cycle.

v2: add comment to mvpp2 driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: switch to using PHY_INTERFACE_MODE_10GBASER rather than 10GKR

Switch network drivers, phy drivers, and SFP/phylink over to use the
more correct 10GBASE-R, rather than 10GBASE-KR. 10GBASE-KR is backplane
ethernet, which is 10GBASE-R with autonegotiation on top, which our
current usage on the affected platforms does not have.

The only remaining user of PHY_INTERFACE_MODE_10GKR is the Aquantia
PHY, which has a separate mode for 10GBASE-KR.

For Marvell mvpp2, we detect 10GBASE-KR, and rewrite it to 10GBASE-R
for compatibility with existing DT - this is the only network driver
at present that makes use of PHY_INTERFACE_MODE_10GKR.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: add PHY_INTERFACE_MODE_10GBASER

Recent discussion has revealed that the use of PHY_INTERFACE_MODE_10GKR
is incorrect. Add a 10GBASE-R definition, document both the -R and -KR
versions, and the fact that 10GKR was used incorrectly.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ionic-add-sriov-support'

Shannon Nelson says:

====================
ionic: add sriov support

Set up the basic support for enabling SR-IOV devices in the
ionic driver.  Since most of the management work happens in
the NIC firmware, the driver becomes mostly a pass-through
for the network stack commands that want to control and
configure the VFs.

v4: changed "vf too big" checks to use pci_num_vf()
changed from vf[] array of pointers of individually allocated
  vf structs to single allocated vfs[] array of vf structs
added clean up of vfs[] on probe fail
added setup for vf stats dma

v3: added check in probe for pre-existing VFs
split out the alloc and dealloc of vf structs to better deal
  with pre-existing VFs (left enabled on remove)
restored the checks for vf too big because of a potential
  case where VFs are already enabled but driver failed to
  alloc the vf structs

v2: use pci_num_vf() and kcalloc()
remove checks for vf too big
add locking for the VF operations
disable VFs in ionic_remove() if they are still running
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ionic: support sr-iov operations

Add the netdev ops for managing VFs. Since most of the
management work happens in the NIC firmware, the driver becomes
mostly a pass-through for the network stack commands that want
to control and configure the VFs.

We also tweak ionic_station_set() a little to allow for
the VFs that start off with a zero'd mac address.

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

ionic: ionic_if bits for sr-iov support

Adds new AdminQ calls and their related structs for
supporting PF controls on VFs:
CMD_OPCODE_VF_GETATTR
CMD_OPCODE_VF_SETATTR

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: sxgbe: Rename Samsung to lowercase

Fix up inconsistent usage of upper and lowercase letters in "Samsung"
name.

"SAMSUNG" is not an abbreviation but a regular trademarked name.
Therefore it should be written with lowercase letters starting with
capital letter.

Although advertisement materials usually use uppercase "SAMSUNG", the
lowercase version is used in all legal aspects (e.g. on Wikipedia and in
privacy/legal statements on
https://www.samsung.com/semiconductor/privacy-global/).

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>