git.monstr.eu Git - linux-2.6-microblaze.git/log

Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-10-16

We've added 90 non-merge commits during the last 25 day(s) which contain
a total of 120 files changed, 3519 insertions(+), 895 deletions(-).

The main changes are:

1) Add missed stats for kprobes to retrieve the number of missed kprobe
   executions and subsequent executions of BPF programs, from Jiri Olsa.

2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is
   for systemd to reimplement the LogNamespace feature which allows
   running multiple instances of systemd-journald to process the logs
   of different services, from Daan De Meyer.

3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich.

4) Improve BPF verifier log output for scalar registers to better
   disambiguate their internal state wrt defaults vs min/max values
   matching, from Andrii Nakryiko.

5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving
   the source IP address with a new BPF_FIB_LOOKUP_SRC flag,
   from Martynas Pumputis.

6) Add support for open-coded task_vma iterator to help with symbolization
   for BPF-collected user stacks, from Dave Marchevsky.

7) Add libbpf getters for accessing individual BPF ring buffers which
   is useful for polling them individually, for example, from Martin Kelly.

8) Extend AF_XDP selftests to validate the SHARED_UMEM feature,
   from Tushar Vyavahare.

9) Improve BPF selftests cross-building support for riscv arch,
   from Björn Töpel.

10) Add the ability to pin a BPF timer to the same calling CPU,
   from David Vernet.

11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic
   implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments,
   from Alexandre Ghiti.

12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen.

13) Fix bpftool's skeleton code generation to guarantee that ELF data
    is 8 byte aligned, from Ian Rogers.

14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4
    security mitigations in BPF verifier, from Yafang Shao.

15) Annotate struct bpf_stack_map with __counted_by attribute to prepare
    BPF side for upcoming __counted_by compiler support, from Kees Cook.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits)
  bpf: Ensure proper register state printing for cond jumps
  bpf: Disambiguate SCALAR register state output in verifier logs
  selftests/bpf: Make align selftests more robust
  selftests/bpf: Improve missed_kprobe_recursion test robustness
  selftests/bpf: Improve percpu_alloc test robustness
  selftests/bpf: Add tests for open-coded task_vma iter
  bpf: Introduce task_vma open-coded iterator kfuncs
  selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c
  bpf: Don't explicitly emit BTF for struct btf_iter_num
  bpf: Change syscall_nr type to int in struct syscall_tp_t
  net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set
  bpf: Avoid unnecessary audit log for CPU security mitigations
  selftests/bpf: Add tests for cgroup unix socket address hooks
  selftests/bpf: Make sure mount directory exists
  documentation/bpf: Document cgroup unix socket address hooks
  bpftool: Add support for cgroup unix socket address hooks
  libbpf: Add support for cgroup unix socket address hooks
  bpf: Implement cgroup sockaddr hooks for unix sockets
  bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf
  bpf: Propagate modified uaddrlen from cgroup sockaddr programs
  ...
====================

Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

page_pool: fragment API support for 32-bit arch with 64-bit DMA

Currently page_pool_alloc_frag() is not supported in 32-bit
arch with 64-bit DMA because of the overlap issue between
pp_frag_count and dma_addr_upper in 'struct page' for those
arches, which seems to be quite common, see [1], which means
driver may need to handle it when using fragment API.

It is assumed that the combination of the above arch with an
address space >16TB does not exist, as all those arches have
64b equivalent, it seems logical to use the 64b version for a
system with a large address space. It is also assumed that dma
address is page aligned when we are dma mapping a page aligned
buffer, see [2].

That means we're storing 12 bits of 0 at the lower end for a
dma address, we can reuse those bits for the above arches to
support 32b+12b, which is 16TB of memory.

If we make a wrong assumption, a warning is emitted so that
user can report to us.

1. https://lore.kernel.org/all/20211117075652.58299-1-linyunsheng@huawei.com/
2. https://lore.kernel.org/all/20230818145145.4b357c89@kernel.org/

Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Lorenzo Bianconi <lorenzo@kernel.org>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Liang Chen <liangchen.linux@gmail.com>
CC: Guillaume Tucker <guillaume.tucker@collabora.com>
CC: Matthew Wilcox <willy@infradead.org>
CC: Linux-MM <linux-mm@kvack.org>
Link: https://lore.kernel.org/r/20231013064827.61135-2-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stub tcp_gro_complete if CONFIG_INET=n

A few networking drivers including bnx2x, bnxt, qede, and idpf call
tcp_gro_complete as part of offloading TCP GRO. The function is only
defined if CONFIG_INET is true, since its TCP specific and is meaningless
if the kernel lacks IP networking support.

The combination of trying to use the complex network drivers with
CONFIG_NET but not CONFIG_INET is rather unlikely in practice: most use
cases are going to need IP networking.

The tcp_gro_complete function just sets some data in the socket buffer for
use in processing the TCP packet in the event that the GRO was offloaded to
the device. If the kernel lacks TCP support, such setup will simply go
unused.

The bnx2x, bnxt, and qede drivers wrap their TCP offload support in
CONFIG_INET checks and skip handling on such kernels.

The idpf driver did not check CONFIG_INET and thus fails to link if the
kernel is configured with CONFIG_NET=y, CONFIG_IDPF=(m|y), and
CONFIG_INET=n.

While checking CONFIG_INET does allow the driver to bypass significantly
more instructions in the event that we know TCP networking isn't supported,
the configuration is unlikely to be used widely.

Rather than require driver authors to care about this, stub the
tcp_gro_complete function when CONFIG_INET=n. This allows drivers to be
left as-is. It does mean the idpf driver will perform slightly more work
than strictly necessary when CONFIG_INET=n, since it will still execute
some of the skb setup in idpf_rx_rsc. However, that work would be performed
in the case where CONFIG_INET=y anyways.

I did not change the existing drivers, since they appear to wrap a
significant portion of code when CONFIG_INET=n. There is little benefit in
trashing these drivers just to unwrap and remove the CONFIG_INET check.

Using a stub for tcp_gro_complete is still beneficial, as it means future
drivers no longer need to worry about this case of CONFIG_NET=y and
CONFIG_INET=n, which should reduce noise from buildbots that check such a
configuration.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20231013185502.1473541-1-jacob.e.keller@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

drivers: net: wwan: wwan_core.c: resolved spelling mistake

resolved typing mistake from devce to device

Signed-off-by: Muhammad Muzammil <m.muzzammilashraf@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231013042304.7881-1-m.muzzammilashraf@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cgroup, netclassid: on modifying netclassid in cgroup, only consider the main process.

When modifying netclassid, the command("echo 0x100001 > net_cls.classid")
will take more time on many threads of one process, because the process
create many fds.
for example, one process exists 28000 fds and 60000 threads, echo command
will task 45 seconds.
Now, we only consider the main process when exec "iterate_fd", and the
time is about 52 milliseconds.

Signed-off-by: Liansen Zhai <zhailiansen@kuaishou.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231012090330.29636-1-zhailiansen@kuaishou.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: replace deprecated strncpy with strscpy

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

Other implementations of .*get_drvinfo use strscpy so this patch brings
sr_get_drvinfo() in line as well:

igb/igb_ethtool.c +851
static void igb_get_drvinfo(struct net_device *netdev,

igbvf/ethtool.c
167:static void igbvf_get_drvinfo(struct net_device *netdev,

i40e/i40e_ethtool.c
1999:static void i40e_get_drvinfo(struct net_device *netdev,

e1000/e1000_ethtool.c
529:static void e1000_get_drvinfo(struct net_device *netdev,

ixgbevf/ethtool.c
211:static void ixgbevf_get_drvinfo(struct net_device *netdev,

...

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-usb-sr9800-c-v1-1-5540832c8ec2@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

lan78xx: replace deprecated strncpy with strscpy

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

Other implementations of .*get_drvinfo use strscpy so this patch brings
lan78xx_get_drvinfo() in line as well:

igb/igb_ethtool.c +851
static void igb_get_drvinfo(struct net_device *netdev,

igbvf/ethtool.c
167:static void igbvf_get_drvinfo(struct net_device *netdev,

i40e/i40e_ethtool.c
1999:static void i40e_get_drvinfo(struct net_device *netdev,

e1000/e1000_ethtool.c
529:static void e1000_get_drvinfo(struct net_device *netdev,

ixgbevf/ethtool.c
211:static void ixgbevf_get_drvinfo(struct net_device *netdev,

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-usb-lan78xx-c-v1-1-99d513061dfc@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: smsc: replace deprecated strncpy with ethtool_sprintf

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

ethtool_sprintf() is designed specifically for get_strings() usage.
Let's replace strncpy in favor of this dedicated helper function.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-phy-smsc-c-v1-1-00528f7524b3@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: netcp: replace deprecated strncpy with strscpy

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

Considering the above, a suitable replacement is `strscpy` [2] due to
the fact that it guarantees NUL-termination on the destination buffer
without unnecessarily NUL-padding.

Other implementations of .*get_drvinfo also use strscpy so this patch
brings keystone_get_drvinfo() in line as well:

igb/igb_ethtool.c +851
static void igb_get_drvinfo(struct net_device *netdev,

igbvf/ethtool.c
167:static void igbvf_get_drvinfo(struct net_device *netdev,

i40e/i40e_ethtool.c
1999:static void i40e_get_drvinfo(struct net_device *netdev,

e1000/e1000_ethtool.c
529:static void e1000_get_drvinfo(struct net_device *netdev,

ixgbevf/ethtool.c
211:static void ixgbevf_get_drvinfo(struct net_device *netdev,

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-ethernet-ti-netcp_ethss-c-v1-1-93142e620864@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: Set pingpong threshold via sysctl

TCP pingpong threshold is 1 by default. But some applications, like SQL DB
may prefer a higher pingpong threshold to activate delayed acks in quick
ack mode for better performance.

The pingpong threshold and related code were changed to 3 in the year
2019 in:
commit 4a41f453bedf ("tcp: change pingpong threshold to 3")
And reverted to 1 in the year 2022 in:
commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"")

There is no single value that fits all applications.
Add net.ipv4.tcp_pingpong_thresh sysctl tunable, so it can be tuned for
optimal performance based on the application needs.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/1697056244-21888-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net, sched: Add tcf_set_drop_reason for {__,}tcf_classify

Add an initial user for the newly added tcf_set_drop_reason() helper to set the
drop reason for internal errors leading to TC_ACT_SHOT inside {__,}tcf_classify().

Right now this only adds a very basic SKB_DROP_REASON_TC_ERROR as a generic
fallback indicator to mark drop locations. Where needed, such locations can be
converted to more specific codes, for example, when hitting the reclassification
limit, etc.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Victor Nogueira <victor@mojatatu.com>
Link: https://lore.kernel.org/r/20231009092655.22025-2-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net, sched: Make tc-related drop reason more flexible

Currently, the kfree_skb_reason() in sch_handle_{ingress,egress}() can only
express a basic SKB_DROP_REASON_TC_INGRESS or SKB_DROP_REASON_TC_EGRESS reason.

Victor kicked-off an initial proposal to make this more flexible by disambiguating
verdict from return code by moving the verdict into struct tcf_result and
letting tcf_classify() return a negative error. If hit, then two new drop
reasons were added in the proposal, that is SKB_DROP_REASON_TC_INGRESS_ERROR
as well as SKB_DROP_REASON_TC_EGRESS_ERROR. Further analysis of the actual
error codes would have required to attach to tcf_classify via kprobe/kretprobe
to more deeply debug skb and the returned error.

In order to make the kfree_skb_reason() in sch_handle_{ingress,egress}() more
extensible, it can be addressed in a more straight forward way, that is: Instead
of placing the verdict into struct tcf_result, we can just put the drop reason
in there, which does not require changes throughout various classful schedulers
given the existing verdict logic can stay as is.

Then, SKB_DROP_REASON_TC_ERROR{,_*} can be added to the enum skb_drop_reason
to disambiguate between an error or an intentional drop. New drop reason error
codes can be added successively to the tc code base.

For internal error locations which have not yet been annotated with a
SKB_DROP_REASON_TC_ERROR{,_*}, the fallback is SKB_DROP_REASON_TC_INGRESS and
SKB_DROP_REASON_TC_EGRESS, respectively. Generic errors could be marked with a
SKB_DROP_REASON_TC_ERROR code until they are converted to more specific ones
if it is found that they would be useful for troubleshooting.

While drop reasons have infrastructure for subsystem specific error codes which
are currently used by mac80211 and ovs, Jakub mentioned that it is preferred
for tc to use the enum skb_drop_reason core codes given it is a better fit and
currently the tooling support is better, too.

With regards to the latter:

  [...] I think Alastair (bpftrace) is working on auto-prettifying enums when
  bpftrace outputs maps. So we can do something like:

  $ bpftrace -e 'tracepoint:skb:kfree_skb { @[args->reason] = count(); }'
  Attaching 1 probe...
  ^C

  @[SKB_DROP_REASON_TC_INGRESS]: 2
  @[SKB_CONSUMED]: 34

  ^^^^^^^^^^^^ names!!

  Auto-magically. [...]

Add a small helper tcf_set_drop_reason() which can be used to set the drop reason
into the tcf_result.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Victor Nogueira <victor@mojatatu.com>
Link: https://lore.kernel.org/netdev/20231006063233.74345d36@kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20231009092655.22025-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bpf-log-improvements'

Andrii Nakryiko says:

====================
This patch set fixes ambiguity in BPF verifier log output of SCALAR register
in the parts that emit umin/umax, smin/smax, etc ranges. See patch #4 for
details.

Also, patch #5 fixes an issue with verifier log missing instruction context
(state) output for conditionals that trigger precision marking. See details in
the patch.

First two patches are just improvements to two selftests that are very flaky
locally when run in parallel mode.

Patch #3 changes 'align' selftest to be less strict about exact verifier log
output (which patch #4 changes, breaking lots of align tests as written). Now
test does more of a register substate checks, mostly around expected var_off()
values. This 'align' selftests is one of the more brittle ones and requires
constant adjustment when verifier log output changes, without really catching
any new issues. So hopefully these changes can minimize future support efforts
for this specific set of tests.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: Ensure proper register state printing for cond jumps

Verifier emits relevant register state involved in any given instruction
next to it after `;` to the right, if possible. Or, worst case, on the
separate line repeating instruction index.

E.g., a nice and simple case would be:

  2: (d5) if r0 s<= 0x0 goto pc+1       ; R0_w=0

But if there is some intervening extra output (e.g., precision
backtracking log) involved, we are supposed to see the state after the
precision backtrack log:

  4: (75) if r0 s>= 0x0 goto pc+1
  mark_precise: frame0: last_idx 4 first_idx 0 subseq_idx -1
  mark_precise: frame0: regs=r0 stack= before 2: (d5) if r0 s<= 0x0 goto pc+1
  mark_precise: frame0: regs=r0 stack= before 1: (b7) r0 = 0
  6: R0_w=0

First off, note that in `6: R0_w=0` instruction index corresponds to the
next instruction, not to the conditional jump instruction itself, which
is wrong and we'll get to that.

But besides that, the above is a happy case that does work today. Yet,
if it so happens that precision backtracking had to traverse some of the
parent states, this `6: R0_w=0` state output would be missing.

This is due to a quirk of print_verifier_state() routine, which performs
mark_verifier_state_clean(env) at the end. This marks all registers as
"non-scratched", which means that subsequent logic to print *relevant*
registers (that is, "scratched ones") fails and doesn't see anything
relevant to print and skips the output altogether.

print_verifier_state() is used both to print instruction context, but
also to print an **entire** verifier state indiscriminately, e.g.,
during precision backtracking (and in a few other situations, like
during entering or exiting subprogram).  Which means if we have to print
entire parent state before getting to printing instruction context
state, instruction context is marked as clean and is omitted.

Long story short, this is definitely not intentional. So we fix this
behavior in this patch by teaching print_verifier_state() to clear
scratch state only if it was used to print instruction state, not the
parent/callback state. This is determined by print_all option, so if
it's not set, we don't clear scratch state. This fixes missing
instruction state for these cases.

As for the mismatched instruction index, we fix that by making sure we
call print_insn_state() early inside check_cond_jmp_op() before we
adjusted insn_idx based on jump branch taken logic. And with that we get
desired correct information:

  9: (16) if w4 == 0x1 goto pc+9
  mark_precise: frame0: last_idx 9 first_idx 9 subseq_idx -1
  mark_precise: frame0: parent state regs=r4 stack=: R2_w=1944 R4_rw=P1 R10=fp0
  mark_precise: frame0: last_idx 8 first_idx 0 subseq_idx 9
  mark_precise: frame0: regs=r4 stack= before 8: (66) if w4 s> 0x3 goto pc+5
  mark_precise: frame0: regs=r4 stack= before 7: (b7) r4 = 1
  9: R4=1

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231011223728.3188086-6-andrii@kernel.org

bpf: Disambiguate SCALAR register state output in verifier logs

Currently the way that verifier prints SCALAR_VALUE register state (and
PTR_TO_PACKET, which can have var_off and ranges info as well) is very
ambiguous.

In the name of brevity we are trying to eliminate "unnecessary" output
of umin/umax, smin/smax, u32_min/u32_max, and s32_min/s32_max values, if
possible. Current rules are that if any of those have their default
value (which for mins is the minimal value of its respective types: 0,
S32_MIN, or S64_MIN, while for maxs it's U32_MAX, S32_MAX, S64_MAX, or
U64_MAX) *OR* if there is another min/max value that as matching value.
E.g., if smin=100 and umin=100, we'll emit only umin=10, omitting smin
altogether. This approach has a few problems, being both ambiguous and
sort-of incorrect in some cases.

Ambiguity is due to missing value could be either default value or value
of umin/umax or smin/smax. This is especially confusing when we mix
signed and unsigned ranges. Quite often, umin=0 and smin=0, and so we'll
have only `umin=0` leaving anyone reading verifier log to guess whether
smin is actually 0 or it's actually -9223372036854775808 (S64_MIN). And
often times it's important to know, especially when debugging tricky
issues.

"Sort-of incorrectness" comes from mixing negative and positive values.
E.g., if umin is some large positive number, it can be equal to smin
which is, interpreted as signed value, is actually some negative value.
Currently, that smin will be omitted and only umin will be emitted with
a large positive value, giving an impression that smin is also positive.

Anyway, ambiguity is the biggest issue making it impossible to have an
exact understanding of register state, preventing any sort of automated
testing of verifier state based on verifier log. This patch is
attempting to rectify the situation by removing ambiguity, while
minimizing the verboseness of register state output.

The rules are straightforward:
  - if some of the values are missing, then it definitely has a default
  value. I.e., `umin=0` means that umin is zero, but smin is actually
  S64_MIN;
  - all the various boundaries that happen to have the same value are
  emitted in one equality separated sequence. E.g., if umin and smin are
  both 100, we'll emit `smin=umin=100`, making this explicit;
  - we do not mix negative and positive values together, and even if
  they happen to have the same bit-level value, they will be emitted
  separately with proper sign. I.e., if both umax and smax happen to be
  0xffffffffffffffff, we'll emit them both separately as
  `smax=-1,umax=18446744073709551615`;
  - in the name of a bit more uniformity and consistency,
  {u32,s32}_{min,max} are renamed to {s,u}{min,max}32, which seems to
  improve readability.

The above means that in case of all 4 ranges being, say, [50, 100] range,
we'd previously see hugely ambiguous:

    R1=scalar(umin=50,umax=100)

Now, we'll be more explicit:

    R1=scalar(smin=umin=smin32=umin32=50,smax=umax=smax32=umax32=100)

This is slightly more verbose, but distinct from the case when we don't
know anything about signed boundaries and 32-bit boundaries, which under
new rules will match the old case:

    R1=scalar(umin=50,umax=100)

Also, in the name of simplicity of implementation and consistency, order
for {s,u}32_{min,max} are emitted *before* var_off. Previously they were
emitted afterwards, for unclear reasons.

This patch also includes a few fixes to selftests that expect exact
register state to accommodate slight changes to verifier format. You can
see that the changes are pretty minimal in common cases.

Note, the special case when SCALAR_VALUE register is a known constant
isn't changed, we'll emit constant value once, interpreted as signed
value.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231011223728.3188086-5-andrii@kernel.org

selftests/bpf: Make align selftests more robust

Align subtest is very specific and finicky about expected verifier log
output and format. This is often completely unnecessary as in a bunch of
situations test actually cares about var_off part of register state. But
given how exact it is right now, any tiny verifier log changes can lead
to align tests failures, requiring constant adjustment.

This patch tries to make this a bit more robust by making logic first
search for specified register and then allowing to match only portion of
register state, not everything exactly. This will come handly with
follow up changes to SCALAR register output disambiguation.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231011223728.3188086-4-andrii@kernel.org

selftests/bpf: Improve missed_kprobe_recursion test robustness

Given missed_kprobe_recursion is non-serial and uses common testing
kfuncs to count number of recursion misses it's possible that some other
parallel test can trigger extraneous recursion misses. So we can't
expect exactly 1 miss. Relax conditions and expect at least one.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231011223728.3188086-3-andrii@kernel.org

selftests/bpf: Improve percpu_alloc test robustness

Make these non-serial tests filter BPF programs by intended PID of
a test runner process. This makes it isolated from other parallel tests
that might interfere accidentally.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231011223728.3188086-2-andrii@kernel.org

tsnep: Inline small fragments within TX descriptor

The tsnep network controller is able to extend the descriptor directly
with data to be transmitted. In this case no TX data DMA address is
necessary. Instead of the TX data DMA address the TX data buffer is
placed at the end of the descriptor.

The descriptor is read with a 64 bytes DMA read by the tsnep network
controller. If the sum of descriptor data and TX data is less than or
equal to 64 bytes, then no additional DMA read is necessary to read the
TX data. Therefore, it makes sense to inline small fragments up to this
limit within the descriptor ring.

Inlined fragments need to be copied to the descriptor ring. On the other
hand DMA mapping is not necessary. At most 40 bytes are copied, so
copying should be faster than DMA mapping.

For A53 1.2 GHz copying takes <100ns and DMA mapping takes >200ns. So
inlining small fragments should result in lower CPU load. Performance
improvement is small. Thus, comparision of CPU load with and without
inlining of small fragments did not show any significant difference.
With this optimization less DMA reads will be done, which decreases the
load of the interconnect.

Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'udp-tunnel-route-lookups'

Beniamino Galvani says:

====================
net: consolidate IPv4 route lookup for UDP tunnels

At the moment different UDP tunnels rely on different functions for
IPv4 route lookup, and those functions all implement the same
logic. Only bareudp uses the generic ip_route_output_tunnel(), while
geneve and vxlan basically duplicate it slightly differently.

This series first extends the generic lookup function so that it is
suitable for all UDP tunnel implementations. Then, bareudp, geneve and
vxlan are adapted to use them.

This results in code with less duplication and hopefully better
maintainability.

After this series is merged, IPv6 will be converted in a similar way.

Changelog:
v2
- fix compilation with IPv6 disabled
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

vxlan: use generic function for tunnel IPv4 route lookup

The route lookup can be done now via generic function
udp_tunnel_dst_lookup() to replace the custom implementations in
vxlan_get_route().

Note that this patch only touches IPv4, while IPv6 still uses
vxlan6_get_route(). After IPv6 route lookup gets converted as well,
vxlan_xmit_one() can be simplified by removing local variables that
will be passed via "struct ip_tunnel_key", such as remote_ip,
local_ip, flow_flags, label.

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

geneve: use generic function for tunnel IPv4 route lookup

The route lookup can be done now via generic function
udp_tunnel_dst_lookup() to replace the custom implementation in
geneve_get_v4_rt().

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

geneve: add dsfield helper function

Add a helper function to compute the tos/dsfield. In this way, we can
factor out some duplicate code. Also, the helper will be called from
more places in the next commit.

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: use tunnel flow flags for tunnel route lookups

Commit 451ef36bd229 ("ip_tunnels: Add new flow flags field to
ip_tunnel_key") added a new field to struct ip_tunnel_key to control
route lookups. Currently the flag is used by vxlan and geneve tunnels;
use it also in udp_tunnel_dst_lookup() so that it affects all tunnel
types relying on this function.

Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: add new arguments to udp_tunnel_dst_lookup()

We want to make the function more generic so that it can be used by
other UDP tunnel implementations such as geneve and vxlan. To do that,
add the following arguments:

- source and destination UDP port;
- ifindex of the output interface, needed by vxlan;
- the tos, because in some cases it is not taken from struct
   ip_tunnel_info (for example, when it's inherited from the inner
   packet);
- the dst cache, because not all tunnel types (e.g. vxlan) want to
   use the one from struct ip_tunnel_info.

With these parameters, the function no longer needs the full struct
ip_tunnel_info as argument and we can pass only the relevant part of
it (struct ip_tunnel_key).

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: remove "proto" argument from udp_tunnel_dst_lookup()

The function is now UDP-specific, the protocol is always IPPROTO_UDP.

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: rename and move ip_route_output_tunnel()

At the moment ip_route_output_tunnel() is used only by bareudp.
Ideally, other UDP tunnel implementations should use it, but to do so
the function needs to accept new parameters that are specific for UDP
tunnels, such as the ports.

Prepare for these changes by renaming the function to
udp_tunnel_dst_lookup() and move it to file
net/ipv4/udp_tunnel_core.c.

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: net: remove unused variables

These variables are never referenced in the code, just remove them

Signed-off-by: zhujun2 <zhujun2@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: cxgb3: simplify logic for rspq_check_napi

Simplify logic for rspq_check_napi.
Drop redundant and wrong napi_is_scheduled call as it's not race free
and directly use the output of napi_schedule to understand if a napi is
pending or not.

rspq_check_napi main logic is to check if is_new_response is true and
check if a napi is not scheduled. The result of this function is then
used to detect if we are missing some interrupt and act on top of
this... With this knowing, we can rework and simplify the logic and make
it less problematic with testing an internal bit for napi.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ptp-multiple-readers'

Xabier Marquiegui says:

====================
ptp: Support for multiple filtered timestamp event queue readers

On systems with multiple timestamp event channels, there can be scenarios
where multiple userspace readers want to access the timestamping data for
various purposes.

One such example is wanting to use a pps out for time synchronization, and
wanting to timestamp external events with the synchronized time base
simultaneously.

Timestmp event consumers on the other hand, are often interested in a
subset of the available timestamp channels. linuxptp ts2phc, for example,
is not happy if more than one timestamping channel is active on the device
it is reading from.

Linked lists are introduced to support multiple timestamp event queue
consumers, and timestamp event channel filters through IOCTLs, as well as
a debugfs interface to do some simple verifications.

Xabier Marquiegui (6):
  posix-clock: introduce posix_clock_context concept
  ptp: Replace timestamp event queue with linked list
  ptp: support multiple timestamp event readers
  ptp: support event queue reader channel masks
  ptp: add debugfs interface to see applied channel masks
  ptp: add testptp mask test

drivers/ptp/ptp_chardev.c                   | 129 ++++++++++++++++----
drivers/ptp/ptp_clock.c                     |  45 ++++++-
drivers/ptp/ptp_private.h                   |  28 +++--
drivers/ptp/ptp_sysfs.c                     |  13 +-
include/linux/posix-clock.h                 |  35 ++++--
include/uapi/linux/ptp_clock.h              |   2 +
kernel/time/posix-clock.c                   |  36 ++++--
tools/testing/selftests/ptp/ptpchmaskfmt.sh |  14 +++
tools/testing/selftests/ptp/testptp.c       |  19 ++-
9 files changed, 261 insertions(+), 60 deletions(-)
create mode 100644 tools/testing/selftests/ptp/ptpchmaskfmt.sh

---
v6:
  - correct commit message
  - correct coding style
v5: https://lore.kernel.org/netdev/cover.1696804243.git.reibax@gmail.com/
  - fix spelling on commit message
  - fix memory leak on ptp_open
v4: https://lore.kernel.org/netdev/cover.1696511486.git.reibax@gmail.com/
  - split modifications in different patches for improved organization
  - rename posix_clock_user to posix_clock_context
  - remove unnecessary flush_users clock operation
  - remove unnecessary tests
  - simpler queue clean procedure
  - fix/clean comment lines
  - simplified release procedures
  - filter modifications exclusive to currently open instance for
    simplicity and security
  - expand mask to 2048 channels
  - make more secure and simple: mask is only applied to the testptp
    instance. Use debugfs to verify effects.
v3: https://lore.kernel.org/netdev/20230928133544.3642650-1-reibax@gmail.com/
  - add this patchset overview file
  - fix use of safe and non safe linked lists for loops
  - introduce new posix_clock private_data and ida object ids for better
    dicrimination of timestamp consumers
  - safer resource release procedures
  - filter application by object id, aided by process id
  - friendlier testptp implementation of event queue channel filters
v2: https://lore.kernel.org/netdev/20230912220217.2008895-1-reibax@gmail.com/
  - fix ptp_poll() return value
  - Style changes to comform to checkpatch strict suggestions
  - more coherent ptp_read error exit routines
  - fix testptp compilation error: unknown type name 'pid_t'
  - rename mask variable for easier code traceability
  - more detailed commit message with two examples
v1: https://lore.kernel.org/netdev/20230906104754.1324412-2-reibax@gmail.com/
====================

Signed-off-by: Xabier Marquiegui <reibax@gmail.com>
Suggested-by: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: add testptp mask test

Add option to test timestamp event queue mask manipulation in testptp.

Option -F allows the user to specify a single channel that will be
applied on the mask filter via IOCTL.

The test program will maintain the file open until user input is
received.

This allows checking the effect of the IOCTL in debugfs.

eg:

Console 1:
```
Channel 12 exclusively enabled. Check on debugfs.
Press any key to continue
```

Console 2:
```
0x00000000 0x00000001 0x00000000 0x00000000 0x00000000 0x00000000
0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
0x00000000 0x00000000 0x00000000 0x00000000
```

Signed-off-by: Xabier Marquiegui <reibax@gmail.com>
Suggested-by: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: add debugfs interface to see applied channel masks

Use debugfs to be able to view channel mask applied to every timestamp
event queue.

Every time the device is opened, a new entry is created in
`$DEBUGFS_MOUNTPOINT/ptpN/$INSTANCE_ADDRESS/mask`.

The mask value can be viewed grouped in 32bit decimal values using cat,
or converted to hexadecimal with the included `ptpchmaskfmt.sh` script.
32 bit values are listed from least significant to most significant.

Signed-off-by: Xabier Marquiegui <reibax@gmail.com>
Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: support event queue reader channel masks

On systems with multiple timestamp event channels, some readers might
want to receive only a subset of those channels.

Add the necessary modifications to support timestamp event channel
filtering, including two IOCTL operations:

- Clear all channels
- Enable one channel

The mask modification operations will be applied exclusively on the
event queue assigned to the file descriptor used on the IOCTL operation,
so the typical procedure to have a reader receiving only a subset of the
enabled channels would be:

- Open device file
- ioctl: clear all channels
- ioctl: enable one channel
- start reading

Calling the enable one channel ioctl more than once will result in
multiple enabled channels.

Signed-off-by: Xabier Marquiegui <reibax@gmail.com>
Suggested-by: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: support multiple timestamp event readers

Use linked lists to create one event queue per open file. This enables
simultaneous readers for timestamp event queues.

Signed-off-by: Xabier Marquiegui <reibax@gmail.com>
Suggested-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: Replace timestamp event queue with linked list

Introduce linked lists to access the timestamp event queue.

Signed-off-by: Xabier Marquiegui <reibax@gmail.com>
Suggested-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

posix-clock: introduce posix_clock_context concept

Add the necessary structure to support custom private-data per
posix-clock user.

The previous implementation of posix-clock assumed all file open
instances need access to the same clock structure on private_data.

The need for individual data structures per file open instance has been
identified when developing support for multiple timestamp event queue
users for ptp_clock.

Signed-off-by: Xabier Marquiegui <reibax@gmail.com>
Suggested-by: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dpll-phase-offset-phase-adjust'

Arkadiusz Kubalewski says:

====================
dpll: add phase-offset and phase-adjust

Improve monitoring and control over dpll devices.
Allow user to receive measurement of phase difference between signals
on pin and dpll (phase-offset).
Allow user to receive and control adjustable value of pin's signal
phase (phase-adjust).

v4->v5:
- rebase series on top of net-next/main, fix conflict - remove redundant
  attribute type definition in subset definition

v3->v4:
- do not increase do version of uAPI header as it is not needed (v3 did
  not have this change)
- fix spelling around commit messages, argument descriptions and docs
- add missing extack errors on failure set callbacks for pin phase
  adjust and frequency
- remove ice check if value is already set, now redundant as checked in
  the dpll subsystem

v2->v3:
- do not increase do version of uAPI header as it is not needed

v1->v2:
- improve handling for error case of requesting the phase adjust set
- align handling for error case of frequency set request with the
approach introduced for phase adjust
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

dpll: netlink/core: change pin frequency set behavior

Align the approach of pin frequency set behavior with the approach
introduced with pin phase adjust set.
Fail the request if any of devices did not registered the callback ops.
If callback op on any pin's registered device fails, return error and
rollback the value to previous one.

Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ice: dpll: implement phase related callbacks

Implement new callback ops related to measurement and adjustment of
signal phase for pin-dpll in ice driver.

Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpll: netlink/core: add support for pin-dpll signal phase offset/adjust

Add callback ops for pin-dpll phase measurement.
Add callback for pin signal phase adjustment.
Add min and max phase adjustment values to pin proprties.
Invoke callbacks in dpll_netlink.c when filling the pin details to
provide user with phase related attribute values.

Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpll: spec: add support for pin-dpll signal phase offset/adjust

Add attributes for providing the user with:
- measurement of signals phase offset between pin and dpll
- ability to adjust the phase of pin signal

Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpll: docs: add support for pin signal phase offset/adjust

Add documentation on:
- measurement of phase of signal between pin and dpll
- adjustment of pin signal phase

Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'i40e-devlink'

Ivan Vecera says:

====================
i40e: Add basic devlink support

The series adds initial support for devlink to i40e driver.

Patch-set overview:
Patch 1: Adds initial devlink support (devlink and port registration)
Patch 2: Refactors and split i40e_nvm_version_str()
Patch 3: Adds support for 'devlink dev info'
Patch 4: Refactors existing helper function to read PBA ID
Patch 5: Adds 'board.id' to 'devlink dev info' using PBA ID
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

i40e: Add PBA as board id info to devlink .info_get

Expose stored PBA ID string as unique board identifier via
devlink's .info_get command.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

i40e: Refactor and rename i40e_read_pba_string()

Function i40e_read_pba_string() is currently unused but will be used
by subsequent patch to provide board ID via devlink device info.

The function reads PBA block from NVM so it cannot be called during
adapter reset and as we would like to provide PBA ID via devlink
info it is better to read the PBA ID during i40e_probe() and cache
it in i40e_hw structure to avoid a waiting for potential adapter
reset in devlink info callback.

So...
- Remove pba_num and pba_num_size arguments from the function,
  allocate resource managed buffer to store PBA ID string and
  save resulting pointer to i40e_hw->pba_id field
- Make the function void as the PBA ID can be missing and in this
  case (or in case of NVM reading failure) the i40e_hw->pba_id
  will be NULL
- Rename the function to i40e_get_pba_string() to align with other
  functions like i40e_get_oem_version() i40e_get_port_mac_addr()...
- Call this function on init during i40e_probe()

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

i40e: Add handler for devlink .info_get

Provide devlink .info_get callback to allow the driver to report
detailed version information. The following info is reported:

"serial_number" -> The PCI DSN of the adapter
"fw.mgmt" -> The version of the firmware
"fw.mgmt.api" -> The API version of interface exposed over the AdminQ
"fw.psid" -> The version of the NVM image
"fw.bundle_id" -> Unique identifier for the combined flash image
"fw.undi" -> The combo image version

With this, 'devlink dev info' provides at least the same amount
information as is reported by ETHTOOL_GDRVINFO:

$ ethtool -i enp2s0f0 | egrep '(driver|firmware)'
driver: i40e
firmware-version: 9.30 0x8000e5f3 1.3429.0

$ devlink dev info pci/0000:02:00.0
pci/0000:02:00.0:
  driver i40e
  serial_number c0-de-b7-ff-ff-ef-ec-3c
  versions:
      running:
        fw.mgmt 9.130.73618
        fw.mgmt.api 1.15
        fw.psid 9.30
        fw.bundle_id 0x8000e5f3
        fw.undi 1.3429.0

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

i40e: Split and refactor i40e_nvm_version_str()

The function formats NVM version string according adapter's
EETrackID value. If this value OEM specific (0xffffffff) then
the reported version is with format:
"<gen>.<snap>.<release>"
and in other case
"<nvm_maj>.<nvm_min> <eetrackid> <cvid_maj>.<cvid_bld>.<cvid_min>"

These versions are reported in the subsequent patch in this series
that implements devlink .info_get but separately.

So split the function into separate ones, refactor it to use them
and remove ugly static string buffer.
Additionally convert NVM/OEM version mask macros to use GENMASK and
use FIELD_GET/FIELD_PREP for them in i40e_nvm_version_str() and
i40e_get_oem_version(). This makes code more readable and allows
us to remove related shift macros.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

i40e: Add initial devlink support

Add an initial support for devlink interface to i40e driver.

Similarly to ice driver the implementation doe not enable devlink
to manage device-wide configuration and devlink instance is created
for each physical function of PCIe device.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tg3: Improve PTP TX timestamping logic

When we are trying to timestamp a TX packet, there may be
occasions when the TX timestamp register is still not
updated with the latest timestamp even if the timestamp
packet descriptor is marked as complete.
This usually happens in cases where the system is under
stress or flow control is affecting the transmit side.

We will solve this problem by saving the snapshot of the
timestamp register when we are posting the TX descriptor.
At this time, the register contains previously timestamped
packet's value and valid timestamp of the current packet must
be different than this.
Upon completion of the current descriptor, we will check if
the timestamp register is updated or not before timestamping
the skb. If not updated, we will schedule the ptp worker to
fetch the updated time later and timestamp the skb.
Also now we restrict number of outstanding PTP TX packet
requests to 1.

Reported-by: Simon White <Simon.White@viavisolutions.com>
Link: https://lore.kernel.org/netdev/CACKFLikGdN9XPtWk-fdrzxdcD=+bv-GHBvfVfSpJzHY7hrW39g@mail.gmail.com/
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

docs: try to encourage (netdev?) reviewers

Add a section to netdev maintainer doc encouraging reviewers
to chime in on the mailing list.

The questions about "when is it okay to share feedback"
keep coming up (most recently at netconf) and the answer
is "pretty much always".

Extend the section of 7.AdvancedTopics.rst which deals
with reviews a little bit to add stuff we had been recommending
locally.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sfc-conntrack-offload'

Edward Cree says:

====================
sfc: support conntrack NAT offload

The EF100 MAE supports performing NAT (and NPT) on packets which match in
the conntrack table. This series adds that capability to the driver.
====================

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: support offloading ct(nat) action in RHS rules

If an IP address and/or L4 port for NAPT is available from a CT match,
the MAE will perform the edits; if no CT lookup has been performed for
this packet, the CT lookup did not return a match, or the matched CT
entry did not include NAPT, the action will have no effect.

Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com>
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: parse mangle actions (NAT) in conntrack entries

The MAE can edit either address, L4 port, or both, for either source
or destination. These can't be mixed; i.e. it can edit source addr
and source port, but not (say) source addr and dest port.

Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com>
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'vsock-virtio-vhost-zerocopy'

Arseniy Krasnov says:

====================
vsock/virtio/vhost: MSG_ZEROCOPY preparations

this patchset is first of three parts of another big patchset for
MSG_ZEROCOPY flag support:
https://lore.kernel.org/netdev/20230701063947.3422088-1-AVKrasnov@sberdevices.ru/

During review of this series, Stefano Garzarella <sgarzare@redhat.com>
suggested to split it for three parts to simplify review and merging:

1) virtio and vhost updates (for fragged skbs) <--- this patchset
2) AF_VSOCK updates (allows to enable MSG_ZEROCOPY mode and read
tx completions) and update for Documentation/.
3) Updates for tests and utils.

This series enables handling of fragged skbs in virtio and vhost parts.
Newly logic won't be triggered, because SO_ZEROCOPY options is still
impossible to enable at this moment (next bunch of patches from big
set above will enable it).

I've included changelog to some patches anyway, because there were some
comments during review of last big patchset from the link above.

Head for this patchset is:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=f2fa1c812c91e99d0317d1fc7d845e1e05f39716

Link to v1:
https://lore.kernel.org/netdev/20230717210051.856388-1-AVKrasnov@sberdevices.ru/
Link to v2:
https://lore.kernel.org/netdev/20230718180237.3248179-1-AVKrasnov@sberdevices.ru/
Link to v3:
https://lore.kernel.org/netdev/20230720214245.457298-1-AVKrasnov@sberdevices.ru/
Link to v4:
https://lore.kernel.org/netdev/20230727222627.1895355-1-AVKrasnov@sberdevices.ru/
Link to v5:
https://lore.kernel.org/netdev/20230730085905.3420811-1-AVKrasnov@sberdevices.ru/
Link to v6:
https://lore.kernel.org/netdev/20230814212720.3679058-1-AVKrasnov@sberdevices.ru/
Link to v7:
https://lore.kernel.org/netdev/20230827085436.941183-1-avkrasnov@salutedevices.com/
Link to v8:
https://lore.kernel.org/netdev/20230911202234.1932024-1-avkrasnov@salutedevices.com/

Changelog:
v3 -> v4:
* Patchset rebased and tested on new HEAD of net-next (see hash above).
v4 -> v5:
* See per-patch changelog after ---.
v5 -> v6:
* Patchset rebased and tested on new HEAD of net-next (see hash above).
* See per-patch changelog after ---.
v6 -> v7:
* Patchset rebased and tested on new HEAD of net-next (see hash above).
* See per-patch changelog after ---.
v7 -> v8:
* Patchset rebased and tested on new HEAD of net-next (see hash above).
* See per-patch changelog after ---.
v8 -> v9:
* Patchset rebased and tested on new HEAD of net-next (see hash above).
* See per-patch changelog after ---.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

test/vsock: io_uring rx/tx tests

This adds set of tests which use io_uring for rx/tx. This test suite is
implemented as separated util like 'vsock_test' and has the same set of
input arguments as 'vsock_test'. These tests only cover cases of data
transmission (no connect/bind/accept etc).

Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

test/vsock: MSG_ZEROCOPY support for vsock_perf

To use this option pass '--zerocopy' parameter:

./vsock_perf --zerocopy --sender <cid> ...

With this option MSG_ZEROCOPY flag will be passed to the 'send()' call.

Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

test/vsock: MSG_ZEROCOPY flag tests

This adds three tests for MSG_ZEROCOPY feature:
1) SOCK_STREAM tx with different buffers.
2) SOCK_SEQPACKET tx with different buffers.
3) SOCK_STREAM test to read empty error queue of the socket.

Patch also works as preparation for the next patches for tools in this
patchset: vsock_perf and vsock_uring_test:
1) Adds several new functions to util.c - they will be also used by
   vsock_uring_test.
2) Adds two new functions for MSG_ZEROCOPY handling to a new source
   file - such source will be shared between vsock_test, vsock_perf and
   vsock_uring_test, thus avoiding code copy-pasting.

Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

docs: net: description of MSG_ZEROCOPY for AF_VSOCK

This adds description of MSG_ZEROCOPY flag support for AF_VSOCK type of
socket.

Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vsock: enable setting SO_ZEROCOPY

For AF_VSOCK, zerocopy tx mode depends on transport, so this option must
be set in AF_VSOCK implementation where transport is accessible (if
transport is not set during setting SO_ZEROCOPY: for example socket is
not connected, then SO_ZEROCOPY will be enabled, but once transport will
be assigned, support of this type of transmission will be checked).

To handle SO_ZEROCOPY, AF_VSOCK implementation uses SOCK_CUSTOM_SOCKOPT
bit, thus handling SOL_SOCKET option operations, but all of them except
SO_ZEROCOPY will be forwarded to the generic handler by calling
'sock_setsockopt()'.

Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vsock/loopback: support MSG_ZEROCOPY for transport

Add 'msgzerocopy_allow()' callback for loopback transport.

Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vsock/virtio: support MSG_ZEROCOPY for transport

Add 'msgzerocopy_allow()' callback for virtio transport.

Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vhost/vsock: support MSG_ZEROCOPY for transport

Add 'msgzerocopy_allow()' callback for vhost transport.

Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vsock: enable SOCK_SUPPORT_ZC bit

This bit is used by io_uring in case of zerocopy tx mode. io_uring code
checks, that socket has this feature. This patch sets it in two places:
1) For socket in 'connect()' call.
2) For new socket which is returned by 'accept()' call.

Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vsock: check for MSG_ZEROCOPY support on send

This feature totally depends on transport, so if transport doesn't
support it, return error.

Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vsock: read from socket's error queue

This adds handling of MSG_ERRQUEUE input flag in receive call. This flag
is used to read socket's error queue instead of data queue. Possible
scenario of error queue usage is receiving completions for transmission
with MSG_ZEROCOPY flag. This patch also adds new defines: 'SOL_VSOCK'
and 'VSOCK_RECVERR'.

Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vsock: set EPOLLERR on non-empty error queue

If socket's error queue is not empty, EPOLLERR must be set. Otherwise,
reader of error queue won't detect data in it using EPOLLERR bit.
Currently for AF_VSOCK this is actual only with MSG_ZEROCOPY, as this
feature is the only user of an error queue of the socket.

Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

appletalk: remove special handling code for ipddp

After commit 1dab47139e61 ("appletalk: remove ipddp driver") removes the
config IPDDP, there is some minor code clean-up possible in the appletalk
network layer.

Remove some code in appletalk layer after the ipddp driver is gone.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231012063443.22368-1-lukas.bulwahn@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

qed: replace uses of strncpy

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

This patch eliminates three uses of strncpy():

Firstly, `dest` is expected to be NUL-terminated which is evident by the
manual setting of a NUL-byte at size - 1. For this use specifically,
strscpy() is a viable replacement due to the fact that it guarantees
NUL-termination on the destination buffer.

The next two cases should simply be memcpy() as the size of the src
string is always 3 and the destination string just wants the first 3
bytes changed.

To be clear, there are no buffer overread bugs in the current code as
the sizes and offsets are carefully managed such that buffers are
NUL-terminated. However, with these changes, the code is now more robust
and less ambiguous (and hopefully easier to read).

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-ethernet-qlogic-qed-qed_debug-c-v2-1-16d2c0162b80@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: fix rare issue with broken rx after link-down on RTL8125

In very rare cases (I've seen two reports so far about different
RTL8125 chip versions) it seems the MAC locks up when link goes down
and requires a software reset to get revived.
Realtek doesn't publish hw errata information, therefore the root cause
is unknown. Realtek vendor drivers do a full hw re-initialization on
each link-up event, the slimmed-down variant here was reported to fix
the issue for the reporting user.
It's not fully clear which parts of the NIC are reset as part of the
software reset, therefore I can't rule out side effects.

Fixes: f1bce4ad2f1c ("r8169: add support for RTL8125")
Reported-by: Martin Kjær Jørgensen <me@lagy.org>
Link: https://lore.kernel.org/netdev/97ec2232-3257-316c-c3e7-a08192ce16a6@gmail.com/T/
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/9edde757-9c3b-4730-be3b-0ef3a374ff71@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-netconsole-configfs-entries-for-boot-target'

Breno Leitao says:

====================
net: netconsole: configfs entries for boot target

There is a limitation in netconsole, where it is impossible to
disable or modify the target created from the command line parameter.
(netconsole=...).

"netconsole" cmdline parameter sets the remote IP, and if the remote IP
changes, the machine needs to be rebooted (with the new remote IP set in
the command line parameter).

This allows the user to modify a target without the need to restart the
machine.

This functionality sits on top of the dynamic target reconfiguration that is
already implemented in netconsole.

The way to modify a boot time target is creating special named configfs
directories, that will be associated with the targets coming from
`netconsole=...`.

Example:

Let's suppose you have two netconsole targets defined at boot time::

netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc;4444@10.0.0.1/eth1,9353@10.0.0.3/12:34:56:78:9a:bc

You can modify these targets in runtime by creating the following targets::

$ mkdir cmdline1
$ cat cmdline1/remote_ip
10.0.0.3
$ echo 0 > cmdline1/enabled
$ echo 10.0.0.4 > cmdline1/remote_ip
$ echo 1 > cmdline1/enabled
====================

Link: https://lore.kernel.org/r/20231012111401.333798-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Documentation: netconsole: add support for cmdline targets

With the previous patches, there is no more limitation at modifying the
targets created at boot time (or module load time).

Document the way on how to create the configfs directories to be able to
modify these netconsole targets.

The design discussion about this topic could be found at:
https://lore.kernel.org/all/ZRWRal5bW93px4km@gmail.com/

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20231012111401.333798-5-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: Attach cmdline target to dynamic target

Enable the attachment of a dynamic target to the target created during
boot time. The boot-time targets are named as "cmdline\d", where "\d" is
a number starting at 0.

If the user creates a dynamic target named "cmdline0", it will attach to
the first target created at boot time (as defined in the
`netconsole=...` command line argument). `cmdline1` will attach to the
second target and so forth.

If there is no netconsole target created at boot time, then, the target
name could be reused.

Relevant design discussion:
https://lore.kernel.org/all/ZRWRal5bW93px4km@gmail.com/

Suggested-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20231012111401.333798-4-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: Initialize configfs_item for default targets

For netconsole targets allocated during the boot time (passing
netconsole=... argument), netconsole_target->item is not initialized.
That is not a problem because it is not used inside configfs.

An upcoming patch will be using it, thus, initialize the targets with
the name 'cmdline' plus a counter starting from 0. This name will match
entries in the configfs later.

Suggested-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20231012111401.333798-3-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move init/cleanup functions lower

Move alloc_param_target() and its counterpart (free_param_target())
to the bottom of the file. These functions are called mostly at
initialization/cleanup of the module, and they should be just above the
callers, at the bottom of the file.

From a practical perspective, having alloc_param_target() at the bottom
of the file will avoid forward declaration later (in the following
patch).

Nothing changed other than the functions location.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20231012111401.333798-2-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: replace deprecated strncpy with strscpy

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

`desc` is expected to be NUL-terminated as evident by the manual
NUL-byte assignment. Moreover, NUL-padding does not seem to be
necessary.

The only caller of efx_mcdi_nvram_metadata() is
efx_devlink_info_nvram_partition() which provides a NULL for `desc`:
| rc = efx_mcdi_nvram_metadata(efx, partition_type, NULL, version, NULL, 0);

Due to this, I am not sure this code is even reached but we should still
favor something other than strncpy.

Considering the above, a suitable replacement is `strscpy` [2] due to
the fact that it guarantees NUL-termination on the destination buffer
without unnecessarily NUL-padding.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-ethernet-sfc-mcdi-c-v1-1-478c8de1039d@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: tja11xx: replace deprecated strncpy with ethtool_sprintf

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

ethtool_sprintf() is designed specifically for get_strings() usage.
Let's replace strncpy in favor of this dedicated helper function.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-phy-nxp-tja11xx-c-v1-1-5ad6c9dff5c4@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ionic: replace deprecated strncpy with strscpy

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

NUL-padding is not needed due to `ident` being memset'd to 0 just before
the copy.

Considering the above, a suitable replacement is `strscpy` [2] due to
the fact that it guarantees NUL-termination on the destination buffer
without unnecessarily NUL-padding.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231011-strncpy-drivers-net-ethernet-pensando-ionic-ionic_main-c-v1-1-23c62a16ff58@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sparx5: replace deprecated strncpy with ethtool_sprintf

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

ethtool_sprintf() is designed specifically for get_strings() usage.
Let's replace strncpy() in favor of this more robust and easier to
understand interface.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231011-strncpy-drivers-net-ethernet-microchip-sparx5-sparx5_ethtool-c-v1-1-410953d07f42@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx4_core: replace deprecated strncpy with strscpy

`strncpy` is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

We expect `dst` to be NUL-terminated based on its use with format
strings:
| mlx4_dbg(dev, "Reporting Driver Version to FW: %s\n", dst);

Moreover, NUL-padding is not required.

Considering the above, a suitable replacement is `strscpy` [2] due to
the fact that it guarantees NUL-termination on the destination buffer
without unnecessarily NUL-padding.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20231011-strncpy-drivers-net-ethernet-mellanox-mlx4-fw-c-v1-1-4d7b5d34c933@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nfp: replace deprecated strncpy with strscpy

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

We expect res->name to be NUL-terminated based on its usage with format
strings:
|       dev_err(cpp->dev.parent, "Dangling area: %d:%d:%d:0x%0llx-0x%0llx%s%s\n",
|               NFP_CPP_ID_TARGET_of(res->cpp_id),
|               NFP_CPP_ID_ACTION_of(res->cpp_id),
|               NFP_CPP_ID_TOKEN_of(res->cpp_id),
|               res->start, res->end,
|               res->name ? " " : "",
|               res->name ? res->name : "");
... and with strcmp()
|       if (!strcmp(res->name, NFP_RESOURCE_TBL_NAME)) {

Moreover, NUL-padding is not required as `res` is already
zero-allocated:
|       res = kzalloc(sizeof(*res), GFP_KERNEL);

Considering the above, a suitable replacement is `strscpy` [2] due to
the fact that it guarantees NUL-termination on the destination buffer
without unnecessarily NUL-padding.

Let's also opt to use the more idiomatic strscpy() usage of (dest, src,
sizeof(dest)) rather than (dest, src, SOME_LEN).

Typically the pattern of 1) allocate memory for string, 2) copy string
into freshly-allocated memory is a candidate for kmemdup_nul() but in
this case we are allocating the entirety of the `res` struct and that
should stay as is. As mentioned above, simple 1:1 replacement of strncpy
-> strscpy :)

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Louis Peens <louis.peens@corigine.com>
Link: https://lore.kernel.org/r/20231011-strncpy-drivers-net-ethernet-netronome-nfp-nfpcore-nfp_resource-c-v1-1-7d1c984f0eba@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: pci: Allocate skbs using GFP_KERNEL during initialization

The driver allocates skbs during initialization and during Rx
processing. Take advantage of the fact that the former happens in
process context and allocate the skbs using GFP_KERNEL to decrease the
probability of allocation failure.

Tested with CONFIG_DEBUG_ATOMIC_SLEEP=y.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/dfa6ed0926e045fe7c14f0894cc0c37fee81bf9d.1697034729.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Enable hardware timestamping for VFs

Currently for VFs, mailbox returns ENODEV error when hardware timestamping
enable is requested. This patch fixes this issue. Modified this patch to
return EPERM error for the PF/VFs which are not attached to CGX/RPM.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: Sai Krishna <saikrishnag@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231011121551.1205211-1-saikrishnag@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'wangxun-ethtool-stats'

Jiawen Wu says:

====================
Wangxun ethtool stats

Support to show ethtool stats for txgbe/ngbe.
====================

Link: https://lore.kernel.org/r/20231011091906.70486-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ngbe: add ethtool stats support

Support to show ethtool statistics.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://lore.kernel.org/r/20231011091906.70486-4-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: txgbe: add ethtool stats support

Support to show ethtool statistics.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://lore.kernel.org/r/20231011091906.70486-3-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: libwx: support hardware statistics

Implement update and clear Rx/Tx statistics.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://lore.kernel.org/r/20231011091906.70486-2-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: vsc73xx: replace deprecated strncpy with ethtool_sprintf

`strncpy` is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

ethtool_sprintf() is designed specifically for get_strings() usage.
Let's replace strncpy in favor of this more robust and easier to
understand interface.

This change could result in misaligned strings when if(cnt) fails. To
combat this, use ternary to place empty string in buffer and properly
increment pointer to next string slot.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231010-strncpy-drivers-net-dsa-vitesse-vsc73xx-core-c-v2-1-ba4416a9ff23@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'Open-coded task_vma iter'

Dave Marchevsky says:

====================
At Meta we have a profiling daemon which periodically collects
information on many hosts. This collection usually involves grabbing
stacks (user and kernel) using perf_event BPF progs and later symbolicating
them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
those cases we must fall back to digging around in /proc/PID/maps to map
virtual address to (binary, offset). The /proc/PID/maps digging does not
occur synchronously with stack collection, so the process might already
be gone, in which case it won't have /proc/PID/maps and we will fail to
symbolicate.

This 'exited process problem' doesn't occur very often as
most of the prod services we care to profile are long-lived daemons, but
there are enough usecases to warrant a workaround: a BPF program which
can be optionally loaded at data collection time and essentially walks
/proc/PID/maps. Currently this is done by walking the vma list:

  struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
  mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */

Since commit 763ecb035029 ("mm: remove the vma linked list") there's no
longer a vma linked list to walk. Walking the vma maple tree is not as
simple as hopping struct vm_area_struct->vm_next. Luckily,
commit f39af05949a4 ("mm: add VMA iterator"), another commit in that series,
added struct vma_iterator and for_each_vma macro for easy vma iteration. If
similar functionality was exposed to BPF programs, it would be perfect for our
usecase.

This series adds such functionality, specifically a BPF equivalent of
for_each_vma using the open-coded iterator style.

Notes:
  * This approach was chosen after discussion on a previous series [0] which
    attempted to solve the same problem by adding a BPF_F_VMA_NEXT flag to
    bpf_find_vma.
  * Unlike the task_vma bpf_iter, the open-coded iterator kfuncs here do not
    drop the vma read lock between iterations. See Alexei's response in [0].
  * The [vsyscall] page isn't really part of task->mm's vmas, but
    /proc/PID/maps returns information about it anyways. The vma iter added
    here does not do the same. See comment on selftest in patch 3.
  * bpf_iter_task_vma allocates a _data struct which contains - among other
    things - struct vma_iterator, using BPF allocator and keeps a pointer to
    the bpf_iter_task_vma_data. This is done in order to prevent changes to
    struct ma_state - which is wrapped by struct vma_iterator - from
    necessitating changes to uapi struct bpf_iter_task_vma.

Changelog:

v6 -> v7: https://lore.kernel.org/bpf/20231010185944.3888849-1-davemarchevsky@fb.com/

Patch numbers correspond to their position in v6

Patch 2 ("selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c")
  * Add Andrii ack
Patch 3 ("bpf: Introduce task_vma open-coded iterator kfuncs")
  * Add Andrii ack
  * Add missing __diag_ignore_all for -Wmissing-prototypes (Song)
Patch 4 ("selftests/bpf: Add tests for open-coded task_vma iter")
  * Remove two unnecessary header includes (Andrii)
  * Remove extraneous !vmas_seen check (Andrii)
New Patch ("bpf: Add BPF_KFUNC_{START,END}_defs macros")
  * After talking to Andrii, this is an attempt to clean up __diag_ignore_all
    spam everywhere kfuncs are defined. If nontrivial changes are needed,
    let's apply the other 4 and I'll respin as a standalone patch.

v5 -> v6: https://lore.kernel.org/bpf/20231010175637.3405682-1-davemarchevsky@fb.com/

Patch 4 ("selftests/bpf: Add tests for open-coded task_vma iter")
  * Remove extraneous blank line. I did this manually to the .patch file
    for v5, which caused BPF CI to complain about failing to apply the
    series

v4 -> v5: https://lore.kernel.org/bpf/20231002195341.2940874-1-davemarchevsky@fb.com/

Patch numbers correspond to their position in v4

New Patch ("selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c")
  * Patch 2's renaming of this selftest, and associated changes in the
    userspace runner, are split out into this separate commit (Andrii)

Patch 2 ("bpf: Introduce task_vma open-coded iterator kfuncs")
  * Remove bpf_iter_task_vma kfuncs from libbpf's bpf_helpers.h, they'll be
    added to selftests' bpf_experimental.h in selftests patch below (Andrii)
  * Split bpf_iter_task_vma.c renaming into separate commit (Andrii)

Patch 3 ("selftests/bpf: Add tests for open-coded task_vma iter")
  * Add bpf_iter_task_vma kfuncs to bpf_experimental.h (Andrii)
  * Remove '?' from prog SEC, open_and_load the skel in one operation (Andrii)
  * Ensure that fclose() always happens in test runner (Andrii)
  * Use global var w/ 1000 (vm_start, vm_end) structs instead of two
    MAP_TYPE_ARRAY's w/ 1k u64s each (Andrii)

v3 -> v4: https://lore.kernel.org/bpf/20230822050558.2937659-1-davemarchevsky@fb.com/

Patch 1 ("bpf: Don't explicitly emit BTF for struct btf_iter_num")
  * Add Andrii ack
Patch 2 ("bpf: Introduce task_vma open-coded iterator kfuncs")
  * Mark bpf_iter_task_vma_new args KF_RCU and remove now-unnecessary !task
    check (Yonghong)
    * Although KF_RCU is a function-level flag, in reality it only applies to
      the task_struct *task parameter, as the other two params are a scalar int
      and a specially-handled KF_ARG_PTR_TO_ITER
   * Remove struct bpf_iter_task_vma definition from uapi headers, define in
     kernel/bpf/task_iter.c instead (Andrii)
Patch 3 ("selftests/bpf: Add tests for open-coded task_vma iter")
  * Use a local var when looping over vmas to track map idx. Update vmas_seen
    global after done iterating. Don't start iterating or update vmas_seen if
    vmas_seen global is nonzero. (Andrii)
  * Move getpgid() call to correct spot - above skel detach. (Andrii)

v2 -> v3: https://lore.kernel.org/bpf/20230821173415.1970776-1-davemarchevsky@fb.com/

Patch 1 ("bpf: Don't explicitly emit BTF for struct btf_iter_num")
  * Add Yonghong ack

Patch 2 ("bpf: Introduce task_vma open-coded iterator kfuncs")
  * UAPI bpf header and tools/ version should match
  * Add bpf_iter_task_vma_kern_data which bpf_iter_task_vma_kern points to,
    bpf_mem_alloc/free it instead of just vma_iterator. (Alexei)
    * Inner data ptr == NULL implies initialization failed

v1 -> v2: https://lore.kernel.org/bpf/20230810183513.684836-1-davemarchevsky@fb.com/
  * Patch 1
    * Now removes the unnecessary BTF_TYPE_EMIT instead of changing the
      type (Yonghong)
  * Patch 2
    * Don't do unnecessary BTF_TYPE_EMIT (Yonghong)
    * Bump task refcount to prevent ->mm reuse (Yonghong)
    * Keep a pointer to vma_iterator in bpf_iter_task_vma, alloc/free
      via BPF mem allocator (Yonghong, Stanislav)
  * Patch 3

  [0]: https://lore.kernel.org/bpf/20230801145414.418145-1-davemarchevsky@fb.com/
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Add tests for open-coded task_vma iter

The open-coded task_vma iter added earlier in this series allows for
natural iteration over a task's vmas using existing open-coded iter
infrastructure, specifically bpf_for_each.

This patch adds a test demonstrating this pattern and validating
correctness. The vma->vm_start and vma->vm_end addresses of the first
1000 vmas are recorded and compared to /proc/PID/maps output. As
expected, both see the same vmas and addresses - with the exception of
the [vsyscall] vma - which is explained in a comment in the prog_tests
program.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231013204426.1074286-5-davemarchevsky@fb.com

bpf: Introduce task_vma open-coded iterator kfuncs

This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
creation and manipulation of struct bpf_iter_task_vma in open-coded
iterator style. BPF programs can use these kfuncs directly or through
bpf_for_each macro for natural-looking iteration of all task vmas.

The implementation borrows heavily from bpf_find_vma helper's locking -
differing only in that it holds the mmap_read lock for all iterations
while the helper only executes its provided callback on a maximum of 1
vma. Aside from locking, struct vma_iterator and vma_next do all the
heavy lifting.

A pointer to an inner data struct, struct bpf_iter_task_vma_data, is the
only field in struct bpf_iter_task_vma. This is because the inner data
struct contains a struct vma_iterator (not ptr), whose size is likely to
change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
such a change would require change in opaque bpf_iter_task_vma struct's
size. So better to allocate vma_iterator using BPF allocator, and since
that alloc must already succeed, might as well allocate all iter fields,
thereby freezing struct bpf_iter_task_vma size.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231013204426.1074286-4-davemarchevsky@fb.com

selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c

Further patches in this series will add a struct bpf_iter_task_vma,
which will result in a name collision with the selftest prog renamed in
this patch. Rename the selftest to avoid the collision.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231013204426.1074286-3-davemarchevsky@fb.com

bpf: Don't explicitly emit BTF for struct btf_iter_num

Commit 6018e1f407cc ("bpf: implement numbers iterator") added the
BTF_TYPE_EMIT line that this patch is modifying. The struct btf_iter_num
doesn't exist, so only a forward declaration is emitted in BTF:

FWD 'btf_iter_num' fwd_kind=struct

That commit was probably hoping to ensure that struct bpf_iter_num is
emitted in vmlinux BTF. A previous version of this patch changed the
line to emit the correct type, but Yonghong confirmed that it would
definitely be emitted regardless in [0], so this patch simply removes
the line.

This isn't marked "Fixes" because the extraneous btf_iter_num FWD wasn't
causing any issues that I noticed, aside from mild confusion when I
looked through the code.

[0]: https://lore.kernel.org/bpf/25d08207-43e6-36a8-5e0f-47a913d4cda5@linux.dev/

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231013204426.1074286-2-davemarchevsky@fb.com

bpf: Change syscall_nr type to int in struct syscall_tp_t

linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support
for lazy preemption")) that adds an extra member to struct trace_entry.
This causes the offset of args field in struct trace_event_raw_sys_enter
be different from the one in struct syscall_trace_enter:

struct trace_event_raw_sys_enter {
        struct trace_entry         ent;                  /*     0    12 */

        /* XXX last struct has 3 bytes of padding */
        /* XXX 4 bytes hole, try to pack */

        long int                   id;                   /*    16     8 */
        long unsigned int          args[6];              /*    24    48 */
        /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
        char                       __data[];             /*    72     0 */

        /* size: 72, cachelines: 2, members: 4 */
        /* sum members: 68, holes: 1, sum holes: 4 */
        /* paddings: 1, sum paddings: 3 */
        /* last cacheline: 8 bytes */
};

struct syscall_trace_enter {
        struct trace_entry         ent;                  /*     0    12 */

        /* XXX last struct has 3 bytes of padding */

        int                        nr;                   /*    12     4 */
        long unsigned int          args[];               /*    16     0 */

        /* size: 16, cachelines: 1, members: 3 */
        /* paddings: 1, sum paddings: 3 */
        /* last cacheline: 16 bytes */
};

This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
test_profiler testcase because max_ctx_offset is calculated based on the
former struct, while off on the latter:

  10488         if (is_tracepoint || is_syscall_tp) {
  10489                 int off = trace_event_get_offsets(event->tp_event);
  10490
  10491                 if (prog->aux->max_ctx_offset > off)
  10492                         return -EACCES;
  10493         }

What bpf program is actually getting is a pointer to struct
syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes
the problem by aligning struct syscall_tp_t with struct
syscall_trace_(enter|exit) and changing the tests to use these structs
to dereference context.

Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20231013054219.172920-1-asavkov@redhat.com

net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set

It was reported that there is a compiler warning on the unused variable
"sin_addr_len" in af_inet.c when CONFIG_CGROUP_BPF is not set.
This patch is to address it similar to the ipv6 counterpart
in inet6_getname(). It is to "return sin_addr_len;"
instead of "return sizeof(*sin);".

Fixes: fefba7d1ae19 ("bpf: Propagate modified uaddrlen from cgroup sockaddr programs")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/bpf/20231013185702.3993710-1-martin.lau@linux.dev
Closes: https://lore.kernel.org/bpf/20231013114007.2fb09691@canb.auug.org.au/

bpf: Avoid unnecessary audit log for CPU security mitigations

Check cpu_mitigations_off() first to avoid calling capable() if it is off.
This can avoid unnecessary audit log.

Fixes: bc5bc309db45 ("bpf: Inherit system settings for CPU security mitigations")
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/CAEf4Bza6UVUWqcWQ-66weZ-nMDr+TFU3Mtq=dumZFD-pSqU7Ow@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20231013083916.4199-1-laoar.shao@gmail.com

net: fix IPSTATS_MIB_OUTFORWDATAGRAMS increment after fragment check

Reproduce environment:
network with 3 VM linuxs is connected as below:
VM1<---->VM2(latest kernel 6.5.0-rc7)<---->VM3
VM1: eth0 ip: 192.168.122.207 MTU 1800
VM2: eth0 ip: 192.168.122.208, eth1 ip: 192.168.123.224 MTU 1500
VM3: eth0 ip: 192.168.123.240 MTU 1800

Reproduce:
VM1 send 1600 bytes UDP data to VM3 using tools scapy with flags='DF'.
scapy command:
send(IP(dst="192.168.123.240",flags='DF')/UDP()/str('0'*1600),count=1,
inter=1.000000)

Result:
Before IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
    ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
    OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss
Ip: 1 64 6 0 2 2 0 0 2 4 0 0 0 0 0 0 0 0 0
......
root@qemux86-64:~#
----------------------------------------------------------------------
After IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
    ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
    OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss
Ip: 1 64 7 0 2 2 0 0 2 5 0 0 0 0 0 0 0 1 0
......
root@qemux86-64:~#
----------------------------------------------------------------------
ForwDatagrams is always keeping 2 without increment.

Issue description and patch:
ip_exceeds_mtu() in ip_forward() drops this IP datagram because skb len
(1600 sending by scapy) is over MTU(1500 in VM2) if "DF" is set.
According to RFC 4293 "3.2.3. IP Statistics Tables",
  +-------+------>------+----->-----+----->-----+
  | InForwDatagrams (6) | OutForwDatagrams (6)  |
  |                     V                       +->-+ OutFragReqds
  |                 InNoRoutes                  |   | (packets)
  / (local packet (3)                           |   |
  |  IF is that of the address                  |   +--> OutFragFails
  |  and may not be the receiving IF)           |   |    (packets)
the IPSTATS_MIB_OUTFORWDATAGRAMS should be counted before fragment
check.
The existing implementation, instead, would incease the counter after
fragment check: ip_exceeds_mtu() in ipv4 and ip6_pkt_too_big() in ipv6.
So do patch to move IPSTATS_MIB_OUTFORWDATAGRAMS counter to ip_forward()
for ipv4 and ip6_forward() for ipv6.

Test result with patch:
Before IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
    ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
    OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss
Ip: 1 64 6 0 2 2 0 0 2 4 0 0 0 0 0 0 0 0 0
......
root@qemux86-64:~#
----------------------------------------------------------------------
After IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
    ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
    OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss
Ip: 1 64 7 0 2 3 0 0 2 5 0 0 0 0 0 0 0 1 0
......
root@qemux86-64:~#
----------------------------------------------------------------------
ForwDatagrams is updated from 2 to 3.

Reviewed-by: Filip Pudak <filip.pudak@windriver.com>
Signed-off-by: Heng Guo <heng.guo@windriver.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231011015137.27262-1-heng.guo@windriver.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mlx5-next' of https://git./linux/kernel/git/mellanox/linux

Leon Romanovsky says:

====================
This PR is collected from
https://lore.kernel.org/all/cover.1695296682.git.leon@kernel.org

This series from Patrisious extends mlx5 to support IPsec packet offload
in multiport devices (MPV, see [1] for more details).

These devices have single flow steering logic and two netdev interfaces,
which require extra logic to manage IPsec configurations as they performed
on netdevs.

[1] https://lore.kernel.org/linux-rdma/20180104152544.28919-1-leon@kernel.org/

* 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: Handle IPsec steering upon master unbind/bind
  net/mlx5: Configure IPsec steering for ingress RoCEv2 MPV traffic
  net/mlx5: Configure IPsec steering for egress RoCEv2 MPV traffic
  net/mlx5: Add create alias flow table function to ipsec roce
  net/mlx5: Implement alias object allow and create functions
  net/mlx5: Add alias flow table bits
  net/mlx5: Store devcom pointer inside IPsec RoCE
  net/mlx5: Register mlx5e priv to devcom in MPV mode
  RDMA/mlx5: Send events from IB driver about device affiliation state
  net/mlx5: Introduce ifc bits for migration in a chunk mode

====================

Link: https://lore.kernel.org/r/20231002083832.19746-1-leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tls-cleanups'

Sabrina Dubroca says:

====================
net: tls: various code cleanups and improvements

This series contains multiple cleanups and simplifications for the
config code of both TLS_SW and TLS_HW.

It also modifies the chcr_ktls driver to use driver_state like all
other drivers, so that we can then make driver_state fixed size
instead of a flex array always allocated to that same fixed size. As
reported by Gustavo A. R. Silva, the way chcr_ktls misuses
driver_state irritates GCC [1].

Patches 1 and 2 are follow-ups to my previous cipher_desc series.

[1] https://lore.kernel.org/netdev/ZRvzdlvlbX4+eIln@work/
====================

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

tls: use fixed size for tls_offload_context_{tx,rx}.driver_state

driver_state is a flex array, but is always allocated by the tls core
to a fixed size (TLS_DRIVER_STATE_SIZE_{TX,RX}). Simplify the code by
making that size explicit so that sizeof(struct
tls_offload_context_{tx,rx}) works.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

chcr_ktls: use tls_offload_context_tx and driver_state like other drivers

chcr_ktls uses the space reserved in driver_state by
tls_set_device_offload, but makes up into own wrapper around
tls_offload_context_tx instead of accessing driver_state via the
__tls_driver_ctx helper.

In this driver, driver_state is only used to store a pointer to a
larger context struct allocated by the driver.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>