git.monstr.eu Git - linux-2.6-microblaze.git/log

tcp: use semicolons rather than commas to separate statements

Replace commas with semicolons. Commas introduce unnecessary
variability in the code structure and are hard to see. What is done
is essentially described by the following Coccinelle semantic patch
(http://coccinelle.lip6.fr/):

// <smpl>
@@ expression e1,e2; @@
e1
-,
+;
e2
... when any
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/1602412498-32025-4-git-send-email-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: remove duplicate ocelot_port_dev_check

A helper for checking whether a net_device belongs to mscc_ocelot
already existed and did not need to be rewritten. Use it.

Fixes: 319e4dd11a20 ("net: mscc: ocelot: introduce conversion helpers between port and netdev")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20201011092041.3535101-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'macb-support-the-2-deep-Tx-queue-on-at91'

Willy Tarreau says:

====================
macb: support the 2-deep Tx queue on at91

while running some tests on my Breadbee board, I noticed poor network
Tx performance. I had a look at the driver (macb, at91ether variant)
and noticed that at91ether_start_xmit() immediately stops the queue
after sending a frame and waits for the interrupt to restart the queue,
causing a dead time after each packet is sent.

The AT91RM9200 datasheet states that the controller supports two frames,
one being sent and the other one being queued, so I performed minimal
changes to support this. The transmit performance on my board has
increased by 50% on medium-sized packets (HTTP traffic), and with large
packets I can now reach line rate.

Since this driver is shared by various platforms, I tried my best to
isolate and limit the changes as much as possible and I think it's pretty
reasonable as-is. I've run extensive tests and couldn't meet any
unexpected situation (no stall, overflow nor lockup).

There are 3 patches in this series. The first one adds the missing
interrupt flag for RM9200 (TBRE, indicating the tx buffer is willing
to take a new packet). The second one replaces the single skb with a
2-array and uses only index 0. It does no other change, this is just
to prepare the code for the third one. The third one implements the
queue. Packets are added at the tail of the queue, the queue is
stopped at 2 packets and the interrupt releases 0, 1 or 2 depending
on what the transmit status register reports.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macb: support the two tx descriptors on at91rm9200

The at91rm9200 variant used by a few chips including the MSC313 supports
two Tx descriptors (one frame being serialized and another one queued).
However the driver only implemented a single one, which adds a dead time
after each transfer to receive and process the interrupt and wake the
queue up, preventing from reaching line rate.

This patch implements a very basic 2-deep queue to address this limitation.
The tests run on a Breadbee board equipped with an MSC313E show that at
1 GHz, HTTP traffic on medium-sized objects (45kB) was limited to exactly
50 Mbps before this patch, and jumped to 76 Mbps with this patch. And tests
on a single TCP stream with an MTU of 576 jump from 10kpps to 15kpps. With
1500 byte packets it's now possible to reach line rate versus 75 Mbps
before.

Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
Cc: Claudiu Beznea <claudiu.beznea@microchip.com>
Cc: Daniel Palmer <daniel@0x0f.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20201011090944.10607-4-w@1wt.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macb: prepare at91 to use a 2-frame TX queue

The RM9200 supports one frame being sent while another one is waiting in
queue. This avoids the dead time that follows the emission of a frame
and which prevents one from reaching line speed.

Right now the driver supports only a single skb, so we'll first replace
the rm9200-specific skb info with an array of two macb_tx_skb (already
used by other drivers). This patch only moves the skb_length to
txq[0].size and skb_physaddr to skb[0].mapping but doesn't perform any
other change. It already uses [desc] in order to minimize future changes.

Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
Cc: Claudiu Beznea <claudiu.beznea@microchip.com>
Cc: Daniel Palmer <daniel@0x0f.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20201011090944.10607-3-w@1wt.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macb: add RM9200's interrupt flag TBRE

Transmit Buffer Register Empty replaces TXERR on RM9200 and signals the
sender may try to send again becase the last queued frame is no longer
in queue (being transmitted or already transmitted).

Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
Cc: Claudiu Beznea <claudiu.beznea@microchip.com>
Cc: Daniel Palmer <daniel@0x0f.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20201011090944.10607-2-w@1wt.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git./linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-10-12

The main changes are:

1) The BPF verifier improvements to track register allocation pattern, from Alexei and Yonghong.

2) libbpf relocation support for different size load/store, from Andrii.

3) bpf_redirect_peer() helper and support for inner map array with different max_entries, from Daniel.

4) BPF support for per-cpu variables, form Hao.

5) sockmap improvements, from John.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git./linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next:

1) Inspect the reply packets coming from DR/TUN and refresh connection
state and timeout, from longguang yue and Julian Anastasov.

2) Series to add support for the inet ingress chain type in nf_tables.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bnxt_en-Updates-for-net-next'

Michael Chan says:

====================
bnxt_en: Updates for net-next.

This series contains these main changes:

1. Change of default message level to enable more logging.
2. Some cleanups related to processing async events from firmware.
3. Allow online ethtool selftest on multi-function PFs.
4. Return stored firmware version information to devlink.

v2:
Patch 3: Change bnxt_reset_task() to silent mode.
Patch 8 & 9: Ensure we copy NULL terminated fw strings to devlink.
Patch 8 & 9: Return directly after the last bnxt_dl_info_put() call.
Patch 9: If FW call to get stored dev info fails, return success to
devlink without the stored versions.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Add stored FW version info to devlink info_get cb.

This patch adds FW versions stored in the flash to devlink info_get
callback. Return the correct fw.psid running version using the
newly added bp->nvm_cfg_ver.

v2:
Ensure stored pkg_name string is NULL terminated when copied to
devlink.

Return directly from the last call to bnxt_dl_info_put().

If the FW call to get stored version fails for any reason, return
success immediately to devlink without the stored versions.

Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-10-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Refactor bnxt_dl_info_get().

Add a new function bnxt_dl_info_put() to simplify the code, as there
are more stored firmware version fields to be added in the next patch.

Also, rename fw_ver variable name to ncsi_ver for better naming while
copying to devlink info_get cb.

v2:
Ensure active_pkg_name string is NULL terminated when copied to
devlink.

Return directly from the last call to bnxt_dl_info_put().

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-9-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Add bnxt_hwrm_nvm_get_dev_info() to query NVM info.

Add a new bnxt_hwrm_nvm_get_dev_info() to query firmware version
information via NVM_GET_DEV_INFO firmware command. Use it to
get the running version of the NVM configuration information.

This new function will also be used in subsequent patches to get the
stored firmware versions.

Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-8-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Log unknown link speed appropriately.

If the VF virtual link is set to always enabled, the speed may be
unknown when the physical link is down. The driver currently logs
the link speed as 4294967295 Mbps which is SPEED_UNKNOWN. Modify
the link up log message as "speed unknown" which makes more sense.

Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-7-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Log event_data1 and event_data2 when handling RESET_NOTIFY event.

Log these values that contain useful firmware state information.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-6-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Simplify bnxt_async_event_process().

event_data1 and event_data2 are used when processing most events.
Store these in local variables at the beginning of the function to
simplify many of the case statements.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-5-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Set driver default message level.

Currently, bp->msg_enable has default value of 0. It is more useful
to have the commonly used NETIF_MSG_DRV and NETIF_MSG_HW enabled by
default.

v2: Change the fall back bnxt_reset_task() inside bnxt_rx_ring_reset()
to silent mode. With older fw, we would take the fall back path and
it would be very noisy.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-4-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Enable online self tests for multi-host/NPAR mode.

Online self tests are not disruptive and can be run in NPAR mode
and in multi-host NIC as well.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-3-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Return -EROFS to user space, if NVM writes are not permitted.

If NVRAM resources are locked, NVM writes are not permitted. In such
scenarios, firmware returns HWRM_ERR_CODE_RESOURCE_LOCKED error to
firmware commands.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-2-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'linux-can-next-for-5.10-20201012' of git://git./linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
linux-can-next-for-5.10-20201012

Both patches are by Oliver Hartkopp, the first one addresses Jakub's review
comments of the ISOTP protocol, the other one removes version strings from
various CAN protocols.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cx82310_eth: use netdev_err instead of dev_err

Use netdev_err for better device identification in syslog.

Signed-off-by: Ondrej Zary <linux@zary.sk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cx82310_eth: re-enable ethernet mode after router reboot

When the router is rebooted without a power cycle, the USB device
remains connected but its configuration is reset. This results in
a non-working ethernet connection with messages like this in syslog:
usb 2-2: RX packet too long: 65535 B

Re-enable ethernet mode when receiving a packet with invalid size of
0xffff.

Signed-off-by: Ondrej Zary <linux@zary.sk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

can: remove obsolete version strings

As pointed out by Jakub Kicinski here:
http://lore.kernel.org/r/20201009175751.5c54097f@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com
this patch removes the obsolete version information of the different
CAN protocols and the AF_CAN core module.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201012074354.25839-2-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: isotp: implement cleanups / improvements from review

As pointed out by Jakub Kicinski here:
http://lore.kernel.org/r/20201009175751.5c54097f@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com
this patch addresses the remarked issues:

- remove empty line in comment
- remove default=y for CAN_ISOTP in Kconfig
- make use of pr_notice_once()
- use GFP_ATOMIC instead of gfp_any() in soft hrtimer context

The version strings in the CAN subsystem are removed by a separate patch.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201012074354.25839-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

Merge branch 'bpf, sockmap: allow verdict only sk_skb progs'

John Fastabend says:

====================

This allows a sockmap sk_skb verdict programs to run without a parser. For
some use cases, such as verdict program that support streaming data or a
l3/l4 proxy that does not use data in packet, loading the nop parser
'return skb->len' is an extra unnecessary complexity. With this series we
simply call the verdict program directly from data_ready instead of
bouncing through the strparser logic.

Patches 1,2 do the lifting on the sockmap side then patches 3,4 add the
selftests.

This applies on top of the series here,

sockmap/sk_skb program memory acct fixes
https://patchwork.ozlabs.org/project/netdev/list/?series=206975

it will apply without the above series cleanly, but will have an incorrect
memory accounting causing a failure in ./test_sockmap. I could have left
it so the series passed without above series, but it seemed odd to have
it out there and then require yet another patch to fix it up here.

Thanks.
---
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, selftests: Add three new sockmap tests for verdict only programs

Here we add three new tests for sockmap to test having a verdict program
without setting the parser program.

The first test covers the most simply case,

   sender         proxy_recv proxy_send      recv
     |                |                       |
     |              verdict -----+            |
     |                |          |            |
     +----------------+          +------------+

We load the verdict program on the proxy_recv socket without a
parser program. It then does a redirect into the send path of the
proxy_send socket using sendpage_locked().

Next we test the drop case to ensure if we kfree_skb as a result of
the verdict program everything behaves as expected.

Next we test the same configuration above, but with ktls and a
redirect into socket ingress queue. Shown here

   tls                                       tls
   sender         proxy_recv proxy_send      recv
     |                |                       |
     |              verdict ------------------+
     |                |      redirect_ingress
     +----------------+

Also to set up ping/pong test

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160239302638.8495.17125996694402793471.stgit@john-Precision-5820-Tower

bpf, selftests: Add option to test_sockmap to omit adding parser program

Add option to allow running without a parser program in place. To test
with ping/pong program use,

# test_sockmap -t ping --txmsg_omit_skb_parser

this will send packets between two socket bouncing through a proxy
socket that does not use a parser program.

   (ping)                                    (pong)
   sender         proxy_recv proxy_send      recv
     |                |                       |
     |              verdict -----+            |
     |                |          |            |
     +----------------+          +------------+

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160239300387.8495.11908295143121563076.stgit@john-Precision-5820-Tower

bpf, sockmap: Allow skipping sk_skb parser program

Currently, we often run with a nop parser namely one that just does
this, 'return skb->len'. This happens when either our verdict program
can handle streaming data or it is only looking at socket data such
as IP addresses and other metadata associated with the flow. The second
case is common for a L3/L4 proxy for instance.

So lets allow loading programs without the parser then we can skip
the stream parser logic and avoid having to add a BPF program that
is effectively a nop.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160239297866.8495.13345662302749219672.stgit@john-Precision-5820-Tower

bpf, sockmap: Check skb_verdict and skb_parser programs explicitly

We are about to allow skb_verdict to run without skb_parser programs
as a first step change code to check each program type specifically.
This should be a mechanical change without any impact to actual result.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160239294756.8495.5796595770890272219.stgit@john-Precision-5820-Tower

Merge branch 'sockmap/sk_skb program memory acct fixes'

John Fastabend says:

====================

Users of sockmap and skmsg trying to build proxys and other tools
have pointed out to me the error handling can be problematic. If
the proxy is under-provisioned and/or the BPF admin does not have
the ability to update/modify memory provisions on the sockets
its possible data may be dropped. For some things we have retries
so everything works out OK, but for most things this is likely
not great. And things go bad.

The original design dropped memory accounting on the receive
socket as early as possible. We did this early in sk_skb
handling and then charged it to the redirect socket immediately
after running the BPF program.

But, this design caused a fundamental problem. Namely, what should we do
if we redirect to a socket that has already reached its socket memory
limits. For proxy use cases the network admin can tune memory limits.
But, in general we punted on this problem and told folks to simply make
your memory limits high enough to handle your workload. This is not a
really good answer. When deploying into environments where we expect this
to be transparent its no longer the case because we need to tune params.
In fact its really only viable in cases where we have fine grained
control over the application. For example a proxy redirecting from an
ingress socket to an egress socket. The result is I get bug
reports because its surprising for one, but more importantly also breaks
some use cases. So lets fix it.

This series cleans up the different cases so that in many common
modes, such as passing packet up to receive socket, we can simply
use the underlying assumption that the TCP stack already has done
memory accounting.

Next instead of trying to do memory accounting against the socket
we plan to redirect into we keep memory accounting on the receive
socket until the skb can be put on the redirect socket. This means
if we do an egress redirect to a socket and sock_writable() returns
EAGAIN we can requeue the skb on the workqueue and try again. The
same scenario plays out for ingress. If the skb can not be put on
the receive queue of the redirect socket than we simply requeue and
retry. In both cases memory is still accounted for against the
receiving socket.

This also handles head of line blocking. With the above scheme the
skb is on a queue associated with the socket it will be sent/recv'd
on, but the memory accounting is against the received socket. This
means the receive socket can advance to the next skb and avoid head
of line blocking. At least until its receive memory on the socket
runs out. This will put some maximum size on the amount of data any
socket can enqueue giving us bounds on the skb lists so they can't grow
indefinitely.

Overall I think this is a win. Tested with test_sockmap.

These are fixes, but I tagged it for bpf-next considering we are
at -rc8.

v1->v2: Fix uninitialized/unused variables (kernel test robot)
v2->v3: fix typo in patch2 err=0 needs to be <0 so use err=-EIO
---
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, sockmap: Add memory accounting so skbs on ingress lists are visible

Move skb->sk assignment out of sk_psock_bpf_run() and into individual
callers. Then we can use proper skb_set_owner_r() call to assign a
sk to a skb. This improves things by also charging the truesize against
the sockets sk_rmem_alloc counter. With this done we get some accounting
in place to ensure the memory associated with skbs on the workqueue are
still being accounted for somewhere. Finally, by using skb_set_owner_r
the destructor is setup so we can just let the normal skb_kfree logic
recover the memory. Combined with previous patch dropping skb_orphan()
we now can recover from memory pressure and maintain accounting.

Note, we will charge the skbs against their originating socket even
if being redirected into another socket. Once the skb completes the
redirect op the kfree_skb will give the memory back. This is important
because if we charged the socket we are redirecting to (like it was
done before this series) the sock_writeable() test could fail because
of the skb trying to be sent is already charged against the socket.

Also TLS case is special. Here we wait until we have decided not to
simply PASS the packet up the stack. In the case where we PASS the
packet up the stack we already have an skb which is accounted for on
the TLS socket context.

For the parser case we continue to just set/clear skb->sk this is
because the skb being used here may be combined with other skbs or
turned into multiple skbs depending on the parser logic. For example
the parser could request a payload length greater than skb->len so
that the strparser needs to collect multiple skbs. At any rate
the final result will be handled in the strparser recv callback.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226867513.5692.10579573214635925960.stgit@john-Precision-5820-Tower

bpf, sockmap: Remove skb_orphan and let normal skb_kfree do cleanup

Calling skb_orphan() is unnecessary in the strp rcv handler because the skb
is from a skb_clone() in __strp_recv. So it never has a destructor or a
sk assigned. Plus its confusing to read because it might hint to the reader
that the skb could have an sk assigned which is not true. Even if we did
have an sk assigned it would be cleaner to simply wait for the upcoming
kfree_skb().

Additionally, move the comment about strparser clone up so its closer to
the logic it is describing and add to it so that it is more complete.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226865548.5692.9098315689984599579.stgit@john-Precision-5820-Tower

bpf, sockmap: Remove dropped data on errors in redirect case

In the sk_skb redirect case we didn't handle the case where we overrun
the sk_rmem_alloc entry on ingress redirect or sk_wmem_alloc on egress.
Because we didn't have anything implemented we simply dropped the skb.
This meant data could be dropped if socket memory accounting was in
place.

This fixes the above dropped data case by moving the memory checks
later in the code where we actually do the send or recv. This pushes
those checks into the workqueue and allows us to return an EAGAIN error
which in turn allows us to try again later from the workqueue.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226863689.5692.13861422742592309285.stgit@john-Precision-5820-Tower

bpf, sockmap: Remove skb_set_owner_w wmem will be taken later from sendpage

The skb_set_owner_w is unnecessary here. The sendpage call will create a
fresh skb and set the owner correctly from workqueue. Its also not entirely
harmless because it consumes cycles, but also impacts resource accounting
by increasing sk_wmem_alloc. This is charging the socket we are going to
send to for the skb, but we will put it on the workqueue for some time
before this happens so we are artifically inflating sk_wmem_alloc for
this period. Further, we don't know how many skbs will be used to send the
packet or how it will be broken up when sent over the new socket so
charging it with one big sum is also not correct when the workqueue may
break it up if facing memory pressure. Seeing we don't know how/when
this is going to be sent drop the early accounting.

A later patch will do proper accounting charged on receive socket for
the case where skbs get enqueued on the workqueue.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226861708.5692.17964237936462425136.stgit@john-Precision-5820-Tower

bpf, sockmap: On receive programs try to fast track SK_PASS ingress

When we receive an skb and the ingress skb verdict program returns
SK_PASS we currently set the ingress flag and put it on the workqueue
so it can be turned into a sk_msg and put on the sk_msg ingress queue.
Then finally telling userspace with data_ready hook.

Here we observe that if the workqueue is empty then we can try to
convert into a sk_msg type and call data_ready directly without
bouncing through a workqueue. Its a common pattern to have a recv
verdict program for visibility that always returns SK_PASS. In this
case unless there is an ENOMEM error or we overrun the socket we
can avoid the workqueue completely only using it when we fall back
to error cases caused by memory pressure.

By doing this we eliminate another case where data may be dropped
if errors occur on memory limits in workqueue.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226859704.5692.12929678876744977669.stgit@john-Precision-5820-Tower

bpf, sockmap: Skb verdict SK_PASS to self already checked rmem limits

For sk_skb case where skb_verdict program returns SK_PASS to continue to
pass packet up the stack, the memory limits were already checked before
enqueuing in skb_queue_tail from TCP side. So, lets remove the extra checks
here. The theory is if the TCP stack believes we have memory to receive
the packet then lets trust the stack and not double check the limits.

In fact the accounting here can cause a drop if sk_rmem_alloc has increased
after the stack accepted this packet, but before the duplicate check here.
And worse if this happens because TCP stack already believes the data has
been received there is no retransmit.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226857664.5692.668205469388498375.stgit@john-Precision-5820-Tower

netfilter: flowtable: reduce calls to pskb_may_pull()

Make two unfront calls to pskb_may_pull() to linearize the network and
transport header.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add inet ingress support

This patch adds a new ingress hook for the inet family. The inet ingress
hook emulates the IP receive path code, therefore, unclean packets are
drop before walking over the ruleset in this basechain.

This patch also introduces the nft_base_chain_netdev() helper function
to check if this hook is bound to one or more devices (through the hook
list infrastructure). This check allows to perform the same handling for
the inet ingress as it would be a netdev ingress chain from the control
plane.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: add inet ingress support

This patch adds the NF_INET_INGRESS pseudohook for the NFPROTO_INET
family. This is a mapping this new hook to the existing NFPROTO_NETDEV
and NF_NETDEV_INGRESS hook. The hook does not guarantee that packets are
inet only, users must filter out non-ip traffic explicitly.

This infrastructure makes it easier to support this new hook in nf_tables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: add nf_ingress_hook() helper function

Add helper function to check if this is an ingress hook.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: add nf_static_key_{inc,dec}

Add helper functions increment and decrement the hook static keys.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ipvs: inspect reply packets from DR/TUN real servers

Just like for MASQ, inspect the reply packets coming from DR/TUN
real servers and alter the connection's state and timeout
according to the protocol.

It's ipvs's duty to do traffic statistic if packets get hit,
no matter what mode it is.

Signed-off-by: longguang.yue <bigclouds@163.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

bpf: Migrate from patchwork.ozlabs.org to patchwork..

Move the bpf/bpf-next patch processing queue to patchwork.kernel.org.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201011200149.66537-1-alexei.starovoitov@gmail.com

bpf: Always return target ifindex in bpf_fib_lookup

The bpf_fib_lookup() helper performs a neighbour lookup for the destination
IP and returns BPF_FIB_LKUP_NO_NEIGH if this fails, with the expectation
that the BPF program will pass the packet up the stack in this case.
However, with the addition of bpf_redirect_neigh() that can be used instead
to perform the neighbour lookup, at the cost of a bit of duplicated work.

For that we still need the target ifindex, and since bpf_fib_lookup()
already has that at the time it performs the neighbour lookup, there is
really no reason why it can't just return it in any case. So let's just
always return the ifindex if the FIB lookup itself succeeds.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@gmail.com>
Link: https://lore.kernel.org/bpf/20201009184234.134214-1-toke@redhat.com

Merge branch 'samples: bpf: Refactor XDP programs with libbpf'

"Daniel T. Lee" says:

====================
To avoid confusion caused by the increasing fragmentation of the BPF
Loader program, this commit would like to convert the previous bpf_load
loader with the libbpf loader.

Thanks to libbpf's bpf_link interface, managing the tracepoint BPF
program is much easier. bpf_program__attach_tracepoint manages the
enable of tracepoint event and attach of BPF programs to it with a
single interface bpf_link, so there is no need to manage event_fd and
prog_fd separately.

And due to addition of generic bpf_program__attach() to libbpf, it is
now possible to attach BPF programs with __attach() instead of
explicitly calling __attach_<type>().

This patchset refactors xdp_monitor with using this libbpf API, and the
bpf_load is removed and migrated to libbpf. Also, attach_tracepoint()
is replaced with the generic __attach() method in xdp_redirect_cpu.
Moreover, maps in kern program have been converted to BTF-defined map.
---
Changes in v2:
- added cleanup logic for bpf_link and bpf_object in xdp_monitor
- program section match with bpf_program__is_<type> instead of strncmp
- revert BTF key/val type to default of BPF_MAP_TYPE_PERF_EVENT_ARRAY
- split increment into seperate satement
- refactor pointer array initialization
- error code cleanup
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

samples: bpf: Refactor XDP kern program maps with BTF-defined map

Most of the samples were converted to use the new BTF-defined MAP as
they moved to libbpf, but some of the samples were missing.

Instead of using the previous BPF MAP definition, this commit refactors
xdp_monitor and xdp_sample_pkts_kern MAP definition with the new
BTF-defined MAP format.

Also, this commit removes the max_entries attribute at PERF_EVENT_ARRAY
map type. The libbpf's bpf_object__create_map() will automatically
set max_entries to the maximum configured number of CPUs on the host.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201010181734.1109-4-danieltimlee@gmail.com

samples: bpf: Replace attach_tracepoint() to attach() in xdp_redirect_cpu

>From commit d7a18ea7e8b6 ("libbpf: Add generic bpf_program__attach()"),
for some BPF programs, it is now possible to attach BPF programs
with __attach() instead of explicitly calling __attach_<type>().

This commit refactors the __attach_tracepoint() with libbpf's generic
__attach() method. In addition, this refactors the logic of setting
the map FD to simplify the code. Also, the missing removal of
bpf_load.o in Makefile has been fixed.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201010181734.1109-3-danieltimlee@gmail.com

samples: bpf: Refactor xdp_monitor with libbpf

To avoid confusion caused by the increasing fragmentation of the BPF
Loader program, this commit would like to change to the libbpf loader
instead of using the bpf_load.

Thanks to libbpf's bpf_link interface, managing the tracepoint BPF
program is much easier. bpf_program__attach_tracepoint manages the
enable of tracepoint event and attach of BPF programs to it with a
single interface bpf_link, so there is no need to manage event_fd and
prog_fd separately.

This commit refactors xdp_monitor with using this libbpf API, and the
bpf_load is removed and migrated to libbpf.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201010181734.1109-2-danieltimlee@gmail.com

Merge branch 'Offload-tc-vlan-mangle-to-mscc_ocelot-switch'

Vladimir Oltean says:

====================
Offload tc-vlan mangle to mscc_ocelot switch

This series offloads one more action to the VCAP IS1 ingress TCAM, which
is to change the classified VLAN for packets, according to the VCAP IS1
keys (VLAN, source MAC, source IP, EtherType, etc).
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: mscc: ocelot: add test for VLAN modify action

Create a test that changes a VLAN ID from 200 to 300.

We also need to modify the preferences of the filters installed for the
other rules so that they are unique, because we now install the "tc-vlan
modify" filter in VCAP IS1 only temporarily, and we need to perform the
deletion by filter preference number.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: tag_ocelot: use VLAN information from tagging header when available

When the Extraction Frame Header contains a valid classified VLAN, use
that instead of the VLAN header present in the packet.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: offload VLAN mangle action to VCAP IS1

The VCAP_IS1_ACT_VID_REPLACE_ENA action, from the VCAP IS1 ingress TCAM,
changes the classified VLAN.

We are only exposing this ability for switch ports that are under VLAN
aware bridges. This is because in standalone ports mode and under a
bridge with vlan_filtering=0, the ocelot driver configures the switch to
operate as VLAN-unaware, so the classified VLAN is not derived from the
802.1Q header from the packet, but instead is always equal to the
port-based VLAN ID of the ingress port. We _can_ still change the
classified VLAN for packets when operating in this mode, but the end
result will most likely be a drop, since both the ingress and the egress
port need to be members of the modified VLAN. And even if we install the
new classified VLAN into the VLAN table of the switch, the result would
still not be as expected: we wouldn't see, on the output port, the
modified VLAN tag, but the original one, even though the classified VLAN
was indeed modified. This is because of how the hardware works: on
egress, what is pushed to the frame is a "port tag", which gives us the
following options:

- Tag all frames with port tag (derived from the classified VLAN)
- Tag all frames with port tag, except if the classified VLAN is 0 or
equal to the native VLAN of the egress port
- No port tag

Needless to say, in VLAN-unaware mode we are disabling the port tag.
Otherwise, the existing VLAN tag would be ignored, and a second VLAN
tag (the port tag), holding the classified VLAN, would be pushed
(instead of replacing the existing 802.1Q tag). This is definitely not
what the user wanted when installing a "vlan modify" action.

So it is simply not worth bothering with VLAN modify rules under other
configurations except when the ports are fully VLAN-aware.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'enetc-Migrate-to-PHYLINK-and-PCS_LYNX'

Claudiu Manoil says:

====================
enetc: Migrate to PHYLINK and PCS_LYNX

Transitioning the enetc driver from phylib to phylink.
Offloading the serdes configuration to the PCS_LYNX
module is a mandatory part of this transition. Aiming
for a cleaner, more maintainable design, and better
code reuse.
The first 2 patches are clean up prerequisites.

Tested on a p1028rdb board.

v2: validate() explicitly rejects now all interface modes not
supported by the driver instead of relying on the device tree
to provide only supported interfaces, and dropped redundant
activation of pcs_poll (addressing Ioana's findings)
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

enetc: Migrate to PHYLINK and PCS_LYNX

This is a methodical transition of the driver from phylib
to phylink, following the guidelines from sfp-phylink.rst.
The MAC register configurations based on interface mode
were moved from the probing path to the mac_config() hook.
MAC enable and disable commands (enabling Rx and Tx paths
at MAC level) were also extracted and assigned to their
corresponding phylink hooks.
As part of the migration to phylink, the serdes configuration
from the driver was offloaded to the PCS_LYNX module,
introduced in commit 0da4c3d393e4 ("net: phy: add Lynx PCS module"),
the PCS_LYNX module being a mandatory component required to
make the enetc driver work with phylink.

Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Ioana Ciornei <ioana.cionei@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

arm64: dts: fsl-ls1028a-rdb: Specify in-band mode for ENETC port 0

As part of the transition of the enetc ethernet driver from phylib
to phylink, the in-band operation mode of the SGMII interface
from enetc port 0 needs to be specified explicitly for phylink.

Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

enetc: Clean up serdes configuration

Decouple internal mdio bus creation from serdes
configuration, as a prerequisite to offloading
serdes configuration to a different module.
Group together mdio bus creation routines, cleanup.

Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

enetc: Clean up MAC and link configuration

Decouple level MAC configuration based on phy interface type
from general port configuration.
Group together MAC and link configuration code.
Decouple external mdio bus creation from interface type
parsing. No longer return an (unhandled) error code when
phy_node not found, use phy_node to indicate whether the
port has a phy or not. No longer fall-through when serdes
configuration fails for the link modes that require
internal link configuration.

Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'Follow-up BPF helper improvements'

Daniel Borkmann says:

====================

This series addresses most of the feedback [0] that was to be followed
up from the last series, that is, UAPI helper comment improvements and
getting rid of the ifindex obj file hacks in the selftest by using a
BPF map instead. The __sk_buff data/data_end pointer work, I'm planning
to do in a later round as well as the mem*() BPF improvements we have
in Cilium for libbpf. Next, the series adds two features, i) a helper
called redirect_peer() to improve latency on netns switch, and ii) to
allow map in map with dynamic inner array map sizes. Selftests for each
are added as well. For details, please check individual patches, thanks!

  [0] https://lore.kernel.org/bpf/cover.1601477936.git.daniel@iogearbox.net/

v5 -> v6:
  - Going with Andrii's suggestion to make the misconfigured verifier
    test more robust, and only probe on -EOPNOTSUPP (Andrii)
v4 -> v5:
  - Replace cnt == -EOPNOTSUPP check with cnt < 0; I've used < 0
    here as I think it's useful to keep the existing cnt == 0 ||
    cnt >= ARRAY_SIZE(insn_buf) for error detection (Andrii)
v3 -> v4:
  - Rename new array map flag to BPF_F_INNER_MAP (Alexei)
v2 -> v3:
  - Remove tab that slipped into uapi helper desc (Jakub)
  - Rework map in map for array to error from map_gen_lookup (Andrii)
v1 -> v2:
  - Fixed selftest comment wrt inner1/inner2 value (Yonghong)
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, selftests: Add redirect_peer selftest

Extend the test_tc_redirect test and add a small test that exercises the new
redirect_peer() helper for the IPv4 and IPv6 case.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201010234006.7075-7-daniel@iogearbox.net

bpf, selftests: Make redirect_neigh test more extensible

Rename into test_tc_redirect.sh and move setup and test code into separate
functions so they can be reused for newly added tests in here. Also remove
the crude hack to override ifindex inside the object file via xxd and sed
and just use a simple map instead. Map given iproute2 does not support BTF
fully and therefore neither global data at this point.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20201010234006.7075-6-daniel@iogearbox.net

bpf, selftests: Add test for different array inner map size

Extend the "diff_size" subtest to also include a non-inlined array map variant
where dynamic inner #elems are possible.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20201010234006.7075-5-daniel@iogearbox.net

bpf: Allow for map-in-map with dynamic inner array map entries

Recent work in f4d05259213f ("bpf: Add map_meta_equal map ops") and 134fede4eecf
("bpf: Relax max_entries check for most of the inner map types") added support
for dynamic inner max elements for most map-in-map types. Exceptions were maps
like array or prog array where the map_gen_lookup() callback uses the maps'
max_entries field as a constant when emitting instructions.

We recently implemented Maglev consistent hashing into Cilium's load balancer
which uses map-in-map with an outer map being hash and inner being array holding
the Maglev backend table for each service. This has been designed this way in
order to reduce overall memory consumption given the outer hash map allows to
avoid preallocating a large, flat memory area for all services. Also, the
number of service mappings is not always known a-priori.

The use case for dynamic inner array map entries is to further reduce memory
overhead, for example, some services might just have a small number of back
ends while others could have a large number. Right now the Maglev backend table
for small and large number of backends would need to have the same inner array
map entries which adds a lot of unneeded overhead.

Dynamic inner array map entries can be realized by avoiding the inlined code
generation for their lookup. The lookup will still be efficient since it will
be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
inline code generation and relaxes array_map_meta_equal() check to ignore both
maps' max_entries. This also still allows to have faster lookups for map-in-map
when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.

Example code generation where inner map is dynamic sized array:

  # bpftool p d x i 125
  int handle__sys_enter(void * ctx):
  ; int handle__sys_enter(void *ctx)
     0: (b4) w1 = 0
  ; int key = 0;
     1: (63) *(u32 *)(r10 -4) = r1
     2: (bf) r2 = r10
  ;
     3: (07) r2 += -4
  ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
     4: (18) r1 = map[id:468]
     6: (07) r1 += 272
     7: (61) r0 = *(u32 *)(r2 +0)
     8: (35) if r0 >= 0x3 goto pc+5
     9: (67) r0 <<= 3
    10: (0f) r0 += r1
    11: (79) r0 = *(u64 *)(r0 +0)
    12: (15) if r0 == 0x0 goto pc+1
    13: (05) goto pc+1
    14: (b7) r0 = 0
    15: (b4) w6 = -1
  ; if (!inner_map)
    16: (15) if r0 == 0x0 goto pc+6
    17: (bf) r2 = r10
  ;
    18: (07) r2 += -4
  ; val = bpf_map_lookup_elem(inner_map, &key);
    19: (bf) r1 = r0                               | No inlining but instead
    20: (85) call array_map_lookup_elem#149280     | call to array_map_lookup_elem()
  ; return val ? *val : -1;                        | for inner array lookup.
    21: (15) if r0 == 0x0 goto pc+1
  ; return val ? *val : -1;
    22: (61) r6 = *(u32 *)(r0 +0)
  ; }
    23: (bc) w0 = w6
    24: (95) exit

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net

bpf: Add redirect_peer helper

Add an efficient ingress to ingress netns switch that can be used out of tc BPF
programs in order to redirect traffic from host ns ingress into a container
veth device ingress without having to go via CPU backlog queue [0]. For local
containers this can also be utilized and path via CPU backlog queue only needs
to be taken once, not twice. On a high level this borrows from ipvlan which does
similar switch in __netif_receive_skb_core() and then iterates via another_round.
This helps to reduce latency for mentioned use cases.

Pod to remote pod with redirect(), TCP_RR [1]:

  # percpu_netperf 10.217.1.33
          RT_LATENCY:         122.450         (per CPU:         122.666         122.401         122.333         122.401 )
        MEAN_LATENCY:         121.210         (per CPU:         121.100         121.260         121.320         121.160 )
      STDDEV_LATENCY:         120.040         (per CPU:         119.420         119.910         125.460         115.370 )
         MIN_LATENCY:          46.500         (per CPU:          47.000          47.000          47.000          45.000 )
         P50_LATENCY:         118.500         (per CPU:         118.000         119.000         118.000         119.000 )
         P90_LATENCY:         127.500         (per CPU:         127.000         128.000         127.000         128.000 )
         P99_LATENCY:         130.750         (per CPU:         131.000         131.000         129.000         132.000 )

    TRANSACTION_RATE:       32666.400         (per CPU:        8152.200        8169.842        8174.439        8169.897 )

Pod to remote pod with redirect_peer(), TCP_RR:

  # percpu_netperf 10.217.1.33
          RT_LATENCY:          44.449         (per CPU:          43.767          43.127          45.279          45.622 )
        MEAN_LATENCY:          45.065         (per CPU:          44.030          45.530          45.190          45.510 )
      STDDEV_LATENCY:          84.823         (per CPU:          66.770          97.290          84.380          90.850 )
         MIN_LATENCY:          33.500         (per CPU:          33.000          33.000          34.000          34.000 )
         P50_LATENCY:          43.250         (per CPU:          43.000          43.000          43.000          44.000 )
         P90_LATENCY:          46.750         (per CPU:          46.000          47.000          47.000          47.000 )
         P99_LATENCY:          52.750         (per CPU:          51.000          54.000          53.000          53.000 )

    TRANSACTION_RATE:       90039.500         (per CPU:       22848.186       23187.089       22085.077       21919.130 )

  [0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf
  [1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperf

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net

bpf: Improve bpf_redirect_neigh helper description

Follow-up to address David's feedback that we should better describe internals
of the bpf_redirect_neigh() helper.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: David Ahern <dsahern@gmail.com>
Link: https://lore.kernel.org/bpf/20201010234006.7075-2-daniel@iogearbox.net

drivers/net/wan/hdlc_fr: Move the skb_headroom check out of fr_hard_header

Move the skb_headroom check out of fr_hard_header and into pvc_xmit.
This has two benefits:

1. Originally we only do this check for skbs sent by users on Ethernet-
emulating PVC devices. After the change we do this check for skbs sent on
normal PVC devices, too.
(Also add a comment to make it clear that this is only a protection
against upper layers that don't take dev->needed_headroom into account.
Such upper layers should be rare and I believe they should be fixed.)

2. After the change we can simplify the parameter list of fr_hard_header.
We no longer need to use a pointer to pointers (skb_p) because we no
longer need to replace the skb inside fr_hard_header.

Cc: Krzysztof Halasa <khc@pm.waw.pl>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: rtl8366rb: Roof MTU for switch

The MTU setting for this DSA switch is global so we need
to keep track of the MTU set for each port, then as soon
as any MTU changes, roof the MTU to the biggest common
denominator and poke that into the switch MTU setting.

To achieve this we need a per-chip-variant state container
for the RTL8366RB to use for the RTL8366RB-specific
stuff. Other SMI switches does seem to have per-port
MTU setting capabilities.

Fixes: 5f4a8ef384db ("net: dsa: rtl8366rb: Support setting MTU")
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: Move of_mdio from drivers/of to drivers/net/mdio

Better place for of_mdio.c is drivers/net/mdio.
Move of_mdio.c from drivers/of to drivers/net/mdio

Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpaa_eth: enable NETIF_MSG_HW by default

When packets are received on the error queue, this function under
net_ratelimit():

netif_err(priv, hw, net_dev, "Err FD status = 0x%08x\n");

does not get printed. Instead we only see:

[ 3658.845592] net_ratelimit: 244 callbacks suppressed
[ 3663.969535] net_ratelimit: 230 callbacks suppressed
[ 3669.085478] net_ratelimit: 228 callbacks suppressed

Enabling NETIF_MSG_HW fixes this issue, and we can see some information
about the frame descriptors of packets.

Signed-off-by: Maxim Kochetkov <fido_max@inbox.ru>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: factor out handling rtl8169_stats

Factor out handling the private packet/byte counters to new
functions rtl_get_priv_stats() and rtl_inc_priv_stats().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usbnet: remove driver version

Obviously this driver version doesn't make sense. Go with the default
and let ethtool display the kernel version.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: thunderx: Use struct_size() helper in kmalloc()

Make use of the new struct_size() helper instead of the offsetof() idiom.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'wireless-drivers-next-2020-10-09' of git://git./linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for v5.10

Fourth and last set of patches for v5.10. Most of these are iwlwifi
patches, but few small fixes to other drivers as well.

Major changes:

iwlwifi

* PNVM support (platform-specific phy config data)

* bump the FW API support to 59
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'mac80211-next-for-net-next-2020-10-08' of git://git./linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
A handful of changes:
* fixes for the recent S1G work
* a docbook build time improvement
* API to pass beacon rate to lower-level driver
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'netlink-export-policy-on-validation-failures'

Johannes Berg says:

====================
netlink: export policy on validation failures

Export the policy used for attribute validation when it fails,
so e.g. for an out-of-range attribute userspace immediately gets
the valid ranges back.

v2 incorporates the suggestion from Jakub to have a function to
estimate the size (netlink_policy_dump_attr_size_estimate()) and
check that it does the right thing on the *normal* policy dumps,
not (just) when calling it from the error scenario.

v3 only addresses a few minor style issues.

v4 fixes up a forgotten 'git add' ... sorry.

v5 is a resend, I messed up v4's cover letter subject (saying v3)
and apparently the second patch didn't go out at all.

Tested using nl80211/iw in a few scenarios, seems to work fine
and return the policy back, e.g.

kernel reports: integer out of range
policy: 04 00 0b 00 0c 00 04 00 01 00 00 00 00 00 00 00
        ^ padding
                    ^ minimum allowed value
policy: 04 00 0b 00 0c 00 05 00 ff ff ff ff 00 00 00 00
        ^ padding
                    ^ maximum allowed value
policy: 08 00 01 00 04 00 00 00
        ^ type 4 == U32

for an out-of-range case.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: export policy in extended ACK

Add a new attribute NLMSGERR_ATTR_POLICY to the extended ACK
to advertise the policy, e.g. if an attribute was out of range,
you'll know the range that's permissible.

Add new NL_SET_ERR_MSG_ATTR_POL() and NL_SET_ERR_MSG_ATTR_POL()
macros to set this, since realistically it's only useful to do
this when the bad attribute (offset) is also returned.

Use it in lib/nlattr.c which practically does all the policy
validation.

v2:
- add and use netlink_policy_dump_attr_size_estimate()
v3:
- remove redundant break
v4:
- really remove redundant break ... sorry

Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: policy: refactor per-attr policy writing

Refactor the per-attribute policy writing into a new
helper function, to be used later for dumping out the
policy of a rejected attribute.

v2:
- fix some indentation
v3:
- change variable order in netlink_policy_dump_write()

Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-smc-updates-2020-10-07'

Karsten Graul says:

====================
net/smc: updates 2020-10-07

Patch 1 and 2 address warnings from static code checkers, and patch 3
handles a case when all proposed ISM V2 devices fail to init and no V1
devices are tried afterwards.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/smc: restore smcd_version when all ISM V2 devices failed to init

Field ini->smcd_version is set to SMC_V2 before calling
smc_listen_ism_init(). This clears the V1 bit that may be set. When all
matching ISM V2 devices fail to initialize then the smcd_version field
needs to get restored to allow any possible V1 devices to initialize.
And be consistent, always go to the not_found label when no device was
found.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/smc: cleanup buffer usage in smc_listen_work()

coccinelle informs about
net/smc/af_smc.c:1770:10-11: WARNING: opportunity for kzfree/kvfree_sensitive

Its not that kzfree() would help here, the memset() is done to prepare
the buffer for another socket receive.
Fix that warning message by reordering the calls, while at it eliminate
the unneeded variable cclc2 and use sizeof(*buf) as above in the same
function. No functional changes.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/smc: consolidate unlocking in same function

Static code checkers warn of inconsistent returns because the lgr mutex
is locked in one function and unlocked in a function called by the
locking function:
net/smc/af_smc.c:823 smc_connect_rdma() warn: inconsistent returns 'smc_client_lgr_pending'.
net/smc/af_smc.c:897 smc_connect_ism() warn: inconsistent returns 'smc_server_lgr_pending'.

Make the code consistent by doing the unlock in the same function that
fetches the lock. No functional changes.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'linux-can-next-for-5.10-20201007' of git://git./linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
linux-can-next-for-5.10-20201007

The first 3 patches are by me and fix several warnings found
when compiling the kernel with W=1.

Lukas Bulwahn's patch adjusts the MAINTAINERS file, to accommodate
the renaming of the mcp251xfd driver.

Vincent Mailhol contributes 3 patches for the CAN networking layer.
First error queue support is added the the CAN RAW protocol.
The second patch converts the get_can_dlc() and get_canfd_dlc()
in-Kernel-only macros from using __u8 to u8.
The third patch adds a helper function to calculate the length of
one bit in in multiple of time quanta.

Oliver Hartkopp's patch add support for the ISO 15765-2:2016
transport protocol to the CAN stack.

Three patches by Lad Prabhakar add documentation for various
new rcar controllers to the device tree bindings of the rcar_can
and rcan_canfd driver.

Michael Walle's patch adds various processors to the flexcan
driver binding documentation.

The next two patches are by me and target the flexcan driver aswell.
The remove the ack_grp and ack_bit from the fsl,stop-mode DT property
and the driver, as they are not used anymore. As these are the last
two arguments this change will not break existing device trees.

The last three patches are by Srinivas Neeli and target
the xilinx_can driver.
The first one increases the lower limit for the bit rate
prescaler to 2, the other two fix sparse and coverity findings.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch '100GbE-Intel-Wired-LAN-Driver-Updates-2020-10-07'

Tony Nguyen says:

====================
100GbE Intel Wired LAN Driver Updates 2020-10-07

This series contains updates to ice driver only.

Andy Shevchenko changes usage to %*phD format to print small buffer as hex
string.

Bruce removes repeated words reported by checkpatch.

Ani changes ice_info_get_dsn() to return void as it always returns
success.

Jake adds devlink reporting of fw.app.bundle_id. Moves devlink_port
structure to ice_vsi to resolve issues with cleanup. Adds additional
debug info for firmware updates.

Bixuan Cui resolves -Wpointer-to-int-cast warnings.

Dan adds additional packet type masks and checks to prevent overwriting
existing Flow Director rules.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: fix adding IP4 IP6 Flow Director rules

A subsequent addition of an IP4 or IP6 rule after other rules would
overwrite any existing TCAM entries of related L4 protocols(ex: tcp4 or
udp6). This was due to the mask including too many TCAM entries. Add new
packet type masks with bits properly excluded so rules are not overwritten.

Signed-off-by: Dan Nowlin <dan.nowlin@intel.com>
Signed-off-by: Henry Tieman <henry.w.tieman@intel.com>
Tested-by: Brijesh Behera <brijeshx.behera@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: Fix pointer cast warnings

pointers should be casted to unsigned long to avoid
-Wpointer-to-int-cast warnings:

drivers/net/ethernet/intel/ice/ice_flow.h:197:33: warning:
cast from pointer to integer of different size
drivers/net/ethernet/intel/ice/ice_flow.h:198:32: warning:
cast to pointer from integer of different size

Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: add additional debug logging for firmware update

While debugging a recent failure to update the flash of an ice device,
I found it helpful to add additional logging which helped determine the
root cause of the problem being a timeout issue.

Add some extra dev_dbg() logging messages which can be enabled using the
dynamic debug facility, including one for ice_aq_wait_for_event that
will use jiffies to capture a rough estimate of how long we waited for
the completion of a firmware command.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Brijesh Behera <brijeshx.behera@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: refactor devlink_port to be per-VSI

Currently, the devlink_port structure is stored within the ice_pf. This
made sense because we create a single devlink_port for each PF. This
setup does not mesh with the abstractions in the driver very well, and
led to a flow where we accidentally call devlink_port_unregister twice
during error cleanup.

In particular, if devlink_port_register or devlink_port_unregister are
called twice, this leads to a kernel panic. This appears to occur during
some possible flows while cleaning up from a failure during driver
probe.

If register_netdev fails, then we will call devlink_port_unregister in
ice_cfg_netdev as it cleans up. Later, we again call
devlink_port_unregister since we assume that we must cleanup the port
that is associated with the PF structure.

This occurs because we cleanup the devlink_port for the main PF even
though it was not allocated. We allocated the port within a per-VSI
function for managing the main netdev, but did not release the port when
cleaning up that VSI, the allocation and destruction are not aligned.

Instead of attempting to manage the devlink_port as part of the PF
structure, manage it as part of the PF VSI. Doing this has advantages,
as we can match the de-allocation of the devlink_port with the
unregister_netdev associated with the main PF VSI.

Moving the port to the VSI is preferable as it paves the way for
handling devlink ports allocated for other purposes such as SR-IOV VFs.

Since we're changing up how we allocate the devlink_port, also change
the indexing. Originally, we indexed the port using the PF id number.
This came from an old goal of sharing a devlink for each physical
function. Managing devlink instances across multiple function drivers is
not workable. Instead, lets set the port number to the logical port
number returned by firmware and set the index using the VSI index
(sometimes referred to as VSI handle).

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: add the DDP Track ID to devlink info

Add "fw.app.bundle_id" to display the DDP Track ID of the active DDP
package. This id is similar to "fw.bundle_id" and is a unique identifier
for the DDP package that is loaded in the device. Each new DDP has
a unique Track ID generated for it, and the ID can be used to identify
and track the DDP package.

Add documentation for the new devlink info version.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: Change ice_info_get_dsn to be void

ice_info_get_dsn always returns 0, so just make it void.

Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: remove repeated words

A new test in checkpatch detects repeated words; cleanup all pre-existing
occurrences of those now.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Co-developed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: devlink: use %*phD to print small buffer

Use %*phD format to print small buffer as hex string.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: add ksz9563 to ksz9477 I2C driver

Add support for the KSZ9563 3-Port Gigabit Ethernet Switch to the
ksz9477 driver. The KSZ9563 supports both SPI (already in) and I2C. The
ksz9563 is already in the device tree binding documentation.

Signed-off-by: Christian Eggers <ceggers@arri.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bpf-llvm-reg-alloc-patterns'

Alexei Starovoitov says:

====================
Make two verifier improvements:

- The llvm register allocator may use two different registers representing
  the same virtual register. Teach the verifier to recognize that.
- Track bounded scalar spill/fill.

The profiler[123] test in patch 3 will fail to load without patches 1 and 2.
The profiler[23] test may fail to load on older llvm due to speculative
code motion nd instruction combining optimizations that are fixed in
https://reviews.llvm.org/D85570

v1 -> v2:
  - fixed 32-bit mov issue spotted by John.
  - allowed r2=r1; r3=r2; sequence as suggested by John.
  - added comments, acks, more tests.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: Asm tests for the verifier regalloc tracking.

Add asm tests for register allocator tracking logic.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201009011240.48506-5-alexei.starovoitov@gmail.com

selftests/bpf: Add profiler test

The main purpose of the profiler test to check different llvm generation
patterns to make sure the verifier can load these large programs.

Note that profiler.inc.h test doesn't follow strict kernel coding style.
The code was formatted in the kernel style, but variable declarations are
kept as-is to preserve original llvm IR pattern.

profiler1.c should pass with older and newer llvm

profiler[23].c may fail on older llvm that don't have:
https://reviews.llvm.org/D85570
because llvm may do speculative code motion optimization that
will generate code like this:

// r9 is a pointer to map_value
// r7 is a scalar
17:       bf 96 00 00 00 00 00 00 r6 = r9
18:       0f 76 00 00 00 00 00 00 r6 += r7
19:       a5 07 01 00 01 01 00 00 if r7 < 257 goto +1
20:       bf 96 00 00 00 00 00 00 r6 = r9
// r6 is used here

The verifier will reject such code with the error:
"math between map_value pointer and register with unbounded min value is not allowed"
At insn 18 the r7 is indeed unbounded. The later insn 19 checks the bounds and
the insn 20 undoes map_value addition. It is currently impossible for the
verifier to understand such speculative pointer arithmetic. Hence llvm D85570
addresses it on the compiler side.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20201009011240.48506-4-alexei.starovoitov@gmail.com

bpf: Track spill/fill of bounded scalars.

Under register pressure the llvm may spill registers with bounds into the stack.
The verifier has to track them through spill/fill otherwise many kinds of bound
errors will be seen. The spill/fill of induction variables was already
happening. This patch extends this logic from tracking spill/fill of a constant
into any bounded register. There is no need to track spill/fill of unbounded,
since no new information will be retrieved from the stack during register fill.

Though extra stack difference could cause state pruning to be less effective, no
adverse affects were seen from this patch on selftests and on cilium programs.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20201009011240.48506-3-alexei.starovoitov@gmail.com

bpf: Propagate scalar ranges through register assignments.

The llvm register allocator may use two different registers representing the
same virtual register. In such case the following pattern can be observed:
1047: (bf) r9 = r6
1048: (a5) if r6 < 0x1000 goto pc+1
1050: ...
1051: (a5) if r9 < 0x2 goto pc+66
1052: ...
1053: (bf) r2 = r9 /* r2 needs to have upper and lower bounds */

This is normal behavior of greedy register allocator.
The slides 137+ explain why regalloc introduces such register copy:
http://llvm.org/devmtg/2018-04/slides/Yatsina-LLVM%20Greedy%20Register%20Allocator.pdf
There is no way to tell llvm 'not to do this'.
Hence the verifier has to recognize such patterns.

In order to track this information without backtracking allocate ID
for scalars in a similar way as it's done for find_good_pkt_pointers().

When the verifier encounters r9 = r6 assignment it will assign the same ID
to both registers. Later if either register range is narrowed via conditional
jump propagate the register state into the other register.

Clear register ID in adjust_reg_min_max_vals() for any alu instruction. The
register ID is ignored for scalars in regsafe() and doesn't affect state
pruning. mark_reg_unknown() clears the ID. It's used to process call, endian
and other instructions. Hence ID is explicitly cleared only in
adjust_reg_min_max_vals() and in 32-bit mov.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20201009011240.48506-2-alexei.starovoitov@gmail.com

net/rds: suppress page allocation failure error in recv buffer refill

RDS/IB tries to refill the recv buffer in softirq context using
GFP_NOWAIT flag. However alloc failure is handled by queueing a work to
refill the recv buffer with GFP_KERNEL flag. This means failure to
allocate with GFP_NOWAIT isn't fatal. Do not print the PAF warnings if
softirq context fails to refill the recv buffer. We will see the PAF
warnings when worker also fails to allocate.

Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'devlink-add-reload-action-and-limit-options'

Moshe Shemesh says:

====================
Add devlink reload action and limit options

Introduce new options on devlink reload API to enable the user to select
the reload action required and constrains limits on these actions that he
may want to ensure. Complete support for reload actions in mlx5.
The following reload actions are supported:
  driver_reinit: driver entities re-initialization, applying devlink-param
                 and devlink-resource values.
  fw_activate: firmware activate.

The uAPI is backward compatible, if the reload action option is omitted
from the reload command, the driver reinit action will be used.
Note that when required to do firmware activation some drivers may need
to reload the driver. On the other hand some drivers may need to reset
the firmware to reinitialize the driver entities. Therefore, the devlink
reload command returns the actions which were actually performed.

By default reload actions are not limited and driver implementation may
include reset or downtime as needed to perform the actions.
However, if reload limit is selected, the driver should perform only if
it can do it while keeping the limit constraints.
Reload limit added:
  no_reset: No reset allowed, no down time allowed, no link flap and no
            configuration is lost.

Each driver which supports devlink reload command should expose the
reload actions and limits supported.

Add reload stats to hold the history per reload action per limit.
For example, the number of times fw_activate has been done on this
device since the driver module was added or if the firmware activation
was done with or without reset.

Patch 1 changes devlink_reload_supported() param type to enable using
        it before allocating devlink.
Patch 2-3 add the new API reload action and reload limit options to
          devlink reload.
Patch 4-5 add reload stats and remote reload stats. These stats are
          exposed through devlink dev get.
Patches 6-11 add support on mlx5 for devlink reload action fw_activate
            and handle the firmware reset events.
Patches 12-13 add devlink enable remote dev reset parameter and use it
             in mlx5.
Patches 14-15 mlx5 add devlink reload limit no_reset support for
              fw_activate reload action.
Patch 16 adds documentation file devlink-reload.rst
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: Add Documentation/networking/devlink/devlink-reload.rst

Add devlink reload rst documentation file.
Update index file to include it.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add support for devlink reload limit no reset

Add support for devlink reload action fw_activate with reload limit
no_reset which does firmware live patching, updating the firmware image
without reset, no downtime and no configuration lose. The driver checks
if the firmware is capable of handling the pending firmware changes as a
live patch. If it is then it triggers firmware live patching flow.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add support for fw live patch event

Firmware live patch event notifies the driver that the firmware was just
updated using live patch. In such case the driver should not reload or
re-initiate entities, part to updating the firmware version and
re-initiate the firmware tracer which can be updated by live patch with
new strings database to help debugging an issue.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>