git.monstr.eu Git - linux-2.6-microblaze.git/log

Bluetooth: Add definitions for advertisement monitor features

This adds support for Advertisement Monitor API. Here are the commands
and events added.
- Read Advertisement Monitor Feature command
- Add Advertisement Pattern Monitor command
- Remove Advertisement Monitor command
- Advertisement Monitor Added event
- Advertisement Monitor Removed event

Signed-off-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>

Bluetooth: Add get/set device flags mgmt op

Add the get device flags and set device flags mgmt ops and the device
flags changed event. Their behavior is described in detail in
mgmt-api.txt in bluez.

Sample btmon trace when a HID device is added (trimmed to 75 chars):

@ MGMT Command: Unknown (0x0050) plen 11        {0x0001} [hci0] 18:06:14.98
        90 c5 13 cd f3 cd 02 01 00 00 00                 ...........
@ MGMT Event: Unknown (0x002a) plen 15          {0x0004} [hci0] 18:06:14.98
        90 c5 13 cd f3 cd 02 01 00 00 00 01 00 00 00     ...............
@ MGMT Event: Unknown (0x002a) plen 15          {0x0003} [hci0] 18:06:14.98
        90 c5 13 cd f3 cd 02 01 00 00 00 01 00 00 00     ...............
@ MGMT Event: Unknown (0x002a) plen 15          {0x0002} [hci0] 18:06:14.98
        90 c5 13 cd f3 cd 02 01 00 00 00 01 00 00 00     ...............
@ MGMT Event: Command Compl.. (0x0001) plen 10  {0x0001} [hci0] 18:06:14.98
      Unknown (0x0050) plen 7
        Status: Success (0x00)
        90 c5 13 cd f3 cd 02                             .......
@ MGMT Command: Add Device (0x0033) plen 8      {0x0001} [hci0] 18:06:14.98
        LE Address: CD:F3:CD:13:C5:90 (Static)
        Action: Auto-connect remote device (0x02)
@ MGMT Event: Device Added (0x001a) plen 8      {0x0004} [hci0] 18:06:14.98
        LE Address: CD:F3:CD:13:C5:90 (Static)
        Action: Auto-connect remote device (0x02)
@ MGMT Event: Device Added (0x001a) plen 8      {0x0003} [hci0] 18:06:14.98
        LE Address: CD:F3:CD:13:C5:90 (Static)
        Action: Auto-connect remote device (0x02)
@ MGMT Event: Device Added (0x001a) plen 8      {0x0002} [hci0] 18:06:14.98
        LE Address: CD:F3:CD:13:C5:90 (Static)
        Action: Auto-connect remote device (0x02)
@ MGMT Event: Unknown (0x002a) plen 15          {0x0004} [hci0] 18:06:14.98
        90 c5 13 cd f3 cd 02 01 00 00 00 01 00 00 00     ...............
@ MGMT Event: Unknown (0x002a) plen 15          {0x0003} [hci0] 18:06:14.98
        90 c5 13 cd f3 cd 02 01 00 00 00 01 00 00 00     ...............
@ MGMT Event: Unknown (0x002a) plen 15          {0x0002} [hci0] 18:06:14.98
        90 c5 13 cd f3 cd 02 01 00 00 00 01 00 00 00     ...............
@ MGMT Event: Unknown (0x002a) plen 15          {0x0001} [hci0] 18:06:14.98
        90 c5 13 cd f3 cd 02 01 00 00 00 01 00 00 00     ...............

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>

Bluetooth: Replace wakeable in hci_conn_params

Replace the wakeable boolean with flags in hci_conn_params and all users
of this boolean. This will be used by the get/set device flags mgmt op.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>

Bluetooth: Replace wakeable list with flag

Since the classic device list now supports flags, convert the wakeable
list into a flag on the existing device list.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>

Bluetooth: Add bdaddr_list_with_flags for classic whitelist

In order to more easily add device flags to classic devices, create
a new type of bdaddr_list that supports setting flags.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>

Bluetooth: mgmt: Add commands for runtime configuration

This adds the required read/set commands for runtime configuration. Even
while currently no parameters are specified, the commands are made
available.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>

Bluetooth: implement read/set default system parameters mgmt

This patch implements the read default system parameters and the set
default system parameters mgmt commands.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: centralize default value initialization.

This patch centralized the initialization of default parameters. This
is required to allow clients to more easily customize the default
system parameters.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: mgmt: read/set system parameter definitions

This patch submits the corresponding kernel definitions to mgmt.h.
This is submitted before the implementation to avoid any conflicts in
values allocations.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Yu Liu <yudiliu@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: hci_qca: Request Tx clock vote off only when Tx is pending

Tx pending flag is set to true when HOST IBS state is AWAKE or
AWAKEING. If IBS state is ASLEEP, then Tx clock is already voted
off. To optimize further directly calling serial_clock_vote()
instead of qca_wq_serial_tx_clock_vote_off(), at this point of
qca_suspend() already data is sent out. No need to wake up hci to
send data.

Signed-off-by: Balakrishna Godavarthi <bgodavar@codeaurora.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: hci_qca: Increase SoC idle timeout to 200ms

In some version of WCN399x, SoC idle timeout is configured
as 80ms instead of 20ms or 40ms. To honor all the SoC's
supported in the driver increasing SoC idle timeout to 200ms.

Fixes: 41d5b25fed0a0 ("Bluetooth: hci_qca: add PM support")
Signed-off-by: Balakrishna Godavarthi <bgodavar@codeaurora.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: hci_qca: Disable SoC debug logging for WCN3991

By default, WCN3991 sent debug packets to HOST via ACL packet
with header 0xDC2E. This logging is not required on commercial
devices. With this patch SoC logging is disabled post fw
download.

Signed-off-by: Balakrishna Godavarthi <bgodavar@codeaurora.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: Use only 8 bits for the HCI CMSG state flags

This change implements suggestions from the code review of the SCO CMSG
state flag patch.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: Add support for BT_PKT_STATUS CMSG data for SCO connections

This change adds support for reporting the BT_PKT_STATUS to the socket
CMSG data to allow the implementation of a packet loss correction on
erroneous data received on the SCO socket.

The patch was partially developed by Marcel Holtmann and validated by
Hsin-yu Chao.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: btmrvl_sdio: Refactor irq wakeup

Use device_init_wakeup to allow the Bluetooth dev to wake the system
from suspend. Currently, the device can wake the system but no
power/wakeup entry is created in sysfs to allow userspace to disable
wakeup.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: btmrvl_sdio: Implement prevent_wake

Use the parent device's power/wakeup to control whether we support
remote wake. If remote wakeup is disabled, Bluetooth will not enable
scanning for incoming connections.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: btmrvl_sdio: Set parent dev to hdev

Set the correct parent dev when registering hdev. This allows userspace
tools to find the parent device (for example, to set the power/wakeup
property).

Before this change, the path was /sys/devices/virtual/bluetooth/hci0
and after this change, it looks more like:
/sys/bus/mmc/devices/mmc1:0001/mmc1:0001:2/bluetooth/hci0

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: btusb: Configure Intel debug feature based on available support

This patch shall enable the Intel telemetry exception format
based on the supported features

Signed-off-by: Chethan T N <chethan.tumkur.narayan@intel.com>
Signed-off-by: Ps AyappadasX <AyappadasX.Ps@intel.com>
Signed-off-by: Kiran K <kiran.k@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: btusb: Add support to read Intel debug feature

The command shall read the Intel controller supported
debug feature. Based on the supported features additional debug
configuration shall be enabled.

Signed-off-by: Chethan T N <chethan.tumkur.narayan@intel.com>
Signed-off-by: Ps AyappadasX <AyappadasX.Ps@intel.com>
Signed-off-by: Kiran K <kiran.k@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: hci_qca: Bug fix during SSR timeout

Due to race conditions between qca_hw_error and qca_controller_memdump
during SSR timeout,the same pointer is freed twice. This results in a
double free. Now a lock is acquired before checking the stauts of SSR
state.

Fixes: d841502c79e3 ("Bluetooth: hci_qca: Collect controller memory dump during SSR")
Signed-off-by: Venkata Lakshmi Narayana Gubba <gubbaven@codeaurora.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: Allow suspend even when preparation has failed

It is preferable to allow suspend even when Bluetooth has problems
preparing for sleep. When Bluetooth fails to finish preparing for
suspend, log the error and allow the suspend notifier to continue
instead.

To also make it clearer why suspend failed, change bt_dev_dbg to
bt_dev_err when handling the suspend timeout.

Fixes: dd522a7429b07e ("Bluetooth: Handle LE devices during suspend")
Reported-by: Len Brown <len.brown@intel.com>
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: hci_qca: Refactor error handling in qca_suspend()

If waiting for IBS sleep times out jump to the error handler, this is
easier to read than multiple 'if' branches and a fall through to the
error handler.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: hci_qca: Skip serdev wait when no transfer is pending

qca_suspend() calls serdev_device_wait_until_sent() regardless of
whether a transfer is pending. While it does no active harm since
the function should return immediately it makes the code more
confusing. Add a flag to track whether a transfer is pending and
only call serdev_device_wait_until_sent() is needed.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: hci_qca: Only remove TX clock vote after TX is completed

qca_suspend() removes the vote for the UART TX clock after
writing an IBS sleep request to the serial buffer. This is
not a good idea since there is no guarantee that the request
has been sent at this point. Instead remove the vote after
successfully entering IBS sleep. This also fixes the issue
of the vote being removed in case of an aborted suspend due
to a failure of entering IBS sleep.

Fixes: 41d5b25fed0a0 ("Bluetooth: hci_qca: add PM support")
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: hci_qca: Simplify determination of serial clock on/off state from votes

The serial clocks should be on when there is a vote for at least one
of the clocks (RX or TX), and off when there is no 'on' vote. The
current logic to determine the combined state is a bit redundant
in the code paths for different types of votes, use a single
statement in the common path instead.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: hci_qca: Fix an error pointer dereference

When a function like devm_clk_get_optional() function returns both error
pointers on error and NULL then the NULL return means that the optional
feature is deliberately disabled. It is a special sort of success and
should not trigger an error message. The surrounding code should be
written to check for NULL and not crash.

On the other hand, if we encounter an error, then the probe from should
clean up and return a failure.

In this code, if devm_clk_get_optional() returns an error pointer then
the kernel will crash inside the call to:

clk_set_rate(qcadev->susclk, SUSCLK_RATE_32KHZ);

The error handling must be updated to prevent that.

Fixes: 77131dfec6af ("Bluetooth: hci_qca: Replace devm_gpiod_get() with devm_gpiod_get_optional()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: Removing noisy dbg message

This patch removes a particularly noisy dbg message. The debug message
isn't particularly interesting for debuggability so it was simply
removed to reduce noise in dbg logs.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: Check scan state before disabling during suspend

Check current scan state by checking HCI_LE_SCAN flag and send scan
disable command only if scan is already enabled.

Signed-off-by: Manish Mandlik <mmandlik@google.com>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

btmrvl: Fix firmware filename for sd8997 chipset

Firmware for sd8997 chipset is distributed by Marvell package and also as
part of the linux-firmware repository in filename sdsd8997_combo_v4.bin.

This patch fixes mwifiex driver to load correct firmware file for sd8997.

Fixes: f0ef67485f591 ("Bluetooth: btmrvl: add sd8997 chipset support")
Signed-off-by: Pali Rohár <pali@kernel.org>
Acked-by: Ganapathi Bhat <ganapathi.bhat@nxp.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

btmrvl: Fix firmware filename for sd8977 chipset

Firmware for sd8977 chipset is distributed by Marvell package and also as
part of the linux-firmware repository in filename sdsd8977_combo_v2.bin.

This patch fixes mwifiex driver to load correct firmware file for sd8977.

Fixes: 8c57983bf7a79 ("Bluetooth: btmrvl: add support for sd8977 chipset")
Signed-off-by: Pali Rohár <pali@kernel.org>
Acked-by: Ganapathi Bhat <ganapathi.bhat@nxp.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

mwifiex: Fix firmware filename for sd8997 chipset

Firmware for sd8997 chipset is distributed by Marvell package and also as
part of the linux-firmware repository in filename sdsd8997_combo_v4.bin.

This patch fixes mwifiex driver to load correct firmware file for sd8997.

Fixes: 6d85ef00d9dfe ("mwifiex: add support for 8997 chipset")
Signed-off-by: Pali Rohár <pali@kernel.org>
Acked-by: Ganapathi Bhat <ganapathi.bhat@nxp.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

mwifiex: Fix firmware filename for sd8977 chipset

Firmware for sd8977 chipset is distributed by Marvell package and also as
part of the linux-firmware repository in filename sdsd8977_combo_v2.bin.

This patch fixes mwifiex driver to load correct firmware file for sd8977.

Fixes: 1a0f547831dce ("mwifiex: add support for sd8977 chipset")
Signed-off-by: Pali Rohár <pali@kernel.org>
Acked-by: Ganapathi Bhat <ganapathi.bhat@nxp.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

selftests: net: ip_defrag: ignore EPERM

When running with conntrack rules, the dropped overlap fragments may cause
EPERM to be returned to sendto. Instead of completely failing, just ignore
those errors and continue. If this causes packets with overlap fragments to
be dropped as expected, that is okay. And if it causes packets that are
expected to be received to be dropped, which should not happen, it will be
detected as failure.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net_failover: fixed rollback in net_failover_open()

found by smatch:
drivers/net/net_failover.c:65 net_failover_open() error:
we previously assumed 'primary_dev' could be null (see line 43)

Fixes: cfc80d9a1163 ("net: Introduce net_failover driver")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tipc-revert-two-patches'

Tuong Lien says:

====================
tipc: revert two patches

We revert two patches:

tipc: Fix potential tipc_node refcnt leak in tipc_rcv
tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv

which prevented TIPC encryption from working properly and caused kernel
panic.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"

This reverts commit 441870ee4240cf67b5d3ab8e16216a9ff42eb5d6.

Like the previous patch in this series, we revert the above commit that
causes similar issues with the 'aead' object.

Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"

This reverts commit de058420767df21e2b6b0f3bb36d1616fb962032.

There is no actual tipc_node refcnt leak as stated in the above commit.
The refcnt is hold carefully for the case of an asynchronous decryption
(i.e. -EINPROGRESS/-EBUSY and skb = NULL is returned), so that the node
object cannot be freed in the meantime. The counter will be re-balanced
when the operation's callback arrives with the decrypted buffer if any.
In other cases, e.g. a synchronous crypto the counter will be decreased
immediately when it is done.

Now with that commit, a kernel panic will occur when there is no node
found (i.e. n = NULL) in the 'tipc_rcv()' or a premature release of the
node object.

This commit solves the issues by reverting the said commit, but keeping
one valid case that the 'skb_linearize()' is failed.

Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Tested-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

vmxnet3: allow rx flow hash ops only when rss is enabled

It makes sense to allow changes to get/set rx flow hash callback only
when rss is enabled. This patch restricts get_rss_hash_opts and
set_rss_hash_opts methods to allow querying and configuring different
Rx flow hash configurations only when rss is enabled

Signed-off-by: Ronak Doshi <doshir@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

hinic: add set_channels ethtool_ops support

add support to change TX/RX queue number with "ethtool -L combined".

V5 -> V6: remove check for carrier in hinic_xmit_frame
V4 -> V5: change time zone in patch header
V3 -> V4: update date in patch header
V2 -> V3: remove check for zero channels->combined_count
V1 -> V2: update commit message("ethtool -L" to "ethtool -L combined")
V0 -> V1: remove check for channels->tx_count/rx_count/other_count

Signed-off-by: Luo bin <luobin9@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2020-06-02

The following pull-request contains BPF _fixes-only_ for your *net-next*
tree.

We've added 10 non-merge commits during the last 1 day(s) which contain
a total of 15 files changed, 229 insertions(+), 74 deletions(-).

The main changes are:

1) Several fixes to s390 BPF JIT e.g. fixing kernel panic when BPF stack is
   not 8-byte aligned, from Ilya Leoshkevich.

2) Fix bpf_skb_adjust_room() helper's CHECKSUM_UNNECESSARY handling which
   was wrongly bypassing TCP checksum verification, from Daniel Borkmann.

3) Fix tools/bpf/ build under MAKEFLAGS=rR which causes built-in CXX and
   others vars to be undefined, also from Ilya Leoshkevich.

4) Fix BPF ringbuf's selftest shared sample_cnt variable to avoid compiler
   optimizations on it, from Andrii Nakryiko.

5) Fix up test_verifier selftest due to addition of rx_queue_mapping to
   the bpf_sock structure, from Alexei Starovoitov.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/bpf: Add a default $(CXX) value

When using make kselftest TARGETS=bpf, tools/bpf is built with
MAKEFLAGS=rR, which causes $(CXX) to be undefined, which in turn causes
the build to fail with

CXX test_cpp
/bin/sh: 2: g: not found

Fix by adding a default $(CXX) value, like tools/build/feature/Makefile
already does.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200602175649.2501580-3-iii@linux.ibm.com

tools/bpf: Don't use $(COMPILE.c)

When using make kselftest TARGETS=bpf, tools/bpf is built with
MAKEFLAGS=rR, which causes $(COMPILE.c) to be undefined, which in turn
causes the build to fail with

CC kselftest/bpf/tools/build/bpftool/map_perf_ring.o
/bin/sh: 1: -MMD: not found

Fix by using $(CC) $(CFLAGS) -c instead of $(COMPILE.c).

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200602175649.2501580-2-iii@linux.ibm.com

bpf, selftests: Use bpf_probe_read_kernel

Since commit 0ebeea8ca8a4 ("bpf: Restrict bpf_probe_read{, str}() only to
archs where they work") 44 verifier tests fail on s390 due to not having
bpf_probe_read anymore. Fix by using bpf_probe_read_kernel.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200602174448.2501214-1-iii@linux.ibm.com

s390/bpf: Use bcr 0,%0 as tail call nop filler

Currently used 0x0000 filler confuses bfd disassembler, making bpftool
prog dump xlated output nearly useless. Fix by using a real instruction.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200602174555.2501389-1-iii@linux.ibm.com

s390/bpf: Maintain 8-byte stack alignment

Certain kernel functions (e.g. get_vtimer/set_vtimer) cause kernel
panic when the stack is not 8-byte aligned. Currently JITed BPF programs
may trigger this by allocating stack frames with non-rounded sizes and
then being interrupted. Fix by using rounded fp->aux->stack_depth.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200602174339.2501066-1-iii@linux.ibm.com

selftests/bpf: Fix verifier test

Adjust verifier test due to addition of new field.

Fixes: c3c16f2ea6d2 ("bpf: Add rx_queue_mapping to bpf_sock")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Fix sample_cnt shared between two threads

Make sample_cnt volatile to fix possible selftests failure due to compiler
optimization preventing latest sample_cnt value to be visible to main thread.
sample_cnt is incremented in background thread, which is then joined into main
thread. So in terms of visibility sample_cnt update is ok. But because it's
not volatile, compiler might make optimizations that would prevent main thread
to see latest updated value. Fix this by marking global variable volatile.

Fixes: cb1c9ddd5525 ("selftests/bpf: Add BPF ringbuf selftests")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200602050349.215037-1-andriin@fb.com

Merge branch 'csum-fixes'

Daniel Borkmann says:

====================
This series fixes an issue originally reported by Lorenz Bauer where using
the bpf_skb_adjust_room() helper hid a checksum bug since it wasn't adjusting
CHECKSUM_UNNECESSARY's skb->csum_level after decap. The fix is two-fold:
i) We do a safe reset in bpf_skb_adjust_room() to CHECKSUM_NONE with an opt-
out flag BPF_F_ADJ_ROOM_NO_CSUM_RESET.
ii) We add a new bpf_csum_level() for the latter in order to allow users to
manually inc/dec the skb->csum_level when needed.
The series is rebased against latest bpf-next tree. It can be applied there,
or to bpf after the merge win sync from net-next.

Thanks!
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, selftests: Adapt cls_redirect to call csum_level helper

Adapt bpf_skb_adjust_room() to pass in BPF_F_ADJ_ROOM_NO_CSUM_RESET flag and
use the new bpf_csum_level() helper to inc/dec the checksum level by one after
the encap/decap.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/e7458f10e3f3d795307cbc5ad870112671d9c6f7.1591108731.git.daniel@iogearbox.net

bpf: Add csum_level helper for fixing up csum levels

Add a bpf_csum_level() helper which BPF programs can use in combination
with bpf_skb_adjust_room() when they pass in BPF_F_ADJ_ROOM_NO_CSUM_RESET
flag to the latter to avoid falling back to CHECKSUM_NONE.

The bpf_csum_level() allows to adjust CHECKSUM_UNNECESSARY skb->csum_levels
via BPF_CSUM_LEVEL_{INC,DEC} which calls __skb_{incr,decr}_checksum_unnecessary()
on the skb. The helper also allows a BPF_CSUM_LEVEL_RESET which sets the skb's
csum to CHECKSUM_NONE as well as a BPF_CSUM_LEVEL_QUERY to just return the
current level. Without this helper, there is no way to otherwise adjust the
skb->csum_level. I did not add an extra dummy flags as there is plenty of free
bitspace in level argument itself iff ever needed in future.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/279ae3717cb3d03c0ffeb511493c93c450a01e1a.1591108731.git.daniel@iogearbox.net

bpf: Fix up bpf_skb_adjust_room helper's skb csum setting

Lorenz recently reported:

  In our TC classifier cls_redirect [0], we use the following sequence of
  helper calls to decapsulate a GUE (basically IP + UDP + custom header)
  encapsulated packet:

    bpf_skb_adjust_room(skb, -encap_len, BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
    bpf_redirect(skb->ifindex, BPF_F_INGRESS)

  It seems like some checksums of the inner headers are not validated in
  this case. For example, a TCP SYN packet with invalid TCP checksum is
  still accepted by the network stack and elicits a SYN ACK. [...]

  That is, we receive the following packet from the driver:

    | ETH | IP | UDP | GUE | IP | TCP |
    skb->ip_summed == CHECKSUM_UNNECESSARY

  ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx checksum offloading.
  On this packet we run skb_adjust_room_mac(-encap_len), and get the following:

    | ETH | IP | TCP |
    skb->ip_summed == CHECKSUM_UNNECESSARY

  Note that ip_summed is still CHECKSUM_UNNECESSARY. After bpf_redirect()'ing
  into the ingress, we end up in tcp_v4_rcv(). There, skb_checksum_init() is
  turned into a no-op due to CHECKSUM_UNNECESSARY.

The bpf_skb_adjust_room() helper is not aware of protocol specifics. Internally,
it handles the CHECKSUM_COMPLETE case via skb_postpull_rcsum(), but that does
not cover CHECKSUM_UNNECESSARY. In this case skb->csum_level of the original
skb prior to bpf_skb_adjust_room() call was 0, that is, covering UDP. Right now
there is no way to adjust the skb->csum_level. NICs that have checksum offload
disabled (CHECKSUM_NONE) or that support CHECKSUM_COMPLETE are not affected.

Use a safe default for CHECKSUM_UNNECESSARY by resetting to CHECKSUM_NONE and
add a flag to the helper called BPF_F_ADJ_ROOM_NO_CSUM_RESET that allows users
from opting out. Opting out is useful for the case where we don't remove/add
full protocol headers, or for the case where a user wants to adjust the csum
level manually e.g. through bpf_csum_level() helper that is added in subsequent
patch.

The bpf_skb_proto_{4_to_6,6_to_4}() for NAT64/46 translation from the BPF
bpf_skb_change_proto() helper uses bpf_skb_net_hdr_{push,pop}() pair internally
as well but doesn't change layers, only transitions between v4 to v6 and vice
versa, therefore no adoption is required there.

  [0] https://lore.kernel.org/bpf/20200424185556.7358-1-lmb@cloudflare.com/

Fixes: 2be7e212d541 ("bpf: add bpf_skb_adjust_room helper")
Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Reported-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/bpf/CACAyw9-uU_52esMd1JjuA80fRPHJv5vsSg8GnfW3t_qDU4aVKQ@mail.gmail.com/
Link: https://lore.kernel.org/bpf/11a90472e7cce83e76ddbfce81fdfce7bfc68808.1591108731.git.daniel@iogearbox.net

Merge git://git./linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-06-01

The following pull-request contains BPF updates for your *net-next* tree.

We've added 55 non-merge commits during the last 1 day(s) which contain
a total of 91 files changed, 4986 insertions(+), 463 deletions(-).

The main changes are:

1) Add rx_queue_mapping to bpf_sock from Amritha.

2) Add BPF ring buffer, from Andrii.

3) Attach and run programs through devmap, from David.

4) Allow SO_BINDTODEVICE opt in bpf_setsockopt, from Ferenc.

5) link based flow_dissector, from Jakub.

6) Use tracing helpers for lsm programs, from Jiri.

7) Several sk_msg fixes and extensions, from John.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()

Sparse reports a warning at efx_ef10_try_update_nic_stats_vf()
warning: context imbalance in efx_ef10_try_update_nic_stats_vf()
- unexpected unlock
The root cause is the missing annotation at
efx_ef10_try_update_nic_stats_vf()
Add the missing _must_hold(&efx->stats_lock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

crypto/chtls: IPv6 support for inline TLS

Extends support to IPv6 for Inline TLS server.

Signed-off-by: Vinay Kumar Yadav <vinay.yadav@chelsio.com>
v1->v2:
- cc'd tcp folks.

v2->v3:
- changed EXPORT_SYMBOL() to EXPORT_SYMBOL_GPL()

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'chelsio-crypto-fixes'

Ayush Sawal says:

====================
Fixing compilation warnings and errors

Patch 1: Fixes the warnings seen when compiling using sparse tool.

Patch 2: Fixes a cocci check error introduced after commit
567be3a5d227 ("crypto: chelsio -
Use multiple txq/rxq per tfm to process the requests").

V1->V2

patch1: Avoid type casting by using get_unaligned_be32() and
put_unaligned_be16/32() functions.

patch2: Modified subject of the patch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Crypto/chcr: Fixes a coccinile check error

This fixes an error observed after running coccinile check.
drivers/crypto/chelsio/chcr_algo.c:1462:5-8: Unneeded variable:
"err". Return "0" on line 1480

This line is missed in the commit 567be3a5d227 ("crypto:
chelsio - Use multiple txq/rxq per tfm to process the requests").

Fixes: 567be3a5d227 ("crypto:
chelsio - Use multiple txq/rxq per tfm to process the requests").

V1->V2
-Modified subject.

Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Crypto/chcr: Fixes compilations warnings

This patch fixes the compilation warnings displayed by sparse tool for
chcr driver.

V1->V2

Avoid type casting by using get_unaligned_be32() and
put_unaligned_be16/32() functions.

The key which comes from stack is an u8 byte stream so we store it in
an unsigned char array(ablkctx->key). The function get_aes_decrypt_key()
is a used to calculate the reverse round key for decryption, for this
operation the key has to be divided into 4 bytes, so to extract 4 bytes
from an u8 byte stream and store it in an u32 variable, get_aligned_be32()
is used. Similarly for copying back the key from u32 variable to the
original u8 key stream, put_aligned_be32() is used.

Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

crypto/chcr: IPV6 code needs to be in CONFIG_IPV6

Error messages seen while building kernel with CONFIG_IPV6
disabled.

Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4/chcr: Enable ktls settings at run time

Current design enables ktls setting from start, which is not
efficient. Now the feature will be enabled when user demands
TLS offload on any interface.

v1->v2:
- taking ULD module refcount till any single connection exists.
- taking rtnl_lock() before clearing tls_devops.

v2->v3:
- cxgb4 is now registering to tlsdev_ops.
- module refcount inc/dec in chcr.
- refcount is only for connections.
- removed new code from cxgb_set_feature().

v3->v4:
- fixed warning message.

Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: fix IPV6_ADDRFORM operation logic

Socket option IPV6_ADDRFORM supports UDP/UDPLITE and TCP at present.
Previously the checking logic looks like:
if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
do_some_check;
else if (sk->sk_protocol != IPPROTO_TCP)
break;

After commit b6f6118901d1 ("ipv6: restrict IPV6_ADDRFORM operation"), TCP
was blocked as the logic changed to:
if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
do_some_check;
else if (sk->sk_protocol == IPPROTO_TCP)
do_some_check;
break;
else
break;

Then after commit 82c9ae440857 ("ipv6: fix restrict IPV6_ADDRFORM operation")
UDP/UDPLITE were blocked as the logic changed to:
if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
do_some_check;
if (sk->sk_protocol == IPPROTO_TCP)
do_some_check;

if (sk->sk_protocol != IPPROTO_TCP)
break;

Fix it by using Eric's code and simply remove the break in TCP check, which
looks like:
if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
do_some_check;
else if (sk->sk_protocol == IPPROTO_TCP)
do_some_check;
else
break;

Fixes: 82c9ae440857 ("ipv6: fix restrict IPV6_ADDRFORM operation")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: Fix NULL pointer dereference in __tipc_sendstream()

tipc_sendstream() may send zero length packet, then tipc_msg_append()
do not alloc skb, skb_peek_tail() will get NULL, msg_set_ack_required
will trigger NULL pointer dereference.

Reported-by: syzbot+8eac6d030e7807c21d32@syzkaller.appspotmail.com
Fixes: 0a3e060f340d ("tipc: add test for Nagle algorithm effectiveness")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Link-based-attach-to-netns'

Jakub Sitnicki says:

====================
One of the pieces of feedback from recent review of BPF hooks for socket
lookup [0] was that new program types should use bpf_link-based
attachment.

This series introduces new bpf_link type for attaching to network
namespace. All link operations are supported. Errors returned from ops
follow cgroup example. Patch 4 description goes into error semantics.

The major change in v2 is a switch away from RCU to mutex-only
synchronization. Andrii pointed out that it is not needed, and it makes
sense to keep locking straightforward.

Also, there were a couple of bugs in update_prog and fill_info initial
implementation, one picked up by kbuild. Those are now fixed. Tests have
been extended to cover them. Full changelog below.

Series is organized as so:

Patches 1-3 prepare a space in struct net to keep state for attached BPF
programs, and massage the code in flow_dissector to make it attach type
agnostic, to finally move it under kernel/bpf/.

Patch 4, the most important one, introduces new bpf_link link type for
attaching to network namespace.

Patch 5 unifies the update error (ENOLINK) between BPF cgroup and netns.

Patches 6-8 make libbpf and bpftool aware of the new link type.

Patches 9-12 Add and extend tests to check that link low- and high-level
API for operating on links to netns works as intended.

Thanks to Alexei, Andrii, Lorenz, Marek, and Stanislav for feedback.

-jkbs

[0] https://lore.kernel.org/bpf/20200511185218.1422406-1-jakub@cloudflare.com/

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Cc: Marek Majkowski <marek@cloudflare.com>
Cc: Stanislav Fomichev <sdf@google.com>
v1 -> v2:

- Switch to mutex-only synchronization. Don't rely on RCU grace period
  guarantee when accessing struct net from link release / update /
  fill_info, and when accessing bpf_link from pernet pre_exit
  callback. (Andrii)
- Drop patch 1, no longer needed with mutex-only synchronization.
- Don't leak uninitialized variable contents from fill_info callback
  when link is in defunct state. (kbuild)
- Make fill_info treat the link as defunct (i.e. no attached netns) when
  struct net refcount is 0, but link has not been yet auto-detached.
- Add missing BPF_LINK_TYPE define in bpf_types.h for new link type.
- Fix link update_prog callback to update the prog that will run, and
  not just the link itself.
- Return EEXIST on prog attach when link already exists, and on link
  create when prog is already attached directly. (Andrii)
- Return EINVAL on prog detach when link is attached. (Andrii)
- Fold __netns_bpf_link_attach into its only caller. (Stanislav)
- Get rid of a wrapper around container_of() (Andrii)
- Use rcu_dereference_protected instead of rcu_access_pointer on
  update-side. (Stanislav)
- Make return-on-success from netns_bpf_link_create less
  confusing. (Andrii)
- Adapt bpf_link for cgroup to return ENOLINK when updating a defunct
  link. (Andrii, Alexei)
- Order new exported symbols in libbpf.map alphabetically (Andrii)
- Keep libbpf's "failed to attach link" warning message clear as to what
  we failed to attach to (cgroup vs netns). (Andrii)
- Extract helpers for printing link attach type. (bpftool, Andrii)
- Switch flow_dissector tests to BPF skeleton and extend them to
  exercise link-based flow dissector attachment. (Andrii)
- Harden flow dissector attachment tests with prog query checks after
  prog attach/detach, or link create/update/close.
- Extend flow dissector tests to cover fill_info for defunct links.
- Rebase onto recent bpf-next
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Extend test_flow_dissector to cover link creation

Extend the existing flow_dissector test case to run tests once using direct
prog attachments, and then for the second time using indirect attachment
via link.

The intention is to exercises the newly added high-level API for attaching
programs to network namespace with links (bpf_program__attach_netns).

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-13-jakub@cloudflare.com

selftests/bpf: Convert test_flow_dissector to use BPF skeleton

Switch flow dissector test setup from custom BPF object loader to BPF
skeleton to save boilerplate and prepare for testing higher-level API for
attaching flow dissector with bpf_link.

To avoid depending on program order in the BPF object when populating the
flow dissector PROG_ARRAY map, change the program section names to contain
the program index into the map. This follows the example set by tailcall
tests.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-12-jakub@cloudflare.com

selftests/bpf, flow_dissector: Close TAP device FD after the test

test_flow_dissector leaves a TAP device after it's finished, potentially
interfering with other tests that will run after it. Fix it by closing the
TAP descriptor on cleanup.

Fixes: 0905beec9f52 ("selftests/bpf: run flow dissector tests in skb-less mode")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-11-jakub@cloudflare.com

selftests/bpf: Add tests for attaching bpf_link to netns

Extend the existing test case for flow dissector attaching to cover:

- link creation,
- link updates,
- link info querying,
- mixing links with direct prog attachment.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-10-jakub@cloudflare.com

bpftool: Support link show for netns-attached links

Make `bpf link show` aware of new link type, that is links attached to
netns. When listing netns-attached links, display netns inode number as its
identifier and link attach type.

Sample session:

  # readlink /proc/self/ns/net
  net:[4026532251]
  # bpftool prog show
  357: flow_dissector  tag a04f5eef06a7f555  gpl
          loaded_at 2020-05-30T16:53:51+0200  uid 0
          xlated 16B  jited 37B  memlock 4096B
  358: flow_dissector  tag a04f5eef06a7f555  gpl
          loaded_at 2020-05-30T16:53:51+0200  uid 0
          xlated 16B  jited 37B  memlock 4096B
  # bpftool link show
  108: netns  prog 357
          netns_ino 4026532251  attach_type flow_dissector
  # bpftool link -jp show
  [{
          "id": 108,
          "type": "netns",
          "prog_id": 357,
          "netns_ino": 4026532251,
          "attach_type": "flow_dissector"
      }
  ]

  (... after netns is gone ...)

  # bpftool link show
  108: netns  prog 357
          netns_ino 0  attach_type flow_dissector
  # bpftool link -jp show
  [{
          "id": 108,
          "type": "netns",
          "prog_id": 357,
          "netns_ino": 0,
          "attach_type": "flow_dissector"
      }
  ]

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-9-jakub@cloudflare.com

bpftool: Extract helpers for showing link attach type

Code for printing link attach_type is duplicated in a couple of places, and
likely will be duplicated for future link types as well. Create helpers to
prevent duplication.

Suggested-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-8-jakub@cloudflare.com

libbpf: Add support for bpf_link-based netns attachment

Add bpf_program__attach_nets(), which uses LINK_CREATE subcommand to create
an FD-based kernel bpf_link, for attach types tied to network namespace,
that is BPF_FLOW_DISSECTOR for the moment.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-7-jakub@cloudflare.com

bpf, cgroup: Return ENOLINK for auto-detached links on update

Failure to update a bpf_link because it has been auto-detached by a dying
cgroup currently results in EINVAL error, even though the arguments passed
to bpf() syscall are not wrong.

bpf_links attaching to netns in this case will return ENOLINK, which
carries the message that the link is no longer attached to anything.

Change cgroup bpf_links to do the same to keep the uAPI errors consistent.

Fixes: 0c991ebc8c69 ("bpf: Implement bpf_prog replacement for an active bpf_cgroup_link")
Suggested-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-6-jakub@cloudflare.com

bpf: Add link-based BPF program attachment to network namespace

Extend bpf() syscall subcommands that operate on bpf_link, that is
LINK_CREATE, LINK_UPDATE, OBJ_GET_INFO, to accept attach types tied to
network namespaces (only flow dissector at the moment).

Link-based and prog-based attachment can be used interchangeably, but only
one can exist at a time. Attempts to attach a link when a prog is already
attached directly, and the other way around, will be met with -EEXIST.
Attempts to detach a program when link exists result in -EINVAL.

Attachment of multiple links of same attach type to one netns is not
supported with the intention to lift the restriction when a use-case
presents itself. Because of that link create returns -E2BIG when trying to
create another netns link, when one already exists.

Link-based attachments to netns don't keep a netns alive by holding a ref
to it. Instead links get auto-detached from netns when the latter is being
destroyed, using a pernet pre_exit callback.

When auto-detached, link lives in defunct state as long there are open FDs
for it. -ENOLINK is returned if a user tries to update a defunct link.

Because bpf_link to netns doesn't hold a ref to struct net, special care is
taken when releasing, updating, or filling link info. The netns might be
getting torn down when any of these link operations are in progress. That
is why auto-detach and update/release/fill_info are synchronized by the
same mutex. Also, link ops have to always check if auto-detach has not
happened yet and if netns is still alive (refcnt > 0).

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-5-jakub@cloudflare.com

flow_dissector: Move out netns_bpf prog callbacks

Move functions to manage BPF programs attached to netns that are not
specific to flow dissector to a dedicated module named
bpf/net_namespace.c.

The set of functions will grow with the addition of bpf_link support for
netns attached programs. This patch prepares ground by creating a place
for it.

This is a code move with no functional changes intended.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-4-jakub@cloudflare.com

net: Introduce netns_bpf for BPF programs attached to netns

In order to:

(1) attach more than one BPF program type to netns, or
(2) support attaching BPF programs to netns with bpf_link, or
(3) support multi-prog attach points for netns

we will need to keep more state per netns than a single pointer like we
have now for BPF flow dissector program.

Prepare for the above by extracting netns_bpf that is part of struct net,
for storing all state related to BPF programs attached to netns.

Turn flow dissector callbacks for querying/attaching/detaching a program
into generic ones that operate on netns_bpf. Next patch will move the
generic callbacks into their own module.

This is similar to how it is organized for cgroup with cgroup_bpf.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-3-jakub@cloudflare.com

flow_dissector: Pull locking up from prog attach callback

Split out the part of attach callback that happens with attach/detach lock
acquired. This structures the prog attach callback in a way that opens up
doors for moving the locking out of flow_dissector and into generic
callbacks for attaching/detaching progs to netns in subsequent patches.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-2-jakub@cloudflare.com

libbpf: Add _GNU_SOURCE for reallocarray to ringbuf.c

On systems with recent enough glibc, reallocarray compat won't kick in, so
reallocarray() itself has to come from stdlib.h include. But _GNU_SOURCE is
necessary to enable it. So add it.

Fixes: bf99c936f947 ("libbpf: Add BPF ring buffer support")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200601202601.2139477-1-andriin@fb.com

bpf: Use tracing helpers for lsm programs

Currenty lsm uses bpf_tracing_func_proto helpers which do
not include stack trace or perf event output. It's useful
to have those for bpftrace lsm support [1].

Using tracing_prog_func_proto helpers for lsm programs.

[1] https://github.com/iovisor/bpftrace/pull/1347

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200531154255.896551-1-jolsa@kernel.org

xdp: Rename convert_to_xdp_frame in xdp_convert_buff_to_frame

In order to use standard 'xdp' prefix, rename convert_to_xdp_frame
utility routine in xdp_convert_buff_to_frame and replace all the
occurrences

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/6344f739be0d1a08ab2b9607584c4d5478c8c083.1590698295.git.lorenzo@kernel.org

xdp: Introduce xdp_convert_frame_to_buff utility routine

Introduce xdp_convert_frame_to_buff utility routine to initialize xdp_buff
fields from xdp_frames ones. Rely on xdp_convert_frame_to_buff in veth xdp
code.

Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/87acf133073c4b2d4cbb8097e8c2480c0a0fac32.1590698295.git.lorenzo@kernel.org

Merge branch 'bpf_setsockopt-SO_BINDTODEVICE'

Ferenc Fejes says:

====================
This option makes it possible to programatically bind sockets
to netdevices. With the help of this option sockets
of VRF unaware applications could be distributed between
multiple VRFs with an eBPF program. This lets the applications
benefit from multiple possible routes.

v2:
- splitting up the patch to three parts
- lock_sk parameter for optional locking in sock_bindtoindex - Stanislav Fomichev
- testing the SO_BINDTODEVICE option - Andrii Nakryiko
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add test for SO_BINDTODEVICE opt of bpf_setsockopt

This test intended to verify if SO_BINDTODEVICE option works in
bpf_setsockopt. Because we already in the SOL_SOCKET level in this
connect bpf prog its safe to verify the sanity in the beginning of
the connect_v4_prog by calling the bind_to_device test helper.

The testing environment already created by the test_sock_addr.sh
script so this test assume that two netdevices already existing in
the system: veth pair with names test_sock_addr1 and test_sock_addr2.
The test will try to bind the socket to those devices first.
Then the test assume there are no netdevice with "nonexistent_dev"
name so the bpf_setsockopt will give use ENODEV error.
At the end the test remove the device binding from the socket
by binding it to an empty name.

Signed-off-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/3f055b8e45c65639c5c73d0b4b6c589e60b86f15.1590871065.git.fejes@inf.elte.hu

bpf: Allow SO_BINDTODEVICE opt in bpf_setsockopt

Extending the supported sockopts in bpf_setsockopt with
SO_BINDTODEVICE. We call sock_bindtoindex with parameter
lock_sk = false in this context because we already owning
the socket.

Signed-off-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/4149e304867b8d5a606a305bc59e29b063e51f49.1590871065.git.fejes@inf.elte.hu

net: Make locking in sock_bindtoindex optional

The sock_bindtoindex intended for kernel wide usage however
it will lock the socket regardless of the context. This modification
relax this behavior optionally: locking the socket will be optional
by calling the sock_bindtoindex with lock_sk = true.

The modification applied to all users of the sock_bindtoindex.

Signed-off-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/bee6355da40d9e991b2f2d12b67d55ebb5f5b207.1590871065.git.fejes@inf.elte.hu

bpf: Change kvfree to kfree in generic_map_lookup_batch()

buf_prevkey in generic_map_lookup_batch() is allocated with
kmalloc(). It's safe to free it with kfree().

Fixes: cb4d03ab499d ("bpf: Add generic support for lookup batch op")
Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200601162814.17426-1-efremov@linux.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'fix-ktls-with-sk_skb_verdict'

John Fastabend says:

====================
If a socket is running a BPF_SK_SKB_SREAM_VERDICT program and KTLS is
enabled the data stream may be broken if both TLS stream parser and
BPF stream parser try to handle data. Fix this here by making KTLS
stream parser run first to ensure TLS messages are received correctly
and then calling the verdict program. This analogous to how we handle
a similar conflict on the TX side.

Note, this is a fix but it doesn't make sense to push this late to
bpf tree so targeting bpf-next and keeping fixes tags.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools/bpf: sync bpf.h

Sync bpf.h into tool/include/uapi/

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, selftests: Add test for ktls with skb bpf ingress policy

This adds a test for bpf ingress policy. To ensure data writes happen
as expected with extra TLS headers we run these tests with data
verification enabled by default. This will test receive packets have
"PASS" stamped into the front of the payload.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/159079363965.5745.3390806911628980210.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'xdp_devmap'

David Ahern says:

====================
Implementation of Daniel's proposal for allowing DEVMAP entries to be
a device index, program fd pair.

Programs are run after XDP_REDIRECT and have access to both Rx device
and Tx device.

v4
- moved struct bpf_devmap_val from uapi to devmap.c, named the union
  and dropped the prefix from the elements - Jesper
- fixed 2 bugs in selftests

v3
- renamed struct to bpf_devmap_val
- used offsetofend to check for expected map size, modification of
  Toke's comment
- check for explicit value sizes
- adjusted switch statement in dev_map_run_prog per Andrii's comment
- changed SEC shortcut to xdp_devmap
- changed selftests to use skeleton and new map declaration

v2
- moved dev_map_ext_val definition to uapi to formalize the API for devmap
  extensions; add bpf_ prefix to the prog_fd and prog_id entries
- changed devmap code to handle struct in a way that it can support future
  extensions
- fixed subject in libbpf patch

v1
- fixed prog put on invalid program - Toke
- changed write value from id to fd per Toke's comments about capabilities
- add test cases
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Fix running sk_skb program types with ktls

KTLS uses a stream parser to collect TLS messages and send them to
the upper layer tls receive handler. This ensures the tls receiver
has a full TLS header to parse when it is run. However, when a
socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
is enabled we end up with two stream parsers running on the same
socket.

The result is both try to run on the same socket. First the KTLS
stream parser runs and calls read_sock() which will tcp_read_sock
which in turn calls tcp_rcv_skb(). This dequeues the skb from the
sk_receive_queue. When this is done KTLS code then data_ready()
callback which because we stacked KTLS on top of the bpf stream
verdict program has been replaced with sk_psock_start_strp(). This
will in turn kick the stream parser again and eventually do the
same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
a skb from the sk_receive_queue.

At this point the data stream is broke. Part of the stream was
handled by the KTLS side some other bytes may have been handled
by the BPF side. Generally this results in either missing data
or more likely a "Bad Message" complaint from the kTLS receive
handler as the BPF program steals some bytes meant to be in a
TLS header and/or the TLS header length is no longer correct.

We've already broke the idealized model where we can stack ULPs
in any order with generic callbacks on the TX side to handle this.
So in this patch we do the same thing but for RX side. We add
a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
program is running and add a tls_sw_has_ctx_rx() helper so BPF
side can learn there is a TLS ULP on the socket.

Then on BPF side we omit calling our stream parser to avoid
breaking the data stream for the KTLS receiver. Then on the
KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
receiver is done with the packet but before it posts the
msg to userspace. This gives us symmetry between the TX and
RX halfs and IMO makes it usable again. On the TX side we
process packets in this order BPF -> TLS -> TCP and on
the receive side in the reverse order TCP -> TLS -> BPF.

Discovered while testing OpenSSL 3.0 Alpha2.0 release.

Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftest: Add tests for XDP programs in devmap entries

Add tests to verify ability to add an XDP program to a
entry in a DEVMAP.

Add negative tests to show DEVMAP programs can not be
attached to devices as a normal XDP program, and accesses
to egress_ifindex require BPF_XDP_DEVMAP attach type.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200529220716.75383-6-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Refactor sockmap redirect code so its easy to reuse

We will need this block of code called from tls context shortly
lets refactor the redirect logic so its easy to use. This also
cleans up the switch stmt so we have fewer fallthrough cases.

No logic changes are intended.

Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/159079360110.5745.7024009076049029819.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: Add SEC name for xdp programs attached to device map

Support SEC("xdp_devmap*") as a short cut for loading the program with
type BPF_PROG_TYPE_XDP and expected attach type BPF_XDP_DEVMAP.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200529220716.75383-5-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

xdp: Add xdp_txq_info to xdp_buff

Add xdp_txq_info as the Tx counterpart to xdp_rxq_info. At the
moment only the device is added. Other fields (queue_index)
can be added as use cases arise.

>From a UAPI perspective, add egress_ifindex to xdp context for
bpf programs to see the Tx device.

Update the verifier to only allow accesses to egress_ifindex by
XDP programs with BPF_XDP_DEVMAP expected attach type.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200529220716.75383-4-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add support to attach bpf program to a devmap entry

Add BPF_XDP_DEVMAP attach type for use with programs associated with a
DEVMAP entry.

Allow DEVMAPs to associate a program with a device entry by adding
a bpf_prog.fd to 'struct bpf_devmap_val'. Values read show the program
id, so the fd and id are a union. bpf programs can get access to the
struct via vmlinux.h.

The program associated with the fd must have type XDP with expected
attach type BPF_XDP_DEVMAP. When a program is associated with a device
index, the program is run on an XDP_REDIRECT and before the buffer is
added to the per-cpu queue. At this point rxq data is still valid; the
next patch adds tx device information allowing the prorgam to see both
ingress and egress device indices.

XDP generic is skb based and XDP programs do not work with skb's. Block
the use case by walking maps used by a program that is to be attached
via xdpgeneric and fail if any of them are DEVMAP / DEVMAP_HASH with

Block attach of BPF_XDP_DEVMAP programs to devices.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200529220716.75383-3-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Use strncpy_from_unsafe_strict() in bpf_seq_printf() helper

In bpf_seq_printf() helper, when user specified a "%s" in the
format string, strncpy_from_unsafe() is used to read the actual string
to a buffer. The string could be a format string or a string in
the kernel data structure. It is really unlikely that the string
will reside in the user memory.

This is different from Commit b2a5212fb634 ("bpf: Restrict bpf_trace_printk()'s %s
usage and add %pks, %pus specifier") which still used
strncpy_from_unsafe() for "%s" to preserve the old behavior.

If in the future, bpf_seq_printf() indeed needs to read user
memory, we can implement "%pus" format string.

Based on discussion in [1], if the intent is to read kernel memory,
strncpy_from_unsafe_strict() should be used. So this patch
changed to use strncpy_from_unsafe_strict().

[1]: https://lore.kernel.org/bpf/20200521152301.2587579-1-hch@lst.de/T/

Fixes: 492e639f0c22 ("bpf: Add bpf_seq_printf and bpf_seq_write helpers")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/bpf/20200529004810.3352219-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

devmap: Formalize map value as a named struct

Add 'struct bpf_devmap_val' to formalize the expected values that can
be passed in for a DEVMAP. Update devmap code to use the struct.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200529220716.75383-2-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add rx_queue_mapping to bpf_sock

Add "rx_queue_mapping" to bpf_sock. This gives read access for the
existing field (sk_rx_queue_mapping) of struct sock from bpf_sock.
Semantics for the bpf_sock rx_queue_mapping access are similar to
sk_rx_queue_get(), i.e the value NO_QUEUE_MAPPING is not allowed
and -1 is returned in that case. This is useful for transmit queue
selection based on the received queue index which is cached in the
socket in the receive path.

v3: Addressed review comments to add usecase in patch description,
and fixed default value for rx_queue_mapping.
v2: fixed build error for CONFIG_XPS wrapping, reported by
kbuild test robot <lkp@intel.com>

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-ring-buffer'

Andrii Nakryiko says:

====================
Implement a new BPF ring buffer, as presented at BPF virtual conference ([0]).
It presents an alternative to perf buffer, following its semantics closely,
but allowing sharing same instance of ring buffer across multiple CPUs
efficiently.

Most patches have extensive commentary explaining various aspects, so I'll
keep cover letter short. Overall structure of the patch set:
- patch #1 adds BPF ring buffer implementation to kernel and necessary
  verifier support;
- patch #2 adds libbpf consumer implementation for BPF ringbuf;
- patch #3 adds selftest, both for single BPF ring buf use case, as well as
  using it with array/hash of maps;
- patch #4 adds extensive benchmarks and provide some analysis in commit
  message, it builds upon selftests/bpf's bench runner.
- patch #5 adds most of patch #1 commit message as a doc under
  Documentation/bpf/ringbuf.rst.

Litmus tests, validating consumer/producer protocols and memory orderings,
were moved out as discussed in [1] and are going to be posted against -rcu
tree and put under Documentation/litmus-tests/bpf-rb.

  [0] https://docs.google.com/presentation/d/18ITdg77Bj6YDOH2LghxrnFxiPWe0fAqcmJY95t_qr0w
  [1] https://lkml.org/lkml/2020/5/22/1011

v3->v4:
- fix ringbuf freeing (vunmap, __free_page); verified with a trivial loop
  creating and closing ringbuf map endlessly (Daniel);

v2->v3:
- dropped unnecessary smp_wmb() (Paul);
- verifier reference type enhancement patch was dropped (Alexei);
- better verifier message for various memory access checks (Alexei);
- clarified a bit roundup_len() bit shifting (Alexei);
- converted doc to .rst (Alexei);
- fixed warning on 32-bit arches regarding tautological ring area size check.

v1->v2:
- commit()/discard()/output() accept flags (NO_WAKEUP/FORCE_WAKEUP) (Stanislav);
- bpf_ringbuf_query() added, returning available data size, ringbuf size,
  consumer/producer positions, needed to implement smarter notification policy
  (Stanislav);
- added ringbuf UAPI constants to include/uapi/linux/bpf.h (Jonathan);
- fixed sample size check, added proper ringbuf size check (Jonathan, Alexei);
- wake_up_all() is done through irq_work (Alexei);
- consistent use of smp_load_acquire/smp_store_release, no
  READ_ONCE/WRITE_ONCE (Alexei);
- added Documentation/bpf/ringbuf.txt (Stanislav);
- updated litmus test with smp_load_acquire/smp_store_release changes;
- added ring_buffer__consume() API to libbpf for busy-polling;
- ring_buffer__poll() on success returns number of records consumed;
- fixed EPOLL notifications, don't assume available data, done similarly to
  perfbuf's implementation;
- both ringbuf and perfbuf now have --rb-sampled mode, instead of
  pb-raw/pb-custom mode, updated benchmark results;
- extended ringbuf selftests to validate epoll logic/manual notification
  logic, as well as bpf_ringbuf_query().
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: Add tests for write-only stacks/queues

For write-only stacks and queues bpf_map_update_elem should be allowed, but
bpf_map_lookup_elem and bpf_map_lookup_and_delete_elem should fail with EPERM.

Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200527185700.14658-6-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

docs/bpf: Add BPF ring buffer design notes

Add commit description from patch #1 as a stand-alone documentation under
Documentation/bpf, as it might be more convenient format, in long term
perspective.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200529075424.3139988-6-andriin@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add BPF ringbuf and perf buffer benchmarks

Extend bench framework with ability to have benchmark-provided child argument
parser for custom benchmark-specific parameters. This makes bench generic code
modular and independent from any specific benchmark.

Also implement a set of benchmarks for new BPF ring buffer and existing perf
buffer. 4 benchmarks were implemented: 2 variations for each of BPF ringbuf
and perfbuf:,
  - rb-libbpf utilizes stock libbpf ring_buffer manager for reading data;
  - rb-custom implements custom ring buffer setup and reading code, to
    eliminate overheads inherent in generic libbpf code due to callback
    functions and the need to update consumer position after each consumed
    record, instead of batching updates (due to pessimistic assumption that
    user callback might take long time and thus could unnecessarily hold ring
    buffer space for too long);
  - pb-libbpf uses stock libbpf perf_buffer code with all the default
    settings, though uses higher-performance raw event callback to minimize
    unnecessary overhead;
  - pb-custom implements its own custom consumer code to minimize any possible
    overhead of generic libbpf implementation and indirect function calls.

All of the test support default, no data notification skipped, mode, as well
as sampled mode (with --rb-sampled flag), which allows to trigger epoll
notification less frequently and reduce overhead. As will be shown, this mode
is especially critical for perf buffer, which suffers from high overhead of
wakeups in kernel.

Otherwise, all benchamrks implement similar way to generate a batch of records
by using fentry/sys_getpgid BPF program, which pushes a bunch of records in
a tight loop and records number of successful and dropped samples. Each record
is a small 8-byte integer, to minimize the effect of memory copying with
bpf_perf_event_output() and bpf_ringbuf_output().

Benchmarks that have only one producer implement optional back-to-back mode,
in which record production and consumption is alternating on the same CPU.
This is the highest-throughput happy case, showing ultimate performance
achievable with either BPF ringbuf or perfbuf.

All the below scenarios are implemented in a script in
benchs/run_bench_ringbufs.sh. Tests were performed on 28-core/56-thread
Intel Xeon CPU E5-2680 v4 @ 2.40GHz CPU.

Single-producer, parallel producer
==================================
rb-libbpf            12.054 ± 0.320M/s (drops 0.000 ± 0.000M/s)
rb-custom            8.158 ± 0.118M/s (drops 0.001 ± 0.003M/s)
pb-libbpf            0.931 ± 0.007M/s (drops 0.000 ± 0.000M/s)
pb-custom            0.965 ± 0.003M/s (drops 0.000 ± 0.000M/s)

Single-producer, parallel producer, sampled notification
========================================================
rb-libbpf            11.563 ± 0.067M/s (drops 0.000 ± 0.000M/s)
rb-custom            15.895 ± 0.076M/s (drops 0.000 ± 0.000M/s)
pb-libbpf            9.889 ± 0.032M/s (drops 0.000 ± 0.000M/s)
pb-custom            9.866 ± 0.028M/s (drops 0.000 ± 0.000M/s)

Single producer on one CPU, consumer on another one, both running at full
speed. Curiously, rb-libbpf has higher throughput than objectively faster (due
to more lightweight consumer code path) rb-custom. It appears that faster
consumer causes kernel to send notifications more frequently, because consumer
appears to be caught up more frequently. Performance of perfbuf suffers from
default "no sampling" policy and huge overhead that causes.

In sampled mode, rb-custom is winning very significantly eliminating too
frequent in-kernel wakeups, the gain appears to be more than 2x.

Perf buffer achieves even more impressive wins, compared to stock perfbuf
settings, with 10x improvements in throughput with 1:500 sampling rate. The
trade-off is that with sampling, application might not get next X events until
X+1st arrives, which is not always acceptable. With steady influx of events,
though, this shouldn't be a problem.

Overall, single-producer performance of ring buffers seems to be better no
matter the sampled/non-sampled modes, but it especially beats ring buffer
without sampling due to its adaptive notification approach.

Single-producer, back-to-back mode
==================================
rb-libbpf            15.507 ± 0.247M/s (drops 0.000 ± 0.000M/s)
rb-libbpf-sampled    14.692 ± 0.195M/s (drops 0.000 ± 0.000M/s)
rb-custom            21.449 ± 0.157M/s (drops 0.000 ± 0.000M/s)
rb-custom-sampled    20.024 ± 0.386M/s (drops 0.000 ± 0.000M/s)
pb-libbpf            1.601 ± 0.015M/s (drops 0.000 ± 0.000M/s)
pb-libbpf-sampled    8.545 ± 0.064M/s (drops 0.000 ± 0.000M/s)
pb-custom            1.607 ± 0.022M/s (drops 0.000 ± 0.000M/s)
pb-custom-sampled    8.988 ± 0.144M/s (drops 0.000 ± 0.000M/s)

Here we test a back-to-back mode, which is arguably best-case scenario both
for BPF ringbuf and perfbuf, because there is no contention and for ringbuf
also no excessive notification, because consumer appears to be behind after
the first record. For ringbuf, custom consumer code clearly wins with 21.5 vs
16 million records per second exchanged between producer and consumer. Sampled
mode actually hurts a bit due to slightly slower producer logic (it needs to
fetch amount of data available to decide whether to skip or force notification).

Perfbuf with wakeup sampling gets 5.5x throughput increase, compared to
no-sampling version. There also doesn't seem to be noticeable overhead from
generic libbpf handling code.

Perfbuf back-to-back, effect of sample rate
===========================================
pb-sampled-1         1.035 ± 0.012M/s (drops 0.000 ± 0.000M/s)
pb-sampled-5         3.476 ± 0.087M/s (drops 0.000 ± 0.000M/s)
pb-sampled-10        5.094 ± 0.136M/s (drops 0.000 ± 0.000M/s)
pb-sampled-25        7.118 ± 0.153M/s (drops 0.000 ± 0.000M/s)
pb-sampled-50        8.169 ± 0.156M/s (drops 0.000 ± 0.000M/s)
pb-sampled-100       8.887 ± 0.136M/s (drops 0.000 ± 0.000M/s)
pb-sampled-250       9.180 ± 0.209M/s (drops 0.000 ± 0.000M/s)
pb-sampled-500       9.353 ± 0.281M/s (drops 0.000 ± 0.000M/s)
pb-sampled-1000      9.411 ± 0.217M/s (drops 0.000 ± 0.000M/s)
pb-sampled-2000      9.464 ± 0.167M/s (drops 0.000 ± 0.000M/s)
pb-sampled-3000      9.575 ± 0.273M/s (drops 0.000 ± 0.000M/s)

This benchmark shows the effect of event sampling for perfbuf. Back-to-back
mode for highest throughput. Just doing every 5th record notification gives
3.5x speed up. 250-500 appears to be the point of diminishing return, with
almost 9x speed up. Most benchmarks use 500 as the default sampling for pb-raw
and pb-custom.

Ringbuf back-to-back, effect of sample rate
===========================================
rb-sampled-1         1.106 ± 0.010M/s (drops 0.000 ± 0.000M/s)
rb-sampled-5         4.746 ± 0.149M/s (drops 0.000 ± 0.000M/s)
rb-sampled-10        7.706 ± 0.164M/s (drops 0.000 ± 0.000M/s)
rb-sampled-25        12.893 ± 0.273M/s (drops 0.000 ± 0.000M/s)
rb-sampled-50        15.961 ± 0.361M/s (drops 0.000 ± 0.000M/s)
rb-sampled-100       18.203 ± 0.445M/s (drops 0.000 ± 0.000M/s)
rb-sampled-250       19.962 ± 0.786M/s (drops 0.000 ± 0.000M/s)
rb-sampled-500       20.881 ± 0.551M/s (drops 0.000 ± 0.000M/s)
rb-sampled-1000      21.317 ± 0.532M/s (drops 0.000 ± 0.000M/s)
rb-sampled-2000      21.331 ± 0.535M/s (drops 0.000 ± 0.000M/s)
rb-sampled-3000      21.688 ± 0.392M/s (drops 0.000 ± 0.000M/s)

Similar benchmark for ring buffer also shows a great advantage (in terms of
throughput) of skipping notifications. Skipping every 5th one gives 4x boost.
Also similar to perfbuf case, 250-500 seems to be the point of diminishing
returns, giving roughly 20x better results.

Keep in mind, for this test, notifications are controlled manually with
BPF_RB_NO_WAKEUP and BPF_RB_FORCE_WAKEUP. As can be seen from previous
benchmarks, adaptive notifications based on consumer's positions provides same
(or even slightly better due to simpler load generator on BPF side) benefits in
favorable back-to-back scenario. Over zealous and fast consumer, which is
almost always caught up, will make thoughput numbers smaller. That's the case
when manual notification control might prove to be extremely beneficial.

Ringbuf back-to-back, reserve+commit vs output
==============================================
reserve              22.819 ± 0.503M/s (drops 0.000 ± 0.000M/s)
output               18.906 ± 0.433M/s (drops 0.000 ± 0.000M/s)

Ringbuf sampled, reserve+commit vs output
=========================================
reserve-sampled      15.350 ± 0.132M/s (drops 0.000 ± 0.000M/s)
output-sampled       14.195 ± 0.144M/s (drops 0.000 ± 0.000M/s)

BPF ringbuf supports two sets of APIs with various usability and performance
tradeoffs: bpf_ringbuf_reserve()+bpf_ringbuf_commit() vs bpf_ringbuf_output().
This benchmark clearly shows superiority of reserve+commit approach, despite
using a small 8-byte record size.

Single-producer, consumer/producer competing on the same CPU, low batch count
=============================================================================
rb-libbpf            3.045 ± 0.020M/s (drops 3.536 ± 0.148M/s)
rb-custom            3.055 ± 0.022M/s (drops 3.893 ± 0.066M/s)
pb-libbpf            1.393 ± 0.024M/s (drops 0.000 ± 0.000M/s)
pb-custom            1.407 ± 0.016M/s (drops 0.000 ± 0.000M/s)

This benchmark shows one of the worst-case scenarios, in which producer and
consumer do not coordinate *and* fight for the same CPU. No batch count and
sampling settings were able to eliminate drops for ringbuffer, producer is
just too fast for consumer to keep up. But ringbuf and perfbuf still able to
pass through quite a lot of messages, which is more than enough for a lot of
applications.

Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1  10.916 ± 0.399M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 2  4.931 ± 0.030M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3  4.880 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4  3.926 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8  4.011 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 3.967 ± 0.016M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 2.604 ± 0.030M/s (drops 0.001 ± 0.002M/s)
rb-libbpf nr_prod 20 2.233 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 24 2.085 ± 0.015M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 28 2.055 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 32 1.962 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 36 2.089 ± 0.005M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 40 2.118 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 44 2.105 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 48 2.120 ± 0.058M/s (drops 0.000 ± 0.001M/s)
rb-libbpf nr_prod 52 2.074 ± 0.024M/s (drops 0.007 ± 0.014M/s)

Ringbuf uses a very short-duration spinlock during reservation phase, to check
few invariants, increment producer count and set record header. This is the
biggest point of contention for ringbuf implementation. This benchmark
evaluates the effect of multiple competing writers on overall throughput of
a single shared ringbuffer.

Overall throughput drops almost 2x when going from single to two
highly-contended producers, gradually dropping with additional competing
producers.  Performance drop stabilizes at around 20 producers and hovers
around 2mln even with 50+ fighting producers, which is a 5x drop compared to
non-contended case. Good kernel implementation in kernel helps maintain decent
performance here.

Note, that in the intended real-world scenarios, it's not expected to get even
close to such a high levels of contention. But if contention will become
a problem, there is always an option of sharding few ring buffers across a set
of CPUs.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200529075424.3139988-5-andriin@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>