Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
authorJakub Kicinski <kuba@kernel.org>
Fri, 22 Jul 2022 23:55:43 +0000 (16:55 -0700)
committerJakub Kicinski <kuba@kernel.org>
Fri, 22 Jul 2022 23:55:44 +0000 (16:55 -0700)
Daniel Borkmann says:

====================
bpf-next 2022-07-22

We've added 73 non-merge commits during the last 12 day(s) which contain
a total of 88 files changed, 3458 insertions(+), 860 deletions(-).

The main changes are:

1) Implement BPF trampoline for arm64 JIT, from Xu Kuohai.

2) Add ksyscall/kretsyscall section support to libbpf to simplify tracing kernel
   syscalls through kprobe mechanism, from Andrii Nakryiko.

3) Allow for livepatch (KLP) and BPF trampolines to attach to the same kernel
   function, from Song Liu & Jiri Olsa.

4) Add new kfunc infrastructure for netfilter's CT e.g. to insert and change
   entries, from Kumar Kartikeya Dwivedi & Lorenzo Bianconi.

5) Add a ksym BPF iterator to allow for more flexible and efficient interactions
   with kernel symbols, from Alan Maguire.

6) Bug fixes in libbpf e.g. for uprobe binary path resolution, from Dan Carpenter.

7) Fix BPF subprog function names in stack traces, from Alexei Starovoitov.

8) libbpf support for writing custom perf event readers, from Jon Doron.

9) Switch to use SPDX tag for BPF helper man page, from Alejandro Colomar.

10) Fix xsk send-only sockets when in busy poll mode, from Maciej Fijalkowski.

11) Reparent BPF maps and their charging on memcg offlining, from Roman Gushchin.

12) Multiple follow-up fixes around BPF lsm cgroup infra, from Stanislav Fomichev.

13) Use bootstrap version of bpftool where possible to speed up builds, from Pu Lehui.

14) Cleanup BPF verifier's check_func_arg() handling, from Joanne Koong.

15) Make non-prealloced BPF map allocations low priority to play better with
    memcg limits, from Yafang Shao.

16) Fix BPF test runner to reject zero-length data for skbs, from Zhengchao Shao.

17) Various smaller cleanups and improvements all over the place.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (73 commits)
  bpf: Simplify bpf_prog_pack_[size|mask]
  bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)
  bpf, x64: Allow to use caller address from stack
  ftrace: Allow IPMODIFY and DIRECT ops on the same function
  ftrace: Add modify_ftrace_direct_multi_nolock
  bpf/selftests: Fix couldn't retrieve pinned program in xdp veth test
  bpf: Fix build error in case of !CONFIG_DEBUG_INFO_BTF
  selftests/bpf: Fix test_verifier failed test in unprivileged mode
  selftests/bpf: Add negative tests for new nf_conntrack kfuncs
  selftests/bpf: Add tests for new nf_conntrack kfuncs
  selftests/bpf: Add verifier tests for trusted kfunc args
  net: netfilter: Add kfuncs to set and change CT status
  net: netfilter: Add kfuncs to set and change CT timeout
  net: netfilter: Add kfuncs to allocate and insert CT
  net: netfilter: Deduplicate code in bpf_{xdp,skb}_ct_lookup
  bpf: Add documentation for kfuncs
  bpf: Add support for forcing kfunc args to be trusted
  bpf: Switch to new kfunc flags infrastructure
  tools/resolve_btfids: Add support for 8-byte BTF sets
  bpf: Introduce 8-byte BTF set
  ...
====================

Link: https://lore.kernel.org/r/20220722221218.29943-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
88 files changed:
Documentation/bpf/btf.rst
Documentation/bpf/index.rst
Documentation/bpf/kfuncs.rst [new file with mode: 0644]
Documentation/bpf/map_hash.rst [new file with mode: 0644]
arch/arm64/include/asm/insn.h
arch/arm64/lib/insn.c
arch/arm64/net/bpf_jit.h
arch/arm64/net/bpf_jit_comp.c
arch/x86/net/bpf_jit_comp.c
include/linux/bpf.h
include/linux/bpf_verifier.h
include/linux/btf.h
include/linux/btf_ids.h
include/linux/filter.h
include/linux/ftrace.h
include/linux/skbuff.h
include/net/netfilter/nf_conntrack_core.h
include/net/xdp_sock_drv.h
include/uapi/linux/bpf.h
kernel/bpf/arraymap.c
kernel/bpf/bpf_lsm.c
kernel/bpf/bpf_struct_ops.c
kernel/bpf/btf.c
kernel/bpf/core.c
kernel/bpf/devmap.c
kernel/bpf/hashtab.c
kernel/bpf/local_storage.c
kernel/bpf/lpm_trie.c
kernel/bpf/preload/iterators/Makefile
kernel/bpf/syscall.c
kernel/bpf/trampoline.c
kernel/bpf/verifier.c
kernel/kallsyms.c
kernel/trace/ftrace.c
net/bpf/test_run.c
net/core/dev.c
net/core/filter.c
net/core/skmsg.c
net/ipv4/bpf_tcp_ca.c
net/ipv4/tcp_bbr.c
net/ipv4/tcp_cubic.c
net/ipv4/tcp_dctcp.c
net/netfilter/nf_conntrack_bpf.c
net/netfilter/nf_conntrack_core.c
net/netfilter/nf_conntrack_netlink.c
net/xdp/xsk.c
samples/bpf/Makefile
samples/bpf/fds_example.c
samples/bpf/sock_example.c
samples/bpf/test_cgrp2_attach.c
samples/bpf/test_lru_dist.c
samples/bpf/test_map_in_map_user.c
samples/bpf/tracex5_user.c
samples/bpf/xdp_redirect_map.bpf.c
samples/bpf/xdp_redirect_map_user.c
scripts/bpf_doc.py
tools/bpf/resolve_btfids/main.c
tools/bpf/runqslower/Makefile
tools/include/uapi/linux/bpf.h
tools/lib/bpf/bpf_tracing.h
tools/lib/bpf/btf_dump.c
tools/lib/bpf/gen_loader.c
tools/lib/bpf/libbpf.c
tools/lib/bpf/libbpf.h
tools/lib/bpf/libbpf.map
tools/lib/bpf/libbpf_internal.h
tools/lib/bpf/usdt.bpf.h
tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
tools/testing/selftests/bpf/prog_tests/bpf_iter.c
tools/testing/selftests/bpf/prog_tests/bpf_nf.c
tools/testing/selftests/bpf/prog_tests/btf.c
tools/testing/selftests/bpf/prog_tests/core_extern.c
tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c
tools/testing/selftests/bpf/prog_tests/skeleton.c
tools/testing/selftests/bpf/progs/bpf_iter.h
tools/testing/selftests/bpf/progs/bpf_iter_ksym.c [new file with mode: 0644]
tools/testing/selftests/bpf/progs/bpf_syscall_macro.c
tools/testing/selftests/bpf/progs/test_attach_probe.c
tools/testing/selftests/bpf/progs/test_bpf_nf.c
tools/testing/selftests/bpf/progs/test_bpf_nf_fail.c [new file with mode: 0644]
tools/testing/selftests/bpf/progs/test_core_extern.c
tools/testing/selftests/bpf/progs/test_probe_user.c
tools/testing/selftests/bpf/progs/test_skeleton.c
tools/testing/selftests/bpf/progs/test_xdp_noinline.c
tools/testing/selftests/bpf/test_xdp_veth.sh
tools/testing/selftests/bpf/verifier/bpf_loop_inline.c
tools/testing/selftests/bpf/verifier/calls.c

index f49aeef..cf8722f 100644 (file)
@@ -369,7 +369,8 @@ No additional type data follow ``btf_type``.
   * ``name_off``: offset to a valid C identifier
   * ``info.kind_flag``: 0
   * ``info.kind``: BTF_KIND_FUNC
-  * ``info.vlen``: 0
+  * ``info.vlen``: linkage information (BTF_FUNC_STATIC, BTF_FUNC_GLOBAL
+                   or BTF_FUNC_EXTERN)
   * ``type``: a BTF_KIND_FUNC_PROTO type
 
 No additional type data follow ``btf_type``.
@@ -380,6 +381,9 @@ type. The BTF_KIND_FUNC may in turn be referenced by a func_info in the
 :ref:`BTF_Ext_Section` (ELF) or in the arguments to :ref:`BPF_Prog_Load`
 (ABI).
 
+Currently, only linkage values of BTF_FUNC_STATIC and BTF_FUNC_GLOBAL are
+supported in the kernel.
+
 2.2.13 BTF_KIND_FUNC_PROTO
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
index 96056a7..1bc2c5c 100644 (file)
@@ -19,6 +19,7 @@ that goes into great technical depth about the BPF Architecture.
    faq
    syscall_api
    helpers
+   kfuncs
    programs
    maps
    bpf_prog_run
diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
new file mode 100644 (file)
index 0000000..c0b7dae
--- /dev/null
@@ -0,0 +1,170 @@
+=============================
+BPF Kernel Functions (kfuncs)
+=============================
+
+1. Introduction
+===============
+
+BPF Kernel Functions or more commonly known as kfuncs are functions in the Linux
+kernel which are exposed for use by BPF programs. Unlike normal BPF helpers,
+kfuncs do not have a stable interface and can change from one kernel release to
+another. Hence, BPF programs need to be updated in response to changes in the
+kernel.
+
+2. Defining a kfunc
+===================
+
+There are two ways to expose a kernel function to BPF programs, either make an
+existing function in the kernel visible, or add a new wrapper for BPF. In both
+cases, care must be taken that BPF program can only call such function in a
+valid context. To enforce this, visibility of a kfunc can be per program type.
+
+If you are not creating a BPF wrapper for existing kernel function, skip ahead
+to :ref:`BPF_kfunc_nodef`.
+
+2.1 Creating a wrapper kfunc
+----------------------------
+
+When defining a wrapper kfunc, the wrapper function should have extern linkage.
+This prevents the compiler from optimizing away dead code, as this wrapper kfunc
+is not invoked anywhere in the kernel itself. It is not necessary to provide a
+prototype in a header for the wrapper kfunc.
+
+An example is given below::
+
+        /* Disables missing prototype warnings */
+        __diag_push();
+        __diag_ignore_all("-Wmissing-prototypes",
+                          "Global kfuncs as their definitions will be in BTF");
+
+        struct task_struct *bpf_find_get_task_by_vpid(pid_t nr)
+        {
+                return find_get_task_by_vpid(nr);
+        }
+
+        __diag_pop();
+
+A wrapper kfunc is often needed when we need to annotate parameters of the
+kfunc. Otherwise one may directly make the kfunc visible to the BPF program by
+registering it with the BPF subsystem. See :ref:`BPF_kfunc_nodef`.
+
+2.2 Annotating kfunc parameters
+-------------------------------
+
+Similar to BPF helpers, there is sometime need for additional context required
+by the verifier to make the usage of kernel functions safer and more useful.
+Hence, we can annotate a parameter by suffixing the name of the argument of the
+kfunc with a __tag, where tag may be one of the supported annotations.
+
+2.2.1 __sz Annotation
+---------------------
+
+This annotation is used to indicate a memory and size pair in the argument list.
+An example is given below::
+
+        void bpf_memzero(void *mem, int mem__sz)
+        {
+        ...
+        }
+
+Here, the verifier will treat first argument as a PTR_TO_MEM, and second
+argument as its size. By default, without __sz annotation, the size of the type
+of the pointer is used. Without __sz annotation, a kfunc cannot accept a void
+pointer.
+
+.. _BPF_kfunc_nodef:
+
+2.3 Using an existing kernel function
+-------------------------------------
+
+When an existing function in the kernel is fit for consumption by BPF programs,
+it can be directly registered with the BPF subsystem. However, care must still
+be taken to review the context in which it will be invoked by the BPF program
+and whether it is safe to do so.
+
+2.4 Annotating kfuncs
+---------------------
+
+In addition to kfuncs' arguments, verifier may need more information about the
+type of kfunc(s) being registered with the BPF subsystem. To do so, we define
+flags on a set of kfuncs as follows::
+
+        BTF_SET8_START(bpf_task_set)
+        BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL)
+        BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE)
+        BTF_SET8_END(bpf_task_set)
+
+This set encodes the BTF ID of each kfunc listed above, and encodes the flags
+along with it. Ofcourse, it is also allowed to specify no flags.
+
+2.4.1 KF_ACQUIRE flag
+---------------------
+
+The KF_ACQUIRE flag is used to indicate that the kfunc returns a pointer to a
+refcounted object. The verifier will then ensure that the pointer to the object
+is eventually released using a release kfunc, or transferred to a map using a
+referenced kptr (by invoking bpf_kptr_xchg). If not, the verifier fails the
+loading of the BPF program until no lingering references remain in all possible
+explored states of the program.
+
+2.4.2 KF_RET_NULL flag
+----------------------
+
+The KF_RET_NULL flag is used to indicate that the pointer returned by the kfunc
+may be NULL. Hence, it forces the user to do a NULL check on the pointer
+returned from the kfunc before making use of it (dereferencing or passing to
+another helper). This flag is often used in pairing with KF_ACQUIRE flag, but
+both are orthogonal to each other.
+
+2.4.3 KF_RELEASE flag
+---------------------
+
+The KF_RELEASE flag is used to indicate that the kfunc releases the pointer
+passed in to it. There can be only one referenced pointer that can be passed in.
+All copies of the pointer being released are invalidated as a result of invoking
+kfunc with this flag.
+
+2.4.4 KF_KPTR_GET flag
+----------------------
+
+The KF_KPTR_GET flag is used to indicate that the kfunc takes the first argument
+as a pointer to kptr, safely increments the refcount of the object it points to,
+and returns a reference to the user. The rest of the arguments may be normal
+arguments of a kfunc. The KF_KPTR_GET flag should be used in conjunction with
+KF_ACQUIRE and KF_RET_NULL flags.
+
+2.4.5 KF_TRUSTED_ARGS flag
+--------------------------
+
+The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It
+indicates that the all pointer arguments will always be refcounted, and have
+their offset set to 0. It can be used to enforce that a pointer to a refcounted
+object acquired from a kfunc or BPF helper is passed as an argument to this
+kfunc without any modifications (e.g. pointer arithmetic) such that it is
+trusted and points to the original object. This flag is often used for kfuncs
+that operate (change some property, perform some operation) on an object that
+was obtained using an acquire kfunc. Such kfuncs need an unchanged pointer to
+ensure the integrity of the operation being performed on the expected object.
+
+2.5 Registering the kfuncs
+--------------------------
+
+Once the kfunc is prepared for use, the final step to making it visible is
+registering it with the BPF subsystem. Registration is done per BPF program
+type. An example is shown below::
+
+        BTF_SET8_START(bpf_task_set)
+        BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL)
+        BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE)
+        BTF_SET8_END(bpf_task_set)
+
+        static const struct btf_kfunc_id_set bpf_task_kfunc_set = {
+                .owner = THIS_MODULE,
+                .set   = &bpf_task_set,
+        };
+
+        static int init_subsystem(void)
+        {
+                return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set);
+        }
+        late_initcall(init_subsystem);
diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst
new file mode 100644 (file)
index 0000000..e851208
--- /dev/null
@@ -0,0 +1,185 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright (C) 2022 Red Hat, Inc.
+
+===============================================
+BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
+===============================================
+
+.. note::
+   - ``BPF_MAP_TYPE_HASH`` was introduced in kernel version 3.19
+   - ``BPF_MAP_TYPE_PERCPU_HASH`` was introduced in version 4.6
+   - Both ``BPF_MAP_TYPE_LRU_HASH`` and ``BPF_MAP_TYPE_LRU_PERCPU_HASH``
+     were introduced in version 4.10
+
+``BPF_MAP_TYPE_HASH`` and ``BPF_MAP_TYPE_PERCPU_HASH`` provide general
+purpose hash map storage. Both the key and the value can be structs,
+allowing for composite keys and values.
+
+The kernel is responsible for allocating and freeing key/value pairs, up
+to the max_entries limit that you specify. Hash maps use pre-allocation
+of hash table elements by default. The ``BPF_F_NO_PREALLOC`` flag can be
+used to disable pre-allocation when it is too memory expensive.
+
+``BPF_MAP_TYPE_PERCPU_HASH`` provides a separate value slot per
+CPU. The per-cpu values are stored internally in an array.
+
+The ``BPF_MAP_TYPE_LRU_HASH`` and ``BPF_MAP_TYPE_LRU_PERCPU_HASH``
+variants add LRU semantics to their respective hash tables. An LRU hash
+will automatically evict the least recently used entries when the hash
+table reaches capacity. An LRU hash maintains an internal LRU list that
+is used to select elements for eviction. This internal LRU list is
+shared across CPUs but it is possible to request a per CPU LRU list with
+the ``BPF_F_NO_COMMON_LRU`` flag when calling ``bpf_map_create``.
+
+Usage
+=====
+
+.. c:function::
+   long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
+
+Hash entries can be added or updated using the ``bpf_map_update_elem()``
+helper. This helper replaces existing elements atomically. The ``flags``
+parameter can be used to control the update behaviour:
+
+- ``BPF_ANY`` will create a new element or update an existing element
+- ``BPF_NOEXIST`` will create a new element only if one did not already
+  exist
+- ``BPF_EXIST`` will update an existing element
+
+``bpf_map_update_elem()`` returns 0 on success, or negative error in
+case of failure.
+
+.. c:function::
+   void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
+
+Hash entries can be retrieved using the ``bpf_map_lookup_elem()``
+helper. This helper returns a pointer to the value associated with
+``key``, or ``NULL`` if no entry was found.
+
+.. c:function::
+   long bpf_map_delete_elem(struct bpf_map *map, const void *key)
+
+Hash entries can be deleted using the ``bpf_map_delete_elem()``
+helper. This helper will return 0 on success, or negative error in case
+of failure.
+
+Per CPU Hashes
+--------------
+
+For ``BPF_MAP_TYPE_PERCPU_HASH`` and ``BPF_MAP_TYPE_LRU_PERCPU_HASH``
+the ``bpf_map_update_elem()`` and ``bpf_map_lookup_elem()`` helpers
+automatically access the hash slot for the current CPU.
+
+.. c:function::
+   void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, u32 cpu)
+
+The ``bpf_map_lookup_percpu_elem()`` helper can be used to lookup the
+value in the hash slot for a specific CPU. Returns value associated with
+``key`` on ``cpu`` , or ``NULL`` if no entry was found or ``cpu`` is
+invalid.
+
+Concurrency
+-----------
+
+Values stored in ``BPF_MAP_TYPE_HASH`` can be accessed concurrently by
+programs running on different CPUs.  Since Kernel version 5.1, the BPF
+infrastructure provides ``struct bpf_spin_lock`` to synchronise access.
+See ``tools/testing/selftests/bpf/progs/test_spin_lock.c``.
+
+Userspace
+---------
+
+.. c:function::
+   int bpf_map_get_next_key(int fd, const void *cur_key, void *next_key)
+
+In userspace, it is possible to iterate through the keys of a hash using
+libbpf's ``bpf_map_get_next_key()`` function. The first key can be fetched by
+calling ``bpf_map_get_next_key()`` with ``cur_key`` set to
+``NULL``. Subsequent calls will fetch the next key that follows the
+current key. ``bpf_map_get_next_key()`` returns 0 on success, -ENOENT if
+cur_key is the last key in the hash, or negative error in case of
+failure.
+
+Note that if ``cur_key`` gets deleted then ``bpf_map_get_next_key()``
+will instead return the *first* key in the hash table which is
+undesirable. It is recommended to use batched lookup if there is going
+to be key deletion intermixed with ``bpf_map_get_next_key()``.
+
+Examples
+========
+
+Please see the ``tools/testing/selftests/bpf`` directory for functional
+examples.  The code snippets below demonstrates API usage.
+
+This example shows how to declare an LRU Hash with a struct key and a
+struct value.
+
+.. code-block:: c
+
+    #include <linux/bpf.h>
+    #include <bpf/bpf_helpers.h>
+
+    struct key {
+        __u32 srcip;
+    };
+
+    struct value {
+        __u64 packets;
+        __u64 bytes;
+    };
+
+    struct {
+            __uint(type, BPF_MAP_TYPE_LRU_HASH);
+            __uint(max_entries, 32);
+            __type(key, struct key);
+            __type(value, struct value);
+    } packet_stats SEC(".maps");
+
+This example shows how to create or update hash values using atomic
+instructions:
+
+.. code-block:: c
+
+    static void update_stats(__u32 srcip, int bytes)
+    {
+            struct key key = {
+                    .srcip = srcip,
+            };
+            struct value *value = bpf_map_lookup_elem(&packet_stats, &key);
+
+            if (value) {
+                    __sync_fetch_and_add(&value->packets, 1);
+                    __sync_fetch_and_add(&value->bytes, bytes);
+            } else {
+                    struct value newval = { 1, bytes };
+
+                    bpf_map_update_elem(&packet_stats, &key, &newval, BPF_NOEXIST);
+            }
+    }
+
+Userspace walking the map elements from the map declared above:
+
+.. code-block:: c
+
+    #include <bpf/libbpf.h>
+    #include <bpf/bpf.h>
+
+    static void walk_hash_elements(int map_fd)
+    {
+            struct key *cur_key = NULL;
+            struct key next_key;
+            struct value value;
+            int err;
+
+            for (;;) {
+                    err = bpf_map_get_next_key(map_fd, cur_key, &next_key);
+                    if (err)
+                            break;
+
+                    bpf_map_lookup_elem(map_fd, &next_key, &value);
+
+                    // Use key and value here
+
+                    cur_key = &next_key;
+            }
+    }
index 6aa2dc8..834bff7 100644 (file)
@@ -510,6 +510,9 @@ u32 aarch64_insn_gen_load_store_imm(enum aarch64_insn_register reg,
                                    unsigned int imm,
                                    enum aarch64_insn_size_type size,
                                    enum aarch64_insn_ldst_type type);
+u32 aarch64_insn_gen_load_literal(unsigned long pc, unsigned long addr,
+                                 enum aarch64_insn_register reg,
+                                 bool is64bit);
 u32 aarch64_insn_gen_load_store_pair(enum aarch64_insn_register reg1,
                                     enum aarch64_insn_register reg2,
                                     enum aarch64_insn_register base,
index 695d736..49e972b 100644 (file)
@@ -323,7 +323,7 @@ static u32 aarch64_insn_encode_ldst_size(enum aarch64_insn_size_type type,
        return insn;
 }
 
-static inline long branch_imm_common(unsigned long pc, unsigned long addr,
+static inline long label_imm_common(unsigned long pc, unsigned long addr,
                                     long range)
 {
        long offset;
@@ -354,7 +354,7 @@ u32 __kprobes aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr,
         * ARM64 virtual address arrangement guarantees all kernel and module
         * texts are within +/-128M.
         */
-       offset = branch_imm_common(pc, addr, SZ_128M);
+       offset = label_imm_common(pc, addr, SZ_128M);
        if (offset >= SZ_128M)
                return AARCH64_BREAK_FAULT;
 
@@ -382,7 +382,7 @@ u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr,
        u32 insn;
        long offset;
 
-       offset = branch_imm_common(pc, addr, SZ_1M);
+       offset = label_imm_common(pc, addr, SZ_1M);
        if (offset >= SZ_1M)
                return AARCH64_BREAK_FAULT;
 
@@ -421,7 +421,7 @@ u32 aarch64_insn_gen_cond_branch_imm(unsigned long pc, unsigned long addr,
        u32 insn;
        long offset;
 
-       offset = branch_imm_common(pc, addr, SZ_1M);
+       offset = label_imm_common(pc, addr, SZ_1M);
 
        insn = aarch64_insn_get_bcond_value();
 
@@ -543,6 +543,28 @@ u32 aarch64_insn_gen_load_store_imm(enum aarch64_insn_register reg,
        return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_12, insn, imm);
 }
 
+u32 aarch64_insn_gen_load_literal(unsigned long pc, unsigned long addr,
+                                 enum aarch64_insn_register reg,
+                                 bool is64bit)
+{
+       u32 insn;
+       long offset;
+
+       offset = label_imm_common(pc, addr, SZ_1M);
+       if (offset >= SZ_1M)
+               return AARCH64_BREAK_FAULT;
+
+       insn = aarch64_insn_get_ldr_lit_value();
+
+       if (is64bit)
+               insn |= BIT(30);
+
+       insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RT, insn, reg);
+
+       return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_19, insn,
+                                            offset >> 2);
+}
+
 u32 aarch64_insn_gen_load_store_pair(enum aarch64_insn_register reg1,
                                     enum aarch64_insn_register reg2,
                                     enum aarch64_insn_register base,
index 194c95c..a6acb94 100644 (file)
 #define A64_STR64I(Xt, Xn, imm) A64_LS_IMM(Xt, Xn, imm, 64, STORE)
 #define A64_LDR64I(Xt, Xn, imm) A64_LS_IMM(Xt, Xn, imm, 64, LOAD)
 
+/* LDR (literal) */
+#define A64_LDR32LIT(Wt, offset) \
+       aarch64_insn_gen_load_literal(0, offset, Wt, false)
+#define A64_LDR64LIT(Xt, offset) \
+       aarch64_insn_gen_load_literal(0, offset, Xt, true)
+
 /* Load/store register pair */
 #define A64_LS_PAIR(Rt, Rt2, Rn, offset, ls, type) \
        aarch64_insn_gen_load_store_pair(Rt, Rt2, Rn, offset, \
 #define A64_BTI_C  A64_HINT(AARCH64_INSN_HINT_BTIC)
 #define A64_BTI_J  A64_HINT(AARCH64_INSN_HINT_BTIJ)
 #define A64_BTI_JC A64_HINT(AARCH64_INSN_HINT_BTIJC)
+#define A64_NOP    A64_HINT(AARCH64_INSN_HINT_NOP)
 
 /* DMB */
 #define A64_DMB_ISH aarch64_insn_gen_dmb(AARCH64_INSN_MB_ISH)
index f08a444..7ca8779 100644 (file)
@@ -10,6 +10,7 @@
 #include <linux/bitfield.h>
 #include <linux/bpf.h>
 #include <linux/filter.h>
+#include <linux/memory.h>
 #include <linux/printk.h>
 #include <linux/slab.h>
 
@@ -18,6 +19,7 @@
 #include <asm/cacheflush.h>
 #include <asm/debug-monitors.h>
 #include <asm/insn.h>
+#include <asm/patching.h>
 #include <asm/set_memory.h>
 
 #include "bpf_jit.h"
@@ -78,6 +80,15 @@ struct jit_ctx {
        int fpb_offset;
 };
 
+struct bpf_plt {
+       u32 insn_ldr; /* load target */
+       u32 insn_br;  /* branch to target */
+       u64 target;   /* target value */
+};
+
+#define PLT_TARGET_SIZE   sizeof_field(struct bpf_plt, target)
+#define PLT_TARGET_OFFSET offsetof(struct bpf_plt, target)
+
 static inline void emit(const u32 insn, struct jit_ctx *ctx)
 {
        if (ctx->image != NULL)
@@ -140,6 +151,12 @@ static inline void emit_a64_mov_i64(const int reg, const u64 val,
        }
 }
 
+static inline void emit_bti(u32 insn, struct jit_ctx *ctx)
+{
+       if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
+               emit(insn, ctx);
+}
+
 /*
  * Kernel addresses in the vmalloc space use at most 48 bits, and the
  * remaining bits are guaranteed to be 0x1. So we can compose the address
@@ -159,6 +176,14 @@ static inline void emit_addr_mov_i64(const int reg, const u64 val,
        }
 }
 
+static inline void emit_call(u64 target, struct jit_ctx *ctx)
+{
+       u8 tmp = bpf2a64[TMP_REG_1];
+
+       emit_addr_mov_i64(tmp, target, ctx);
+       emit(A64_BLR(tmp), ctx);
+}
+
 static inline int bpf2a64_offset(int bpf_insn, int off,
                                 const struct jit_ctx *ctx)
 {
@@ -235,13 +260,30 @@ static bool is_lsi_offset(int offset, int scale)
        return true;
 }
 
+/* generated prologue:
+ *      bti c // if CONFIG_ARM64_BTI_KERNEL
+ *      mov x9, lr
+ *      nop  // POKE_OFFSET
+ *      paciasp // if CONFIG_ARM64_PTR_AUTH_KERNEL
+ *      stp x29, lr, [sp, #-16]!
+ *      mov x29, sp
+ *      stp x19, x20, [sp, #-16]!
+ *      stp x21, x22, [sp, #-16]!
+ *      stp x25, x26, [sp, #-16]!
+ *      stp x27, x28, [sp, #-16]!
+ *      mov x25, sp
+ *      mov tcc, #0
+ *      // PROLOGUE_OFFSET
+ */
+
+#define BTI_INSNS (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) ? 1 : 0)
+#define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0)
+
+/* Offset of nop instruction in bpf prog entry to be poked */
+#define POKE_OFFSET (BTI_INSNS + 1)
+
 /* Tail call offset to jump into */
-#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) || \
-       IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)
-#define PROLOGUE_OFFSET 9
-#else
-#define PROLOGUE_OFFSET 8
-#endif
+#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8)
 
 static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
 {
@@ -280,12 +322,14 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
         *
         */
 
+       emit_bti(A64_BTI_C, ctx);
+
+       emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
+       emit(A64_NOP, ctx);
+
        /* Sign lr */
        if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))
                emit(A64_PACIASP, ctx);
-       /* BTI landing pad */
-       else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
-               emit(A64_BTI_C, ctx);
 
        /* Save FP and LR registers to stay align with ARM64 AAPCS */
        emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
@@ -312,8 +356,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
                }
 
                /* BTI landing pad for the tail call, done with a BR */
-               if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
-                       emit(A64_BTI_J, ctx);
+               emit_bti(A64_BTI_J, ctx);
        }
 
        emit(A64_SUB_I(1, fpb, fp, ctx->fpb_offset), ctx);
@@ -557,6 +600,53 @@ static int emit_ll_sc_atomic(const struct bpf_insn *insn, struct jit_ctx *ctx)
        return 0;
 }
 
+void dummy_tramp(void);
+
+asm (
+"      .pushsection .text, \"ax\", @progbits\n"
+"      .global dummy_tramp\n"
+"      .type dummy_tramp, %function\n"
+"dummy_tramp:"
+#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)
+"      bti j\n" /* dummy_tramp is called via "br x10" */
+#endif
+"      mov x10, x30\n"
+"      mov x30, x9\n"
+"      ret x10\n"
+"      .size dummy_tramp, .-dummy_tramp\n"
+"      .popsection\n"
+);
+
+/* build a plt initialized like this:
+ *
+ * plt:
+ *      ldr tmp, target
+ *      br tmp
+ * target:
+ *      .quad dummy_tramp
+ *
+ * when a long jump trampoline is attached, target is filled with the
+ * trampoline address, and when the trampoline is removed, target is
+ * restored to dummy_tramp address.
+ */
+static void build_plt(struct jit_ctx *ctx)
+{
+       const u8 tmp = bpf2a64[TMP_REG_1];
+       struct bpf_plt *plt = NULL;
+
+       /* make sure target is 64-bit aligned */
+       if ((ctx->idx + PLT_TARGET_OFFSET / AARCH64_INSN_SIZE) % 2)
+               emit(A64_NOP, ctx);
+
+       plt = (struct bpf_plt *)(ctx->image + ctx->idx);
+       /* plt is called via bl, no BTI needed here */
+       emit(A64_LDR64LIT(tmp, 2 * AARCH64_INSN_SIZE), ctx);
+       emit(A64_BR(tmp), ctx);
+
+       if (ctx->image)
+               plt->target = (u64)&dummy_tramp;
+}
+
 static void build_epilogue(struct jit_ctx *ctx)
 {
        const u8 r0 = bpf2a64[BPF_REG_0];
@@ -991,8 +1081,7 @@ emit_cond_jmp:
                                            &func_addr, &func_addr_fixed);
                if (ret < 0)
                        return ret;
-               emit_addr_mov_i64(tmp, func_addr, ctx);
-               emit(A64_BLR(tmp), ctx);
+               emit_call(func_addr, ctx);
                emit(A64_MOV(1, r0, A64_R(0)), ctx);
                break;
        }
@@ -1336,6 +1425,13 @@ static int validate_code(struct jit_ctx *ctx)
                if (a64_insn == AARCH64_BREAK_FAULT)
                        return -1;
        }
+       return 0;
+}
+
+static int validate_ctx(struct jit_ctx *ctx)
+{
+       if (validate_code(ctx))
+               return -1;
 
        if (WARN_ON_ONCE(ctx->exentry_idx != ctx->prog->aux->num_exentries))
                return -1;
@@ -1356,7 +1452,7 @@ struct arm64_jit_data {
 
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
-       int image_size, prog_size, extable_size;
+       int image_size, prog_size, extable_size, extable_align, extable_offset;
        struct bpf_prog *tmp, *orig_prog = prog;
        struct bpf_binary_header *header;
        struct arm64_jit_data *jit_data;
@@ -1426,13 +1522,17 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 
        ctx.epilogue_offset = ctx.idx;
        build_epilogue(&ctx);
+       build_plt(&ctx);
 
+       extable_align = __alignof__(struct exception_table_entry);
        extable_size = prog->aux->num_exentries *
                sizeof(struct exception_table_entry);
 
        /* Now we know the actual image size. */
        prog_size = sizeof(u32) * ctx.idx;
-       image_size = prog_size + extable_size;
+       /* also allocate space for plt target */
+       extable_offset = round_up(prog_size + PLT_TARGET_SIZE, extable_align);
+       image_size = extable_offset + extable_size;
        header = bpf_jit_binary_alloc(image_size, &image_ptr,
                                      sizeof(u32), jit_fill_hole);
        if (header == NULL) {
@@ -1444,7 +1544,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 
        ctx.image = (__le32 *)image_ptr;
        if (extable_size)
-               prog->aux->extable = (void *)image_ptr + prog_size;
+               prog->aux->extable = (void *)image_ptr + extable_offset;
 skip_init_ctx:
        ctx.idx = 0;
        ctx.exentry_idx = 0;
@@ -1458,9 +1558,10 @@ skip_init_ctx:
        }
 
        build_epilogue(&ctx);
+       build_plt(&ctx);
 
        /* 3. Extra pass to validate JITed code. */
-       if (validate_code(&ctx)) {
+       if (validate_ctx(&ctx)) {
                bpf_jit_binary_free(header);
                prog = orig_prog;
                goto out_off;
@@ -1537,3 +1638,583 @@ bool bpf_jit_supports_subprog_tailcalls(void)
 {
        return true;
 }
+
+static void invoke_bpf_prog(struct jit_ctx *ctx, struct bpf_tramp_link *l,
+                           int args_off, int retval_off, int run_ctx_off,
+                           bool save_ret)
+{
+       u32 *branch;
+       u64 enter_prog;
+       u64 exit_prog;
+       struct bpf_prog *p = l->link.prog;
+       int cookie_off = offsetof(struct bpf_tramp_run_ctx, bpf_cookie);
+
+       if (p->aux->sleepable) {
+               enter_prog = (u64)__bpf_prog_enter_sleepable;
+               exit_prog = (u64)__bpf_prog_exit_sleepable;
+       } else {
+               enter_prog = (u64)__bpf_prog_enter;
+               exit_prog = (u64)__bpf_prog_exit;
+       }
+
+       if (l->cookie == 0) {
+               /* if cookie is zero, one instruction is enough to store it */
+               emit(A64_STR64I(A64_ZR, A64_SP, run_ctx_off + cookie_off), ctx);
+       } else {
+               emit_a64_mov_i64(A64_R(10), l->cookie, ctx);
+               emit(A64_STR64I(A64_R(10), A64_SP, run_ctx_off + cookie_off),
+                    ctx);
+       }
+
+       /* save p to callee saved register x19 to avoid loading p with mov_i64
+        * each time.
+        */
+       emit_addr_mov_i64(A64_R(19), (const u64)p, ctx);
+
+       /* arg1: prog */
+       emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
+       /* arg2: &run_ctx */
+       emit(A64_ADD_I(1, A64_R(1), A64_SP, run_ctx_off), ctx);
+
+       emit_call(enter_prog, ctx);
+
+       /* if (__bpf_prog_enter(prog) == 0)
+        *         goto skip_exec_of_prog;
+        */
+       branch = ctx->image + ctx->idx;
+       emit(A64_NOP, ctx);
+
+       /* save return value to callee saved register x20 */
+       emit(A64_MOV(1, A64_R(20), A64_R(0)), ctx);
+
+       emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
+       if (!p->jited)
+               emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
+
+       emit_call((const u64)p->bpf_func, ctx);
+
+       if (save_ret)
+               emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx);
+
+       if (ctx->image) {
+               int offset = &ctx->image[ctx->idx] - branch;
+               *branch = A64_CBZ(1, A64_R(0), offset);
+       }
+
+       /* arg1: prog */
+       emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
+       /* arg2: start time */
+       emit(A64_MOV(1, A64_R(1), A64_R(20)), ctx);
+       /* arg3: &run_ctx */
+       emit(A64_ADD_I(1, A64_R(2), A64_SP, run_ctx_off), ctx);
+
+       emit_call(exit_prog, ctx);
+}
+
+static void invoke_bpf_mod_ret(struct jit_ctx *ctx, struct bpf_tramp_links *tl,
+                              int args_off, int retval_off, int run_ctx_off,
+                              u32 **branches)
+{
+       int i;
+
+       /* The first fmod_ret program will receive a garbage return value.
+        * Set this to 0 to avoid confusing the program.
+        */
+       emit(A64_STR64I(A64_ZR, A64_SP, retval_off), ctx);
+       for (i = 0; i < tl->nr_links; i++) {
+               invoke_bpf_prog(ctx, tl->links[i], args_off, retval_off,
+                               run_ctx_off, true);
+               /* if (*(u64 *)(sp + retval_off) !=  0)
+                *      goto do_fexit;
+                */
+               emit(A64_LDR64I(A64_R(10), A64_SP, retval_off), ctx);
+               /* Save the location of branch, and generate a nop.
+                * This nop will be replaced with a cbnz later.
+                */
+               branches[i] = ctx->image + ctx->idx;
+               emit(A64_NOP, ctx);
+       }
+}
+
+static void save_args(struct jit_ctx *ctx, int args_off, int nargs)
+{
+       int i;
+
+       for (i = 0; i < nargs; i++) {
+               emit(A64_STR64I(i, A64_SP, args_off), ctx);
+               args_off += 8;
+       }
+}
+
+static void restore_args(struct jit_ctx *ctx, int args_off, int nargs)
+{
+       int i;
+
+       for (i = 0; i < nargs; i++) {
+               emit(A64_LDR64I(i, A64_SP, args_off), ctx);
+               args_off += 8;
+       }
+}
+
+/* Based on the x86's implementation of arch_prepare_bpf_trampoline().
+ *
+ * bpf prog and function entry before bpf trampoline hooked:
+ *   mov x9, lr
+ *   nop
+ *
+ * bpf prog and function entry after bpf trampoline hooked:
+ *   mov x9, lr
+ *   bl  <bpf_trampoline or plt>
+ *
+ */
+static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
+                             struct bpf_tramp_links *tlinks, void *orig_call,
+                             int nargs, u32 flags)
+{
+       int i;
+       int stack_size;
+       int retaddr_off;
+       int regs_off;
+       int retval_off;
+       int args_off;
+       int nargs_off;
+       int ip_off;
+       int run_ctx_off;
+       struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
+       struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
+       struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
+       bool save_ret;
+       u32 **branches = NULL;
+
+       /* trampoline stack layout:
+        *                  [ parent ip         ]
+        *                  [ FP                ]
+        * SP + retaddr_off [ self ip           ]
+        *                  [ FP                ]
+        *
+        *                  [ padding           ] align SP to multiples of 16
+        *
+        *                  [ x20               ] callee saved reg x20
+        * SP + regs_off    [ x19               ] callee saved reg x19
+        *
+        * SP + retval_off  [ return value      ] BPF_TRAMP_F_CALL_ORIG or
+        *                                        BPF_TRAMP_F_RET_FENTRY_RET
+        *
+        *                  [ argN              ]
+        *                  [ ...               ]
+        * SP + args_off    [ arg1              ]
+        *
+        * SP + nargs_off   [ args count        ]
+        *
+        * SP + ip_off      [ traced function   ] BPF_TRAMP_F_IP_ARG flag
+        *
+        * SP + run_ctx_off [ bpf_tramp_run_ctx ]
+        */
+
+       stack_size = 0;
+       run_ctx_off = stack_size;
+       /* room for bpf_tramp_run_ctx */
+       stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8);
+
+       ip_off = stack_size;
+       /* room for IP address argument */
+       if (flags & BPF_TRAMP_F_IP_ARG)
+               stack_size += 8;
+
+       nargs_off = stack_size;
+       /* room for args count */
+       stack_size += 8;
+
+       args_off = stack_size;
+       /* room for args */
+       stack_size += nargs * 8;
+
+       /* room for return value */
+       retval_off = stack_size;
+       save_ret = flags & (BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_RET_FENTRY_RET);
+       if (save_ret)
+               stack_size += 8;
+
+       /* room for callee saved registers, currently x19 and x20 are used */
+       regs_off = stack_size;
+       stack_size += 16;
+
+       /* round up to multiples of 16 to avoid SPAlignmentFault */
+       stack_size = round_up(stack_size, 16);
+
+       /* return address locates above FP */
+       retaddr_off = stack_size + 8;
+
+       /* bpf trampoline may be invoked by 3 instruction types:
+        * 1. bl, attached to bpf prog or kernel function via short jump
+        * 2. br, attached to bpf prog or kernel function via long jump
+        * 3. blr, working as a function pointer, used by struct_ops.
+        * So BTI_JC should used here to support both br and blr.
+        */
+       emit_bti(A64_BTI_JC, ctx);
+
+       /* frame for parent function */
+       emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx);
+       emit(A64_MOV(1, A64_FP, A64_SP), ctx);
+
+       /* frame for patched function */
+       emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
+       emit(A64_MOV(1, A64_FP, A64_SP), ctx);
+
+       /* allocate stack space */
+       emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
+
+       if (flags & BPF_TRAMP_F_IP_ARG) {
+               /* save ip address of the traced function */
+               emit_addr_mov_i64(A64_R(10), (const u64)orig_call, ctx);
+               emit(A64_STR64I(A64_R(10), A64_SP, ip_off), ctx);
+       }
+
+       /* save args count*/
+       emit(A64_MOVZ(1, A64_R(10), nargs, 0), ctx);
+       emit(A64_STR64I(A64_R(10), A64_SP, nargs_off), ctx);
+
+       /* save args */
+       save_args(ctx, args_off, nargs);
+
+       /* save callee saved registers */
+       emit(A64_STR64I(A64_R(19), A64_SP, regs_off), ctx);
+       emit(A64_STR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
+
+       if (flags & BPF_TRAMP_F_CALL_ORIG) {
+               emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
+               emit_call((const u64)__bpf_tramp_enter, ctx);
+       }
+
+       for (i = 0; i < fentry->nr_links; i++)
+               invoke_bpf_prog(ctx, fentry->links[i], args_off,
+                               retval_off, run_ctx_off,
+                               flags & BPF_TRAMP_F_RET_FENTRY_RET);
+
+       if (fmod_ret->nr_links) {
+               branches = kcalloc(fmod_ret->nr_links, sizeof(u32 *),
+                                  GFP_KERNEL);
+               if (!branches)
+                       return -ENOMEM;
+
+               invoke_bpf_mod_ret(ctx, fmod_ret, args_off, retval_off,
+                                  run_ctx_off, branches);
+       }
+
+       if (flags & BPF_TRAMP_F_CALL_ORIG) {
+               restore_args(ctx, args_off, nargs);
+               /* call original func */
+               emit(A64_LDR64I(A64_R(10), A64_SP, retaddr_off), ctx);
+               emit(A64_BLR(A64_R(10)), ctx);
+               /* store return value */
+               emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx);
+               /* reserve a nop for bpf_tramp_image_put */
+               im->ip_after_call = ctx->image + ctx->idx;
+               emit(A64_NOP, ctx);
+       }
+
+       /* update the branches saved in invoke_bpf_mod_ret with cbnz */
+       for (i = 0; i < fmod_ret->nr_links && ctx->image != NULL; i++) {
+               int offset = &ctx->image[ctx->idx] - branches[i];
+               *branches[i] = A64_CBNZ(1, A64_R(10), offset);
+       }
+
+       for (i = 0; i < fexit->nr_links; i++)
+               invoke_bpf_prog(ctx, fexit->links[i], args_off, retval_off,
+                               run_ctx_off, false);
+
+       if (flags & BPF_TRAMP_F_CALL_ORIG) {
+               im->ip_epilogue = ctx->image + ctx->idx;
+               emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
+               emit_call((const u64)__bpf_tramp_exit, ctx);
+       }
+
+       if (flags & BPF_TRAMP_F_RESTORE_REGS)
+               restore_args(ctx, args_off, nargs);
+
+       /* restore callee saved register x19 and x20 */
+       emit(A64_LDR64I(A64_R(19), A64_SP, regs_off), ctx);
+       emit(A64_LDR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
+
+       if (save_ret)
+               emit(A64_LDR64I(A64_R(0), A64_SP, retval_off), ctx);
+
+       /* reset SP  */
+       emit(A64_MOV(1, A64_SP, A64_FP), ctx);
+
+       /* pop frames  */
+       emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
+       emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx);
+
+       if (flags & BPF_TRAMP_F_SKIP_FRAME) {
+               /* skip patched function, return to parent */
+               emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
+               emit(A64_RET(A64_R(9)), ctx);
+       } else {
+               /* return to patched function */
+               emit(A64_MOV(1, A64_R(10), A64_LR), ctx);
+               emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
+               emit(A64_RET(A64_R(10)), ctx);
+       }
+
+       if (ctx->image)
+               bpf_flush_icache(ctx->image, ctx->image + ctx->idx);
+
+       kfree(branches);
+
+       return ctx->idx;
+}
+
+int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image,
+                               void *image_end, const struct btf_func_model *m,
+                               u32 flags, struct bpf_tramp_links *tlinks,
+                               void *orig_call)
+{
+       int ret;
+       int nargs = m->nr_args;
+       int max_insns = ((long)image_end - (long)image) / AARCH64_INSN_SIZE;
+       struct jit_ctx ctx = {
+               .image = NULL,
+               .idx = 0,
+       };
+
+       /* the first 8 arguments are passed by registers */
+       if (nargs > 8)
+               return -ENOTSUPP;
+
+       ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags);
+       if (ret < 0)
+               return ret;
+
+       if (ret > max_insns)
+               return -EFBIG;
+
+       ctx.image = image;
+       ctx.idx = 0;
+
+       jit_fill_hole(image, (unsigned int)(image_end - image));
+       ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags);
+
+       if (ret > 0 && validate_code(&ctx) < 0)
+               ret = -EINVAL;
+
+       if (ret > 0)
+               ret *= AARCH64_INSN_SIZE;
+
+       return ret;
+}
+
+static bool is_long_jump(void *ip, void *target)
+{
+       long offset;
+
+       /* NULL target means this is a NOP */
+       if (!target)
+               return false;
+
+       offset = (long)target - (long)ip;
+       return offset < -SZ_128M || offset >= SZ_128M;
+}
+
+static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip,
+                            void *addr, void *plt, u32 *insn)
+{
+       void *target;
+
+       if (!addr) {
+               *insn = aarch64_insn_gen_nop();
+               return 0;
+       }
+
+       if (is_long_jump(ip, addr))
+               target = plt;
+       else
+               target = addr;
+
+       *insn = aarch64_insn_gen_branch_imm((unsigned long)ip,
+                                           (unsigned long)target,
+                                           type);
+
+       return *insn != AARCH64_BREAK_FAULT ? 0 : -EFAULT;
+}
+
+/* Replace the branch instruction from @ip to @old_addr in a bpf prog or a bpf
+ * trampoline with the branch instruction from @ip to @new_addr. If @old_addr
+ * or @new_addr is NULL, the old or new instruction is NOP.
+ *
+ * When @ip is the bpf prog entry, a bpf trampoline is being attached or
+ * detached. Since bpf trampoline and bpf prog are allocated separately with
+ * vmalloc, the address distance may exceed 128MB, the maximum branch range.
+ * So long jump should be handled.
+ *
+ * When a bpf prog is constructed, a plt pointing to empty trampoline
+ * dummy_tramp is placed at the end:
+ *
+ *      bpf_prog:
+ *              mov x9, lr
+ *              nop // patchsite
+ *              ...
+ *              ret
+ *
+ *      plt:
+ *              ldr x10, target
+ *              br x10
+ *      target:
+ *              .quad dummy_tramp // plt target
+ *
+ * This is also the state when no trampoline is attached.
+ *
+ * When a short-jump bpf trampoline is attached, the patchsite is patched
+ * to a bl instruction to the trampoline directly:
+ *
+ *      bpf_prog:
+ *              mov x9, lr
+ *              bl <short-jump bpf trampoline address> // patchsite
+ *              ...
+ *              ret
+ *
+ *      plt:
+ *              ldr x10, target
+ *              br x10
+ *      target:
+ *              .quad dummy_tramp // plt target
+ *
+ * When a long-jump bpf trampoline is attached, the plt target is filled with
+ * the trampoline address and the patchsite is patched to a bl instruction to
+ * the plt:
+ *
+ *      bpf_prog:
+ *              mov x9, lr
+ *              bl plt // patchsite
+ *              ...
+ *              ret
+ *
+ *      plt:
+ *              ldr x10, target
+ *              br x10
+ *      target:
+ *              .quad <long-jump bpf trampoline address> // plt target
+ *
+ * The dummy_tramp is used to prevent another CPU from jumping to unknown
+ * locations during the patching process, making the patching process easier.
+ */
+int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
+                      void *old_addr, void *new_addr)
+{
+       int ret;
+       u32 old_insn;
+       u32 new_insn;
+       u32 replaced;
+       struct bpf_plt *plt = NULL;
+       unsigned long size = 0UL;
+       unsigned long offset = ~0UL;
+       enum aarch64_insn_branch_type branch_type;
+       char namebuf[KSYM_NAME_LEN];
+       void *image = NULL;
+       u64 plt_target = 0ULL;
+       bool poking_bpf_entry;
+
+       if (!__bpf_address_lookup((unsigned long)ip, &size, &offset, namebuf))
+               /* Only poking bpf text is supported. Since kernel function
+                * entry is set up by ftrace, we reply on ftrace to poke kernel
+                * functions.
+                */
+               return -ENOTSUPP;
+
+       image = ip - offset;
+       /* zero offset means we're poking bpf prog entry */
+       poking_bpf_entry = (offset == 0UL);
+
+       /* bpf prog entry, find plt and the real patchsite */
+       if (poking_bpf_entry) {
+               /* plt locates at the end of bpf prog */
+               plt = image + size - PLT_TARGET_OFFSET;
+
+               /* skip to the nop instruction in bpf prog entry:
+                * bti c // if BTI enabled
+                * mov x9, x30
+                * nop
+                */
+               ip = image + POKE_OFFSET * AARCH64_INSN_SIZE;
+       }
+
+       /* long jump is only possible at bpf prog entry */
+       if (WARN_ON((is_long_jump(ip, new_addr) || is_long_jump(ip, old_addr)) &&
+                   !poking_bpf_entry))
+               return -EINVAL;
+
+       if (poke_type == BPF_MOD_CALL)
+               branch_type = AARCH64_INSN_BRANCH_LINK;
+       else
+               branch_type = AARCH64_INSN_BRANCH_NOLINK;
+
+       if (gen_branch_or_nop(branch_type, ip, old_addr, plt, &old_insn) < 0)
+               return -EFAULT;
+
+       if (gen_branch_or_nop(branch_type, ip, new_addr, plt, &new_insn) < 0)
+               return -EFAULT;
+
+       if (is_long_jump(ip, new_addr))
+               plt_target = (u64)new_addr;
+       else if (is_long_jump(ip, old_addr))
+               /* if the old target is a long jump and the new target is not,
+                * restore the plt target to dummy_tramp, so there is always a
+                * legal and harmless address stored in plt target, and we'll
+                * never jump from plt to an unknown place.
+                */
+               plt_target = (u64)&dummy_tramp;
+
+       if (plt_target) {
+               /* non-zero plt_target indicates we're patching a bpf prog,
+                * which is read only.
+                */
+               if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target), 1))
+                       return -EFAULT;
+               WRITE_ONCE(plt->target, plt_target);
+               set_memory_ro(PAGE_MASK & ((uintptr_t)&plt->target), 1);
+               /* since plt target points to either the new trampoline
+                * or dummy_tramp, even if another CPU reads the old plt
+                * target value before fetching the bl instruction to plt,
+                * it will be brought back by dummy_tramp, so no barrier is
+                * required here.
+                */
+       }
+
+       /* if the old target and the new target are both long jumps, no
+        * patching is required
+        */
+       if (old_insn == new_insn)
+               return 0;
+
+       mutex_lock(&text_mutex);
+       if (aarch64_insn_read(ip, &replaced)) {
+               ret = -EFAULT;
+               goto out;
+       }
+
+       if (replaced != old_insn) {
+               ret = -EFAULT;
+               goto out;
+       }
+
+       /* We call aarch64_insn_patch_text_nosync() to replace instruction
+        * atomically, so no other CPUs will fetch a half-new and half-old
+        * instruction. But there is chance that another CPU executes the
+        * old instruction after the patching operation finishes (e.g.,
+        * pipeline not flushed, or icache not synchronized yet).
+        *
+        * 1. when a new trampoline is attached, it is not a problem for
+        *    different CPUs to jump to different trampolines temporarily.
+        *
+        * 2. when an old trampoline is freed, we should wait for all other
+        *    CPUs to exit the trampoline and make sure the trampoline is no
+        *    longer reachable, since bpf_tramp_image_put() function already
+        *    uses percpu_ref and task-based rcu to do the sync, no need to call
+        *    the sync version here, see bpf_tramp_image_put() for details.
+        */
+       ret = aarch64_insn_patch_text_nosync(ip, new_insn);
+out:
+       mutex_unlock(&text_mutex);
+
+       return ret;
+}
index 7e95697..c1f6c1c 100644 (file)
@@ -1950,23 +1950,6 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog,
        return 0;
 }
 
-static bool is_valid_bpf_tramp_flags(unsigned int flags)
-{
-       if ((flags & BPF_TRAMP_F_RESTORE_REGS) &&
-           (flags & BPF_TRAMP_F_SKIP_FRAME))
-               return false;
-
-       /*
-        * BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops,
-        * and it must be used alone.
-        */
-       if ((flags & BPF_TRAMP_F_RET_FENTRY_RET) &&
-           (flags & ~BPF_TRAMP_F_RET_FENTRY_RET))
-               return false;
-
-       return true;
-}
-
 /* Example:
  * __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
  * its 'struct btf_func_model' will be nr_args=2
@@ -2045,9 +2028,6 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
        if (nr_args > 6)
                return -ENOTSUPP;
 
-       if (!is_valid_bpf_tramp_flags(flags))
-               return -EINVAL;
-
        /* Generated trampoline stack layout:
         *
         * RBP + 8         [ return address  ]
@@ -2153,10 +2133,15 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
        if (flags & BPF_TRAMP_F_CALL_ORIG) {
                restore_regs(m, &prog, nr_args, regs_off);
 
-               /* call original function */
-               if (emit_call(&prog, orig_call, prog)) {
-                       ret = -EINVAL;
-                       goto cleanup;
+               if (flags & BPF_TRAMP_F_ORIG_STACK) {
+                       emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, 8);
+                       EMIT2(0xff, 0xd0); /* call *rax */
+               } else {
+                       /* call original function */
+                       if (emit_call(&prog, orig_call, prog)) {
+                               ret = -EINVAL;
+                               goto cleanup;
+                       }
                }
                /* remember return value in a stack for bpf prog to access */
                emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8);
@@ -2520,3 +2505,28 @@ bool bpf_jit_supports_subprog_tailcalls(void)
 {
        return true;
 }
+
+void bpf_jit_free(struct bpf_prog *prog)
+{
+       if (prog->jited) {
+               struct x64_jit_data *jit_data = prog->aux->jit_data;
+               struct bpf_binary_header *hdr;
+
+               /*
+                * If we fail the final pass of JIT (from jit_subprogs),
+                * the program may not be finalized yet. Call finalize here
+                * before freeing it.
+                */
+               if (jit_data) {
+                       bpf_jit_binary_pack_finalize(prog, jit_data->header,
+                                                    jit_data->rw_header);
+                       kvfree(jit_data->addrs);
+                       kfree(jit_data);
+               }
+               hdr = bpf_jit_binary_pack_hdr(prog);
+               bpf_jit_binary_pack_free(hdr, NULL);
+               WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(prog));
+       }
+
+       bpf_prog_unlock_free(prog);
+}
index 2b21f2a..20c26ae 100644 (file)
@@ -47,6 +47,7 @@ struct kobject;
 struct mem_cgroup;
 struct module;
 struct bpf_func_state;
+struct ftrace_ops;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -221,7 +222,7 @@ struct bpf_map {
        u32 btf_vmlinux_value_type_id;
        struct btf *btf;
 #ifdef CONFIG_MEMCG_KMEM
-       struct mem_cgroup *memcg;
+       struct obj_cgroup *objcg;
 #endif
        char name[BPF_OBJ_NAME_LEN];
        struct bpf_map_off_arr *off_arr;
@@ -751,6 +752,16 @@ struct btf_func_model {
 /* Return the return value of fentry prog. Only used by bpf_struct_ops. */
 #define BPF_TRAMP_F_RET_FENTRY_RET     BIT(4)
 
+/* Get original function from stack instead of from provided direct address.
+ * Makes sense for trampolines with fexit or fmod_ret programs.
+ */
+#define BPF_TRAMP_F_ORIG_STACK         BIT(5)
+
+/* This trampoline is on a function with another ftrace_ops with IPMODIFY,
+ * e.g., a live patch. This flag is set and cleared by ftrace call backs,
+ */
+#define BPF_TRAMP_F_SHARE_IPMODIFY     BIT(6)
+
 /* Each call __bpf_prog_enter + call bpf_func + call __bpf_prog_exit is ~50
  * bytes on x86.
  */
@@ -833,9 +844,11 @@ struct bpf_tramp_image {
 struct bpf_trampoline {
        /* hlist for trampoline_table */
        struct hlist_node hlist;
+       struct ftrace_ops *fops;
        /* serializes access to fields of this trampoline */
        struct mutex mutex;
        refcount_t refcnt;
+       u32 flags;
        u64 key;
        struct {
                struct btf_func_model model;
@@ -1044,7 +1057,6 @@ struct bpf_prog_aux {
        bool sleepable;
        bool tail_call_reachable;
        bool xdp_has_frags;
-       bool use_bpf_prog_pack;
        /* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
        const struct btf_type *attach_func_proto;
        /* function name for valid attach_btf_id */
@@ -1255,9 +1267,6 @@ struct bpf_dummy_ops {
 int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr,
                            union bpf_attr __user *uattr);
 #endif
-int bpf_trampoline_link_cgroup_shim(struct bpf_prog *prog,
-                                   int cgroup_atype);
-void bpf_trampoline_unlink_cgroup_shim(struct bpf_prog *prog);
 #else
 static inline const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
 {
@@ -1281,6 +1290,13 @@ static inline int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map,
 {
        return -EINVAL;
 }
+#endif
+
+#if defined(CONFIG_CGROUP_BPF) && defined(CONFIG_BPF_LSM)
+int bpf_trampoline_link_cgroup_shim(struct bpf_prog *prog,
+                                   int cgroup_atype);
+void bpf_trampoline_unlink_cgroup_shim(struct bpf_prog *prog);
+#else
 static inline int bpf_trampoline_link_cgroup_shim(struct bpf_prog *prog,
                                                  int cgroup_atype)
 {
@@ -1921,7 +1937,8 @@ int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog,
                                struct bpf_reg_state *regs);
 int btf_check_kfunc_arg_match(struct bpf_verifier_env *env,
                              const struct btf *btf, u32 func_id,
-                             struct bpf_reg_state *regs);
+                             struct bpf_reg_state *regs,
+                             u32 kfunc_flags);
 int btf_prepare_func_args(struct bpf_verifier_env *env, int subprog,
                          struct bpf_reg_state *reg);
 int btf_check_type_match(struct bpf_verifier_log *log, const struct bpf_prog *prog,
index 81b1966..2e3bad8 100644 (file)
@@ -345,10 +345,10 @@ struct bpf_verifier_state_list {
 };
 
 struct bpf_loop_inline_state {
-       int initialized:1; /* set to true upon first entry */
-       int fit_for_inline:1; /* true if callback function is the same
-                              * at each call and flags are always zero
-                              */
+       unsigned int initialized:1; /* set to true upon first entry */
+       unsigned int fit_for_inline:1; /* true if callback function is the same
+                                       * at each call and flags are always zero
+                                       */
        u32 callback_subprogno; /* valid when fit_for_inline is true */
 };
 
index 1bfed7f..cdb376d 100644 (file)
 #define BTF_TYPE_EMIT(type) ((void)(type *)0)
 #define BTF_TYPE_EMIT_ENUM(enum_val) ((void)enum_val)
 
-enum btf_kfunc_type {
-       BTF_KFUNC_TYPE_CHECK,
-       BTF_KFUNC_TYPE_ACQUIRE,
-       BTF_KFUNC_TYPE_RELEASE,
-       BTF_KFUNC_TYPE_RET_NULL,
-       BTF_KFUNC_TYPE_KPTR_ACQUIRE,
-       BTF_KFUNC_TYPE_MAX,
-};
+/* These need to be macros, as the expressions are used in assembler input */
+#define KF_ACQUIRE     (1 << 0) /* kfunc is an acquire function */
+#define KF_RELEASE     (1 << 1) /* kfunc is a release function */
+#define KF_RET_NULL    (1 << 2) /* kfunc returns a pointer that may be NULL */
+#define KF_KPTR_GET    (1 << 3) /* kfunc returns reference to a kptr */
+/* Trusted arguments are those which are meant to be referenced arguments with
+ * unchanged offset. It is used to enforce that pointers obtained from acquire
+ * kfuncs remain unmodified when being passed to helpers taking trusted args.
+ *
+ * Consider
+ *     struct foo {
+ *             int data;
+ *             struct foo *next;
+ *     };
+ *
+ *     struct bar {
+ *             int data;
+ *             struct foo f;
+ *     };
+ *
+ *     struct foo *f = alloc_foo(); // Acquire kfunc
+ *     struct bar *b = alloc_bar(); // Acquire kfunc
+ *
+ * If a kfunc set_foo_data() wants to operate only on the allocated object, it
+ * will set the KF_TRUSTED_ARGS flag, which will prevent unsafe usage like:
+ *
+ *     set_foo_data(f, 42);       // Allowed
+ *     set_foo_data(f->next, 42); // Rejected, non-referenced pointer
+ *     set_foo_data(&f->next, 42);// Rejected, referenced, but wrong type
+ *     set_foo_data(&b->f, 42);   // Rejected, referenced, but bad offset
+ *
+ * In the final case, usually for the purposes of type matching, it is deduced
+ * by looking at the type of the member at the offset, but due to the
+ * requirement of trusted argument, this deduction will be strict and not done
+ * for this case.
+ */
+#define KF_TRUSTED_ARGS (1 << 4) /* kfunc only takes trusted pointer arguments */
 
 struct btf;
 struct btf_member;
@@ -30,16 +59,7 @@ struct btf_id_set;
 
 struct btf_kfunc_id_set {
        struct module *owner;
-       union {
-               struct {
-                       struct btf_id_set *check_set;
-                       struct btf_id_set *acquire_set;
-                       struct btf_id_set *release_set;
-                       struct btf_id_set *ret_null_set;
-                       struct btf_id_set *kptr_acquire_set;
-               };
-               struct btf_id_set *sets[BTF_KFUNC_TYPE_MAX];
-       };
+       struct btf_id_set8 *set;
 };
 
 struct btf_id_dtor_kfunc {
@@ -378,9 +398,9 @@ const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id);
 const char *btf_name_by_offset(const struct btf *btf, u32 offset);
 struct btf *btf_parse_vmlinux(void);
 struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog);
-bool btf_kfunc_id_set_contains(const struct btf *btf,
+u32 *btf_kfunc_id_set_contains(const struct btf *btf,
                               enum bpf_prog_type prog_type,
-                              enum btf_kfunc_type type, u32 kfunc_btf_id);
+                              u32 kfunc_btf_id);
 int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
                              const struct btf_kfunc_id_set *s);
 s32 btf_find_dtor_kfunc(struct btf *btf, u32 btf_id);
@@ -397,12 +417,11 @@ static inline const char *btf_name_by_offset(const struct btf *btf,
 {
        return NULL;
 }
-static inline bool btf_kfunc_id_set_contains(const struct btf *btf,
+static inline u32 *btf_kfunc_id_set_contains(const struct btf *btf,
                                             enum bpf_prog_type prog_type,
-                                            enum btf_kfunc_type type,
                                             u32 kfunc_btf_id)
 {
-       return false;
+       return NULL;
 }
 static inline int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
                                            const struct btf_kfunc_id_set *s)
index 252a4be..2aea877 100644 (file)
@@ -8,6 +8,15 @@ struct btf_id_set {
        u32 ids[];
 };
 
+struct btf_id_set8 {
+       u32 cnt;
+       u32 flags;
+       struct {
+               u32 id;
+               u32 flags;
+       } pairs[];
+};
+
 #ifdef CONFIG_DEBUG_INFO_BTF
 
 #include <linux/compiler.h> /* for __PASTE */
@@ -25,7 +34,7 @@ struct btf_id_set {
 
 #define BTF_IDS_SECTION ".BTF_ids"
 
-#define ____BTF_ID(symbol)                             \
+#define ____BTF_ID(symbol, word)                       \
 asm(                                                   \
 ".pushsection " BTF_IDS_SECTION ",\"a\";       \n"     \
 ".local " #symbol " ;                          \n"     \
@@ -33,10 +42,11 @@ asm(                                                        \
 ".size  " #symbol ", 4;                        \n"     \
 #symbol ":                                     \n"     \
 ".zero 4                                       \n"     \
+word                                                   \
 ".popsection;                                  \n");
 
-#define __BTF_ID(symbol) \
-       ____BTF_ID(symbol)
+#define __BTF_ID(symbol, word) \
+       ____BTF_ID(symbol, word)
 
 #define __ID(prefix) \
        __PASTE(prefix, __COUNTER__)
@@ -46,7 +56,14 @@ asm(                                                 \
  * to 4 zero bytes.
  */
 #define BTF_ID(prefix, name) \
-       __BTF_ID(__ID(__BTF_ID__##prefix##__##name##__))
+       __BTF_ID(__ID(__BTF_ID__##prefix##__##name##__), "")
+
+#define ____BTF_ID_FLAGS(prefix, name, flags) \
+       __BTF_ID(__ID(__BTF_ID__##prefix##__##name##__), ".long " #flags "\n")
+#define __BTF_ID_FLAGS(prefix, name, flags, ...) \
+       ____BTF_ID_FLAGS(prefix, name, flags)
+#define BTF_ID_FLAGS(prefix, name, ...) \
+       __BTF_ID_FLAGS(prefix, name, ##__VA_ARGS__, 0)
 
 /*
  * The BTF_ID_LIST macro defines pure (unsorted) list
@@ -145,10 +162,51 @@ asm(                                                      \
 ".popsection;                                 \n");    \
 extern struct btf_id_set name;
 
+/*
+ * The BTF_SET8_START/END macros pair defines sorted list of
+ * BTF IDs and their flags plus its members count, with the
+ * following layout:
+ *
+ * BTF_SET8_START(list)
+ * BTF_ID_FLAGS(type1, name1, flags)
+ * BTF_ID_FLAGS(type2, name2, flags)
+ * BTF_SET8_END(list)
+ *
+ * __BTF_ID__set8__list:
+ * .zero 8
+ * list:
+ * __BTF_ID__type1__name1__3:
+ * .zero 4
+ * .word (1 << 0) | (1 << 2)
+ * __BTF_ID__type2__name2__5:
+ * .zero 4
+ * .word (1 << 3) | (1 << 1) | (1 << 2)
+ *
+ */
+#define __BTF_SET8_START(name, scope)                  \
+asm(                                                   \
+".pushsection " BTF_IDS_SECTION ",\"a\";       \n"     \
+"." #scope " __BTF_ID__set8__" #name ";        \n"     \
+"__BTF_ID__set8__" #name ":;                   \n"     \
+".zero 8                                       \n"     \
+".popsection;                                  \n");
+
+#define BTF_SET8_START(name)                           \
+__BTF_ID_LIST(name, local)                             \
+__BTF_SET8_START(name, local)
+
+#define BTF_SET8_END(name)                             \
+asm(                                                   \
+".pushsection " BTF_IDS_SECTION ",\"a\";      \n"      \
+".size __BTF_ID__set8__" #name ", .-" #name "  \n"     \
+".popsection;                                 \n");    \
+extern struct btf_id_set8 name;
+
 #else
 
 #define BTF_ID_LIST(name) static u32 __maybe_unused name[5];
 #define BTF_ID(prefix, name)
+#define BTF_ID_FLAGS(prefix, name, ...)
 #define BTF_ID_UNUSED
 #define BTF_ID_LIST_GLOBAL(name, n) u32 __maybe_unused name[n];
 #define BTF_ID_LIST_SINGLE(name, prefix, typename) static u32 __maybe_unused name[1];
@@ -156,6 +214,8 @@ extern struct btf_id_set name;
 #define BTF_SET_START(name) static struct btf_id_set __maybe_unused name = { 0 };
 #define BTF_SET_START_GLOBAL(name) static struct btf_id_set __maybe_unused name = { 0 };
 #define BTF_SET_END(name)
+#define BTF_SET8_START(name) static struct btf_id_set8 __maybe_unused name = { 0 };
+#define BTF_SET8_END(name)
 
 #endif /* CONFIG_DEBUG_INFO_BTF */
 
index 4c1a8b2..a5f21dc 100644 (file)
@@ -1027,6 +1027,14 @@ u64 bpf_jit_alloc_exec_limit(void);
 void *bpf_jit_alloc_exec(unsigned long size);
 void bpf_jit_free_exec(void *addr);
 void bpf_jit_free(struct bpf_prog *fp);
+struct bpf_binary_header *
+bpf_jit_binary_pack_hdr(const struct bpf_prog *fp);
+
+static inline bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
+{
+       return list_empty(&fp->aux->ksym.lnode) ||
+              fp->aux->ksym.lnode.prev == LIST_POISON2;
+}
 
 struct bpf_binary_header *
 bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **ro_image,
index 979f6bf..0b61371 100644 (file)
@@ -208,6 +208,43 @@ enum {
        FTRACE_OPS_FL_DIRECT                    = BIT(17),
 };
 
+/*
+ * FTRACE_OPS_CMD_* commands allow the ftrace core logic to request changes
+ * to a ftrace_ops. Note, the requests may fail.
+ *
+ * ENABLE_SHARE_IPMODIFY_SELF - enable a DIRECT ops to work on the same
+ *                              function as an ops with IPMODIFY. Called
+ *                              when the DIRECT ops is being registered.
+ *                              This is called with both direct_mutex and
+ *                              ftrace_lock are locked.
+ *
+ * ENABLE_SHARE_IPMODIFY_PEER - enable a DIRECT ops to work on the same
+ *                              function as an ops with IPMODIFY. Called
+ *                              when the other ops (the one with IPMODIFY)
+ *                              is being registered.
+ *                              This is called with direct_mutex locked.
+ *
+ * DISABLE_SHARE_IPMODIFY_PEER - disable a DIRECT ops to work on the same
+ *                               function as an ops with IPMODIFY. Called
+ *                               when the other ops (the one with IPMODIFY)
+ *                               is being unregistered.
+ *                               This is called with direct_mutex locked.
+ */
+enum ftrace_ops_cmd {
+       FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY_SELF,
+       FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY_PEER,
+       FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY_PEER,
+};
+
+/*
+ * For most ftrace_ops_cmd,
+ * Returns:
+ *        0 - Success.
+ *        Negative on failure. The return value is dependent on the
+ *        callback.
+ */
+typedef int (*ftrace_ops_func_t)(struct ftrace_ops *op, enum ftrace_ops_cmd cmd);
+
 #ifdef CONFIG_DYNAMIC_FTRACE
 /* The hash used to know what functions callbacks trace */
 struct ftrace_ops_hash {
@@ -250,6 +287,7 @@ struct ftrace_ops {
        unsigned long                   trampoline;
        unsigned long                   trampoline_size;
        struct list_head                list;
+       ftrace_ops_func_t               ops_func;
 #endif
 };
 
@@ -340,6 +378,7 @@ unsigned long ftrace_find_rec_direct(unsigned long ip);
 int register_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr);
 int unregister_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr);
 int modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr);
+int modify_ftrace_direct_multi_nolock(struct ftrace_ops *ops, unsigned long addr);
 
 #else
 struct ftrace_ops;
@@ -384,6 +423,10 @@ static inline int modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned lo
 {
        return -ENODEV;
 }
+static inline int modify_ftrace_direct_multi_nolock(struct ftrace_ops *ops, unsigned long addr)
+{
+       return -ENODEV;
+}
 #endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
 
 #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
index 5d7ff88..ca8afa3 100644 (file)
@@ -2487,6 +2487,14 @@ static inline void skb_set_tail_pointer(struct sk_buff *skb, const int offset)
 
 #endif /* NET_SKBUFF_DATA_USES_OFFSET */
 
+static inline void skb_assert_len(struct sk_buff *skb)
+{
+#ifdef CONFIG_DEBUG_NET
+       if (WARN_ONCE(!skb->len, "%s\n", __func__))
+               DO_ONCE_LITE(skb_dump, KERN_ERR, skb, false);
+#endif /* CONFIG_DEBUG_NET */
+}
+
 /*
  *     Add data to an sk_buff
  */
index 37866c8..3cd3a6e 100644 (file)
@@ -84,4 +84,23 @@ void nf_conntrack_lock(spinlock_t *lock);
 
 extern spinlock_t nf_conntrack_expect_lock;
 
+/* ctnetlink code shared by both ctnetlink and nf_conntrack_bpf */
+
+#if (IS_BUILTIN(CONFIG_NF_CONNTRACK) && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) || \
+    (IS_MODULE(CONFIG_NF_CONNTRACK) && IS_ENABLED(CONFIG_DEBUG_INFO_BTF_MODULES) || \
+    IS_ENABLED(CONFIG_NF_CT_NETLINK))
+
+static inline void __nf_ct_set_timeout(struct nf_conn *ct, u64 timeout)
+{
+       if (timeout > INT_MAX)
+               timeout = INT_MAX;
+       WRITE_ONCE(ct->timeout, nfct_time_stamp + (u32)timeout);
+}
+
+int __nf_ct_change_timeout(struct nf_conn *ct, u64 cta_timeout);
+void __nf_ct_change_status(struct nf_conn *ct, unsigned long on, unsigned long off);
+int nf_ct_change_status_common(struct nf_conn *ct, unsigned int status);
+
+#endif
+
 #endif /* _NF_CONNTRACK_CORE_H */
index 4aa0318..4277b0d 100644 (file)
@@ -44,6 +44,15 @@ static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool,
        xp_set_rxq_info(pool, rxq);
 }
 
+static inline unsigned int xsk_pool_get_napi_id(struct xsk_buff_pool *pool)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+       return pool->heads[0].xdp.rxq->napi_id;
+#else
+       return 0;
+#endif
+}
+
 static inline void xsk_pool_dma_unmap(struct xsk_buff_pool *pool,
                                      unsigned long attrs)
 {
@@ -198,6 +207,11 @@ static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool,
 {
 }
 
+static inline unsigned int xsk_pool_get_napi_id(struct xsk_buff_pool *pool)
+{
+       return 0;
+}
+
 static inline void xsk_pool_dma_unmap(struct xsk_buff_pool *pool,
                                      unsigned long attrs)
 {
index 3dd13fe..59a217c 100644 (file)
@@ -2361,7 +2361,8 @@ union bpf_attr {
  *             Pull in non-linear data in case the *skb* is non-linear and not
  *             all of *len* are part of the linear section. Make *len* bytes
  *             from *skb* readable and writable. If a zero value is passed for
- *             *len*, then the whole length of the *skb* is pulled.
+ *             *len*, then all bytes in the linear part of *skb* will be made
+ *             readable and writable.
  *
  *             This helper is only needed for reading and writing with direct
  *             packet access.
index fe40d3b..d3e734b 100644 (file)
@@ -70,10 +70,8 @@ int array_map_alloc_check(union bpf_attr *attr)
            attr->map_flags & BPF_F_PRESERVE_ELEMS)
                return -EINVAL;
 
-       if (attr->value_size > KMALLOC_MAX_SIZE)
-               /* if value_size is bigger, the user space won't be able to
-                * access the elements.
-                */
+       /* avoid overflow on round_up(map->value_size) */
+       if (attr->value_size > INT_MAX)
                return -E2BIG;
 
        return 0;
@@ -156,6 +154,11 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
        return &array->map;
 }
 
+static void *array_map_elem_ptr(struct bpf_array* array, u32 index)
+{
+       return array->value + (u64)array->elem_size * index;
+}
+
 /* Called from syscall or from eBPF program */
 static void *array_map_lookup_elem(struct bpf_map *map, void *key)
 {
@@ -165,7 +168,7 @@ static void *array_map_lookup_elem(struct bpf_map *map, void *key)
        if (unlikely(index >= array->map.max_entries))
                return NULL;
 
-       return array->value + array->elem_size * (index & array->index_mask);
+       return array->value + (u64)array->elem_size * (index & array->index_mask);
 }
 
 static int array_map_direct_value_addr(const struct bpf_map *map, u64 *imm,
@@ -203,7 +206,7 @@ static int array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
 {
        struct bpf_array *array = container_of(map, struct bpf_array, map);
        struct bpf_insn *insn = insn_buf;
-       u32 elem_size = round_up(map->value_size, 8);
+       u32 elem_size = array->elem_size;
        const int ret = BPF_REG_0;
        const int map_ptr = BPF_REG_1;
        const int index = BPF_REG_2;
@@ -272,7 +275,7 @@ int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value)
         * access 'value_size' of them, so copying rounded areas
         * will not leak any kernel data
         */
-       size = round_up(map->value_size, 8);
+       size = array->elem_size;
        rcu_read_lock();
        pptr = array->pptrs[index & array->index_mask];
        for_each_possible_cpu(cpu) {
@@ -339,7 +342,7 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
                       value, map->value_size);
        } else {
                val = array->value +
-                       array->elem_size * (index & array->index_mask);
+                       (u64)array->elem_size * (index & array->index_mask);
                if (map_flags & BPF_F_LOCK)
                        copy_map_value_locked(map, val, value, false);
                else
@@ -376,7 +379,7 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value,
         * returned or zeros which were zero-filled by percpu_alloc,
         * so no kernel data leaks possible
         */
-       size = round_up(map->value_size, 8);
+       size = array->elem_size;
        rcu_read_lock();
        pptr = array->pptrs[index & array->index_mask];
        for_each_possible_cpu(cpu) {
@@ -408,8 +411,7 @@ static void array_map_free_timers(struct bpf_map *map)
                return;
 
        for (i = 0; i < array->map.max_entries; i++)
-               bpf_timer_cancel_and_free(array->value + array->elem_size * i +
-                                         map->timer_off);
+               bpf_timer_cancel_and_free(array_map_elem_ptr(array, i) + map->timer_off);
 }
 
 /* Called when map->refcnt goes to zero, either from workqueue or from syscall */
@@ -420,7 +422,7 @@ static void array_map_free(struct bpf_map *map)
 
        if (map_value_has_kptrs(map)) {
                for (i = 0; i < array->map.max_entries; i++)
-                       bpf_map_free_kptrs(map, array->value + array->elem_size * i);
+                       bpf_map_free_kptrs(map, array_map_elem_ptr(array, i));
                bpf_map_free_kptr_off_tab(map);
        }
 
@@ -556,7 +558,7 @@ static void *bpf_array_map_seq_start(struct seq_file *seq, loff_t *pos)
        index = info->index & array->index_mask;
        if (info->percpu_value_buf)
               return array->pptrs[index];
-       return array->value + array->elem_size * index;
+       return array_map_elem_ptr(array, index);
 }
 
 static void *bpf_array_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
@@ -575,7 +577,7 @@ static void *bpf_array_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
        index = info->index & array->index_mask;
        if (info->percpu_value_buf)
               return array->pptrs[index];
-       return array->value + array->elem_size * index;
+       return array_map_elem_ptr(array, index);
 }
 
 static int __bpf_array_map_seq_show(struct seq_file *seq, void *v)
@@ -583,6 +585,7 @@ static int __bpf_array_map_seq_show(struct seq_file *seq, void *v)
        struct bpf_iter_seq_array_map_info *info = seq->private;
        struct bpf_iter__bpf_map_elem ctx = {};
        struct bpf_map *map = info->map;
+       struct bpf_array *array = container_of(map, struct bpf_array, map);
        struct bpf_iter_meta meta;
        struct bpf_prog *prog;
        int off = 0, cpu = 0;
@@ -603,7 +606,7 @@ static int __bpf_array_map_seq_show(struct seq_file *seq, void *v)
                        ctx.value = v;
                } else {
                        pptr = v;
-                       size = round_up(map->value_size, 8);
+                       size = array->elem_size;
                        for_each_possible_cpu(cpu) {
                                bpf_long_memcpy(info->percpu_value_buf + off,
                                                per_cpu_ptr(pptr, cpu),
@@ -633,11 +636,12 @@ static int bpf_iter_init_array_map(void *priv_data,
 {
        struct bpf_iter_seq_array_map_info *seq_info = priv_data;
        struct bpf_map *map = aux->map;
+       struct bpf_array *array = container_of(map, struct bpf_array, map);
        void *value_buf;
        u32 buf_size;
 
        if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
-               buf_size = round_up(map->value_size, 8) * num_possible_cpus();
+               buf_size = array->elem_size * num_possible_cpus();
                value_buf = kmalloc(buf_size, GFP_USER | __GFP_NOWARN);
                if (!value_buf)
                        return -ENOMEM;
@@ -690,7 +694,7 @@ static int bpf_for_each_array_elem(struct bpf_map *map, bpf_callback_t callback_
                if (is_percpu)
                        val = this_cpu_ptr(array->pptrs[i]);
                else
-                       val = array->value + array->elem_size * i;
+                       val = array_map_elem_ptr(array, i);
                num_elems++;
                key = i;
                ret = callback_fn((u64)(long)map, (u64)(long)&key,
@@ -1322,7 +1326,7 @@ static int array_of_map_gen_lookup(struct bpf_map *map,
                                   struct bpf_insn *insn_buf)
 {
        struct bpf_array *array = container_of(map, struct bpf_array, map);
-       u32 elem_size = round_up(map->value_size, 8);
+       u32 elem_size = array->elem_size;
        struct bpf_insn *insn = insn_buf;
        const int ret = BPF_REG_0;
        const int map_ptr = BPF_REG_1;
index d469b7f..fa71d58 100644 (file)
@@ -63,10 +63,11 @@ BTF_ID(func, bpf_lsm_socket_post_create)
 BTF_ID(func, bpf_lsm_socket_socketpair)
 BTF_SET_END(bpf_lsm_unlocked_sockopt_hooks)
 
+#ifdef CONFIG_CGROUP_BPF
 void bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog,
                             bpf_func_t *bpf_func)
 {
-       const struct btf_param *args;
+       const struct btf_param *args __maybe_unused;
 
        if (btf_type_vlen(prog->aux->attach_func_proto) < 1 ||
            btf_id_set_contains(&bpf_lsm_current_hooks,
@@ -75,9 +76,9 @@ void bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog,
                return;
        }
 
+#ifdef CONFIG_NET
        args = btf_params(prog->aux->attach_func_proto);
 
-#ifdef CONFIG_NET
        if (args[0].type == btf_sock_ids[BTF_SOCK_TYPE_SOCKET])
                *bpf_func = __cgroup_bpf_run_lsm_socket;
        else if (args[0].type == btf_sock_ids[BTF_SOCK_TYPE_SOCK])
@@ -86,6 +87,7 @@ void bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog,
 #endif
                *bpf_func = __cgroup_bpf_run_lsm_current;
 }
+#endif
 
 int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog,
                        const struct bpf_prog *prog)
@@ -219,6 +221,7 @@ bpf_lsm_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
        case BPF_FUNC_get_retval:
                return prog->expected_attach_type == BPF_LSM_CGROUP ?
                        &bpf_get_retval_proto : NULL;
+#ifdef CONFIG_NET
        case BPF_FUNC_setsockopt:
                if (prog->expected_attach_type != BPF_LSM_CGROUP)
                        return NULL;
@@ -239,6 +242,7 @@ bpf_lsm_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
                                        prog->aux->attach_btf_id))
                        return &bpf_unlocked_sk_getsockopt_proto;
                return NULL;
+#endif
        default:
                return tracing_prog_func_proto(func_id, prog);
        }
index 7e0068c..84b2d9d 100644 (file)
@@ -341,6 +341,9 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks,
 
        tlinks[BPF_TRAMP_FENTRY].links[0] = link;
        tlinks[BPF_TRAMP_FENTRY].nr_links = 1;
+       /* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops,
+        * and it must be used alone.
+        */
        flags = model->ret_size > 0 ? BPF_TRAMP_F_RET_FENTRY_RET : 0;
        return arch_prepare_bpf_trampoline(NULL, image, image_end,
                                           model, flags, tlinks, NULL);
index 4423045..7ac971e 100644 (file)
@@ -213,7 +213,7 @@ enum {
 };
 
 struct btf_kfunc_set_tab {
-       struct btf_id_set *sets[BTF_KFUNC_HOOK_MAX][BTF_KFUNC_TYPE_MAX];
+       struct btf_id_set8 *sets[BTF_KFUNC_HOOK_MAX];
 };
 
 struct btf_id_dtor_kfunc_tab {
@@ -1116,7 +1116,8 @@ __printf(2, 3) static void btf_show(struct btf_show *show, const char *fmt, ...)
  */
 #define btf_show_type_value(show, fmt, value)                                 \
        do {                                                                   \
-               if ((value) != 0 || (show->flags & BTF_SHOW_ZERO) ||           \
+               if ((value) != (__typeof__(value))0 ||                         \
+                   (show->flags & BTF_SHOW_ZERO) ||                           \
                    show->state.depth == 0) {                                  \
                        btf_show(show, "%s%s" fmt "%s%s",                      \
                                 btf_show_indent(show),                        \
@@ -1615,7 +1616,7 @@ static void btf_free_id(struct btf *btf)
 static void btf_free_kfunc_set_tab(struct btf *btf)
 {
        struct btf_kfunc_set_tab *tab = btf->kfunc_set_tab;
-       int hook, type;
+       int hook;
 
        if (!tab)
                return;
@@ -1624,10 +1625,8 @@ static void btf_free_kfunc_set_tab(struct btf *btf)
         */
        if (btf_is_module(btf))
                goto free_tab;
-       for (hook = 0; hook < ARRAY_SIZE(tab->sets); hook++) {
-               for (type = 0; type < ARRAY_SIZE(tab->sets[0]); type++)
-                       kfree(tab->sets[hook][type]);
-       }
+       for (hook = 0; hook < ARRAY_SIZE(tab->sets); hook++)
+               kfree(tab->sets[hook]);
 free_tab:
        kfree(tab);
        btf->kfunc_set_tab = NULL;
@@ -6171,13 +6170,14 @@ static bool is_kfunc_arg_mem_size(const struct btf *btf,
 static int btf_check_func_arg_match(struct bpf_verifier_env *env,
                                    const struct btf *btf, u32 func_id,
                                    struct bpf_reg_state *regs,
-                                   bool ptr_to_mem_ok)
+                                   bool ptr_to_mem_ok,
+                                   u32 kfunc_flags)
 {
        enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
+       bool rel = false, kptr_get = false, trusted_arg = false;
        struct bpf_verifier_log *log = &env->log;
        u32 i, nargs, ref_id, ref_obj_id = 0;
        bool is_kfunc = btf_is_kernel(btf);
-       bool rel = false, kptr_get = false;
        const char *func_name, *ref_tname;
        const struct btf_type *t, *ref_t;
        const struct btf_param *args;
@@ -6209,10 +6209,9 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 
        if (is_kfunc) {
                /* Only kfunc can be release func */
-               rel = btf_kfunc_id_set_contains(btf, resolve_prog_type(env->prog),
-                                               BTF_KFUNC_TYPE_RELEASE, func_id);
-               kptr_get = btf_kfunc_id_set_contains(btf, resolve_prog_type(env->prog),
-                                                    BTF_KFUNC_TYPE_KPTR_ACQUIRE, func_id);
+               rel = kfunc_flags & KF_RELEASE;
+               kptr_get = kfunc_flags & KF_KPTR_GET;
+               trusted_arg = kfunc_flags & KF_TRUSTED_ARGS;
        }
 
        /* check that BTF function arguments match actual types that the
@@ -6237,10 +6236,19 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
                        return -EINVAL;
                }
 
+               /* Check if argument must be a referenced pointer, args + i has
+                * been verified to be a pointer (after skipping modifiers).
+                */
+               if (is_kfunc && trusted_arg && !reg->ref_obj_id) {
+                       bpf_log(log, "R%d must be referenced\n", regno);
+                       return -EINVAL;
+               }
+
                ref_t = btf_type_skip_modifiers(btf, t->type, &ref_id);
                ref_tname = btf_name_by_offset(btf, ref_t->name_off);
 
-               if (rel && reg->ref_obj_id)
+               /* Trusted args have the same offset checks as release arguments */
+               if (trusted_arg || (rel && reg->ref_obj_id))
                        arg_type |= OBJ_RELEASE;
                ret = check_func_arg_reg_off(env, reg, regno, arg_type);
                if (ret < 0)
@@ -6338,7 +6346,8 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
                        reg_ref_tname = btf_name_by_offset(reg_btf,
                                                           reg_ref_t->name_off);
                        if (!btf_struct_ids_match(log, reg_btf, reg_ref_id,
-                                                 reg->off, btf, ref_id, rel && reg->ref_obj_id)) {
+                                                 reg->off, btf, ref_id,
+                                                 trusted_arg || (rel && reg->ref_obj_id))) {
                                bpf_log(log, "kernel function %s args#%d expected pointer to %s %s but R%d has a pointer to %s %s\n",
                                        func_name, i,
                                        btf_type_str(ref_t), ref_tname,
@@ -6441,7 +6450,7 @@ int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog,
                return -EINVAL;
 
        is_global = prog->aux->func_info_aux[subprog].linkage == BTF_FUNC_GLOBAL;
-       err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global);
+       err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global, 0);
 
        /* Compiler optimizations can remove arguments from static functions
         * or mismatched type can be passed into a global function.
@@ -6454,9 +6463,10 @@ int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog,
 
 int btf_check_kfunc_arg_match(struct bpf_verifier_env *env,
                              const struct btf *btf, u32 func_id,
-                             struct bpf_reg_state *regs)
+                             struct bpf_reg_state *regs,
+                             u32 kfunc_flags)
 {
-       return btf_check_func_arg_match(env, btf, func_id, regs, true);
+       return btf_check_func_arg_match(env, btf, func_id, regs, true, kfunc_flags);
 }
 
 /* Convert BTF of a function into bpf_reg_state if possible
@@ -6853,6 +6863,11 @@ bool btf_id_set_contains(const struct btf_id_set *set, u32 id)
        return bsearch(&id, set->ids, set->cnt, sizeof(u32), btf_id_cmp_func) != NULL;
 }
 
+static void *btf_id_set8_contains(const struct btf_id_set8 *set, u32 id)
+{
+       return bsearch(&id, set->pairs, set->cnt, sizeof(set->pairs[0]), btf_id_cmp_func);
+}
+
 enum {
        BTF_MODULE_F_LIVE = (1 << 0),
 };
@@ -7101,16 +7116,16 @@ BTF_TRACING_TYPE_xxx
 
 /* Kernel Function (kfunc) BTF ID set registration API */
 
-static int __btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook,
-                                   enum btf_kfunc_type type,
-                                   struct btf_id_set *add_set, bool vmlinux_set)
+static int btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook,
+                                 struct btf_id_set8 *add_set)
 {
+       bool vmlinux_set = !btf_is_module(btf);
        struct btf_kfunc_set_tab *tab;
-       struct btf_id_set *set;
+       struct btf_id_set8 *set;
        u32 set_cnt;
        int ret;
 
-       if (hook >= BTF_KFUNC_HOOK_MAX || type >= BTF_KFUNC_TYPE_MAX) {
+       if (hook >= BTF_KFUNC_HOOK_MAX) {
                ret = -EINVAL;
                goto end;
        }
@@ -7126,7 +7141,7 @@ static int __btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook,
                btf->kfunc_set_tab = tab;
        }
 
-       set = tab->sets[hook][type];
+       set = tab->sets[hook];
        /* Warn when register_btf_kfunc_id_set is called twice for the same hook
         * for module sets.
         */
@@ -7140,7 +7155,7 @@ static int __btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook,
         * pointer and return.
         */
        if (!vmlinux_set) {
-               tab->sets[hook][type] = add_set;
+               tab->sets[hook] = add_set;
                return 0;
        }
 
@@ -7149,7 +7164,7 @@ static int __btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook,
         * and concatenate all individual sets being registered. While each set
         * is individually sorted, they may become unsorted when concatenated,
         * hence re-sorting the final set again is required to make binary
-        * searching the set using btf_id_set_contains function work.
+        * searching the set using btf_id_set8_contains function work.
         */
        set_cnt = set ? set->cnt : 0;
 
@@ -7164,8 +7179,8 @@ static int __btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook,
        }
 
        /* Grow set */
-       set = krealloc(tab->sets[hook][type],
-                      offsetof(struct btf_id_set, ids[set_cnt + add_set->cnt]),
+       set = krealloc(tab->sets[hook],
+                      offsetof(struct btf_id_set8, pairs[set_cnt + add_set->cnt]),
                       GFP_KERNEL | __GFP_NOWARN);
        if (!set) {
                ret = -ENOMEM;
@@ -7173,15 +7188,15 @@ static int __btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook,
        }
 
        /* For newly allocated set, initialize set->cnt to 0 */
-       if (!tab->sets[hook][type])
+       if (!tab->sets[hook])
                set->cnt = 0;
-       tab->sets[hook][type] = set;
+       tab->sets[hook] = set;
 
        /* Concatenate the two sets */
-       memcpy(set->ids + set->cnt, add_set->ids, add_set->cnt * sizeof(set->ids[0]));
+       memcpy(set->pairs + set->cnt, add_set->pairs, add_set->cnt * sizeof(set->pairs[0]));
        set->cnt += add_set->cnt;
 
-       sort(set->ids, set->cnt, sizeof(set->ids[0]), btf_id_cmp_func, NULL);
+       sort(set->pairs, set->cnt, sizeof(set->pairs[0]), btf_id_cmp_func, NULL);
 
        return 0;
 end:
@@ -7189,38 +7204,25 @@ end:
        return ret;
 }
 
-static int btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook,
-                                 const struct btf_kfunc_id_set *kset)
-{
-       bool vmlinux_set = !btf_is_module(btf);
-       int type, ret = 0;
-
-       for (type = 0; type < ARRAY_SIZE(kset->sets); type++) {
-               if (!kset->sets[type])
-                       continue;
-
-               ret = __btf_populate_kfunc_set(btf, hook, type, kset->sets[type], vmlinux_set);
-               if (ret)
-                       break;
-       }
-       return ret;
-}
-
-static bool __btf_kfunc_id_set_contains(const struct btf *btf,
+static u32 *__btf_kfunc_id_set_contains(const struct btf *btf,
                                        enum btf_kfunc_hook hook,
-                                       enum btf_kfunc_type type,
                                        u32 kfunc_btf_id)
 {
-       struct btf_id_set *set;
+       struct btf_id_set8 *set;
+       u32 *id;
 
-       if (hook >= BTF_KFUNC_HOOK_MAX || type >= BTF_KFUNC_TYPE_MAX)
-               return false;
+       if (hook >= BTF_KFUNC_HOOK_MAX)
+               return NULL;
        if (!btf->kfunc_set_tab)
-               return false;
-       set = btf->kfunc_set_tab->sets[hook][type];
+               return NULL;
+       set = btf->kfunc_set_tab->sets[hook];
        if (!set)
-               return false;
-       return btf_id_set_contains(set, kfunc_btf_id);
+               return NULL;
+       id = btf_id_set8_contains(set, kfunc_btf_id);
+       if (!id)
+               return NULL;
+       /* The flags for BTF ID are located next to it */
+       return id + 1;
 }
 
 static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
@@ -7248,14 +7250,14 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
  * keeping the reference for the duration of the call provides the necessary
  * protection for looking up a well-formed btf->kfunc_set_tab.
  */
-bool btf_kfunc_id_set_contains(const struct btf *btf,
+u32 *btf_kfunc_id_set_contains(const struct btf *btf,
                               enum bpf_prog_type prog_type,
-                              enum btf_kfunc_type type, u32 kfunc_btf_id)
+                              u32 kfunc_btf_id)
 {
        enum btf_kfunc_hook hook;
 
        hook = bpf_prog_type_to_kfunc_hook(prog_type);
-       return __btf_kfunc_id_set_contains(btf, hook, type, kfunc_btf_id);
+       return __btf_kfunc_id_set_contains(btf, hook, kfunc_btf_id);
 }
 
 /* This function must be invoked only from initcalls/module init functions */
@@ -7282,7 +7284,7 @@ int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
                return PTR_ERR(btf);
 
        hook = bpf_prog_type_to_kfunc_hook(prog_type);
-       ret = btf_populate_kfunc_set(btf, hook, kset);
+       ret = btf_populate_kfunc_set(btf, hook, kset->set);
        btf_put(btf);
        return ret;
 }
index bfeb9b9..c1e10d0 100644 (file)
@@ -652,12 +652,6 @@ static bool bpf_prog_kallsyms_candidate(const struct bpf_prog *fp)
        return fp->jited && !bpf_prog_was_classic(fp);
 }
 
-static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
-{
-       return list_empty(&fp->aux->ksym.lnode) ||
-              fp->aux->ksym.lnode.prev == LIST_POISON2;
-}
-
 void bpf_prog_kallsyms_add(struct bpf_prog *fp)
 {
        if (!bpf_prog_kallsyms_candidate(fp) ||
@@ -833,15 +827,6 @@ struct bpf_prog_pack {
 
 #define BPF_PROG_SIZE_TO_NBITS(size)   (round_up(size, BPF_PROG_CHUNK_SIZE) / BPF_PROG_CHUNK_SIZE)
 
-static size_t bpf_prog_pack_size = -1;
-static size_t bpf_prog_pack_mask = -1;
-
-static int bpf_prog_chunk_count(void)
-{
-       WARN_ON_ONCE(bpf_prog_pack_size == -1);
-       return bpf_prog_pack_size / BPF_PROG_CHUNK_SIZE;
-}
-
 static DEFINE_MUTEX(pack_mutex);
 static LIST_HEAD(pack_list);
 
@@ -849,55 +834,33 @@ static LIST_HEAD(pack_list);
  * CONFIG_MMU=n. Use PAGE_SIZE in these cases.
  */
 #ifdef PMD_SIZE
-#define BPF_HPAGE_SIZE PMD_SIZE
-#define BPF_HPAGE_MASK PMD_MASK
+#define BPF_PROG_PACK_SIZE (PMD_SIZE * num_possible_nodes())
 #else
-#define BPF_HPAGE_SIZE PAGE_SIZE
-#define BPF_HPAGE_MASK PAGE_MASK
+#define BPF_PROG_PACK_SIZE PAGE_SIZE
 #endif
 
-static size_t select_bpf_prog_pack_size(void)
-{
-       size_t size;
-       void *ptr;
-
-       size = BPF_HPAGE_SIZE * num_online_nodes();
-       ptr = module_alloc(size);
-
-       /* Test whether we can get huge pages. If not just use PAGE_SIZE
-        * packs.
-        */
-       if (!ptr || !is_vm_area_hugepages(ptr)) {
-               size = PAGE_SIZE;
-               bpf_prog_pack_mask = PAGE_MASK;
-       } else {
-               bpf_prog_pack_mask = BPF_HPAGE_MASK;
-       }
-
-       vfree(ptr);
-       return size;
-}
+#define BPF_PROG_CHUNK_COUNT (BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE)
 
 static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_insns)
 {
        struct bpf_prog_pack *pack;
 
-       pack = kzalloc(struct_size(pack, bitmap, BITS_TO_LONGS(bpf_prog_chunk_count())),
+       pack = kzalloc(struct_size(pack, bitmap, BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)),
                       GFP_KERNEL);
        if (!pack)
                return NULL;
-       pack->ptr = module_alloc(bpf_prog_pack_size);
+       pack->ptr = module_alloc(BPF_PROG_PACK_SIZE);
        if (!pack->ptr) {
                kfree(pack);
                return NULL;
        }
-       bpf_fill_ill_insns(pack->ptr, bpf_prog_pack_size);
-       bitmap_zero(pack->bitmap, bpf_prog_pack_size / BPF_PROG_CHUNK_SIZE);
+       bpf_fill_ill_insns(pack->ptr, BPF_PROG_PACK_SIZE);
+       bitmap_zero(pack->bitmap, BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE);
        list_add_tail(&pack->list, &pack_list);
 
        set_vm_flush_reset_perms(pack->ptr);
-       set_memory_ro((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
-       set_memory_x((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
+       set_memory_ro((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
+       set_memory_x((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
        return pack;
 }
 
@@ -909,10 +872,7 @@ static void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insn
        void *ptr = NULL;
 
        mutex_lock(&pack_mutex);
-       if (bpf_prog_pack_size == -1)
-               bpf_prog_pack_size = select_bpf_prog_pack_size();
-
-       if (size > bpf_prog_pack_size) {
+       if (size > BPF_PROG_PACK_SIZE) {
                size = round_up(size, PAGE_SIZE);
                ptr = module_alloc(size);
                if (ptr) {
@@ -924,9 +884,9 @@ static void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insn
                goto out;
        }
        list_for_each_entry(pack, &pack_list, list) {
-               pos = bitmap_find_next_zero_area(pack->bitmap, bpf_prog_chunk_count(), 0,
+               pos = bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
                                                 nbits, 0);
-               if (pos < bpf_prog_chunk_count())
+               if (pos < BPF_PROG_CHUNK_COUNT)
                        goto found_free_area;
        }
 
@@ -950,18 +910,15 @@ static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
        struct bpf_prog_pack *pack = NULL, *tmp;
        unsigned int nbits;
        unsigned long pos;
-       void *pack_ptr;
 
        mutex_lock(&pack_mutex);
-       if (hdr->size > bpf_prog_pack_size) {
+       if (hdr->size > BPF_PROG_PACK_SIZE) {
                module_memfree(hdr);
                goto out;
        }
 
-       pack_ptr = (void *)((unsigned long)hdr & bpf_prog_pack_mask);
-
        list_for_each_entry(tmp, &pack_list, list) {
-               if (tmp->ptr == pack_ptr) {
+               if ((void *)hdr >= tmp->ptr && (tmp->ptr + BPF_PROG_PACK_SIZE) > (void *)hdr) {
                        pack = tmp;
                        break;
                }
@@ -971,14 +928,14 @@ static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
                goto out;
 
        nbits = BPF_PROG_SIZE_TO_NBITS(hdr->size);
-       pos = ((unsigned long)hdr - (unsigned long)pack_ptr) >> BPF_PROG_CHUNK_SHIFT;
+       pos = ((unsigned long)hdr - (unsigned long)pack->ptr) >> BPF_PROG_CHUNK_SHIFT;
 
        WARN_ONCE(bpf_arch_text_invalidate(hdr, hdr->size),
                  "bpf_prog_pack bug: missing bpf_arch_text_invalidate?\n");
 
        bitmap_clear(pack->bitmap, pos, nbits);
-       if (bitmap_find_next_zero_area(pack->bitmap, bpf_prog_chunk_count(), 0,
-                                      bpf_prog_chunk_count(), 0) == 0) {
+       if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
+                                      BPF_PROG_CHUNK_COUNT, 0) == 0) {
                list_del(&pack->list);
                module_memfree(pack->ptr);
                kfree(pack);
@@ -1155,7 +1112,6 @@ int bpf_jit_binary_pack_finalize(struct bpf_prog *prog,
                bpf_prog_pack_free(ro_header);
                return PTR_ERR(ptr);
        }
-       prog->aux->use_bpf_prog_pack = true;
        return 0;
 }
 
@@ -1179,17 +1135,23 @@ void bpf_jit_binary_pack_free(struct bpf_binary_header *ro_header,
        bpf_jit_uncharge_modmem(size);
 }
 
+struct bpf_binary_header *
+bpf_jit_binary_pack_hdr(const struct bpf_prog *fp)
+{
+       unsigned long real_start = (unsigned long)fp->bpf_func;
+       unsigned long addr;
+
+       addr = real_start & BPF_PROG_CHUNK_MASK;
+       return (void *)addr;
+}
+
 static inline struct bpf_binary_header *
 bpf_jit_binary_hdr(const struct bpf_prog *fp)
 {
        unsigned long real_start = (unsigned long)fp->bpf_func;
        unsigned long addr;
 
-       if (fp->aux->use_bpf_prog_pack)
-               addr = real_start & BPF_PROG_CHUNK_MASK;
-       else
-               addr = real_start & PAGE_MASK;
-
+       addr = real_start & PAGE_MASK;
        return (void *)addr;
 }
 
@@ -1202,11 +1164,7 @@ void __weak bpf_jit_free(struct bpf_prog *fp)
        if (fp->jited) {
                struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
 
-               if (fp->aux->use_bpf_prog_pack)
-                       bpf_jit_binary_pack_free(hdr, NULL /* rw_buffer */);
-               else
-                       bpf_jit_binary_free(hdr);
-
+               bpf_jit_binary_free(hdr);
                WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp));
        }
 
index c286706..1400561 100644 (file)
@@ -845,7 +845,7 @@ static struct bpf_dtab_netdev *__dev_map_alloc_node(struct net *net,
        struct bpf_dtab_netdev *dev;
 
        dev = bpf_map_kmalloc_node(&dtab->map, sizeof(*dev),
-                                  GFP_ATOMIC | __GFP_NOWARN,
+                                  GFP_NOWAIT | __GFP_NOWARN,
                                   dtab->map.numa_node);
        if (!dev)
                return ERR_PTR(-ENOMEM);
index 17fb69c..da75784 100644 (file)
@@ -61,7 +61,7 @@
  *
  * As regular device interrupt handlers and soft interrupts are forced into
  * thread context, the existing code which does
- *   spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*();
+ *   spin_lock*(); alloc(GFP_ATOMIC); spin_unlock*();
  * just works.
  *
  * In theory the BPF locks could be converted to regular spinlocks as well,
@@ -978,7 +978,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
                                goto dec_count;
                        }
                l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size,
-                                            GFP_ATOMIC | __GFP_NOWARN,
+                                            GFP_NOWAIT | __GFP_NOWARN,
                                             htab->map.numa_node);
                if (!l_new) {
                        l_new = ERR_PTR(-ENOMEM);
@@ -996,7 +996,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
                } else {
                        /* alloc_percpu zero-fills */
                        pptr = bpf_map_alloc_percpu(&htab->map, size, 8,
-                                                   GFP_ATOMIC | __GFP_NOWARN);
+                                                   GFP_NOWAIT | __GFP_NOWARN);
                        if (!pptr) {
                                kfree(l_new);
                                l_new = ERR_PTR(-ENOMEM);
index 8654fc9..49ef0ce 100644 (file)
@@ -165,7 +165,7 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *key,
        }
 
        new = bpf_map_kmalloc_node(map, struct_size(new, data, map->value_size),
-                                  __GFP_ZERO | GFP_ATOMIC | __GFP_NOWARN,
+                                  __GFP_ZERO | GFP_NOWAIT | __GFP_NOWARN,
                                   map->numa_node);
        if (!new)
                return -ENOMEM;
index f0d05a3..d789e3b 100644 (file)
@@ -285,7 +285,7 @@ static struct lpm_trie_node *lpm_trie_node_alloc(const struct lpm_trie *trie,
        if (value)
                size += trie->map.value_size;
 
-       node = bpf_map_kmalloc_node(&trie->map, size, GFP_ATOMIC | __GFP_NOWARN,
+       node = bpf_map_kmalloc_node(&trie->map, size, GFP_NOWAIT | __GFP_NOWARN,
                                    trie->map.numa_node);
        if (!node)
                return NULL;
index bfe24f8..6762b12 100644 (file)
@@ -9,7 +9,7 @@ LLVM_STRIP ?= llvm-strip
 TOOLS_PATH := $(abspath ../../../../tools)
 BPFTOOL_SRC := $(TOOLS_PATH)/bpf/bpftool
 BPFTOOL_OUTPUT := $(abs_out)/bpftool
-DEFAULT_BPFTOOL := $(OUTPUT)/sbin/bpftool
+DEFAULT_BPFTOOL := $(BPFTOOL_OUTPUT)/bootstrap/bpftool
 BPFTOOL ?= $(DEFAULT_BPFTOOL)
 
 LIBBPF_SRC := $(TOOLS_PATH)/lib/bpf
@@ -61,9 +61,5 @@ $(BPFOBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(LIBBPF_OU
                    OUTPUT=$(abspath $(dir $@))/ prefix=                       \
                    DESTDIR=$(LIBBPF_DESTDIR) $(abspath $@) install_headers
 
-$(DEFAULT_BPFTOOL): $(BPFOBJ) | $(BPFTOOL_OUTPUT)
-       $(Q)$(MAKE) $(submake_extras) -C $(BPFTOOL_SRC)                        \
-                   OUTPUT=$(BPFTOOL_OUTPUT)/                                  \
-                   LIBBPF_OUTPUT=$(LIBBPF_OUTPUT)/                            \
-                   LIBBPF_DESTDIR=$(LIBBPF_DESTDIR)/                          \
-                   prefix= DESTDIR=$(abs_out)/ install-bin
+$(DEFAULT_BPFTOOL): | $(BPFTOOL_OUTPUT)
+       $(Q)$(MAKE) $(submake_extras) -C $(BPFTOOL_SRC) OUTPUT=$(BPFTOOL_OUTPUT)/ bootstrap
index ab688d8..83c7136 100644 (file)
@@ -419,35 +419,53 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock)
 #ifdef CONFIG_MEMCG_KMEM
 static void bpf_map_save_memcg(struct bpf_map *map)
 {
-       map->memcg = get_mem_cgroup_from_mm(current->mm);
+       /* Currently if a map is created by a process belonging to the root
+        * memory cgroup, get_obj_cgroup_from_current() will return NULL.
+        * So we have to check map->objcg for being NULL each time it's
+        * being used.
+        */
+       map->objcg = get_obj_cgroup_from_current();
 }
 
 static void bpf_map_release_memcg(struct bpf_map *map)
 {
-       mem_cgroup_put(map->memcg);
+       if (map->objcg)
+               obj_cgroup_put(map->objcg);
+}
+
+static struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
+{
+       if (map->objcg)
+               return get_mem_cgroup_from_objcg(map->objcg);
+
+       return root_mem_cgroup;
 }
 
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
                           int node)
 {
-       struct mem_cgroup *old_memcg;
+       struct mem_cgroup *memcg, *old_memcg;
        void *ptr;
 
-       old_memcg = set_active_memcg(map->memcg);
+       memcg = bpf_map_get_memcg(map);
+       old_memcg = set_active_memcg(memcg);
        ptr = kmalloc_node(size, flags | __GFP_ACCOUNT, node);
        set_active_memcg(old_memcg);
+       mem_cgroup_put(memcg);
 
        return ptr;
 }
 
 void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 {
-       struct mem_cgroup *old_memcg;
+       struct mem_cgroup *memcg, *old_memcg;
        void *ptr;
 
-       old_memcg = set_active_memcg(map->memcg);
+       memcg = bpf_map_get_memcg(map);
+       old_memcg = set_active_memcg(memcg);
        ptr = kzalloc(size, flags | __GFP_ACCOUNT);
        set_active_memcg(old_memcg);
+       mem_cgroup_put(memcg);
 
        return ptr;
 }
@@ -455,12 +473,14 @@ void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
                                    size_t align, gfp_t flags)
 {
-       struct mem_cgroup *old_memcg;
+       struct mem_cgroup *memcg, *old_memcg;
        void __percpu *ptr;
 
-       old_memcg = set_active_memcg(map->memcg);
+       memcg = bpf_map_get_memcg(map);
+       old_memcg = set_active_memcg(memcg);
        ptr = __alloc_percpu_gfp(size, align, flags | __GFP_ACCOUNT);
        set_active_memcg(old_memcg);
+       mem_cgroup_put(memcg);
 
        return ptr;
 }
index 6cd2265..42e387a 100644 (file)
@@ -13,6 +13,7 @@
 #include <linux/static_call.h>
 #include <linux/bpf_verifier.h>
 #include <linux/bpf_lsm.h>
+#include <linux/delay.h>
 
 /* dummy _ops. The verifier will operate on target program's ops. */
 const struct bpf_verifier_ops bpf_extension_verifier_ops = {
@@ -29,6 +30,81 @@ static struct hlist_head trampoline_table[TRAMPOLINE_TABLE_SIZE];
 /* serializes access to trampoline_table */
 static DEFINE_MUTEX(trampoline_mutex);
 
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex);
+
+static int bpf_tramp_ftrace_ops_func(struct ftrace_ops *ops, enum ftrace_ops_cmd cmd)
+{
+       struct bpf_trampoline *tr = ops->private;
+       int ret = 0;
+
+       if (cmd == FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY_SELF) {
+               /* This is called inside register_ftrace_direct_multi(), so
+                * tr->mutex is already locked.
+                */
+               lockdep_assert_held_once(&tr->mutex);
+
+               /* Instead of updating the trampoline here, we propagate
+                * -EAGAIN to register_ftrace_direct_multi(). Then we can
+                * retry register_ftrace_direct_multi() after updating the
+                * trampoline.
+                */
+               if ((tr->flags & BPF_TRAMP_F_CALL_ORIG) &&
+                   !(tr->flags & BPF_TRAMP_F_ORIG_STACK)) {
+                       if (WARN_ON_ONCE(tr->flags & BPF_TRAMP_F_SHARE_IPMODIFY))
+                               return -EBUSY;
+
+                       tr->flags |= BPF_TRAMP_F_SHARE_IPMODIFY;
+                       return -EAGAIN;
+               }
+
+               return 0;
+       }
+
+       /* The normal locking order is
+        *    tr->mutex => direct_mutex (ftrace.c) => ftrace_lock (ftrace.c)
+        *
+        * The following two commands are called from
+        *
+        *   prepare_direct_functions_for_ipmodify
+        *   cleanup_direct_functions_after_ipmodify
+        *
+        * In both cases, direct_mutex is already locked. Use
+        * mutex_trylock(&tr->mutex) to avoid deadlock in race condition
+        * (something else is making changes to this same trampoline).
+        */
+       if (!mutex_trylock(&tr->mutex)) {
+               /* sleep 1 ms to make sure whatever holding tr->mutex makes
+                * some progress.
+                */
+               msleep(1);
+               return -EAGAIN;
+       }
+
+       switch (cmd) {
+       case FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY_PEER:
+               tr->flags |= BPF_TRAMP_F_SHARE_IPMODIFY;
+
+               if ((tr->flags & BPF_TRAMP_F_CALL_ORIG) &&
+                   !(tr->flags & BPF_TRAMP_F_ORIG_STACK))
+                       ret = bpf_trampoline_update(tr, false /* lock_direct_mutex */);
+               break;
+       case FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY_PEER:
+               tr->flags &= ~BPF_TRAMP_F_SHARE_IPMODIFY;
+
+               if (tr->flags & BPF_TRAMP_F_ORIG_STACK)
+                       ret = bpf_trampoline_update(tr, false /* lock_direct_mutex */);
+               break;
+       default:
+               ret = -EINVAL;
+               break;
+       };
+
+       mutex_unlock(&tr->mutex);
+       return ret;
+}
+#endif
+
 bool bpf_prog_has_trampoline(const struct bpf_prog *prog)
 {
        enum bpf_attach_type eatype = prog->expected_attach_type;
@@ -89,6 +165,16 @@ static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
        tr = kzalloc(sizeof(*tr), GFP_KERNEL);
        if (!tr)
                goto out;
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+       tr->fops = kzalloc(sizeof(struct ftrace_ops), GFP_KERNEL);
+       if (!tr->fops) {
+               kfree(tr);
+               tr = NULL;
+               goto out;
+       }
+       tr->fops->private = tr;
+       tr->fops->ops_func = bpf_tramp_ftrace_ops_func;
+#endif
 
        tr->key = key;
        INIT_HLIST_NODE(&tr->hlist);
@@ -128,7 +214,7 @@ static int unregister_fentry(struct bpf_trampoline *tr, void *old_addr)
        int ret;
 
        if (tr->func.ftrace_managed)
-               ret = unregister_ftrace_direct((long)ip, (long)old_addr);
+               ret = unregister_ftrace_direct_multi(tr->fops, (long)old_addr);
        else
                ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, old_addr, NULL);
 
@@ -137,15 +223,20 @@ static int unregister_fentry(struct bpf_trampoline *tr, void *old_addr)
        return ret;
 }
 
-static int modify_fentry(struct bpf_trampoline *tr, void *old_addr, void *new_addr)
+static int modify_fentry(struct bpf_trampoline *tr, void *old_addr, void *new_addr,
+                        bool lock_direct_mutex)
 {
        void *ip = tr->func.addr;
        int ret;
 
-       if (tr->func.ftrace_managed)
-               ret = modify_ftrace_direct((long)ip, (long)old_addr, (long)new_addr);
-       else
+       if (tr->func.ftrace_managed) {
+               if (lock_direct_mutex)
+                       ret = modify_ftrace_direct_multi(tr->fops, (long)new_addr);
+               else
+                       ret = modify_ftrace_direct_multi_nolock(tr->fops, (long)new_addr);
+       } else {
                ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, old_addr, new_addr);
+       }
        return ret;
 }
 
@@ -163,10 +254,12 @@ static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
        if (bpf_trampoline_module_get(tr))
                return -ENOENT;
 
-       if (tr->func.ftrace_managed)
-               ret = register_ftrace_direct((long)ip, (long)new_addr);
-       else
+       if (tr->func.ftrace_managed) {
+               ftrace_set_filter_ip(tr->fops, (unsigned long)ip, 0, 0);
+               ret = register_ftrace_direct_multi(tr->fops, (long)new_addr);
+       } else {
                ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, NULL, new_addr);
+       }
 
        if (ret)
                bpf_trampoline_module_put(tr);
@@ -332,11 +425,11 @@ out:
        return ERR_PTR(err);
 }
 
-static int bpf_trampoline_update(struct bpf_trampoline *tr)
+static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex)
 {
        struct bpf_tramp_image *im;
        struct bpf_tramp_links *tlinks;
-       u32 flags = BPF_TRAMP_F_RESTORE_REGS;
+       u32 orig_flags = tr->flags;
        bool ip_arg = false;
        int err, total;
 
@@ -358,15 +451,31 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
                goto out;
        }
 
+       /* clear all bits except SHARE_IPMODIFY */
+       tr->flags &= BPF_TRAMP_F_SHARE_IPMODIFY;
+
        if (tlinks[BPF_TRAMP_FEXIT].nr_links ||
-           tlinks[BPF_TRAMP_MODIFY_RETURN].nr_links)
-               flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;
+           tlinks[BPF_TRAMP_MODIFY_RETURN].nr_links) {
+               /* NOTE: BPF_TRAMP_F_RESTORE_REGS and BPF_TRAMP_F_SKIP_FRAME
+                * should not be set together.
+                */
+               tr->flags |= BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;
+       } else {
+               tr->flags |= BPF_TRAMP_F_RESTORE_REGS;
+       }
 
        if (ip_arg)
-               flags |= BPF_TRAMP_F_IP_ARG;
+               tr->flags |= BPF_TRAMP_F_IP_ARG;
+
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+again:
+       if ((tr->flags & BPF_TRAMP_F_SHARE_IPMODIFY) &&
+           (tr->flags & BPF_TRAMP_F_CALL_ORIG))
+               tr->flags |= BPF_TRAMP_F_ORIG_STACK;
+#endif
 
        err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
-                                         &tr->func.model, flags, tlinks,
+                                         &tr->func.model, tr->flags, tlinks,
                                          tr->func.addr);
        if (err < 0)
                goto out;
@@ -375,17 +484,34 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
        WARN_ON(!tr->cur_image && tr->selector);
        if (tr->cur_image)
                /* progs already running at this address */
-               err = modify_fentry(tr, tr->cur_image->image, im->image);
+               err = modify_fentry(tr, tr->cur_image->image, im->image, lock_direct_mutex);
        else
                /* first time registering */
                err = register_fentry(tr, im->image);
+
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+       if (err == -EAGAIN) {
+               /* -EAGAIN from bpf_tramp_ftrace_ops_func. Now
+                * BPF_TRAMP_F_SHARE_IPMODIFY is set, we can generate the
+                * trampoline again, and retry register.
+                */
+               /* reset fops->func and fops->trampoline for re-register */
+               tr->fops->func = NULL;
+               tr->fops->trampoline = 0;
+               goto again;
+       }
+#endif
        if (err)
                goto out;
+
        if (tr->cur_image)
                bpf_tramp_image_put(tr->cur_image);
        tr->cur_image = im;
        tr->selector++;
 out:
+       /* If any error happens, restore previous flags */
+       if (err)
+               tr->flags = orig_flags;
        kfree(tlinks);
        return err;
 }
@@ -451,7 +577,7 @@ static int __bpf_trampoline_link_prog(struct bpf_tramp_link *link, struct bpf_tr
 
        hlist_add_head(&link->tramp_hlist, &tr->progs_hlist[kind]);
        tr->progs_cnt[kind]++;
-       err = bpf_trampoline_update(tr);
+       err = bpf_trampoline_update(tr, true /* lock_direct_mutex */);
        if (err) {
                hlist_del_init(&link->tramp_hlist);
                tr->progs_cnt[kind]--;
@@ -484,7 +610,7 @@ static int __bpf_trampoline_unlink_prog(struct bpf_tramp_link *link, struct bpf_
        }
        hlist_del_init(&link->tramp_hlist);
        tr->progs_cnt[kind]--;
-       return bpf_trampoline_update(tr);
+       return bpf_trampoline_update(tr, true /* lock_direct_mutex */);
 }
 
 /* bpf_trampoline_unlink_prog() should never fail. */
@@ -498,7 +624,7 @@ int bpf_trampoline_unlink_prog(struct bpf_tramp_link *link, struct bpf_trampolin
        return err;
 }
 
-#if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
+#if defined(CONFIG_CGROUP_BPF) && defined(CONFIG_BPF_LSM)
 static void bpf_shim_tramp_link_release(struct bpf_link *link)
 {
        struct bpf_shim_tramp_link *shim_link =
@@ -712,6 +838,7 @@ void bpf_trampoline_put(struct bpf_trampoline *tr)
         * multiple rcu callbacks.
         */
        hlist_del(&tr->hlist);
+       kfree(tr->fops);
        kfree(tr);
 out:
        mutex_unlock(&trampoline_mutex);
index 328cfab..096fdac 100644 (file)
@@ -5533,17 +5533,6 @@ static bool arg_type_is_mem_size(enum bpf_arg_type type)
               type == ARG_CONST_SIZE_OR_ZERO;
 }
 
-static bool arg_type_is_alloc_size(enum bpf_arg_type type)
-{
-       return type == ARG_CONST_ALLOC_SIZE_OR_ZERO;
-}
-
-static bool arg_type_is_int_ptr(enum bpf_arg_type type)
-{
-       return type == ARG_PTR_TO_INT ||
-              type == ARG_PTR_TO_LONG;
-}
-
 static bool arg_type_is_release(enum bpf_arg_type type)
 {
        return type & OBJ_RELEASE;
@@ -5929,7 +5918,8 @@ skip_type_check:
                meta->ref_obj_id = reg->ref_obj_id;
        }
 
-       if (arg_type == ARG_CONST_MAP_PTR) {
+       switch (base_type(arg_type)) {
+       case ARG_CONST_MAP_PTR:
                /* bpf_map_xxx(map_ptr) call: remember that map_ptr */
                if (meta->map_ptr) {
                        /* Use map_uid (which is unique id of inner map) to reject:
@@ -5954,7 +5944,8 @@ skip_type_check:
                }
                meta->map_ptr = reg->map_ptr;
                meta->map_uid = reg->map_uid;
-       } else if (arg_type == ARG_PTR_TO_MAP_KEY) {
+               break;
+       case ARG_PTR_TO_MAP_KEY:
                /* bpf_map_xxx(..., map_ptr, ..., key) call:
                 * check that [key, key + map->key_size) are within
                 * stack limits and initialized
@@ -5971,7 +5962,8 @@ skip_type_check:
                err = check_helper_mem_access(env, regno,
                                              meta->map_ptr->key_size, false,
                                              NULL);
-       } else if (base_type(arg_type) == ARG_PTR_TO_MAP_VALUE) {
+               break;
+       case ARG_PTR_TO_MAP_VALUE:
                if (type_may_be_null(arg_type) && register_is_null(reg))
                        return 0;
 
@@ -5987,14 +5979,16 @@ skip_type_check:
                err = check_helper_mem_access(env, regno,
                                              meta->map_ptr->value_size, false,
                                              meta);
-       } else if (arg_type == ARG_PTR_TO_PERCPU_BTF_ID) {
+               break;
+       case ARG_PTR_TO_PERCPU_BTF_ID:
                if (!reg->btf_id) {
                        verbose(env, "Helper has invalid btf_id in R%d\n", regno);
                        return -EACCES;
                }
                meta->ret_btf = reg->btf;
                meta->ret_btf_id = reg->btf_id;
-       } else if (arg_type == ARG_PTR_TO_SPIN_LOCK) {
+               break;
+       case ARG_PTR_TO_SPIN_LOCK:
                if (meta->func_id == BPF_FUNC_spin_lock) {
                        if (process_spin_lock(env, regno, true))
                                return -EACCES;
@@ -6005,12 +5999,15 @@ skip_type_check:
                        verbose(env, "verifier internal error\n");
                        return -EFAULT;
                }
-       } else if (arg_type == ARG_PTR_TO_TIMER) {
+               break;
+       case ARG_PTR_TO_TIMER:
                if (process_timer_func(env, regno, meta))
                        return -EACCES;
-       } else if (arg_type == ARG_PTR_TO_FUNC) {
+               break;
+       case ARG_PTR_TO_FUNC:
                meta->subprogno = reg->subprogno;
-       } else if (base_type(arg_type) == ARG_PTR_TO_MEM) {
+               break;
+       case ARG_PTR_TO_MEM:
                /* The access to this pointer is only checked when we hit the
                 * next is_mem_size argument below.
                 */
@@ -6020,11 +6017,14 @@ skip_type_check:
                                                      fn->arg_size[arg], false,
                                                      meta);
                }
-       } else if (arg_type_is_mem_size(arg_type)) {
-               bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO);
-
-               err = check_mem_size_reg(env, reg, regno, zero_size_allowed, meta);
-       } else if (arg_type_is_dynptr(arg_type)) {
+               break;
+       case ARG_CONST_SIZE:
+               err = check_mem_size_reg(env, reg, regno, false, meta);
+               break;
+       case ARG_CONST_SIZE_OR_ZERO:
+               err = check_mem_size_reg(env, reg, regno, true, meta);
+               break;
+       case ARG_PTR_TO_DYNPTR:
                if (arg_type & MEM_UNINIT) {
                        if (!is_dynptr_reg_valid_uninit(env, reg)) {
                                verbose(env, "Dynptr has to be an uninitialized dynptr\n");
@@ -6058,21 +6058,28 @@ skip_type_check:
                                err_extra, arg + 1);
                        return -EINVAL;
                }
-       } else if (arg_type_is_alloc_size(arg_type)) {
+               break;
+       case ARG_CONST_ALLOC_SIZE_OR_ZERO:
                if (!tnum_is_const(reg->var_off)) {
                        verbose(env, "R%d is not a known constant'\n",
                                regno);
                        return -EACCES;
                }
                meta->mem_size = reg->var_off.value;
-       } else if (arg_type_is_int_ptr(arg_type)) {
+               break;
+       case ARG_PTR_TO_INT:
+       case ARG_PTR_TO_LONG:
+       {
                int size = int_ptr_type_to_size(arg_type);
 
                err = check_helper_mem_access(env, regno, size, false, meta);
                if (err)
                        return err;
                err = check_ptr_alignment(env, reg, 0, size, true);
-       } else if (arg_type == ARG_PTR_TO_CONST_STR) {
+               break;
+       }
+       case ARG_PTR_TO_CONST_STR:
+       {
                struct bpf_map *map = reg->map_ptr;
                int map_off;
                u64 map_addr;
@@ -6111,9 +6118,12 @@ skip_type_check:
                        verbose(env, "string is not zero-terminated\n");
                        return -EINVAL;
                }
-       } else if (arg_type == ARG_PTR_TO_KPTR) {
+               break;
+       }
+       case ARG_PTR_TO_KPTR:
                if (process_kptr_func(env, regno, meta))
                        return -EACCES;
+               break;
        }
 
        return err;
@@ -7160,6 +7170,7 @@ static void update_loop_inline_state(struct bpf_verifier_env *env, u32 subprogno
 static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
                             int *insn_idx_p)
 {
+       enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
        const struct bpf_func_proto *fn = NULL;
        enum bpf_return_type ret_type;
        enum bpf_type_flag ret_flag;
@@ -7321,7 +7332,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
                }
                break;
        case BPF_FUNC_set_retval:
-               if (env->prog->expected_attach_type == BPF_LSM_CGROUP) {
+               if (prog_type == BPF_PROG_TYPE_LSM &&
+                   env->prog->expected_attach_type == BPF_LSM_CGROUP) {
                        if (!env->prog->aux->attach_func_proto->type) {
                                /* Make sure programs that attach to void
                                 * hooks don't try to modify return value.
@@ -7550,6 +7562,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
        int err, insn_idx = *insn_idx_p;
        const struct btf_param *args;
        struct btf *desc_btf;
+       u32 *kfunc_flags;
        bool acq;
 
        /* skip for now, but return error when we find this in fixup_kfunc_call */
@@ -7565,18 +7578,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
        func_name = btf_name_by_offset(desc_btf, func->name_off);
        func_proto = btf_type_by_id(desc_btf, func->type);
 
-       if (!btf_kfunc_id_set_contains(desc_btf, resolve_prog_type(env->prog),
-                                     BTF_KFUNC_TYPE_CHECK, func_id)) {
+       kfunc_flags = btf_kfunc_id_set_contains(desc_btf, resolve_prog_type(env->prog), func_id);
+       if (!kfunc_flags) {
                verbose(env, "calling kernel function %s is not allowed\n",
                        func_name);
                return -EACCES;
        }
-
-       acq = btf_kfunc_id_set_contains(desc_btf, resolve_prog_type(env->prog),
-                                       BTF_KFUNC_TYPE_ACQUIRE, func_id);
+       acq = *kfunc_flags & KF_ACQUIRE;
 
        /* Check the arguments */
-       err = btf_check_kfunc_arg_match(env, desc_btf, func_id, regs);
+       err = btf_check_kfunc_arg_match(env, desc_btf, func_id, regs, *kfunc_flags);
        if (err < 0)
                return err;
        /* In case of release function, we get register number of refcounted
@@ -7620,8 +7631,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
                regs[BPF_REG_0].btf = desc_btf;
                regs[BPF_REG_0].type = PTR_TO_BTF_ID;
                regs[BPF_REG_0].btf_id = ptr_type_id;
-               if (btf_kfunc_id_set_contains(desc_btf, resolve_prog_type(env->prog),
-                                             BTF_KFUNC_TYPE_RET_NULL, func_id)) {
+               if (*kfunc_flags & KF_RET_NULL) {
                        regs[BPF_REG_0].type |= PTR_MAYBE_NULL;
                        /* For mark_ptr_or_null_reg, see 93c230e3f5bd6 */
                        regs[BPF_REG_0].id = ++env->id_gen;
@@ -12562,6 +12572,7 @@ static bool is_tracing_prog_type(enum bpf_prog_type type)
        case BPF_PROG_TYPE_TRACEPOINT:
        case BPF_PROG_TYPE_PERF_EVENT:
        case BPF_PROG_TYPE_RAW_TRACEPOINT:
+       case BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE:
                return true;
        default:
                return false;
@@ -13620,6 +13631,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
                /* Below members will be freed only at prog->aux */
                func[i]->aux->btf = prog->aux->btf;
                func[i]->aux->func_info = prog->aux->func_info;
+               func[i]->aux->func_info_cnt = prog->aux->func_info_cnt;
                func[i]->aux->poke_tab = prog->aux->poke_tab;
                func[i]->aux->size_poke_tab = prog->aux->size_poke_tab;
 
@@ -13632,9 +13644,6 @@ static int jit_subprogs(struct bpf_verifier_env *env)
                                poke->aux = func[i]->aux;
                }
 
-               /* Use bpf_prog_F_tag to indicate functions in stack traces.
-                * Long term would need debug info to populate names
-                */
                func[i]->aux->name[0] = 'F';
                func[i]->aux->stack_depth = env->subprog_info[i].stack_depth;
                func[i]->jit_requested = 1;
index fbdf8d3..79a8583 100644 (file)
@@ -30,6 +30,7 @@
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/bsearch.h>
+#include <linux/btf_ids.h>
 
 /*
  * These will be re-linked against their real values
@@ -799,6 +800,96 @@ static const struct seq_operations kallsyms_op = {
        .show = s_show
 };
 
+#ifdef CONFIG_BPF_SYSCALL
+
+struct bpf_iter__ksym {
+       __bpf_md_ptr(struct bpf_iter_meta *, meta);
+       __bpf_md_ptr(struct kallsym_iter *, ksym);
+};
+
+static int ksym_prog_seq_show(struct seq_file *m, bool in_stop)
+{
+       struct bpf_iter__ksym ctx;
+       struct bpf_iter_meta meta;
+       struct bpf_prog *prog;
+
+       meta.seq = m;
+       prog = bpf_iter_get_info(&meta, in_stop);
+       if (!prog)
+               return 0;
+
+       ctx.meta = &meta;
+       ctx.ksym = m ? m->private : NULL;
+       return bpf_iter_run_prog(prog, &ctx);
+}
+
+static int bpf_iter_ksym_seq_show(struct seq_file *m, void *p)
+{
+       return ksym_prog_seq_show(m, false);
+}
+
+static void bpf_iter_ksym_seq_stop(struct seq_file *m, void *p)
+{
+       if (!p)
+               (void) ksym_prog_seq_show(m, true);
+       else
+               s_stop(m, p);
+}
+
+static const struct seq_operations bpf_iter_ksym_ops = {
+       .start = s_start,
+       .next = s_next,
+       .stop = bpf_iter_ksym_seq_stop,
+       .show = bpf_iter_ksym_seq_show,
+};
+
+static int bpf_iter_ksym_init(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+       struct kallsym_iter *iter = priv_data;
+
+       reset_iter(iter, 0);
+
+       /* cache here as in kallsyms_open() case; use current process
+        * credentials to tell BPF iterators if values should be shown.
+        */
+       iter->show_value = kallsyms_show_value(current_cred());
+
+       return 0;
+}
+
+DEFINE_BPF_ITER_FUNC(ksym, struct bpf_iter_meta *meta, struct kallsym_iter *ksym)
+
+static const struct bpf_iter_seq_info ksym_iter_seq_info = {
+       .seq_ops                = &bpf_iter_ksym_ops,
+       .init_seq_private       = bpf_iter_ksym_init,
+       .fini_seq_private       = NULL,
+       .seq_priv_size          = sizeof(struct kallsym_iter),
+};
+
+static struct bpf_iter_reg ksym_iter_reg_info = {
+       .target                 = "ksym",
+       .feature                = BPF_ITER_RESCHED,
+       .ctx_arg_info_size      = 1,
+       .ctx_arg_info           = {
+               { offsetof(struct bpf_iter__ksym, ksym),
+                 PTR_TO_BTF_ID_OR_NULL },
+       },
+       .seq_info               = &ksym_iter_seq_info,
+};
+
+BTF_ID_LIST(btf_ksym_iter_id)
+BTF_ID(struct, kallsym_iter)
+
+static int __init bpf_ksym_iter_register(void)
+{
+       ksym_iter_reg_info.ctx_arg_info[0].btf_id = *btf_ksym_iter_id;
+       return bpf_iter_reg_target(&ksym_iter_reg_info);
+}
+
+late_initcall(bpf_ksym_iter_register);
+
+#endif /* CONFIG_BPF_SYSCALL */
+
 static inline int kallsyms_for_perf(void)
 {
 #ifdef CONFIG_PERF_EVENTS
index 601ccf1..bc921a3 100644 (file)
@@ -1861,6 +1861,8 @@ static void ftrace_hash_rec_enable_modify(struct ftrace_ops *ops,
        ftrace_hash_rec_update_modify(ops, filter_hash, 1);
 }
 
+static bool ops_references_ip(struct ftrace_ops *ops, unsigned long ip);
+
 /*
  * Try to update IPMODIFY flag on each ftrace_rec. Return 0 if it is OK
  * or no-needed to update, -EBUSY if it detects a conflict of the flag
@@ -1869,6 +1871,13 @@ static void ftrace_hash_rec_enable_modify(struct ftrace_ops *ops,
  *  - If the hash is NULL, it hits all recs (if IPMODIFY is set, this is rejected)
  *  - If the hash is EMPTY_HASH, it hits nothing
  *  - Anything else hits the recs which match the hash entries.
+ *
+ * DIRECT ops does not have IPMODIFY flag, but we still need to check it
+ * against functions with FTRACE_FL_IPMODIFY. If there is any overlap, call
+ * ops_func(SHARE_IPMODIFY_SELF) to make sure current ops can share with
+ * IPMODIFY. If ops_func(SHARE_IPMODIFY_SELF) returns non-zero, propagate
+ * the return value to the caller and eventually to the owner of the DIRECT
+ * ops.
  */
 static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
                                         struct ftrace_hash *old_hash,
@@ -1877,17 +1886,26 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
        struct ftrace_page *pg;
        struct dyn_ftrace *rec, *end = NULL;
        int in_old, in_new;
+       bool is_ipmodify, is_direct;
 
        /* Only update if the ops has been registered */
        if (!(ops->flags & FTRACE_OPS_FL_ENABLED))
                return 0;
 
-       if (!(ops->flags & FTRACE_OPS_FL_IPMODIFY))
+       is_ipmodify = ops->flags & FTRACE_OPS_FL_IPMODIFY;
+       is_direct = ops->flags & FTRACE_OPS_FL_DIRECT;
+
+       /* neither IPMODIFY nor DIRECT, skip */
+       if (!is_ipmodify && !is_direct)
+               return 0;
+
+       if (WARN_ON_ONCE(is_ipmodify && is_direct))
                return 0;
 
        /*
-        * Since the IPMODIFY is a very address sensitive action, we do not
-        * allow ftrace_ops to set all functions to new hash.
+        * Since the IPMODIFY and DIRECT are very address sensitive
+        * actions, we do not allow ftrace_ops to set all functions to new
+        * hash.
         */
        if (!new_hash || !old_hash)
                return -EINVAL;
@@ -1905,12 +1923,32 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
                        continue;
 
                if (in_new) {
-                       /* New entries must ensure no others are using it */
-                       if (rec->flags & FTRACE_FL_IPMODIFY)
-                               goto rollback;
-                       rec->flags |= FTRACE_FL_IPMODIFY;
-               } else /* Removed entry */
+                       if (rec->flags & FTRACE_FL_IPMODIFY) {
+                               int ret;
+
+                               /* Cannot have two ipmodify on same rec */
+                               if (is_ipmodify)
+                                       goto rollback;
+
+                               FTRACE_WARN_ON(rec->flags & FTRACE_FL_DIRECT);
+
+                               /*
+                                * Another ops with IPMODIFY is already
+                                * attached. We are now attaching a direct
+                                * ops. Run SHARE_IPMODIFY_SELF, to check
+                                * whether sharing is supported.
+                                */
+                               if (!ops->ops_func)
+                                       return -EBUSY;
+                               ret = ops->ops_func(ops, FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY_SELF);
+                               if (ret)
+                                       return ret;
+                       } else if (is_ipmodify) {
+                               rec->flags |= FTRACE_FL_IPMODIFY;
+                       }
+               } else if (is_ipmodify) {
                        rec->flags &= ~FTRACE_FL_IPMODIFY;
+               }
        } while_for_each_ftrace_rec();
 
        return 0;
@@ -2454,8 +2492,7 @@ static void call_direct_funcs(unsigned long ip, unsigned long pip,
 
 struct ftrace_ops direct_ops = {
        .func           = call_direct_funcs,
-       .flags          = FTRACE_OPS_FL_IPMODIFY
-                         | FTRACE_OPS_FL_DIRECT | FTRACE_OPS_FL_SAVE_REGS
+       .flags          = FTRACE_OPS_FL_DIRECT | FTRACE_OPS_FL_SAVE_REGS
                          | FTRACE_OPS_FL_PERMANENT,
        /*
         * By declaring the main trampoline as this trampoline
@@ -3072,14 +3109,14 @@ static inline int ops_traces_mod(struct ftrace_ops *ops)
 }
 
 /*
- * Check if the current ops references the record.
+ * Check if the current ops references the given ip.
  *
  * If the ops traces all functions, then it was already accounted for.
  * If the ops does not trace the current record function, skip it.
  * If the ops ignores the function via notrace filter, skip it.
  */
-static inline bool
-ops_references_rec(struct ftrace_ops *ops, struct dyn_ftrace *rec)
+static bool
+ops_references_ip(struct ftrace_ops *ops, unsigned long ip)
 {
        /* If ops isn't enabled, ignore it */
        if (!(ops->flags & FTRACE_OPS_FL_ENABLED))
@@ -3091,16 +3128,29 @@ ops_references_rec(struct ftrace_ops *ops, struct dyn_ftrace *rec)
 
        /* The function must be in the filter */
        if (!ftrace_hash_empty(ops->func_hash->filter_hash) &&
-           !__ftrace_lookup_ip(ops->func_hash->filter_hash, rec->ip))
+           !__ftrace_lookup_ip(ops->func_hash->filter_hash, ip))
                return false;
 
        /* If in notrace hash, we ignore it too */
-       if (ftrace_lookup_ip(ops->func_hash->notrace_hash, rec->ip))
+       if (ftrace_lookup_ip(ops->func_hash->notrace_hash, ip))
                return false;
 
        return true;
 }
 
+/*
+ * Check if the current ops references the record.
+ *
+ * If the ops traces all functions, then it was already accounted for.
+ * If the ops does not trace the current record function, skip it.
+ * If the ops ignores the function via notrace filter, skip it.
+ */
+static bool
+ops_references_rec(struct ftrace_ops *ops, struct dyn_ftrace *rec)
+{
+       return ops_references_ip(ops, rec->ip);
+}
+
 static int ftrace_update_code(struct module *mod, struct ftrace_page *new_pgs)
 {
        bool init_nop = ftrace_need_init_nop();
@@ -5215,6 +5265,8 @@ static struct ftrace_direct_func *ftrace_alloc_direct_func(unsigned long addr)
        return direct;
 }
 
+static int register_ftrace_function_nolock(struct ftrace_ops *ops);
+
 /**
  * register_ftrace_direct - Call a custom trampoline directly
  * @ip: The address of the nop at the beginning of a function
@@ -5286,7 +5338,7 @@ int register_ftrace_direct(unsigned long ip, unsigned long addr)
        ret = ftrace_set_filter_ip(&direct_ops, ip, 0, 0);
 
        if (!ret && !(direct_ops.flags & FTRACE_OPS_FL_ENABLED)) {
-               ret = register_ftrace_function(&direct_ops);
+               ret = register_ftrace_function_nolock(&direct_ops);
                if (ret)
                        ftrace_set_filter_ip(&direct_ops, ip, 1, 0);
        }
@@ -5545,8 +5597,7 @@ int modify_ftrace_direct(unsigned long ip,
 }
 EXPORT_SYMBOL_GPL(modify_ftrace_direct);
 
-#define MULTI_FLAGS (FTRACE_OPS_FL_IPMODIFY | FTRACE_OPS_FL_DIRECT | \
-                    FTRACE_OPS_FL_SAVE_REGS)
+#define MULTI_FLAGS (FTRACE_OPS_FL_DIRECT | FTRACE_OPS_FL_SAVE_REGS)
 
 static int check_direct_multi(struct ftrace_ops *ops)
 {
@@ -5639,7 +5690,7 @@ int register_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
        ops->flags = MULTI_FLAGS;
        ops->trampoline = FTRACE_REGS_ADDR;
 
-       err = register_ftrace_function(ops);
+       err = register_ftrace_function_nolock(ops);
 
  out_remove:
        if (err)
@@ -5691,22 +5742,8 @@ int unregister_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(unregister_ftrace_direct_multi);
 
-/**
- * modify_ftrace_direct_multi - Modify an existing direct 'multi' call
- * to call something else
- * @ops: The address of the struct ftrace_ops object
- * @addr: The address of the new trampoline to call at @ops functions
- *
- * This is used to unregister currently registered direct caller and
- * register new one @addr on functions registered in @ops object.
- *
- * Note there's window between ftrace_shutdown and ftrace_startup calls
- * where there will be no callbacks called.
- *
- * Returns: zero on success. Non zero on error, which includes:
- *  -EINVAL - The @ops object was not properly registered.
- */
-int modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
+static int
+__modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
 {
        struct ftrace_hash *hash;
        struct ftrace_func_entry *entry, *iter;
@@ -5717,20 +5754,15 @@ int modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
        int i, size;
        int err;
 
-       if (check_direct_multi(ops))
-               return -EINVAL;
-       if (!(ops->flags & FTRACE_OPS_FL_ENABLED))
-               return -EINVAL;
-
-       mutex_lock(&direct_mutex);
+       lockdep_assert_held_once(&direct_mutex);
 
        /* Enable the tmp_ops to have the same functions as the direct ops */
        ftrace_ops_init(&tmp_ops);
        tmp_ops.func_hash = ops->func_hash;
 
-       err = register_ftrace_function(&tmp_ops);
+       err = register_ftrace_function_nolock(&tmp_ops);
        if (err)
-               goto out_direct;
+               return err;
 
        /*
         * Now the ftrace_ops_list_func() is called to do the direct callers.
@@ -5754,7 +5786,64 @@ int modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
        /* Removing the tmp_ops will add the updated direct callers to the functions */
        unregister_ftrace_function(&tmp_ops);
 
- out_direct:
+       return err;
+}
+
+/**
+ * modify_ftrace_direct_multi_nolock - Modify an existing direct 'multi' call
+ * to call something else
+ * @ops: The address of the struct ftrace_ops object
+ * @addr: The address of the new trampoline to call at @ops functions
+ *
+ * This is used to unregister currently registered direct caller and
+ * register new one @addr on functions registered in @ops object.
+ *
+ * Note there's window between ftrace_shutdown and ftrace_startup calls
+ * where there will be no callbacks called.
+ *
+ * Caller should already have direct_mutex locked, so we don't lock
+ * direct_mutex here.
+ *
+ * Returns: zero on success. Non zero on error, which includes:
+ *  -EINVAL - The @ops object was not properly registered.
+ */
+int modify_ftrace_direct_multi_nolock(struct ftrace_ops *ops, unsigned long addr)
+{
+       if (check_direct_multi(ops))
+               return -EINVAL;
+       if (!(ops->flags & FTRACE_OPS_FL_ENABLED))
+               return -EINVAL;
+
+       return __modify_ftrace_direct_multi(ops, addr);
+}
+EXPORT_SYMBOL_GPL(modify_ftrace_direct_multi_nolock);
+
+/**
+ * modify_ftrace_direct_multi - Modify an existing direct 'multi' call
+ * to call something else
+ * @ops: The address of the struct ftrace_ops object
+ * @addr: The address of the new trampoline to call at @ops functions
+ *
+ * This is used to unregister currently registered direct caller and
+ * register new one @addr on functions registered in @ops object.
+ *
+ * Note there's window between ftrace_shutdown and ftrace_startup calls
+ * where there will be no callbacks called.
+ *
+ * Returns: zero on success. Non zero on error, which includes:
+ *  -EINVAL - The @ops object was not properly registered.
+ */
+int modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
+{
+       int err;
+
+       if (check_direct_multi(ops))
+               return -EINVAL;
+       if (!(ops->flags & FTRACE_OPS_FL_ENABLED))
+               return -EINVAL;
+
+       mutex_lock(&direct_mutex);
+       err = __modify_ftrace_direct_multi(ops, addr);
        mutex_unlock(&direct_mutex);
        return err;
 }
@@ -7965,6 +8054,143 @@ int ftrace_is_dead(void)
        return ftrace_disabled;
 }
 
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+/*
+ * When registering ftrace_ops with IPMODIFY, it is necessary to make sure
+ * it doesn't conflict with any direct ftrace_ops. If there is existing
+ * direct ftrace_ops on a kernel function being patched, call
+ * FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY_PEER on it to enable sharing.
+ *
+ * @ops:     ftrace_ops being registered.
+ *
+ * Returns:
+ *         0 on success;
+ *         Negative on failure.
+ */
+static int prepare_direct_functions_for_ipmodify(struct ftrace_ops *ops)
+{
+       struct ftrace_func_entry *entry;
+       struct ftrace_hash *hash;
+       struct ftrace_ops *op;
+       int size, i, ret;
+
+       lockdep_assert_held_once(&direct_mutex);
+
+       if (!(ops->flags & FTRACE_OPS_FL_IPMODIFY))
+               return 0;
+
+       hash = ops->func_hash->filter_hash;
+       size = 1 << hash->size_bits;
+       for (i = 0; i < size; i++) {
+               hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
+                       unsigned long ip = entry->ip;
+                       bool found_op = false;
+
+                       mutex_lock(&ftrace_lock);
+                       do_for_each_ftrace_op(op, ftrace_ops_list) {
+                               if (!(op->flags & FTRACE_OPS_FL_DIRECT))
+                                       continue;
+                               if (ops_references_ip(op, ip)) {
+                                       found_op = true;
+                                       break;
+                               }
+                       } while_for_each_ftrace_op(op);
+                       mutex_unlock(&ftrace_lock);
+
+                       if (found_op) {
+                               if (!op->ops_func)
+                                       return -EBUSY;
+
+                               ret = op->ops_func(op, FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY_PEER);
+                               if (ret)
+                                       return ret;
+                       }
+               }
+       }
+
+       return 0;
+}
+
+/*
+ * Similar to prepare_direct_functions_for_ipmodify, clean up after ops
+ * with IPMODIFY is unregistered. The cleanup is optional for most DIRECT
+ * ops.
+ */
+static void cleanup_direct_functions_after_ipmodify(struct ftrace_ops *ops)
+{
+       struct ftrace_func_entry *entry;
+       struct ftrace_hash *hash;
+       struct ftrace_ops *op;
+       int size, i;
+
+       if (!(ops->flags & FTRACE_OPS_FL_IPMODIFY))
+               return;
+
+       mutex_lock(&direct_mutex);
+
+       hash = ops->func_hash->filter_hash;
+       size = 1 << hash->size_bits;
+       for (i = 0; i < size; i++) {
+               hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
+                       unsigned long ip = entry->ip;
+                       bool found_op = false;
+
+                       mutex_lock(&ftrace_lock);
+                       do_for_each_ftrace_op(op, ftrace_ops_list) {
+                               if (!(op->flags & FTRACE_OPS_FL_DIRECT))
+                                       continue;
+                               if (ops_references_ip(op, ip)) {
+                                       found_op = true;
+                                       break;
+                               }
+                       } while_for_each_ftrace_op(op);
+                       mutex_unlock(&ftrace_lock);
+
+                       /* The cleanup is optional, ignore any errors */
+                       if (found_op && op->ops_func)
+                               op->ops_func(op, FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY_PEER);
+               }
+       }
+       mutex_unlock(&direct_mutex);
+}
+
+#define lock_direct_mutex()    mutex_lock(&direct_mutex)
+#define unlock_direct_mutex()  mutex_unlock(&direct_mutex)
+
+#else  /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
+
+static int prepare_direct_functions_for_ipmodify(struct ftrace_ops *ops)
+{
+       return 0;
+}
+
+static void cleanup_direct_functions_after_ipmodify(struct ftrace_ops *ops)
+{
+}
+
+#define lock_direct_mutex()    do { } while (0)
+#define unlock_direct_mutex()  do { } while (0)
+
+#endif  /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
+
+/*
+ * Similar to register_ftrace_function, except we don't lock direct_mutex.
+ */
+static int register_ftrace_function_nolock(struct ftrace_ops *ops)
+{
+       int ret;
+
+       ftrace_ops_init(ops);
+
+       mutex_lock(&ftrace_lock);
+
+       ret = ftrace_startup(ops, 0);
+
+       mutex_unlock(&ftrace_lock);
+
+       return ret;
+}
+
 /**
  * register_ftrace_function - register a function for profiling
  * @ops:       ops structure that holds the function for profiling.
@@ -7980,14 +8206,15 @@ int register_ftrace_function(struct ftrace_ops *ops)
 {
        int ret;
 
-       ftrace_ops_init(ops);
-
-       mutex_lock(&ftrace_lock);
-
-       ret = ftrace_startup(ops, 0);
+       lock_direct_mutex();
+       ret = prepare_direct_functions_for_ipmodify(ops);
+       if (ret < 0)
+               goto out_unlock;
 
-       mutex_unlock(&ftrace_lock);
+       ret = register_ftrace_function_nolock(ops);
 
+out_unlock:
+       unlock_direct_mutex();
        return ret;
 }
 EXPORT_SYMBOL_GPL(register_ftrace_function);
@@ -8006,6 +8233,7 @@ int unregister_ftrace_function(struct ftrace_ops *ops)
        ret = ftrace_shutdown(ops, 0);
        mutex_unlock(&ftrace_lock);
 
+       cleanup_direct_functions_after_ipmodify(ops);
        return ret;
 }
 EXPORT_SYMBOL_GPL(unregister_ftrace_function);
index 2ca96ac..cbc9cd5 100644 (file)
@@ -691,52 +691,35 @@ noinline void bpf_kfunc_call_test_mem_len_fail2(u64 *mem, int len)
 {
 }
 
+noinline void bpf_kfunc_call_test_ref(struct prog_test_ref_kfunc *p)
+{
+}
+
 __diag_pop();
 
 ALLOW_ERROR_INJECTION(bpf_modify_return_test, ERRNO);
 
-BTF_SET_START(test_sk_check_kfunc_ids)
-BTF_ID(func, bpf_kfunc_call_test1)
-BTF_ID(func, bpf_kfunc_call_test2)
-BTF_ID(func, bpf_kfunc_call_test3)
-BTF_ID(func, bpf_kfunc_call_test_acquire)
-BTF_ID(func, bpf_kfunc_call_memb_acquire)
-BTF_ID(func, bpf_kfunc_call_test_release)
-BTF_ID(func, bpf_kfunc_call_memb_release)
-BTF_ID(func, bpf_kfunc_call_memb1_release)
-BTF_ID(func, bpf_kfunc_call_test_kptr_get)
-BTF_ID(func, bpf_kfunc_call_test_pass_ctx)
-BTF_ID(func, bpf_kfunc_call_test_pass1)
-BTF_ID(func, bpf_kfunc_call_test_pass2)
-BTF_ID(func, bpf_kfunc_call_test_fail1)
-BTF_ID(func, bpf_kfunc_call_test_fail2)
-BTF_ID(func, bpf_kfunc_call_test_fail3)
-BTF_ID(func, bpf_kfunc_call_test_mem_len_pass1)
-BTF_ID(func, bpf_kfunc_call_test_mem_len_fail1)
-BTF_ID(func, bpf_kfunc_call_test_mem_len_fail2)
-BTF_SET_END(test_sk_check_kfunc_ids)
-
-BTF_SET_START(test_sk_acquire_kfunc_ids)
-BTF_ID(func, bpf_kfunc_call_test_acquire)
-BTF_ID(func, bpf_kfunc_call_memb_acquire)
-BTF_ID(func, bpf_kfunc_call_test_kptr_get)
-BTF_SET_END(test_sk_acquire_kfunc_ids)
-
-BTF_SET_START(test_sk_release_kfunc_ids)
-BTF_ID(func, bpf_kfunc_call_test_release)
-BTF_ID(func, bpf_kfunc_call_memb_release)
-BTF_ID(func, bpf_kfunc_call_memb1_release)
-BTF_SET_END(test_sk_release_kfunc_ids)
-
-BTF_SET_START(test_sk_ret_null_kfunc_ids)
-BTF_ID(func, bpf_kfunc_call_test_acquire)
-BTF_ID(func, bpf_kfunc_call_memb_acquire)
-BTF_ID(func, bpf_kfunc_call_test_kptr_get)
-BTF_SET_END(test_sk_ret_null_kfunc_ids)
-
-BTF_SET_START(test_sk_kptr_acquire_kfunc_ids)
-BTF_ID(func, bpf_kfunc_call_test_kptr_get)
-BTF_SET_END(test_sk_kptr_acquire_kfunc_ids)
+BTF_SET8_START(test_sk_check_kfunc_ids)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test1)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test2)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test3)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_acquire, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_kfunc_call_memb_acquire, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_release, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_kfunc_call_memb_release, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_kfunc_call_memb1_release, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_kptr_get, KF_ACQUIRE | KF_RET_NULL | KF_KPTR_GET)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_pass_ctx)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_pass1)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_pass2)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_fail1)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_fail2)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_fail3)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_mem_len_pass1)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_mem_len_fail1)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_mem_len_fail2)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_ref, KF_TRUSTED_ARGS)
+BTF_SET8_END(test_sk_check_kfunc_ids)
 
 static void *bpf_test_init(const union bpf_attr *kattr, u32 user_size,
                           u32 size, u32 headroom, u32 tailroom)
@@ -955,6 +938,9 @@ static int convert___skb_to_skb(struct sk_buff *skb, struct __sk_buff *__skb)
 {
        struct qdisc_skb_cb *cb = (struct qdisc_skb_cb *)skb->cb;
 
+       if (!skb->len)
+               return -EINVAL;
+
        if (!__skb)
                return 0;
 
@@ -1617,12 +1603,8 @@ out:
 }
 
 static const struct btf_kfunc_id_set bpf_prog_test_kfunc_set = {
-       .owner        = THIS_MODULE,
-       .check_set        = &test_sk_check_kfunc_ids,
-       .acquire_set      = &test_sk_acquire_kfunc_ids,
-       .release_set      = &test_sk_release_kfunc_ids,
-       .ret_null_set     = &test_sk_ret_null_kfunc_ids,
-       .kptr_acquire_set = &test_sk_kptr_acquire_kfunc_ids
+       .owner = THIS_MODULE,
+       .set   = &test_sk_check_kfunc_ids,
 };
 
 BTF_ID_LIST(bpf_prog_test_dtor_kfunc_ids)
index d588fd0..716df64 100644 (file)
@@ -4168,6 +4168,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
        bool again = false;
 
        skb_reset_mac_header(skb);
+       skb_assert_len(skb);
 
        if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
                __skb_tstamp_tx(skb, NULL, NULL, skb->sk, SCM_TSTAMP_SCHED);
index a0c6109..57c5e4c 100644 (file)
@@ -237,7 +237,7 @@ BPF_CALL_2(bpf_skb_load_helper_8_no_cache, const struct sk_buff *, skb,
 BPF_CALL_4(bpf_skb_load_helper_16, const struct sk_buff *, skb, const void *,
           data, int, headlen, int, offset)
 {
-       u16 tmp, *ptr;
+       __be16 tmp, *ptr;
        const int len = sizeof(tmp);
 
        if (offset >= 0) {
@@ -264,7 +264,7 @@ BPF_CALL_2(bpf_skb_load_helper_16_no_cache, const struct sk_buff *, skb,
 BPF_CALL_4(bpf_skb_load_helper_32, const struct sk_buff *, skb, const void *,
           data, int, headlen, int, offset)
 {
-       u32 tmp, *ptr;
+       __be32 tmp, *ptr;
        const int len = sizeof(tmp);
 
        if (likely(offset >= 0)) {
index 266d3b7..8162789 100644 (file)
@@ -462,7 +462,7 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
 
                        if (copied == len)
                                break;
-               } while (i != msg_rx->sg.end);
+               } while (!sg_is_last(sge));
 
                if (unlikely(peek)) {
                        msg_rx = sk_psock_next_msg(psock, msg_rx);
@@ -472,7 +472,7 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
                }
 
                msg_rx->sg.start = i;
-               if (!sge->length && msg_rx->sg.start == msg_rx->sg.end) {
+               if (!sge->length && sg_is_last(sge)) {
                        msg_rx = sk_psock_dequeue_msg(psock);
                        kfree_sk_msg(msg_rx);
                }
index 7a18163..85a9e50 100644 (file)
@@ -197,17 +197,17 @@ bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id,
        }
 }
 
-BTF_SET_START(bpf_tcp_ca_check_kfunc_ids)
-BTF_ID(func, tcp_reno_ssthresh)
-BTF_ID(func, tcp_reno_cong_avoid)
-BTF_ID(func, tcp_reno_undo_cwnd)
-BTF_ID(func, tcp_slow_start)
-BTF_ID(func, tcp_cong_avoid_ai)
-BTF_SET_END(bpf_tcp_ca_check_kfunc_ids)
+BTF_SET8_START(bpf_tcp_ca_check_kfunc_ids)
+BTF_ID_FLAGS(func, tcp_reno_ssthresh)
+BTF_ID_FLAGS(func, tcp_reno_cong_avoid)
+BTF_ID_FLAGS(func, tcp_reno_undo_cwnd)
+BTF_ID_FLAGS(func, tcp_slow_start)
+BTF_ID_FLAGS(func, tcp_cong_avoid_ai)
+BTF_SET8_END(bpf_tcp_ca_check_kfunc_ids)
 
 static const struct btf_kfunc_id_set bpf_tcp_ca_kfunc_set = {
-       .owner     = THIS_MODULE,
-       .check_set = &bpf_tcp_ca_check_kfunc_ids,
+       .owner = THIS_MODULE,
+       .set   = &bpf_tcp_ca_check_kfunc_ids,
 };
 
 static const struct bpf_verifier_ops bpf_tcp_ca_verifier_ops = {
index 075e744..54eec33 100644 (file)
@@ -1154,24 +1154,24 @@ static struct tcp_congestion_ops tcp_bbr_cong_ops __read_mostly = {
        .set_state      = bbr_set_state,
 };
 
-BTF_SET_START(tcp_bbr_check_kfunc_ids)
+BTF_SET8_START(tcp_bbr_check_kfunc_ids)
 #ifdef CONFIG_X86
 #ifdef CONFIG_DYNAMIC_FTRACE
-BTF_ID(func, bbr_init)
-BTF_ID(func, bbr_main)
-BTF_ID(func, bbr_sndbuf_expand)
-BTF_ID(func, bbr_undo_cwnd)
-BTF_ID(func, bbr_cwnd_event)
-BTF_ID(func, bbr_ssthresh)
-BTF_ID(func, bbr_min_tso_segs)
-BTF_ID(func, bbr_set_state)
+BTF_ID_FLAGS(func, bbr_init)
+BTF_ID_FLAGS(func, bbr_main)
+BTF_ID_FLAGS(func, bbr_sndbuf_expand)
+BTF_ID_FLAGS(func, bbr_undo_cwnd)
+BTF_ID_FLAGS(func, bbr_cwnd_event)
+BTF_ID_FLAGS(func, bbr_ssthresh)
+BTF_ID_FLAGS(func, bbr_min_tso_segs)
+BTF_ID_FLAGS(func, bbr_set_state)
 #endif
 #endif
-BTF_SET_END(tcp_bbr_check_kfunc_ids)
+BTF_SET8_END(tcp_bbr_check_kfunc_ids)
 
 static const struct btf_kfunc_id_set tcp_bbr_kfunc_set = {
-       .owner     = THIS_MODULE,
-       .check_set = &tcp_bbr_check_kfunc_ids,
+       .owner = THIS_MODULE,
+       .set   = &tcp_bbr_check_kfunc_ids,
 };
 
 static int __init bbr_register(void)
index 68178e7..768c10c 100644 (file)
@@ -485,22 +485,22 @@ static struct tcp_congestion_ops cubictcp __read_mostly = {
        .name           = "cubic",
 };
 
-BTF_SET_START(tcp_cubic_check_kfunc_ids)
+BTF_SET8_START(tcp_cubic_check_kfunc_ids)
 #ifdef CONFIG_X86
 #ifdef CONFIG_DYNAMIC_FTRACE
-BTF_ID(func, cubictcp_init)
-BTF_ID(func, cubictcp_recalc_ssthresh)
-BTF_ID(func, cubictcp_cong_avoid)
-BTF_ID(func, cubictcp_state)
-BTF_ID(func, cubictcp_cwnd_event)
-BTF_ID(func, cubictcp_acked)
+BTF_ID_FLAGS(func, cubictcp_init)
+BTF_ID_FLAGS(func, cubictcp_recalc_ssthresh)
+BTF_ID_FLAGS(func, cubictcp_cong_avoid)
+BTF_ID_FLAGS(func, cubictcp_state)
+BTF_ID_FLAGS(func, cubictcp_cwnd_event)
+BTF_ID_FLAGS(func, cubictcp_acked)
 #endif
 #endif
-BTF_SET_END(tcp_cubic_check_kfunc_ids)
+BTF_SET8_END(tcp_cubic_check_kfunc_ids)
 
 static const struct btf_kfunc_id_set tcp_cubic_kfunc_set = {
-       .owner     = THIS_MODULE,
-       .check_set = &tcp_cubic_check_kfunc_ids,
+       .owner = THIS_MODULE,
+       .set   = &tcp_cubic_check_kfunc_ids,
 };
 
 static int __init cubictcp_register(void)
index ab034a4..2a6c0dd 100644 (file)
@@ -239,22 +239,22 @@ static struct tcp_congestion_ops dctcp_reno __read_mostly = {
        .name           = "dctcp-reno",
 };
 
-BTF_SET_START(tcp_dctcp_check_kfunc_ids)
+BTF_SET8_START(tcp_dctcp_check_kfunc_ids)
 #ifdef CONFIG_X86
 #ifdef CONFIG_DYNAMIC_FTRACE
-BTF_ID(func, dctcp_init)
-BTF_ID(func, dctcp_update_alpha)
-BTF_ID(func, dctcp_cwnd_event)
-BTF_ID(func, dctcp_ssthresh)
-BTF_ID(func, dctcp_cwnd_undo)
-BTF_ID(func, dctcp_state)
+BTF_ID_FLAGS(func, dctcp_init)
+BTF_ID_FLAGS(func, dctcp_update_alpha)
+BTF_ID_FLAGS(func, dctcp_cwnd_event)
+BTF_ID_FLAGS(func, dctcp_ssthresh)
+BTF_ID_FLAGS(func, dctcp_cwnd_undo)
+BTF_ID_FLAGS(func, dctcp_state)
 #endif
 #endif
-BTF_SET_END(tcp_dctcp_check_kfunc_ids)
+BTF_SET8_END(tcp_dctcp_check_kfunc_ids)
 
 static const struct btf_kfunc_id_set tcp_dctcp_kfunc_set = {
-       .owner     = THIS_MODULE,
-       .check_set = &tcp_dctcp_check_kfunc_ids,
+       .owner = THIS_MODULE,
+       .set   = &tcp_dctcp_check_kfunc_ids,
 };
 
 static int __init dctcp_register(void)
index bc4d5cd..1cd87b2 100644 (file)
@@ -55,57 +55,131 @@ enum {
        NF_BPF_CT_OPTS_SZ = 12,
 };
 
-static struct nf_conn *__bpf_nf_ct_lookup(struct net *net,
-                                         struct bpf_sock_tuple *bpf_tuple,
-                                         u32 tuple_len, u8 protonum,
-                                         s32 netns_id, u8 *dir)
+static int bpf_nf_ct_tuple_parse(struct bpf_sock_tuple *bpf_tuple,
+                                u32 tuple_len, u8 protonum, u8 dir,
+                                struct nf_conntrack_tuple *tuple)
 {
-       struct nf_conntrack_tuple_hash *hash;
-       struct nf_conntrack_tuple tuple;
-       struct nf_conn *ct;
+       union nf_inet_addr *src = dir ? &tuple->dst.u3 : &tuple->src.u3;
+       union nf_inet_addr *dst = dir ? &tuple->src.u3 : &tuple->dst.u3;
+       union nf_conntrack_man_proto *sport = dir ? (void *)&tuple->dst.u
+                                                 : &tuple->src.u;
+       union nf_conntrack_man_proto *dport = dir ? &tuple->src.u
+                                                 : (void *)&tuple->dst.u;
 
        if (unlikely(protonum != IPPROTO_TCP && protonum != IPPROTO_UDP))
-               return ERR_PTR(-EPROTO);
-       if (unlikely(netns_id < BPF_F_CURRENT_NETNS))
-               return ERR_PTR(-EINVAL);
+               return -EPROTO;
+
+       memset(tuple, 0, sizeof(*tuple));
 
-       memset(&tuple, 0, sizeof(tuple));
        switch (tuple_len) {
        case sizeof(bpf_tuple->ipv4):
-               tuple.src.l3num = AF_INET;
-               tuple.src.u3.ip = bpf_tuple->ipv4.saddr;
-               tuple.src.u.tcp.port = bpf_tuple->ipv4.sport;
-               tuple.dst.u3.ip = bpf_tuple->ipv4.daddr;
-               tuple.dst.u.tcp.port = bpf_tuple->ipv4.dport;
+               tuple->src.l3num = AF_INET;
+               src->ip = bpf_tuple->ipv4.saddr;
+               sport->tcp.port = bpf_tuple->ipv4.sport;
+               dst->ip = bpf_tuple->ipv4.daddr;
+               dport->tcp.port = bpf_tuple->ipv4.dport;
                break;
        case sizeof(bpf_tuple->ipv6):
-               tuple.src.l3num = AF_INET6;
-               memcpy(tuple.src.u3.ip6, bpf_tuple->ipv6.saddr, sizeof(bpf_tuple->ipv6.saddr));
-               tuple.src.u.tcp.port = bpf_tuple->ipv6.sport;
-               memcpy(tuple.dst.u3.ip6, bpf_tuple->ipv6.daddr, sizeof(bpf_tuple->ipv6.daddr));
-               tuple.dst.u.tcp.port = bpf_tuple->ipv6.dport;
+               tuple->src.l3num = AF_INET6;
+               memcpy(src->ip6, bpf_tuple->ipv6.saddr, sizeof(bpf_tuple->ipv6.saddr));
+               sport->tcp.port = bpf_tuple->ipv6.sport;
+               memcpy(dst->ip6, bpf_tuple->ipv6.daddr, sizeof(bpf_tuple->ipv6.daddr));
+               dport->tcp.port = bpf_tuple->ipv6.dport;
                break;
        default:
-               return ERR_PTR(-EAFNOSUPPORT);
+               return -EAFNOSUPPORT;
+       }
+       tuple->dst.protonum = protonum;
+       tuple->dst.dir = dir;
+
+       return 0;
+}
+
+static struct nf_conn *
+__bpf_nf_ct_alloc_entry(struct net *net, struct bpf_sock_tuple *bpf_tuple,
+                       u32 tuple_len, struct bpf_ct_opts *opts, u32 opts_len,
+                       u32 timeout)
+{
+       struct nf_conntrack_tuple otuple, rtuple;
+       struct nf_conn *ct;
+       int err;
+
+       if (!opts || !bpf_tuple || opts->reserved[0] || opts->reserved[1] ||
+           opts_len != NF_BPF_CT_OPTS_SZ)
+               return ERR_PTR(-EINVAL);
+
+       if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS))
+               return ERR_PTR(-EINVAL);
+
+       err = bpf_nf_ct_tuple_parse(bpf_tuple, tuple_len, opts->l4proto,
+                                   IP_CT_DIR_ORIGINAL, &otuple);
+       if (err < 0)
+               return ERR_PTR(err);
+
+       err = bpf_nf_ct_tuple_parse(bpf_tuple, tuple_len, opts->l4proto,
+                                   IP_CT_DIR_REPLY, &rtuple);
+       if (err < 0)
+               return ERR_PTR(err);
+
+       if (opts->netns_id >= 0) {
+               net = get_net_ns_by_id(net, opts->netns_id);
+               if (unlikely(!net))
+                       return ERR_PTR(-ENONET);
        }
 
-       tuple.dst.protonum = protonum;
+       ct = nf_conntrack_alloc(net, &nf_ct_zone_dflt, &otuple, &rtuple,
+                               GFP_ATOMIC);
+       if (IS_ERR(ct))
+               goto out;
+
+       memset(&ct->proto, 0, sizeof(ct->proto));
+       __nf_ct_set_timeout(ct, timeout * HZ);
+       ct->status |= IPS_CONFIRMED;
+
+out:
+       if (opts->netns_id >= 0)
+               put_net(net);
+
+       return ct;
+}
+
+static struct nf_conn *__bpf_nf_ct_lookup(struct net *net,
+                                         struct bpf_sock_tuple *bpf_tuple,
+                                         u32 tuple_len, struct bpf_ct_opts *opts,
+                                         u32 opts_len)
+{
+       struct nf_conntrack_tuple_hash *hash;
+       struct nf_conntrack_tuple tuple;
+       struct nf_conn *ct;
+       int err;
+
+       if (!opts || !bpf_tuple || opts->reserved[0] || opts->reserved[1] ||
+           opts_len != NF_BPF_CT_OPTS_SZ)
+               return ERR_PTR(-EINVAL);
+       if (unlikely(opts->l4proto != IPPROTO_TCP && opts->l4proto != IPPROTO_UDP))
+               return ERR_PTR(-EPROTO);
+       if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS))
+               return ERR_PTR(-EINVAL);
+
+       err = bpf_nf_ct_tuple_parse(bpf_tuple, tuple_len, opts->l4proto,
+                                   IP_CT_DIR_ORIGINAL, &tuple);
+       if (err < 0)
+               return ERR_PTR(err);
 
-       if (netns_id >= 0) {
-               net = get_net_ns_by_id(net, netns_id);
+       if (opts->netns_id >= 0) {
+               net = get_net_ns_by_id(net, opts->netns_id);
                if (unlikely(!net))
                        return ERR_PTR(-ENONET);
        }
 
        hash = nf_conntrack_find_get(net, &nf_ct_zone_dflt, &tuple);
-       if (netns_id >= 0)
+       if (opts->netns_id >= 0)
                put_net(net);
        if (!hash)
                return ERR_PTR(-ENOENT);
 
        ct = nf_ct_tuplehash_to_ctrack(hash);
-       if (dir)
-               *dir = NF_CT_DIRECTION(hash);
+       opts->dir = NF_CT_DIRECTION(hash);
 
        return ct;
 }
@@ -114,6 +188,43 @@ __diag_push();
 __diag_ignore_all("-Wmissing-prototypes",
                  "Global functions as their definitions will be in nf_conntrack BTF");
 
+struct nf_conn___init {
+       struct nf_conn ct;
+};
+
+/* bpf_xdp_ct_alloc - Allocate a new CT entry
+ *
+ * Parameters:
+ * @xdp_ctx    - Pointer to ctx (xdp_md) in XDP program
+ *                 Cannot be NULL
+ * @bpf_tuple  - Pointer to memory representing the tuple to look up
+ *                 Cannot be NULL
+ * @tuple__sz  - Length of the tuple structure
+ *                 Must be one of sizeof(bpf_tuple->ipv4) or
+ *                 sizeof(bpf_tuple->ipv6)
+ * @opts       - Additional options for allocation (documented above)
+ *                 Cannot be NULL
+ * @opts__sz   - Length of the bpf_ct_opts structure
+ *                 Must be NF_BPF_CT_OPTS_SZ (12)
+ */
+struct nf_conn___init *
+bpf_xdp_ct_alloc(struct xdp_md *xdp_ctx, struct bpf_sock_tuple *bpf_tuple,
+                u32 tuple__sz, struct bpf_ct_opts *opts, u32 opts__sz)
+{
+       struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
+       struct nf_conn *nfct;
+
+       nfct = __bpf_nf_ct_alloc_entry(dev_net(ctx->rxq->dev), bpf_tuple, tuple__sz,
+                                      opts, opts__sz, 10);
+       if (IS_ERR(nfct)) {
+               if (opts)
+                       opts->error = PTR_ERR(nfct);
+               return NULL;
+       }
+
+       return (struct nf_conn___init *)nfct;
+}
+
 /* bpf_xdp_ct_lookup - Lookup CT entry for the given tuple, and acquire a
  *                    reference to it
  *
@@ -138,25 +249,50 @@ bpf_xdp_ct_lookup(struct xdp_md *xdp_ctx, struct bpf_sock_tuple *bpf_tuple,
        struct net *caller_net;
        struct nf_conn *nfct;
 
-       BUILD_BUG_ON(sizeof(struct bpf_ct_opts) != NF_BPF_CT_OPTS_SZ);
-
-       if (!opts)
-               return NULL;
-       if (!bpf_tuple || opts->reserved[0] || opts->reserved[1] ||
-           opts__sz != NF_BPF_CT_OPTS_SZ) {
-               opts->error = -EINVAL;
-               return NULL;
-       }
        caller_net = dev_net(ctx->rxq->dev);
-       nfct = __bpf_nf_ct_lookup(caller_net, bpf_tuple, tuple__sz, opts->l4proto,
-                                 opts->netns_id, &opts->dir);
+       nfct = __bpf_nf_ct_lookup(caller_net, bpf_tuple, tuple__sz, opts, opts__sz);
        if (IS_ERR(nfct)) {
-               opts->error = PTR_ERR(nfct);
+               if (opts)
+                       opts->error = PTR_ERR(nfct);
                return NULL;
        }
        return nfct;
 }
 
+/* bpf_skb_ct_alloc - Allocate a new CT entry
+ *
+ * Parameters:
+ * @skb_ctx    - Pointer to ctx (__sk_buff) in TC program
+ *                 Cannot be NULL
+ * @bpf_tuple  - Pointer to memory representing the tuple to look up
+ *                 Cannot be NULL
+ * @tuple__sz  - Length of the tuple structure
+ *                 Must be one of sizeof(bpf_tuple->ipv4) or
+ *                 sizeof(bpf_tuple->ipv6)
+ * @opts       - Additional options for allocation (documented above)
+ *                 Cannot be NULL
+ * @opts__sz   - Length of the bpf_ct_opts structure
+ *                 Must be NF_BPF_CT_OPTS_SZ (12)
+ */
+struct nf_conn___init *
+bpf_skb_ct_alloc(struct __sk_buff *skb_ctx, struct bpf_sock_tuple *bpf_tuple,
+                u32 tuple__sz, struct bpf_ct_opts *opts, u32 opts__sz)
+{
+       struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+       struct nf_conn *nfct;
+       struct net *net;
+
+       net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+       nfct = __bpf_nf_ct_alloc_entry(net, bpf_tuple, tuple__sz, opts, opts__sz, 10);
+       if (IS_ERR(nfct)) {
+               if (opts)
+                       opts->error = PTR_ERR(nfct);
+               return NULL;
+       }
+
+       return (struct nf_conn___init *)nfct;
+}
+
 /* bpf_skb_ct_lookup - Lookup CT entry for the given tuple, and acquire a
  *                    reference to it
  *
@@ -181,20 +317,31 @@ bpf_skb_ct_lookup(struct __sk_buff *skb_ctx, struct bpf_sock_tuple *bpf_tuple,
        struct net *caller_net;
        struct nf_conn *nfct;
 
-       BUILD_BUG_ON(sizeof(struct bpf_ct_opts) != NF_BPF_CT_OPTS_SZ);
-
-       if (!opts)
-               return NULL;
-       if (!bpf_tuple || opts->reserved[0] || opts->reserved[1] ||
-           opts__sz != NF_BPF_CT_OPTS_SZ) {
-               opts->error = -EINVAL;
-               return NULL;
-       }
        caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
-       nfct = __bpf_nf_ct_lookup(caller_net, bpf_tuple, tuple__sz, opts->l4proto,
-                                 opts->netns_id, &opts->dir);
+       nfct = __bpf_nf_ct_lookup(caller_net, bpf_tuple, tuple__sz, opts, opts__sz);
        if (IS_ERR(nfct)) {
-               opts->error = PTR_ERR(nfct);
+               if (opts)
+                       opts->error = PTR_ERR(nfct);
+               return NULL;
+       }
+       return nfct;
+}
+
+/* bpf_ct_insert_entry - Add the provided entry into a CT map
+ *
+ * This must be invoked for referenced PTR_TO_BTF_ID.
+ *
+ * @nfct        - Pointer to referenced nf_conn___init object, obtained
+ *                using bpf_xdp_ct_alloc or bpf_skb_ct_alloc.
+ */
+struct nf_conn *bpf_ct_insert_entry(struct nf_conn___init *nfct_i)
+{
+       struct nf_conn *nfct = (struct nf_conn *)nfct_i;
+       int err;
+
+       err = nf_conntrack_hash_check_insert(nfct);
+       if (err < 0) {
+               nf_conntrack_free(nfct);
                return NULL;
        }
        return nfct;
@@ -217,50 +364,90 @@ void bpf_ct_release(struct nf_conn *nfct)
        nf_ct_put(nfct);
 }
 
+/* bpf_ct_set_timeout - Set timeout of allocated nf_conn
+ *
+ * Sets the default timeout of newly allocated nf_conn before insertion.
+ * This helper must be invoked for refcounted pointer to nf_conn___init.
+ *
+ * Parameters:
+ * @nfct        - Pointer to referenced nf_conn object, obtained using
+ *                 bpf_xdp_ct_alloc or bpf_skb_ct_alloc.
+ * @timeout      - Timeout in msecs.
+ */
+void bpf_ct_set_timeout(struct nf_conn___init *nfct, u32 timeout)
+{
+       __nf_ct_set_timeout((struct nf_conn *)nfct, msecs_to_jiffies(timeout));
+}
+
+/* bpf_ct_change_timeout - Change timeout of inserted nf_conn
+ *
+ * Change timeout associated of the inserted or looked up nf_conn.
+ * This helper must be invoked for refcounted pointer to nf_conn.
+ *
+ * Parameters:
+ * @nfct        - Pointer to referenced nf_conn object, obtained using
+ *                bpf_ct_insert_entry, bpf_xdp_ct_lookup, or bpf_skb_ct_lookup.
+ * @timeout      - New timeout in msecs.
+ */
+int bpf_ct_change_timeout(struct nf_conn *nfct, u32 timeout)
+{
+       return __nf_ct_change_timeout(nfct, msecs_to_jiffies(timeout));
+}
+
+/* bpf_ct_set_status - Set status field of allocated nf_conn
+ *
+ * Set the status field of the newly allocated nf_conn before insertion.
+ * This must be invoked for referenced PTR_TO_BTF_ID to nf_conn___init.
+ *
+ * Parameters:
+ * @nfct        - Pointer to referenced nf_conn object, obtained using
+ *                bpf_xdp_ct_alloc or bpf_skb_ct_alloc.
+ * @status       - New status value.
+ */
+int bpf_ct_set_status(const struct nf_conn___init *nfct, u32 status)
+{
+       return nf_ct_change_status_common((struct nf_conn *)nfct, status);
+}
+
+/* bpf_ct_change_status - Change status of inserted nf_conn
+ *
+ * Change the status field of the provided connection tracking entry.
+ * This must be invoked for referenced PTR_TO_BTF_ID to nf_conn.
+ *
+ * Parameters:
+ * @nfct        - Pointer to referenced nf_conn object, obtained using
+ *                bpf_ct_insert_entry, bpf_xdp_ct_lookup or bpf_skb_ct_lookup.
+ * @status       - New status value.
+ */
+int bpf_ct_change_status(struct nf_conn *nfct, u32 status)
+{
+       return nf_ct_change_status_common(nfct, status);
+}
+
 __diag_pop()
 
-BTF_SET_START(nf_ct_xdp_check_kfunc_ids)
-BTF_ID(func, bpf_xdp_ct_lookup)
-BTF_ID(func, bpf_ct_release)
-BTF_SET_END(nf_ct_xdp_check_kfunc_ids)
-
-BTF_SET_START(nf_ct_tc_check_kfunc_ids)
-BTF_ID(func, bpf_skb_ct_lookup)
-BTF_ID(func, bpf_ct_release)
-BTF_SET_END(nf_ct_tc_check_kfunc_ids)
-
-BTF_SET_START(nf_ct_acquire_kfunc_ids)
-BTF_ID(func, bpf_xdp_ct_lookup)
-BTF_ID(func, bpf_skb_ct_lookup)
-BTF_SET_END(nf_ct_acquire_kfunc_ids)
-
-BTF_SET_START(nf_ct_release_kfunc_ids)
-BTF_ID(func, bpf_ct_release)
-BTF_SET_END(nf_ct_release_kfunc_ids)
-
-/* Both sets are identical */
-#define nf_ct_ret_null_kfunc_ids nf_ct_acquire_kfunc_ids
-
-static const struct btf_kfunc_id_set nf_conntrack_xdp_kfunc_set = {
-       .owner        = THIS_MODULE,
-       .check_set    = &nf_ct_xdp_check_kfunc_ids,
-       .acquire_set  = &nf_ct_acquire_kfunc_ids,
-       .release_set  = &nf_ct_release_kfunc_ids,
-       .ret_null_set = &nf_ct_ret_null_kfunc_ids,
-};
+BTF_SET8_START(nf_ct_kfunc_set)
+BTF_ID_FLAGS(func, bpf_xdp_ct_alloc, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_xdp_ct_lookup, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_skb_ct_alloc, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_skb_ct_lookup, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_ct_insert_entry, KF_ACQUIRE | KF_RET_NULL | KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_ct_release, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_ct_set_timeout, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_ct_change_timeout, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_ct_set_status, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_ct_change_status, KF_TRUSTED_ARGS)
+BTF_SET8_END(nf_ct_kfunc_set)
 
-static const struct btf_kfunc_id_set nf_conntrack_tc_kfunc_set = {
-       .owner        = THIS_MODULE,
-       .check_set    = &nf_ct_tc_check_kfunc_ids,
-       .acquire_set  = &nf_ct_acquire_kfunc_ids,
-       .release_set  = &nf_ct_release_kfunc_ids,
-       .ret_null_set = &nf_ct_ret_null_kfunc_ids,
+static const struct btf_kfunc_id_set nf_conntrack_kfunc_set = {
+       .owner = THIS_MODULE,
+       .set   = &nf_ct_kfunc_set,
 };
 
 int register_nf_conntrack_bpf(void)
 {
        int ret;
 
-       ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &nf_conntrack_xdp_kfunc_set);
-       return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &nf_conntrack_tc_kfunc_set);
+       ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &nf_conntrack_kfunc_set);
+       return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &nf_conntrack_kfunc_set);
 }
index 8c97d06..71c2f4f 100644 (file)
@@ -2806,3 +2806,65 @@ err_expect:
        free_percpu(net->ct.stat);
        return ret;
 }
+
+#if (IS_BUILTIN(CONFIG_NF_CONNTRACK) && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) || \
+    (IS_MODULE(CONFIG_NF_CONNTRACK) && IS_ENABLED(CONFIG_DEBUG_INFO_BTF_MODULES) || \
+    IS_ENABLED(CONFIG_NF_CT_NETLINK))
+
+/* ctnetlink code shared by both ctnetlink and nf_conntrack_bpf */
+
+int __nf_ct_change_timeout(struct nf_conn *ct, u64 timeout)
+{
+       if (test_bit(IPS_FIXED_TIMEOUT_BIT, &ct->status))
+               return -EPERM;
+
+       __nf_ct_set_timeout(ct, timeout);
+
+       if (test_bit(IPS_DYING_BIT, &ct->status))
+               return -ETIME;
+
+       return 0;
+}
+EXPORT_SYMBOL_GPL(__nf_ct_change_timeout);
+
+void __nf_ct_change_status(struct nf_conn *ct, unsigned long on, unsigned long off)
+{
+       unsigned int bit;
+
+       /* Ignore these unchangable bits */
+       on &= ~IPS_UNCHANGEABLE_MASK;
+       off &= ~IPS_UNCHANGEABLE_MASK;
+
+       for (bit = 0; bit < __IPS_MAX_BIT; bit++) {
+               if (on & (1 << bit))
+                       set_bit(bit, &ct->status);
+               else if (off & (1 << bit))
+                       clear_bit(bit, &ct->status);
+       }
+}
+EXPORT_SYMBOL_GPL(__nf_ct_change_status);
+
+int nf_ct_change_status_common(struct nf_conn *ct, unsigned int status)
+{
+       unsigned long d;
+
+       d = ct->status ^ status;
+
+       if (d & (IPS_EXPECTED|IPS_CONFIRMED|IPS_DYING))
+               /* unchangeable */
+               return -EBUSY;
+
+       if (d & IPS_SEEN_REPLY && !(status & IPS_SEEN_REPLY))
+               /* SEEN_REPLY bit can only be set */
+               return -EBUSY;
+
+       if (d & IPS_ASSURED && !(status & IPS_ASSURED))
+               /* ASSURED bit can only be set */
+               return -EBUSY;
+
+       __nf_ct_change_status(ct, status, 0);
+       return 0;
+}
+EXPORT_SYMBOL_GPL(nf_ct_change_status_common);
+
+#endif
index f8dd4ed..04169b5 100644 (file)
@@ -1891,45 +1891,10 @@ ctnetlink_parse_nat_setup(struct nf_conn *ct,
 }
 #endif
 
-static void
-__ctnetlink_change_status(struct nf_conn *ct, unsigned long on,
-                         unsigned long off)
-{
-       unsigned int bit;
-
-       /* Ignore these unchangable bits */
-       on &= ~IPS_UNCHANGEABLE_MASK;
-       off &= ~IPS_UNCHANGEABLE_MASK;
-
-       for (bit = 0; bit < __IPS_MAX_BIT; bit++) {
-               if (on & (1 << bit))
-                       set_bit(bit, &ct->status);
-               else if (off & (1 << bit))
-                       clear_bit(bit, &ct->status);
-       }
-}
-
 static int
 ctnetlink_change_status(struct nf_conn *ct, const struct nlattr * const cda[])
 {
-       unsigned long d;
-       unsigned int status = ntohl(nla_get_be32(cda[CTA_STATUS]));
-       d = ct->status ^ status;
-
-       if (d & (IPS_EXPECTED|IPS_CONFIRMED|IPS_DYING))
-               /* unchangeable */
-               return -EBUSY;
-
-       if (d & IPS_SEEN_REPLY && !(status & IPS_SEEN_REPLY))
-               /* SEEN_REPLY bit can only be set */
-               return -EBUSY;
-
-       if (d & IPS_ASSURED && !(status & IPS_ASSURED))
-               /* ASSURED bit can only be set */
-               return -EBUSY;
-
-       __ctnetlink_change_status(ct, status, 0);
-       return 0;
+       return nf_ct_change_status_common(ct, ntohl(nla_get_be32(cda[CTA_STATUS])));
 }
 
 static int
@@ -2024,16 +1989,7 @@ static int ctnetlink_change_helper(struct nf_conn *ct,
 static int ctnetlink_change_timeout(struct nf_conn *ct,
                                    const struct nlattr * const cda[])
 {
-       u64 timeout = (u64)ntohl(nla_get_be32(cda[CTA_TIMEOUT])) * HZ;
-
-       if (timeout > INT_MAX)
-               timeout = INT_MAX;
-       WRITE_ONCE(ct->timeout, nfct_time_stamp + (u32)timeout);
-
-       if (test_bit(IPS_DYING_BIT, &ct->status))
-               return -ETIME;
-
-       return 0;
+       return __nf_ct_change_timeout(ct, (u64)ntohl(nla_get_be32(cda[CTA_TIMEOUT])) * HZ);
 }
 
 #if defined(CONFIG_NF_CONNTRACK_MARK)
@@ -2293,9 +2249,7 @@ ctnetlink_create_conntrack(struct net *net,
                goto err1;
 
        timeout = (u64)ntohl(nla_get_be32(cda[CTA_TIMEOUT])) * HZ;
-       if (timeout > INT_MAX)
-               timeout = INT_MAX;
-       ct->timeout = (u32)timeout + nfct_time_stamp;
+       __nf_ct_set_timeout(ct, timeout);
 
        rcu_read_lock();
        if (cda[CTA_HELP]) {
@@ -2837,7 +2791,7 @@ ctnetlink_update_status(struct nf_conn *ct, const struct nlattr * const cda[])
         * unchangeable bits but do not error out. Also user programs
         * are allowed to clear the bits that they are allowed to change.
         */
-       __ctnetlink_change_status(ct, status, ~status);
+       __nf_ct_change_status(ct, status, ~status);
        return 0;
 }
 
index 0900238..5b4ce6b 100644 (file)
@@ -639,8 +639,11 @@ static int __xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len
        if (unlikely(need_wait))
                return -EOPNOTSUPP;
 
-       if (sk_can_busy_loop(sk))
+       if (sk_can_busy_loop(sk)) {
+               if (xs->zc)
+                       __sk_mark_napi_id_once(sk, xsk_pool_get_napi_id(xs->pool));
                sk_busy_loop(sk, 1); /* only support non-blocking sockets */
+       }
 
        if (xs->zc && xsk_no_wakeup(sk))
                return 0;
index 5002a5b..727da3c 100644 (file)
@@ -282,12 +282,10 @@ $(LIBBPF): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(LIBBPF_OU
 
 BPFTOOLDIR := $(TOOLS_PATH)/bpf/bpftool
 BPFTOOL_OUTPUT := $(abspath $(BPF_SAMPLES_PATH))/bpftool
-BPFTOOL := $(BPFTOOL_OUTPUT)/bpftool
-$(BPFTOOL): $(LIBBPF) $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile) | $(BPFTOOL_OUTPUT)
-           $(MAKE) -C $(BPFTOOLDIR) srctree=$(BPF_SAMPLES_PATH)/../../ \
-               OUTPUT=$(BPFTOOL_OUTPUT)/ \
-               LIBBPF_OUTPUT=$(LIBBPF_OUTPUT)/ \
-               LIBBPF_DESTDIR=$(LIBBPF_DESTDIR)/
+BPFTOOL := $(BPFTOOL_OUTPUT)/bootstrap/bpftool
+$(BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile) | $(BPFTOOL_OUTPUT)
+       $(MAKE) -C $(BPFTOOLDIR) srctree=$(BPF_SAMPLES_PATH)/../../             \
+               OUTPUT=$(BPFTOOL_OUTPUT)/ bootstrap
 
 $(LIBBPF_OUTPUT) $(BPFTOOL_OUTPUT):
        $(call msg,MKDIR,$@)
index 16dbf49..88a26f3 100644 (file)
@@ -17,6 +17,7 @@
 #include <bpf/libbpf.h>
 #include "bpf_insn.h"
 #include "sock_example.h"
+#include "bpf_util.h"
 
 #define BPF_F_PIN      (1 << 0)
 #define BPF_F_GET      (1 << 1)
@@ -52,7 +53,7 @@ static int bpf_prog_create(const char *object)
                BPF_MOV64_IMM(BPF_REG_0, 1),
                BPF_EXIT_INSN(),
        };
-       size_t insns_cnt = sizeof(insns) / sizeof(struct bpf_insn);
+       size_t insns_cnt = ARRAY_SIZE(insns);
        struct bpf_object *obj;
        int err;
 
index a88f695..5b66f24 100644 (file)
@@ -29,6 +29,7 @@
 #include <bpf/bpf.h>
 #include "bpf_insn.h"
 #include "sock_example.h"
+#include "bpf_util.h"
 
 char bpf_log_buf[BPF_LOG_BUF_SIZE];
 
@@ -58,7 +59,7 @@ static int test_sock(void)
                BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
                BPF_EXIT_INSN(),
        };
-       size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
+       size_t insns_cnt = ARRAY_SIZE(prog);
        LIBBPF_OPTS(bpf_prog_load_opts, opts,
                .log_buf = bpf_log_buf,
                .log_size = BPF_LOG_BUF_SIZE,
index 6d90874..68ce694 100644 (file)
@@ -31,6 +31,7 @@
 #include <bpf/bpf.h>
 
 #include "bpf_insn.h"
+#include "bpf_util.h"
 
 enum {
        MAP_KEY_PACKETS,
@@ -70,7 +71,7 @@ static int prog_load(int map_fd, int verdict)
                BPF_MOV64_IMM(BPF_REG_0, verdict), /* r0 = verdict */
                BPF_EXIT_INSN(),
        };
-       size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
+       size_t insns_cnt = ARRAY_SIZE(prog);
        LIBBPF_OPTS(bpf_prog_load_opts, opts,
                .log_buf = bpf_log_buf,
                .log_size = BPF_LOG_BUF_SIZE,
index be98ccb..5efb917 100644 (file)
@@ -523,7 +523,7 @@ int main(int argc, char **argv)
                return -1;
        }
 
-       for (f = 0; f < sizeof(map_flags) / sizeof(*map_flags); f++) {
+       for (f = 0; f < ARRAY_SIZE(map_flags); f++) {
                test_lru_loss0(BPF_MAP_TYPE_LRU_HASH, map_flags[f]);
                test_lru_loss1(BPF_MAP_TYPE_LRU_HASH, map_flags[f]);
                test_parallel_lru_loss(BPF_MAP_TYPE_LRU_HASH, map_flags[f],
index e8b4cc1..652ec72 100644 (file)
@@ -12,6 +12,8 @@
 #include <bpf/bpf.h>
 #include <bpf/libbpf.h>
 
+#include "bpf_util.h"
+
 static int map_fd[7];
 
 #define PORT_A         (map_fd[0])
@@ -28,7 +30,7 @@ static const char * const test_names[] = {
        "Hash of Hash",
 };
 
-#define NR_TESTS (sizeof(test_names) / sizeof(*test_names))
+#define NR_TESTS ARRAY_SIZE(test_names)
 
 static void check_map_id(int inner_map_fd, int map_in_map_fd, uint32_t key)
 {
index e910dc2..9d7d79f 100644 (file)
@@ -8,6 +8,7 @@
 #include <bpf/bpf.h>
 #include <bpf/libbpf.h>
 #include "trace_helpers.h"
+#include "bpf_util.h"
 
 #ifdef __mips__
 #define        MAX_ENTRIES  6000 /* MIPS n64 syscalls start at 5000 */
@@ -24,7 +25,7 @@ static void install_accept_all_seccomp(void)
                BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        };
        struct sock_fprog prog = {
-               .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+               .len = (unsigned short)ARRAY_SIZE(filter),
                .filter = filter,
        };
        if (prctl(PR_SET_SECCOMP, 2, &prog))
index 415bac1..8557c27 100644 (file)
@@ -33,7 +33,7 @@ struct {
 } tx_port_native SEC(".maps");
 
 /* store egress interface mac address */
-const volatile char tx_mac_addr[ETH_ALEN];
+const volatile __u8 tx_mac_addr[ETH_ALEN];
 
 static __always_inline int xdp_redirect_map(struct xdp_md *ctx, void *redirect_map)
 {
@@ -73,6 +73,7 @@ int xdp_redirect_map_egress(struct xdp_md *ctx)
 {
        void *data_end = (void *)(long)ctx->data_end;
        void *data = (void *)(long)ctx->data;
+       u8 *mac_addr = (u8 *) tx_mac_addr;
        struct ethhdr *eth = data;
        u64 nh_off;
 
@@ -80,7 +81,8 @@ int xdp_redirect_map_egress(struct xdp_md *ctx)
        if (data + nh_off > data_end)
                return XDP_DROP;
 
-       __builtin_memcpy(eth->h_source, (const char *)tx_mac_addr, ETH_ALEN);
+       barrier_var(mac_addr); /* prevent optimizing out memcpy */
+       __builtin_memcpy(eth->h_source, mac_addr, ETH_ALEN);
 
        return XDP_PASS;
 }
index b6e4fc8..c889a13 100644 (file)
@@ -40,6 +40,8 @@ static const struct option long_options[] = {
        {}
 };
 
+static int verbose = 0;
+
 int main(int argc, char **argv)
 {
        struct bpf_devmap_val devmap_val = {};
@@ -79,6 +81,7 @@ int main(int argc, char **argv)
                        break;
                case 'v':
                        sample_switch_mode();
+                       verbose = 1;
                        break;
                case 's':
                        mask |= SAMPLE_REDIRECT_MAP_CNT;
@@ -134,6 +137,12 @@ int main(int argc, char **argv)
                        ret = EXIT_FAIL;
                        goto end_destroy;
                }
+               if (verbose)
+                       printf("Egress ifindex:%d using src MAC %02x:%02x:%02x:%02x:%02x:%02x\n",
+                              ifindex_out,
+                              skel->rodata->tx_mac_addr[0], skel->rodata->tx_mac_addr[1],
+                              skel->rodata->tx_mac_addr[2], skel->rodata->tx_mac_addr[3],
+                              skel->rodata->tx_mac_addr[4], skel->rodata->tx_mac_addr[5]);
        }
 
        skel->rodata->from_match[0] = ifindex_in;
index a0ec321..dfb260d 100755 (executable)
@@ -333,27 +333,7 @@ class PrinterRST(Printer):
 .. Copyright (C) All BPF authors and contributors from 2014 to present.
 .. See git log include/uapi/linux/bpf.h in kernel tree for details.
 .. 
-.. %%%LICENSE_START(VERBATIM)
-.. Permission is granted to make and distribute verbatim copies of this
-.. manual provided the copyright notice and this permission notice are
-.. preserved on all copies.
-.. 
-.. Permission is granted to copy and distribute modified versions of this
-.. manual under the conditions for verbatim copying, provided that the
-.. entire resulting derived work is distributed under the terms of a
-.. permission notice identical to this one.
-.. 
-.. Since the Linux kernel and libraries are constantly changing, this
-.. manual page may be incorrect or out-of-date.  The author(s) assume no
-.. responsibility for errors or omissions, or for damages resulting from
-.. the use of the information contained herein.  The author(s) may not
-.. have taken the same level of care in the production of this manual,
-.. which is licensed free of charge, as they might when working
-.. professionally.
-.. 
-.. Formatted or processed versions of this manual, if unaccompanied by
-.. the source, must acknowledge the copyright and authors of this work.
-.. %%%LICENSE_END
+.. SPDX-License-Identifier:  Linux-man-pages-copyleft
 .. 
 .. Please do not edit this file. It was generated from the documentation
 .. located in file include/uapi/linux/bpf.h of the Linux kernel sources
index 5d26f3c..80cd784 100644 (file)
  *             .zero 4
  *             __BTF_ID__func__vfs_fallocate__4:
  *             .zero 4
+ *
+ *   set8    - store symbol size into first 4 bytes and sort following
+ *             ID list
+ *
+ *             __BTF_ID__set8__list:
+ *             .zero 8
+ *             list:
+ *             __BTF_ID__func__vfs_getattr__3:
+ *             .zero 4
+ *            .word (1 << 0) | (1 << 2)
+ *             __BTF_ID__func__vfs_fallocate__5:
+ *             .zero 4
+ *            .word (1 << 3) | (1 << 1) | (1 << 2)
  */
 
 #define  _GNU_SOURCE
@@ -72,6 +85,7 @@
 #define BTF_TYPEDEF    "typedef"
 #define BTF_FUNC       "func"
 #define BTF_SET                "set"
+#define BTF_SET8       "set8"
 
 #define ADDR_CNT       100
 
@@ -84,6 +98,7 @@ struct btf_id {
        };
        int              addr_cnt;
        bool             is_set;
+       bool             is_set8;
        Elf64_Addr       addr[ADDR_CNT];
 };
 
@@ -231,14 +246,14 @@ static char *get_id(const char *prefix_end)
        return id;
 }
 
-static struct btf_id *add_set(struct object *obj, char *name)
+static struct btf_id *add_set(struct object *obj, char *name, bool is_set8)
 {
        /*
         * __BTF_ID__set__name
         * name =    ^
         * id   =         ^
         */
-       char *id = name + sizeof(BTF_SET "__") - 1;
+       char *id = name + (is_set8 ? sizeof(BTF_SET8 "__") : sizeof(BTF_SET "__")) - 1;
        int len = strlen(name);
 
        if (id >= name + len) {
@@ -444,9 +459,21 @@ static int symbols_collect(struct object *obj)
                } else if (!strncmp(prefix, BTF_FUNC, sizeof(BTF_FUNC) - 1)) {
                        obj->nr_funcs++;
                        id = add_symbol(&obj->funcs, prefix, sizeof(BTF_FUNC) - 1);
+               /* set8 */
+               } else if (!strncmp(prefix, BTF_SET8, sizeof(BTF_SET8) - 1)) {
+                       id = add_set(obj, prefix, true);
+                       /*
+                        * SET8 objects store list's count, which is encoded
+                        * in symbol's size, together with 'cnt' field hence
+                        * that - 1.
+                        */
+                       if (id) {
+                               id->cnt = sym.st_size / sizeof(uint64_t) - 1;
+                               id->is_set8 = true;
+                       }
                /* set */
                } else if (!strncmp(prefix, BTF_SET, sizeof(BTF_SET) - 1)) {
-                       id = add_set(obj, prefix);
+                       id = add_set(obj, prefix, false);
                        /*
                         * SET objects store list's count, which is encoded
                         * in symbol's size, together with 'cnt' field hence
@@ -571,7 +598,8 @@ static int id_patch(struct object *obj, struct btf_id *id)
        int *ptr = data->d_buf;
        int i;
 
-       if (!id->id && !id->is_set)
+       /* For set, set8, id->id may be 0 */
+       if (!id->id && !id->is_set && !id->is_set8)
                pr_err("WARN: resolve_btfids: unresolved symbol %s\n", id->name);
 
        for (i = 0; i < id->addr_cnt; i++) {
@@ -643,13 +671,13 @@ static int sets_patch(struct object *obj)
                }
 
                idx = idx / sizeof(int);
-               base = &ptr[idx] + 1;
+               base = &ptr[idx] + (id->is_set8 ? 2 : 1);
                cnt = ptr[idx];
 
                pr_debug("sorting  addr %5lu: cnt %6d [%s]\n",
                         (idx + 1) * sizeof(int), cnt, id->name);
 
-               qsort(base, cnt, sizeof(int), cmp_id);
+               qsort(base, cnt, id->is_set8 ? sizeof(uint64_t) : sizeof(int), cmp_id);
 
                next = rb_next(next);
        }
index da6de16..8b3d87b 100644 (file)
@@ -4,7 +4,7 @@ include ../../scripts/Makefile.include
 OUTPUT ?= $(abspath .output)/
 
 BPFTOOL_OUTPUT := $(OUTPUT)bpftool/
-DEFAULT_BPFTOOL := $(BPFTOOL_OUTPUT)bpftool
+DEFAULT_BPFTOOL := $(BPFTOOL_OUTPUT)bootstrap/bpftool
 BPFTOOL ?= $(DEFAULT_BPFTOOL)
 LIBBPF_SRC := $(abspath ../../lib/bpf)
 BPFOBJ_OUTPUT := $(OUTPUT)libbpf/
@@ -86,6 +86,5 @@ $(BPFOBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(BPFOBJ_OU
        $(Q)$(MAKE) $(submake_extras) -C $(LIBBPF_SRC) OUTPUT=$(BPFOBJ_OUTPUT) \
                    DESTDIR=$(BPFOBJ_OUTPUT) prefix= $(abspath $@) install_headers
 
-$(DEFAULT_BPFTOOL): $(BPFOBJ) | $(BPFTOOL_OUTPUT)
-       $(Q)$(MAKE) $(submake_extras) -C ../bpftool OUTPUT=$(BPFTOOL_OUTPUT)   \
-                   ARCH= CROSS_COMPILE= CC=$(HOSTCC) LD=$(HOSTLD)
+$(DEFAULT_BPFTOOL): | $(BPFTOOL_OUTPUT)
+       $(Q)$(MAKE) $(submake_extras) -C ../bpftool OUTPUT=$(BPFTOOL_OUTPUT) bootstrap
index 3dd13fe..59a217c 100644 (file)
@@ -2361,7 +2361,8 @@ union bpf_attr {
  *             Pull in non-linear data in case the *skb* is non-linear and not
  *             all of *len* are part of the linear section. Make *len* bytes
  *             from *skb* readable and writable. If a zero value is passed for
- *             *len*, then the whole length of the *skb* is pulled.
+ *             *len*, then all bytes in the linear part of *skb* will be made
+ *             readable and writable.
  *
  *             This helper is only needed for reading and writing with direct
  *             packet access.
index 11f9096..f4d3e1e 100644 (file)
@@ -2,6 +2,8 @@
 #ifndef __BPF_TRACING_H__
 #define __BPF_TRACING_H__
 
+#include <bpf/bpf_helpers.h>
+
 /* Scan the ARCH passed in from ARCH env variable (see Makefile) */
 #if defined(__TARGET_ARCH_x86)
        #define bpf_target_x86
@@ -140,7 +142,7 @@ struct pt_regs___s390 {
 #define __PT_RC_REG gprs[2]
 #define __PT_SP_REG gprs[15]
 #define __PT_IP_REG psw.addr
-#define PT_REGS_PARM1_SYSCALL(x) ({ _Pragma("GCC error \"use PT_REGS_PARM1_CORE_SYSCALL() instead\""); 0l; })
+#define PT_REGS_PARM1_SYSCALL(x) PT_REGS_PARM1_CORE_SYSCALL(x)
 #define PT_REGS_PARM1_CORE_SYSCALL(x) BPF_CORE_READ((const struct pt_regs___s390 *)(x), orig_gpr2)
 
 #elif defined(bpf_target_arm)
@@ -174,7 +176,7 @@ struct pt_regs___arm64 {
 #define __PT_RC_REG regs[0]
 #define __PT_SP_REG sp
 #define __PT_IP_REG pc
-#define PT_REGS_PARM1_SYSCALL(x) ({ _Pragma("GCC error \"use PT_REGS_PARM1_CORE_SYSCALL() instead\""); 0l; })
+#define PT_REGS_PARM1_SYSCALL(x) PT_REGS_PARM1_CORE_SYSCALL(x)
 #define PT_REGS_PARM1_CORE_SYSCALL(x) BPF_CORE_READ((const struct pt_regs___arm64 *)(x), orig_x0)
 
 #elif defined(bpf_target_mips)
@@ -493,39 +495,62 @@ typeof(name(0)) name(struct pt_regs *ctx)                             \
 }                                                                          \
 static __always_inline typeof(name(0)) ____##name(struct pt_regs *ctx, ##args)
 
+/* If kernel has CONFIG_ARCH_HAS_SYSCALL_WRAPPER, read pt_regs directly */
 #define ___bpf_syscall_args0()           ctx
-#define ___bpf_syscall_args1(x)          ___bpf_syscall_args0(), (void *)PT_REGS_PARM1_CORE_SYSCALL(regs)
-#define ___bpf_syscall_args2(x, args...) ___bpf_syscall_args1(args), (void *)PT_REGS_PARM2_CORE_SYSCALL(regs)
-#define ___bpf_syscall_args3(x, args...) ___bpf_syscall_args2(args), (void *)PT_REGS_PARM3_CORE_SYSCALL(regs)
-#define ___bpf_syscall_args4(x, args...) ___bpf_syscall_args3(args), (void *)PT_REGS_PARM4_CORE_SYSCALL(regs)
-#define ___bpf_syscall_args5(x, args...) ___bpf_syscall_args4(args), (void *)PT_REGS_PARM5_CORE_SYSCALL(regs)
+#define ___bpf_syscall_args1(x)          ___bpf_syscall_args0(), (void *)PT_REGS_PARM1_SYSCALL(regs)
+#define ___bpf_syscall_args2(x, args...) ___bpf_syscall_args1(args), (void *)PT_REGS_PARM2_SYSCALL(regs)
+#define ___bpf_syscall_args3(x, args...) ___bpf_syscall_args2(args), (void *)PT_REGS_PARM3_SYSCALL(regs)
+#define ___bpf_syscall_args4(x, args...) ___bpf_syscall_args3(args), (void *)PT_REGS_PARM4_SYSCALL(regs)
+#define ___bpf_syscall_args5(x, args...) ___bpf_syscall_args4(args), (void *)PT_REGS_PARM5_SYSCALL(regs)
 #define ___bpf_syscall_args(args...)     ___bpf_apply(___bpf_syscall_args, ___bpf_narg(args))(args)
 
+/* If kernel doesn't have CONFIG_ARCH_HAS_SYSCALL_WRAPPER, we have to BPF_CORE_READ from pt_regs */
+#define ___bpf_syswrap_args0()           ctx
+#define ___bpf_syswrap_args1(x)          ___bpf_syswrap_args0(), (void *)PT_REGS_PARM1_CORE_SYSCALL(regs)
+#define ___bpf_syswrap_args2(x, args...) ___bpf_syswrap_args1(args), (void *)PT_REGS_PARM2_CORE_SYSCALL(regs)
+#define ___bpf_syswrap_args3(x, args...) ___bpf_syswrap_args2(args), (void *)PT_REGS_PARM3_CORE_SYSCALL(regs)
+#define ___bpf_syswrap_args4(x, args...) ___bpf_syswrap_args3(args), (void *)PT_REGS_PARM4_CORE_SYSCALL(regs)
+#define ___bpf_syswrap_args5(x, args...) ___bpf_syswrap_args4(args), (void *)PT_REGS_PARM5_CORE_SYSCALL(regs)
+#define ___bpf_syswrap_args(args...)     ___bpf_apply(___bpf_syswrap_args, ___bpf_narg(args))(args)
+
 /*
- * BPF_KPROBE_SYSCALL is a variant of BPF_KPROBE, which is intended for
+ * BPF_KSYSCALL is a variant of BPF_KPROBE, which is intended for
  * tracing syscall functions, like __x64_sys_close. It hides the underlying
  * platform-specific low-level way of getting syscall input arguments from
  * struct pt_regs, and provides a familiar typed and named function arguments
  * syntax and semantics of accessing syscall input parameters.
  *
- * Original struct pt_regs* context is preserved as 'ctx' argument. This might
+ * Original struct pt_regs * context is preserved as 'ctx' argument. This might
  * be necessary when using BPF helpers like bpf_perf_event_output().
  *
- * This macro relies on BPF CO-RE support.
+ * At the moment BPF_KSYSCALL does not handle all the calling convention
+ * quirks for mmap(), clone() and compat syscalls transparrently. This may or
+ * may not change in the future. User needs to take extra measures to handle
+ * such quirks explicitly, if necessary.
+ *
+ * This macro relies on BPF CO-RE support and virtual __kconfig externs.
  */
-#define BPF_KPROBE_SYSCALL(name, args...)                                  \
+#define BPF_KSYSCALL(name, args...)                                        \
 name(struct pt_regs *ctx);                                                 \
+extern _Bool LINUX_HAS_SYSCALL_WRAPPER __kconfig;                          \
 static __attribute__((always_inline)) typeof(name(0))                      \
 ____##name(struct pt_regs *ctx, ##args);                                   \
 typeof(name(0)) name(struct pt_regs *ctx)                                  \
 {                                                                          \
-       struct pt_regs *regs = PT_REGS_SYSCALL_REGS(ctx);                   \
+       struct pt_regs *regs = LINUX_HAS_SYSCALL_WRAPPER                    \
+                              ? (struct pt_regs *)PT_REGS_PARM1(ctx)       \
+                              : ctx;                                       \
        _Pragma("GCC diagnostic push")                                      \
        _Pragma("GCC diagnostic ignored \"-Wint-conversion\"")              \
-       return ____##name(___bpf_syscall_args(args));                       \
+       if (LINUX_HAS_SYSCALL_WRAPPER)                                      \
+               return ____##name(___bpf_syswrap_args(args));               \
+       else                                                                \
+               return ____##name(___bpf_syscall_args(args));               \
        _Pragma("GCC diagnostic pop")                                       \
 }                                                                          \
 static __attribute__((always_inline)) typeof(name(0))                      \
 ____##name(struct pt_regs *ctx, ##args)
 
+#define BPF_KPROBE_SYSCALL BPF_KSYSCALL
+
 #endif
index 400e84f..627edb5 100644 (file)
@@ -2045,7 +2045,7 @@ static int btf_dump_get_enum_value(struct btf_dump *d,
                *value = *(__s64 *)data;
                return 0;
        case 4:
-               *value = is_signed ? *(__s32 *)data : *(__u32 *)data;
+               *value = is_signed ? (__s64)*(__s32 *)data : *(__u32 *)data;
                return 0;
        case 2:
                *value = is_signed ? *(__s16 *)data : *(__u16 *)data;
index 927745b..23f5c46 100644 (file)
@@ -533,7 +533,7 @@ void bpf_gen__record_attach_target(struct bpf_gen *gen, const char *attach_name,
        gen->attach_kind = kind;
        ret = snprintf(gen->attach_target, sizeof(gen->attach_target), "%s%s",
                       prefix, attach_name);
-       if (ret == sizeof(gen->attach_target))
+       if (ret >= sizeof(gen->attach_target))
                gen->error = -ENOSPC;
 }
 
index cb49408..b01fe01 100644 (file)
@@ -1694,7 +1694,7 @@ static int set_kcfg_value_tri(struct extern_desc *ext, void *ext_val,
        switch (ext->kcfg.type) {
        case KCFG_BOOL:
                if (value == 'm') {
-                       pr_warn("extern (kcfg) %s=%c should be tristate or char\n",
+                       pr_warn("extern (kcfg) '%s': value '%c' implies tristate or char type\n",
                                ext->name, value);
                        return -EINVAL;
                }
@@ -1715,7 +1715,7 @@ static int set_kcfg_value_tri(struct extern_desc *ext, void *ext_val,
        case KCFG_INT:
        case KCFG_CHAR_ARR:
        default:
-               pr_warn("extern (kcfg) %s=%c should be bool, tristate, or char\n",
+               pr_warn("extern (kcfg) '%s': value '%c' implies bool, tristate, or char type\n",
                        ext->name, value);
                return -EINVAL;
        }
@@ -1729,7 +1729,8 @@ static int set_kcfg_value_str(struct extern_desc *ext, char *ext_val,
        size_t len;
 
        if (ext->kcfg.type != KCFG_CHAR_ARR) {
-               pr_warn("extern (kcfg) %s=%s should be char array\n", ext->name, value);
+               pr_warn("extern (kcfg) '%s': value '%s' implies char array type\n",
+                       ext->name, value);
                return -EINVAL;
        }
 
@@ -1743,7 +1744,7 @@ static int set_kcfg_value_str(struct extern_desc *ext, char *ext_val,
        /* strip quotes */
        len -= 2;
        if (len >= ext->kcfg.sz) {
-               pr_warn("extern (kcfg) '%s': long string config %s of (%zu bytes) truncated to %d bytes\n",
+               pr_warn("extern (kcfg) '%s': long string '%s' of (%zu bytes) truncated to %d bytes\n",
                        ext->name, value, len, ext->kcfg.sz - 1);
                len = ext->kcfg.sz - 1;
        }
@@ -1800,13 +1801,20 @@ static bool is_kcfg_value_in_range(const struct extern_desc *ext, __u64 v)
 static int set_kcfg_value_num(struct extern_desc *ext, void *ext_val,
                              __u64 value)
 {
-       if (ext->kcfg.type != KCFG_INT && ext->kcfg.type != KCFG_CHAR) {
-               pr_warn("extern (kcfg) %s=%llu should be integer\n",
+       if (ext->kcfg.type != KCFG_INT && ext->kcfg.type != KCFG_CHAR &&
+           ext->kcfg.type != KCFG_BOOL) {
+               pr_warn("extern (kcfg) '%s': value '%llu' implies integer, char, or boolean type\n",
                        ext->name, (unsigned long long)value);
                return -EINVAL;
        }
+       if (ext->kcfg.type == KCFG_BOOL && value > 1) {
+               pr_warn("extern (kcfg) '%s': value '%llu' isn't boolean compatible\n",
+                       ext->name, (unsigned long long)value);
+               return -EINVAL;
+
+       }
        if (!is_kcfg_value_in_range(ext, value)) {
-               pr_warn("extern (kcfg) %s=%llu value doesn't fit in %d bytes\n",
+               pr_warn("extern (kcfg) '%s': value '%llu' doesn't fit in %d bytes\n",
                        ext->name, (unsigned long long)value, ext->kcfg.sz);
                return -ERANGE;
        }
@@ -1870,16 +1878,19 @@ static int bpf_object__process_kconfig_line(struct bpf_object *obj,
                /* assume integer */
                err = parse_u64(value, &num);
                if (err) {
-                       pr_warn("extern (kcfg) %s=%s should be integer\n",
-                               ext->name, value);
+                       pr_warn("extern (kcfg) '%s': value '%s' isn't a valid integer\n", ext->name, value);
                        return err;
                }
+               if (ext->kcfg.type != KCFG_INT && ext->kcfg.type != KCFG_CHAR) {
+                       pr_warn("extern (kcfg) '%s': value '%s' implies integer type\n", ext->name, value);
+                       return -EINVAL;
+               }
                err = set_kcfg_value_num(ext, ext_val, num);
                break;
        }
        if (err)
                return err;
-       pr_debug("extern (kcfg) %s=%s\n", ext->name, value);
+       pr_debug("extern (kcfg) '%s': set to %s\n", ext->name, value);
        return 0;
 }
 
@@ -2320,6 +2331,37 @@ int parse_btf_map_def(const char *map_name, struct btf *btf,
        return 0;
 }
 
+static size_t adjust_ringbuf_sz(size_t sz)
+{
+       __u32 page_sz = sysconf(_SC_PAGE_SIZE);
+       __u32 mul;
+
+       /* if user forgot to set any size, make sure they see error */
+       if (sz == 0)
+               return 0;
+       /* Kernel expects BPF_MAP_TYPE_RINGBUF's max_entries to be
+        * a power-of-2 multiple of kernel's page size. If user diligently
+        * satisified these conditions, pass the size through.
+        */
+       if ((sz % page_sz) == 0 && is_pow_of_2(sz / page_sz))
+               return sz;
+
+       /* Otherwise find closest (page_sz * power_of_2) product bigger than
+        * user-set size to satisfy both user size request and kernel
+        * requirements and substitute correct max_entries for map creation.
+        */
+       for (mul = 1; mul <= UINT_MAX / page_sz; mul <<= 1) {
+               if (mul * page_sz > sz)
+                       return mul * page_sz;
+       }
+
+       /* if it's impossible to satisfy the conditions (i.e., user size is
+        * very close to UINT_MAX but is not a power-of-2 multiple of
+        * page_size) then just return original size and let kernel reject it
+        */
+       return sz;
+}
+
 static void fill_map_from_def(struct bpf_map *map, const struct btf_map_def *def)
 {
        map->def.type = def->map_type;
@@ -2333,6 +2375,10 @@ static void fill_map_from_def(struct bpf_map *map, const struct btf_map_def *def
        map->btf_key_type_id = def->key_type_id;
        map->btf_value_type_id = def->value_type_id;
 
+       /* auto-adjust BPF ringbuf map max_entries to be a multiple of page size */
+       if (map->def.type == BPF_MAP_TYPE_RINGBUF)
+               map->def.max_entries = adjust_ringbuf_sz(map->def.max_entries);
+
        if (def->parts & MAP_DEF_MAP_TYPE)
                pr_debug("map '%s': found type = %u.\n", map->name, def->map_type);
 
@@ -3687,7 +3733,7 @@ static int bpf_object__collect_externs(struct bpf_object *obj)
                        ext->kcfg.type = find_kcfg_type(obj->btf, t->type,
                                                        &ext->kcfg.is_signed);
                        if (ext->kcfg.type == KCFG_UNKNOWN) {
-                               pr_warn("extern (kcfg) '%s' type is unsupported\n", ext_name);
+                               pr_warn("extern (kcfg) '%s': type is unsupported\n", ext_name);
                                return -ENOTSUP;
                        }
                } else if (strcmp(sec_name, KSYMS_SEC) == 0) {
@@ -4232,7 +4278,7 @@ int bpf_map__set_autocreate(struct bpf_map *map, bool autocreate)
 int bpf_map__reuse_fd(struct bpf_map *map, int fd)
 {
        struct bpf_map_info info = {};
-       __u32 len = sizeof(info);
+       __u32 len = sizeof(info), name_len;
        int new_fd, err;
        char *new_name;
 
@@ -4242,7 +4288,12 @@ int bpf_map__reuse_fd(struct bpf_map *map, int fd)
        if (err)
                return libbpf_err(err);
 
-       new_name = strdup(info.name);
+       name_len = strlen(info.name);
+       if (name_len == BPF_OBJ_NAME_LEN - 1 && strncmp(map->name, info.name, name_len) == 0)
+               new_name = strdup(map->name);
+       else
+               new_name = strdup(info.name);
+
        if (!new_name)
                return libbpf_err(-errno);
 
@@ -4301,9 +4352,15 @@ struct bpf_map *bpf_map__inner_map(struct bpf_map *map)
 
 int bpf_map__set_max_entries(struct bpf_map *map, __u32 max_entries)
 {
-       if (map->fd >= 0)
+       if (map->obj->loaded)
                return libbpf_err(-EBUSY);
+
        map->def.max_entries = max_entries;
+
+       /* auto-adjust BPF ringbuf map max_entries to be a multiple of page size */
+       if (map->def.type == BPF_MAP_TYPE_RINGBUF)
+               map->def.max_entries = adjust_ringbuf_sz(map->def.max_entries);
+
        return 0;
 }
 
@@ -4654,6 +4711,8 @@ static int probe_kern_btf_enum64(void)
                                             strs, sizeof(strs)));
 }
 
+static int probe_kern_syscall_wrapper(void);
+
 enum kern_feature_result {
        FEAT_UNKNOWN = 0,
        FEAT_SUPPORTED = 1,
@@ -4722,6 +4781,9 @@ static struct kern_feature_desc {
        [FEAT_BTF_ENUM64] = {
                "BTF_KIND_ENUM64 support", probe_kern_btf_enum64,
        },
+       [FEAT_SYSCALL_WRAPPER] = {
+               "Kernel using syscall wrapper", probe_kern_syscall_wrapper,
+       },
 };
 
 bool kernel_supports(const struct bpf_object *obj, enum kern_feature_id feat_id)
@@ -4854,37 +4916,6 @@ bpf_object__populate_internal_map(struct bpf_object *obj, struct bpf_map *map)
 
 static void bpf_map__destroy(struct bpf_map *map);
 
-static size_t adjust_ringbuf_sz(size_t sz)
-{
-       __u32 page_sz = sysconf(_SC_PAGE_SIZE);
-       __u32 mul;
-
-       /* if user forgot to set any size, make sure they see error */
-       if (sz == 0)
-               return 0;
-       /* Kernel expects BPF_MAP_TYPE_RINGBUF's max_entries to be
-        * a power-of-2 multiple of kernel's page size. If user diligently
-        * satisified these conditions, pass the size through.
-        */
-       if ((sz % page_sz) == 0 && is_pow_of_2(sz / page_sz))
-               return sz;
-
-       /* Otherwise find closest (page_sz * power_of_2) product bigger than
-        * user-set size to satisfy both user size request and kernel
-        * requirements and substitute correct max_entries for map creation.
-        */
-       for (mul = 1; mul <= UINT_MAX / page_sz; mul <<= 1) {
-               if (mul * page_sz > sz)
-                       return mul * page_sz;
-       }
-
-       /* if it's impossible to satisfy the conditions (i.e., user size is
-        * very close to UINT_MAX but is not a power-of-2 multiple of
-        * page_size) then just return original size and let kernel reject it
-        */
-       return sz;
-}
-
 static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, bool is_inner)
 {
        LIBBPF_OPTS(bpf_map_create_opts, create_attr);
@@ -4923,9 +4954,6 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
        }
 
        switch (def->type) {
-       case BPF_MAP_TYPE_RINGBUF:
-               map->def.max_entries = adjust_ringbuf_sz(map->def.max_entries);
-               /* fallthrough */
        case BPF_MAP_TYPE_PERF_EVENT_ARRAY:
        case BPF_MAP_TYPE_CGROUP_ARRAY:
        case BPF_MAP_TYPE_STACK_TRACE:
@@ -7282,14 +7310,14 @@ static int kallsyms_cb(unsigned long long sym_addr, char sym_type,
                return 0;
 
        if (ext->is_set && ext->ksym.addr != sym_addr) {
-               pr_warn("extern (ksym) '%s' resolution is ambiguous: 0x%llx or 0x%llx\n",
+               pr_warn("extern (ksym) '%s': resolution is ambiguous: 0x%llx or 0x%llx\n",
                        sym_name, ext->ksym.addr, sym_addr);
                return -EINVAL;
        }
        if (!ext->is_set) {
                ext->is_set = true;
                ext->ksym.addr = sym_addr;
-               pr_debug("extern (ksym) %s=0x%llx\n", sym_name, sym_addr);
+               pr_debug("extern (ksym) '%s': set to 0x%llx\n", sym_name, sym_addr);
        }
        return 0;
 }
@@ -7493,28 +7521,52 @@ static int bpf_object__resolve_externs(struct bpf_object *obj,
        for (i = 0; i < obj->nr_extern; i++) {
                ext = &obj->externs[i];
 
-               if (ext->type == EXT_KCFG &&
-                   strcmp(ext->name, "LINUX_KERNEL_VERSION") == 0) {
-                       void *ext_val = kcfg_data + ext->kcfg.data_off;
-                       __u32 kver = get_kernel_version();
+               if (ext->type == EXT_KSYM) {
+                       if (ext->ksym.type_id)
+                               need_vmlinux_btf = true;
+                       else
+                               need_kallsyms = true;
+                       continue;
+               } else if (ext->type == EXT_KCFG) {
+                       void *ext_ptr = kcfg_data + ext->kcfg.data_off;
+                       __u64 value = 0;
+
+                       /* Kconfig externs need actual /proc/config.gz */
+                       if (str_has_pfx(ext->name, "CONFIG_")) {
+                               need_config = true;
+                               continue;
+                       }
 
-                       if (!kver) {
-                               pr_warn("failed to get kernel version\n");
+                       /* Virtual kcfg externs are customly handled by libbpf */
+                       if (strcmp(ext->name, "LINUX_KERNEL_VERSION") == 0) {
+                               value = get_kernel_version();
+                               if (!value) {
+                                       pr_warn("extern (kcfg) '%s': failed to get kernel version\n", ext->name);
+                                       return -EINVAL;
+                               }
+                       } else if (strcmp(ext->name, "LINUX_HAS_BPF_COOKIE") == 0) {
+                               value = kernel_supports(obj, FEAT_BPF_COOKIE);
+                       } else if (strcmp(ext->name, "LINUX_HAS_SYSCALL_WRAPPER") == 0) {
+                               value = kernel_supports(obj, FEAT_SYSCALL_WRAPPER);
+                       } else if (!str_has_pfx(ext->name, "LINUX_") || !ext->is_weak) {
+                               /* Currently libbpf supports only CONFIG_ and LINUX_ prefixed
+                                * __kconfig externs, where LINUX_ ones are virtual and filled out
+                                * customly by libbpf (their values don't come from Kconfig).
+                                * If LINUX_xxx variable is not recognized by libbpf, but is marked
+                                * __weak, it defaults to zero value, just like for CONFIG_xxx
+                                * externs.
+                                */
+                               pr_warn("extern (kcfg) '%s': unrecognized virtual extern\n", ext->name);
                                return -EINVAL;
                        }
-                       err = set_kcfg_value_num(ext, ext_val, kver);
+
+                       err = set_kcfg_value_num(ext, ext_ptr, value);
                        if (err)
                                return err;
-                       pr_debug("extern (kcfg) %s=0x%x\n", ext->name, kver);
-               } else if (ext->type == EXT_KCFG && str_has_pfx(ext->name, "CONFIG_")) {
-                       need_config = true;
-               } else if (ext->type == EXT_KSYM) {
-                       if (ext->ksym.type_id)
-                               need_vmlinux_btf = true;
-                       else
-                               need_kallsyms = true;
+                       pr_debug("extern (kcfg) '%s': set to 0x%llx\n",
+                                ext->name, (long long)value);
                } else {
-                       pr_warn("unrecognized extern '%s'\n", ext->name);
+                       pr_warn("extern '%s': unrecognized extern kind\n", ext->name);
                        return -EINVAL;
                }
        }
@@ -7550,10 +7602,10 @@ static int bpf_object__resolve_externs(struct bpf_object *obj,
                ext = &obj->externs[i];
 
                if (!ext->is_set && !ext->is_weak) {
-                       pr_warn("extern %s (strong) not resolved\n", ext->name);
+                       pr_warn("extern '%s' (strong): not resolved\n", ext->name);
                        return -ESRCH;
                } else if (!ext->is_set) {
-                       pr_debug("extern %s (weak) not resolved, defaulting to zero\n",
+                       pr_debug("extern '%s' (weak): not resolved, defaulting to zero\n",
                                 ext->name);
                }
        }
@@ -8381,6 +8433,7 @@ int bpf_program__set_log_buf(struct bpf_program *prog, char *log_buf, size_t log
 
 static int attach_kprobe(const struct bpf_program *prog, long cookie, struct bpf_link **link);
 static int attach_uprobe(const struct bpf_program *prog, long cookie, struct bpf_link **link);
+static int attach_ksyscall(const struct bpf_program *prog, long cookie, struct bpf_link **link);
 static int attach_usdt(const struct bpf_program *prog, long cookie, struct bpf_link **link);
 static int attach_tp(const struct bpf_program *prog, long cookie, struct bpf_link **link);
 static int attach_raw_tp(const struct bpf_program *prog, long cookie, struct bpf_link **link);
@@ -8401,6 +8454,8 @@ static const struct bpf_sec_def section_defs[] = {
        SEC_DEF("uretprobe.s+",         KPROBE, 0, SEC_SLEEPABLE, attach_uprobe),
        SEC_DEF("kprobe.multi+",        KPROBE, BPF_TRACE_KPROBE_MULTI, SEC_NONE, attach_kprobe_multi),
        SEC_DEF("kretprobe.multi+",     KPROBE, BPF_TRACE_KPROBE_MULTI, SEC_NONE, attach_kprobe_multi),
+       SEC_DEF("ksyscall+",            KPROBE, 0, SEC_NONE, attach_ksyscall),
+       SEC_DEF("kretsyscall+",         KPROBE, 0, SEC_NONE, attach_ksyscall),
        SEC_DEF("usdt+",                KPROBE, 0, SEC_NONE, attach_usdt),
        SEC_DEF("tc",                   SCHED_CLS, 0, SEC_NONE),
        SEC_DEF("classifier",           SCHED_CLS, 0, SEC_NONE),
@@ -9757,7 +9812,7 @@ static int perf_event_open_probe(bool uprobe, bool retprobe, const char *name,
 {
        struct perf_event_attr attr = {};
        char errmsg[STRERR_BUFSIZE];
-       int type, pfd, err;
+       int type, pfd;
 
        if (ref_ctr_off >= (1ULL << PERF_UPROBE_REF_CTR_OFFSET_BITS))
                return -EINVAL;
@@ -9793,14 +9848,7 @@ static int perf_event_open_probe(bool uprobe, bool retprobe, const char *name,
                      pid < 0 ? -1 : pid /* pid */,
                      pid == -1 ? 0 : -1 /* cpu */,
                      -1 /* group_fd */, PERF_FLAG_FD_CLOEXEC);
-       if (pfd < 0) {
-               err = -errno;
-               pr_warn("%s perf_event_open() failed: %s\n",
-                       uprobe ? "uprobe" : "kprobe",
-                       libbpf_strerror_r(err, errmsg, sizeof(errmsg)));
-               return err;
-       }
-       return pfd;
+       return pfd >= 0 ? pfd : -errno;
 }
 
 static int append_to_file(const char *file, const char *fmt, ...)
@@ -9823,6 +9871,34 @@ static int append_to_file(const char *file, const char *fmt, ...)
        return err;
 }
 
+#define DEBUGFS "/sys/kernel/debug/tracing"
+#define TRACEFS "/sys/kernel/tracing"
+
+static bool use_debugfs(void)
+{
+       static int has_debugfs = -1;
+
+       if (has_debugfs < 0)
+               has_debugfs = access(DEBUGFS, F_OK) == 0;
+
+       return has_debugfs == 1;
+}
+
+static const char *tracefs_path(void)
+{
+       return use_debugfs() ? DEBUGFS : TRACEFS;
+}
+
+static const char *tracefs_kprobe_events(void)
+{
+       return use_debugfs() ? DEBUGFS"/kprobe_events" : TRACEFS"/kprobe_events";
+}
+
+static const char *tracefs_uprobe_events(void)
+{
+       return use_debugfs() ? DEBUGFS"/uprobe_events" : TRACEFS"/uprobe_events";
+}
+
 static void gen_kprobe_legacy_event_name(char *buf, size_t buf_sz,
                                         const char *kfunc_name, size_t offset)
 {
@@ -9835,9 +9911,7 @@ static void gen_kprobe_legacy_event_name(char *buf, size_t buf_sz,
 static int add_kprobe_event_legacy(const char *probe_name, bool retprobe,
                                   const char *kfunc_name, size_t offset)
 {
-       const char *file = "/sys/kernel/debug/tracing/kprobe_events";
-
-       return append_to_file(file, "%c:%s/%s %s+0x%zx",
+       return append_to_file(tracefs_kprobe_events(), "%c:%s/%s %s+0x%zx",
                              retprobe ? 'r' : 'p',
                              retprobe ? "kretprobes" : "kprobes",
                              probe_name, kfunc_name, offset);
@@ -9845,18 +9919,16 @@ static int add_kprobe_event_legacy(const char *probe_name, bool retprobe,
 
 static int remove_kprobe_event_legacy(const char *probe_name, bool retprobe)
 {
-       const char *file = "/sys/kernel/debug/tracing/kprobe_events";
-
-       return append_to_file(file, "-:%s/%s", retprobe ? "kretprobes" : "kprobes", probe_name);
+       return append_to_file(tracefs_kprobe_events(), "-:%s/%s",
+                             retprobe ? "kretprobes" : "kprobes", probe_name);
 }
 
 static int determine_kprobe_perf_type_legacy(const char *probe_name, bool retprobe)
 {
        char file[256];
 
-       snprintf(file, sizeof(file),
-                "/sys/kernel/debug/tracing/events/%s/%s/id",
-                retprobe ? "kretprobes" : "kprobes", probe_name);
+       snprintf(file, sizeof(file), "%s/events/%s/%s/id",
+                tracefs_path(), retprobe ? "kretprobes" : "kprobes", probe_name);
 
        return parse_uint_from_file(file, "%d\n");
 }
@@ -9905,6 +9977,60 @@ err_clean_legacy:
        return err;
 }
 
+static const char *arch_specific_syscall_pfx(void)
+{
+#if defined(__x86_64__)
+       return "x64";
+#elif defined(__i386__)
+       return "ia32";
+#elif defined(__s390x__)
+       return "s390x";
+#elif defined(__s390__)
+       return "s390";
+#elif defined(__arm__)
+       return "arm";
+#elif defined(__aarch64__)
+       return "arm64";
+#elif defined(__mips__)
+       return "mips";
+#elif defined(__riscv)
+       return "riscv";
+#else
+       return NULL;
+#endif
+}
+
+static int probe_kern_syscall_wrapper(void)
+{
+       char syscall_name[64];
+       const char *ksys_pfx;
+
+       ksys_pfx = arch_specific_syscall_pfx();
+       if (!ksys_pfx)
+               return 0;
+
+       snprintf(syscall_name, sizeof(syscall_name), "__%s_sys_bpf", ksys_pfx);
+
+       if (determine_kprobe_perf_type() >= 0) {
+               int pfd;
+
+               pfd = perf_event_open_probe(false, false, syscall_name, 0, getpid(), 0);
+               if (pfd >= 0)
+                       close(pfd);
+
+               return pfd >= 0 ? 1 : 0;
+       } else { /* legacy mode */
+               char probe_name[128];
+
+               gen_kprobe_legacy_event_name(probe_name, sizeof(probe_name), syscall_name, 0);
+               if (add_kprobe_event_legacy(probe_name, false, syscall_name, 0) < 0)
+                       return 0;
+
+               (void)remove_kprobe_event_legacy(probe_name, false);
+               return 1;
+       }
+}
+
 struct bpf_link *
 bpf_program__attach_kprobe_opts(const struct bpf_program *prog,
                                const char *func_name,
@@ -9990,6 +10116,29 @@ struct bpf_link *bpf_program__attach_kprobe(const struct bpf_program *prog,
        return bpf_program__attach_kprobe_opts(prog, func_name, &opts);
 }
 
+struct bpf_link *bpf_program__attach_ksyscall(const struct bpf_program *prog,
+                                             const char *syscall_name,
+                                             const struct bpf_ksyscall_opts *opts)
+{
+       LIBBPF_OPTS(bpf_kprobe_opts, kprobe_opts);
+       char func_name[128];
+
+       if (!OPTS_VALID(opts, bpf_ksyscall_opts))
+               return libbpf_err_ptr(-EINVAL);
+
+       if (kernel_supports(prog->obj, FEAT_SYSCALL_WRAPPER)) {
+               snprintf(func_name, sizeof(func_name), "__%s_sys_%s",
+                        arch_specific_syscall_pfx(), syscall_name);
+       } else {
+               snprintf(func_name, sizeof(func_name), "__se_sys_%s", syscall_name);
+       }
+
+       kprobe_opts.retprobe = OPTS_GET(opts, retprobe, false);
+       kprobe_opts.bpf_cookie = OPTS_GET(opts, bpf_cookie, 0);
+
+       return bpf_program__attach_kprobe_opts(prog, func_name, &kprobe_opts);
+}
+
 /* Adapted from perf/util/string.c */
 static bool glob_match(const char *str, const char *pat)
 {
@@ -10160,6 +10309,27 @@ static int attach_kprobe(const struct bpf_program *prog, long cookie, struct bpf
        return libbpf_get_error(*link);
 }
 
+static int attach_ksyscall(const struct bpf_program *prog, long cookie, struct bpf_link **link)
+{
+       LIBBPF_OPTS(bpf_ksyscall_opts, opts);
+       const char *syscall_name;
+
+       *link = NULL;
+
+       /* no auto-attach for SEC("ksyscall") and SEC("kretsyscall") */
+       if (strcmp(prog->sec_name, "ksyscall") == 0 || strcmp(prog->sec_name, "kretsyscall") == 0)
+               return 0;
+
+       opts.retprobe = str_has_pfx(prog->sec_name, "kretsyscall/");
+       if (opts.retprobe)
+               syscall_name = prog->sec_name + sizeof("kretsyscall/") - 1;
+       else
+               syscall_name = prog->sec_name + sizeof("ksyscall/") - 1;
+
+       *link = bpf_program__attach_ksyscall(prog, syscall_name, &opts);
+       return *link ? 0 : -errno;
+}
+
 static int attach_kprobe_multi(const struct bpf_program *prog, long cookie, struct bpf_link **link)
 {
        LIBBPF_OPTS(bpf_kprobe_multi_opts, opts);
@@ -10208,9 +10378,7 @@ static void gen_uprobe_legacy_event_name(char *buf, size_t buf_sz,
 static inline int add_uprobe_event_legacy(const char *probe_name, bool retprobe,
                                          const char *binary_path, size_t offset)
 {
-       const char *file = "/sys/kernel/debug/tracing/uprobe_events";
-
-       return append_to_file(file, "%c:%s/%s %s:0x%zx",
+       return append_to_file(tracefs_uprobe_events(), "%c:%s/%s %s:0x%zx",
                              retprobe ? 'r' : 'p',
                              retprobe ? "uretprobes" : "uprobes",
                              probe_name, binary_path, offset);
@@ -10218,18 +10386,16 @@ static inline int add_uprobe_event_legacy(const char *probe_name, bool retprobe,
 
 static inline int remove_uprobe_event_legacy(const char *probe_name, bool retprobe)
 {
-       const char *file = "/sys/kernel/debug/tracing/uprobe_events";
-
-       return append_to_file(file, "-:%s/%s", retprobe ? "uretprobes" : "uprobes", probe_name);
+       return append_to_file(tracefs_uprobe_events(), "-:%s/%s",
+                             retprobe ? "uretprobes" : "uprobes", probe_name);
 }
 
 static int determine_uprobe_perf_type_legacy(const char *probe_name, bool retprobe)
 {
        char file[512];
 
-       snprintf(file, sizeof(file),
-                "/sys/kernel/debug/tracing/events/%s/%s/id",
-                retprobe ? "uretprobes" : "uprobes", probe_name);
+       snprintf(file, sizeof(file), "%s/events/%s/%s/id",
+                tracefs_path(), retprobe ? "uretprobes" : "uprobes", probe_name);
 
        return parse_uint_from_file(file, "%d\n");
 }
@@ -10545,7 +10711,10 @@ bpf_program__attach_uprobe_opts(const struct bpf_program *prog, pid_t pid,
        ref_ctr_off = OPTS_GET(opts, ref_ctr_offset, 0);
        pe_opts.bpf_cookie = OPTS_GET(opts, bpf_cookie, 0);
 
-       if (binary_path && !strchr(binary_path, '/')) {
+       if (!binary_path)
+               return libbpf_err_ptr(-EINVAL);
+
+       if (!strchr(binary_path, '/')) {
                err = resolve_full_path(binary_path, full_binary_path,
                                        sizeof(full_binary_path));
                if (err) {
@@ -10559,11 +10728,6 @@ bpf_program__attach_uprobe_opts(const struct bpf_program *prog, pid_t pid,
        if (func_name) {
                long sym_off;
 
-               if (!binary_path) {
-                       pr_warn("prog '%s': name-based attach requires binary_path\n",
-                               prog->name);
-                       return libbpf_err_ptr(-EINVAL);
-               }
                sym_off = elf_find_func_offset(binary_path, func_name);
                if (sym_off < 0)
                        return libbpf_err_ptr(sym_off);
@@ -10711,6 +10875,9 @@ struct bpf_link *bpf_program__attach_usdt(const struct bpf_program *prog,
                return libbpf_err_ptr(-EINVAL);
        }
 
+       if (!binary_path)
+               return libbpf_err_ptr(-EINVAL);
+
        if (!strchr(binary_path, '/')) {
                err = resolve_full_path(binary_path, resolved_path, sizeof(resolved_path));
                if (err) {
@@ -10776,9 +10943,8 @@ static int determine_tracepoint_id(const char *tp_category,
        char file[PATH_MAX];
        int ret;
 
-       ret = snprintf(file, sizeof(file),
-                      "/sys/kernel/debug/tracing/events/%s/%s/id",
-                      tp_category, tp_name);
+       ret = snprintf(file, sizeof(file), "%s/events/%s/%s/id",
+                      tracefs_path(), tp_category, tp_name);
        if (ret < 0)
                return -errno;
        if (ret >= sizeof(file)) {
@@ -11728,6 +11894,22 @@ int perf_buffer__buffer_fd(const struct perf_buffer *pb, size_t buf_idx)
        return cpu_buf->fd;
 }
 
+int perf_buffer__buffer(struct perf_buffer *pb, int buf_idx, void **buf, size_t *buf_size)
+{
+       struct perf_cpu_buf *cpu_buf;
+
+       if (buf_idx >= pb->cpu_cnt)
+               return libbpf_err(-EINVAL);
+
+       cpu_buf = pb->cpu_bufs[buf_idx];
+       if (!cpu_buf)
+               return libbpf_err(-ENOENT);
+
+       *buf = cpu_buf->base;
+       *buf_size = pb->mmap_size;
+       return 0;
+}
+
 /*
  * Consume data from perf ring buffer corresponding to slot *buf_idx* in
  * PERF_EVENT_ARRAY BPF map without waiting/polling. If there is no data to
index e4d5353..61493c4 100644 (file)
@@ -457,6 +457,52 @@ bpf_program__attach_kprobe_multi_opts(const struct bpf_program *prog,
                                      const char *pattern,
                                      const struct bpf_kprobe_multi_opts *opts);
 
+struct bpf_ksyscall_opts {
+       /* size of this struct, for forward/backward compatiblity */
+       size_t sz;
+       /* custom user-provided value fetchable through bpf_get_attach_cookie() */
+       __u64 bpf_cookie;
+       /* attach as return probe? */
+       bool retprobe;
+       size_t :0;
+};
+#define bpf_ksyscall_opts__last_field retprobe
+
+/**
+ * @brief **bpf_program__attach_ksyscall()** attaches a BPF program
+ * to kernel syscall handler of a specified syscall. Optionally it's possible
+ * to request to install retprobe that will be triggered at syscall exit. It's
+ * also possible to associate BPF cookie (though options).
+ *
+ * Libbpf automatically will determine correct full kernel function name,
+ * which depending on system architecture and kernel version/configuration
+ * could be of the form __<arch>_sys_<syscall> or __se_sys_<syscall>, and will
+ * attach specified program using kprobe/kretprobe mechanism.
+ *
+ * **bpf_program__attach_ksyscall()** is an API counterpart of declarative
+ * **SEC("ksyscall/<syscall>")** annotation of BPF programs.
+ *
+ * At the moment **SEC("ksyscall")** and **bpf_program__attach_ksyscall()** do
+ * not handle all the calling convention quirks for mmap(), clone() and compat
+ * syscalls. It also only attaches to "native" syscall interfaces. If host
+ * system supports compat syscalls or defines 32-bit syscalls in 64-bit
+ * kernel, such syscall interfaces won't be attached to by libbpf.
+ *
+ * These limitations may or may not change in the future. Therefore it is
+ * recommended to use SEC("kprobe") for these syscalls or if working with
+ * compat and 32-bit interfaces is required.
+ *
+ * @param prog BPF program to attach
+ * @param syscall_name Symbolic name of the syscall (e.g., "bpf")
+ * @param opts Additional options (see **struct bpf_ksyscall_opts**)
+ * @return Reference to the newly created BPF link; or NULL is returned on
+ * error, error code is stored in errno
+ */
+LIBBPF_API struct bpf_link *
+bpf_program__attach_ksyscall(const struct bpf_program *prog,
+                            const char *syscall_name,
+                            const struct bpf_ksyscall_opts *opts);
+
 struct bpf_uprobe_opts {
        /* size of this struct, for forward/backward compatiblity */
        size_t sz;
@@ -1053,6 +1099,22 @@ LIBBPF_API int perf_buffer__consume(struct perf_buffer *pb);
 LIBBPF_API int perf_buffer__consume_buffer(struct perf_buffer *pb, size_t buf_idx);
 LIBBPF_API size_t perf_buffer__buffer_cnt(const struct perf_buffer *pb);
 LIBBPF_API int perf_buffer__buffer_fd(const struct perf_buffer *pb, size_t buf_idx);
+/**
+ * @brief **perf_buffer__buffer()** returns the per-cpu raw mmap()'ed underlying
+ * memory region of the ring buffer.
+ * This ring buffer can be used to implement a custom events consumer.
+ * The ring buffer starts with the *struct perf_event_mmap_page*, which
+ * holds the ring buffer managment fields, when accessing the header
+ * structure it's important to be SMP aware.
+ * You can refer to *perf_event_read_simple* for a simple example.
+ * @param pb the perf buffer structure
+ * @param buf_idx the buffer index to retreive
+ * @param buf (out) gets the base pointer of the mmap()'ed memory
+ * @param buf_size (out) gets the size of the mmap()'ed region
+ * @return 0 on success, negative error code for failure
+ */
+LIBBPF_API int perf_buffer__buffer(struct perf_buffer *pb, int buf_idx, void **buf,
+                                  size_t *buf_size);
 
 struct bpf_prog_linfo;
 struct bpf_prog_info;
index 94b589e..0625adb 100644 (file)
@@ -356,10 +356,12 @@ LIBBPF_0.8.0 {
 LIBBPF_1.0.0 {
        global:
                bpf_prog_query_opts;
+               bpf_program__attach_ksyscall;
                btf__add_enum64;
                btf__add_enum64_value;
                libbpf_bpf_attach_type_str;
                libbpf_bpf_link_type_str;
                libbpf_bpf_map_type_str;
                libbpf_bpf_prog_type_str;
+               perf_buffer__buffer;
 };
index 9cd7829..4135ae0 100644 (file)
@@ -108,9 +108,9 @@ static inline bool str_has_sfx(const char *str, const char *sfx)
        size_t str_len = strlen(str);
        size_t sfx_len = strlen(sfx);
 
-       if (sfx_len <= str_len)
-               return strcmp(str + str_len - sfx_len, sfx);
-       return false;
+       if (sfx_len > str_len)
+               return false;
+       return strcmp(str + str_len - sfx_len, sfx) == 0;
 }
 
 /* Symbol versioning is different between static and shared library.
@@ -352,6 +352,8 @@ enum kern_feature_id {
        FEAT_BPF_COOKIE,
        /* BTF_KIND_ENUM64 support and BTF_KIND_ENUM kflag support */
        FEAT_BTF_ENUM64,
+       /* Kernel uses syscall wrapper (CONFIG_ARCH_HAS_SYSCALL_WRAPPER) */
+       FEAT_SYSCALL_WRAPPER,
        __FEAT_CNT,
 };
 
index 4181fdd..4f2adc0 100644 (file)
@@ -6,7 +6,6 @@
 #include <linux/errno.h>
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_tracing.h>
-#include <bpf/bpf_core_read.h>
 
 /* Below types and maps are internal implementation details of libbpf's USDT
  * support and are subjects to change. Also, bpf_usdt_xxx() API helpers should
 #ifndef BPF_USDT_MAX_IP_CNT
 #define BPF_USDT_MAX_IP_CNT (4 * BPF_USDT_MAX_SPEC_CNT)
 #endif
-/* We use BPF CO-RE to detect support for BPF cookie from BPF side. This is
- * the only dependency on CO-RE, so if it's undesirable, user can override
- * BPF_USDT_HAS_BPF_COOKIE to specify whether to BPF cookie is supported or not.
- */
-#ifndef BPF_USDT_HAS_BPF_COOKIE
-#define BPF_USDT_HAS_BPF_COOKIE \
-       bpf_core_enum_value_exists(enum bpf_func_id___usdt, BPF_FUNC_get_attach_cookie___usdt)
-#endif
 
 enum __bpf_usdt_arg_type {
        BPF_USDT_ARG_CONST,
@@ -83,15 +74,12 @@ struct {
        __type(value, __u32);
 } __bpf_usdt_ip_to_spec_id SEC(".maps") __weak;
 
-/* don't rely on user's BPF code to have latest definition of bpf_func_id */
-enum bpf_func_id___usdt {
-       BPF_FUNC_get_attach_cookie___usdt = 0xBAD, /* value doesn't matter */
-};
+extern const _Bool LINUX_HAS_BPF_COOKIE __kconfig;
 
 static __always_inline
 int __bpf_usdt_spec_id(struct pt_regs *ctx)
 {
-       if (!BPF_USDT_HAS_BPF_COOKIE) {
+       if (!LINUX_HAS_BPF_COOKIE) {
                long ip = PT_REGS_IP(ctx);
                int *spec_id_ptr;
 
index e585e1c..792cb15 100644 (file)
@@ -148,13 +148,13 @@ static struct bin_attribute bin_attr_bpf_testmod_file __ro_after_init = {
        .write = bpf_testmod_test_write,
 };
 
-BTF_SET_START(bpf_testmod_check_kfunc_ids)
-BTF_ID(func, bpf_testmod_test_mod_kfunc)
-BTF_SET_END(bpf_testmod_check_kfunc_ids)
+BTF_SET8_START(bpf_testmod_check_kfunc_ids)
+BTF_ID_FLAGS(func, bpf_testmod_test_mod_kfunc)
+BTF_SET8_END(bpf_testmod_check_kfunc_ids)
 
 static const struct btf_kfunc_id_set bpf_testmod_kfunc_set = {
-       .owner     = THIS_MODULE,
-       .check_set = &bpf_testmod_check_kfunc_ids,
+       .owner = THIS_MODULE,
+       .set   = &bpf_testmod_check_kfunc_ids,
 };
 
 extern int bpf_fentry_test1(int a);
index 7ff5fa9..a33874b 100644 (file)
@@ -27,6 +27,7 @@
 #include "bpf_iter_test_kern5.skel.h"
 #include "bpf_iter_test_kern6.skel.h"
 #include "bpf_iter_bpf_link.skel.h"
+#include "bpf_iter_ksym.skel.h"
 
 static int duration;
 
@@ -1120,6 +1121,19 @@ static void test_link_iter(void)
        bpf_iter_bpf_link__destroy(skel);
 }
 
+static void test_ksym_iter(void)
+{
+       struct bpf_iter_ksym *skel;
+
+       skel = bpf_iter_ksym__open_and_load();
+       if (!ASSERT_OK_PTR(skel, "bpf_iter_ksym__open_and_load"))
+               return;
+
+       do_dummy_read(skel->progs.dump_ksym);
+
+       bpf_iter_ksym__destroy(skel);
+}
+
 #define CMP_BUFFER_SIZE 1024
 static char task_vma_output[CMP_BUFFER_SIZE];
 static char proc_maps_output[CMP_BUFFER_SIZE];
@@ -1267,4 +1281,6 @@ void test_bpf_iter(void)
                test_buf_neg_offset();
        if (test__start_subtest("link-iter"))
                test_link_iter();
+       if (test__start_subtest("ksym"))
+               test_ksym_iter();
 }
index dd30b1e..7a74a15 100644 (file)
@@ -2,13 +2,29 @@
 #include <test_progs.h>
 #include <network_helpers.h>
 #include "test_bpf_nf.skel.h"
+#include "test_bpf_nf_fail.skel.h"
+
+static char log_buf[1024 * 1024];
+
+struct {
+       const char *prog_name;
+       const char *err_msg;
+} test_bpf_nf_fail_tests[] = {
+       { "alloc_release", "kernel function bpf_ct_release args#0 expected pointer to STRUCT nf_conn but" },
+       { "insert_insert", "kernel function bpf_ct_insert_entry args#0 expected pointer to STRUCT nf_conn___init but" },
+       { "lookup_insert", "kernel function bpf_ct_insert_entry args#0 expected pointer to STRUCT nf_conn___init but" },
+       { "set_timeout_after_insert", "kernel function bpf_ct_set_timeout args#0 expected pointer to STRUCT nf_conn___init but" },
+       { "set_status_after_insert", "kernel function bpf_ct_set_status args#0 expected pointer to STRUCT nf_conn___init but" },
+       { "change_timeout_after_alloc", "kernel function bpf_ct_change_timeout args#0 expected pointer to STRUCT nf_conn but" },
+       { "change_status_after_alloc", "kernel function bpf_ct_change_status args#0 expected pointer to STRUCT nf_conn but" },
+};
 
 enum {
        TEST_XDP,
        TEST_TC_BPF,
 };
 
-void test_bpf_nf_ct(int mode)
+static void test_bpf_nf_ct(int mode)
 {
        struct test_bpf_nf *skel;
        int prog_fd, err;
@@ -39,14 +55,60 @@ void test_bpf_nf_ct(int mode)
        ASSERT_EQ(skel->bss->test_enonet_netns_id, -ENONET, "Test ENONET for bad but valid netns_id");
        ASSERT_EQ(skel->bss->test_enoent_lookup, -ENOENT, "Test ENOENT for failed lookup");
        ASSERT_EQ(skel->bss->test_eafnosupport, -EAFNOSUPPORT, "Test EAFNOSUPPORT for invalid len__tuple");
+       ASSERT_EQ(skel->data->test_alloc_entry, 0, "Test for alloc new entry");
+       ASSERT_EQ(skel->data->test_insert_entry, 0, "Test for insert new entry");
+       ASSERT_EQ(skel->data->test_succ_lookup, 0, "Test for successful lookup");
+       /* allow some tolerance for test_delta_timeout value to avoid races. */
+       ASSERT_GT(skel->bss->test_delta_timeout, 8, "Test for min ct timeout update");
+       ASSERT_LE(skel->bss->test_delta_timeout, 10, "Test for max ct timeout update");
+       /* expected status is IPS_SEEN_REPLY */
+       ASSERT_EQ(skel->bss->test_status, 2, "Test for ct status update ");
 end:
        test_bpf_nf__destroy(skel);
 }
 
+static void test_bpf_nf_ct_fail(const char *prog_name, const char *err_msg)
+{
+       LIBBPF_OPTS(bpf_object_open_opts, opts, .kernel_log_buf = log_buf,
+                                               .kernel_log_size = sizeof(log_buf),
+                                               .kernel_log_level = 1);
+       struct test_bpf_nf_fail *skel;
+       struct bpf_program *prog;
+       int ret;
+
+       skel = test_bpf_nf_fail__open_opts(&opts);
+       if (!ASSERT_OK_PTR(skel, "test_bpf_nf_fail__open"))
+               return;
+
+       prog = bpf_object__find_program_by_name(skel->obj, prog_name);
+       if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
+               goto end;
+
+       bpf_program__set_autoload(prog, true);
+
+       ret = test_bpf_nf_fail__load(skel);
+       if (!ASSERT_ERR(ret, "test_bpf_nf_fail__load must fail"))
+               goto end;
+
+       if (!ASSERT_OK_PTR(strstr(log_buf, err_msg), "expected error message")) {
+               fprintf(stderr, "Expected: %s\n", err_msg);
+               fprintf(stderr, "Verifier: %s\n", log_buf);
+       }
+
+end:
+       test_bpf_nf_fail__destroy(skel);
+}
+
 void test_bpf_nf(void)
 {
+       int i;
        if (test__start_subtest("xdp-ct"))
                test_bpf_nf_ct(TEST_XDP);
        if (test__start_subtest("tc-bpf-ct"))
                test_bpf_nf_ct(TEST_TC_BPF);
+       for (i = 0; i < ARRAY_SIZE(test_bpf_nf_fail_tests); i++) {
+               if (test__start_subtest(test_bpf_nf_fail_tests[i].prog_name))
+                       test_bpf_nf_ct_fail(test_bpf_nf_fail_tests[i].prog_name,
+                                           test_bpf_nf_fail_tests[i].err_msg);
+       }
 }
index 941b010..ef6528b 100644 (file)
@@ -5338,7 +5338,7 @@ static void do_test_pprint(int test_num)
        ret = snprintf(pin_path, sizeof(pin_path), "%s/%s",
                       "/sys/fs/bpf", test->map_name);
 
-       if (CHECK(ret == sizeof(pin_path), "pin_path %s/%s is too long",
+       if (CHECK(ret >= sizeof(pin_path), "pin_path %s/%s is too long",
                  "/sys/fs/bpf", test->map_name)) {
                err = -1;
                goto done;
index 1931a15..63a51e9 100644 (file)
@@ -39,6 +39,7 @@ static struct test_case {
                       "CONFIG_STR=\"abracad\"\n"
                       "CONFIG_MISSING=0",
                .data = {
+                       .unkn_virt_val = 0,
                        .bpf_syscall = false,
                        .tristate_val = TRI_MODULE,
                        .bool_val = true,
@@ -121,7 +122,7 @@ static struct test_case {
 void test_core_extern(void)
 {
        const uint32_t kern_ver = get_kernel_version();
-       int err, duration = 0, i, j;
+       int err, i, j;
        struct test_core_extern *skel = NULL;
        uint64_t *got, *exp;
        int n = sizeof(*skel->data) / sizeof(uint64_t);
@@ -136,19 +137,17 @@ void test_core_extern(void)
                        continue;
 
                skel = test_core_extern__open_opts(&opts);
-               if (CHECK(!skel, "skel_open", "skeleton open failed\n"))
+               if (!ASSERT_OK_PTR(skel, "skel_open"))
                        goto cleanup;
                err = test_core_extern__load(skel);
                if (t->fails) {
-                       CHECK(!err, "skel_load",
-                             "shouldn't succeed open/load of skeleton\n");
+                       ASSERT_ERR(err, "skel_load_should_fail");
                        goto cleanup;
-               } else if (CHECK(err, "skel_load",
-                                "failed to open/load skeleton\n")) {
+               } else if (!ASSERT_OK(err, "skel_load")) {
                        goto cleanup;
                }
                err = test_core_extern__attach(skel);
-               if (CHECK(err, "attach_raw_tp", "failed attach: %d\n", err))
+               if (!ASSERT_OK(err, "attach_raw_tp"))
                        goto cleanup;
 
                usleep(1);
@@ -158,9 +157,7 @@ void test_core_extern(void)
                got = (uint64_t *)skel->data;
                exp = (uint64_t *)&t->data;
                for (j = 0; j < n; j++) {
-                       CHECK(got[j] != exp[j], "check_res",
-                             "result #%d: expected %llx, but got %llx\n",
-                              j, (__u64)exp[j], (__u64)got[j]);
+                       ASSERT_EQ(got[j], exp[j], "result");
                }
 cleanup:
                test_core_extern__destroy(skel);
index 335917d..d457a55 100644 (file)
@@ -364,6 +364,8 @@ static int get_syms(char ***symsp, size_t *cntp)
                        continue;
                if (!strncmp(name, "rcu_", 4))
                        continue;
+               if (!strcmp(name, "bpf_dispatcher_xdp_func"))
+                       continue;
                if (!strncmp(name, "__ftrace_invalid_address__",
                             sizeof("__ftrace_invalid_address__") - 1))
                        continue;
index eb5f7f5..1455911 100644 (file)
@@ -50,6 +50,13 @@ void test_ringbuf_multi(void)
        if (CHECK(!skel, "skel_open", "skeleton open failed\n"))
                return;
 
+       /* validate ringbuf size adjustment logic */
+       ASSERT_EQ(bpf_map__max_entries(skel->maps.ringbuf1), page_size, "rb1_size_before");
+       ASSERT_OK(bpf_map__set_max_entries(skel->maps.ringbuf1, page_size + 1), "rb1_resize");
+       ASSERT_EQ(bpf_map__max_entries(skel->maps.ringbuf1), 2 * page_size, "rb1_size_after");
+       ASSERT_OK(bpf_map__set_max_entries(skel->maps.ringbuf1, page_size), "rb1_reset");
+       ASSERT_EQ(bpf_map__max_entries(skel->maps.ringbuf1), page_size, "rb1_size_final");
+
        proto_fd = bpf_map_create(BPF_MAP_TYPE_RINGBUF, NULL, 0, 0, page_size, NULL);
        if (CHECK(proto_fd < 0, "bpf_map_create", "bpf_map_create failed\n"))
                goto cleanup;
@@ -65,6 +72,10 @@ void test_ringbuf_multi(void)
        close(proto_fd);
        proto_fd = -1;
 
+       /* make sure we can't resize ringbuf after object load */
+       if (!ASSERT_ERR(bpf_map__set_max_entries(skel->maps.ringbuf1, 3 * page_size), "rb1_resize_after_load"))
+               goto cleanup;
+
        /* only trigger BPF program for current process */
        skel->bss->pid = getpid();
 
index 180afd6..99dac52 100644 (file)
@@ -122,6 +122,8 @@ void test_skeleton(void)
 
        ASSERT_EQ(skel->bss->out_mostly_var, 123, "out_mostly_var");
 
+       ASSERT_EQ(bss->huge_arr[ARRAY_SIZE(bss->huge_arr) - 1], 123, "huge_arr");
+
        elf_bytes = test_skeleton__elf_bytes(&elf_bytes_sz);
        ASSERT_OK_PTR(elf_bytes, "elf_bytes");
        ASSERT_GE(elf_bytes_sz, 0, "elf_bytes_sz");
index 97ec8bc..e984660 100644 (file)
@@ -22,6 +22,7 @@
 #define BTF_F_NONAME BTF_F_NONAME___not_used
 #define BTF_F_PTR_RAW BTF_F_PTR_RAW___not_used
 #define BTF_F_ZERO BTF_F_ZERO___not_used
+#define bpf_iter__ksym bpf_iter__ksym___not_used
 #include "vmlinux.h"
 #undef bpf_iter_meta
 #undef bpf_iter__bpf_map
@@ -44,6 +45,7 @@
 #undef BTF_F_NONAME
 #undef BTF_F_PTR_RAW
 #undef BTF_F_ZERO
+#undef bpf_iter__ksym
 
 struct bpf_iter_meta {
        struct seq_file *seq;
@@ -151,3 +153,8 @@ enum {
        BTF_F_PTR_RAW   =       (1ULL << 2),
        BTF_F_ZERO      =       (1ULL << 3),
 };
+
+struct bpf_iter__ksym {
+       struct bpf_iter_meta *meta;
+       struct kallsym_iter *ksym;
+};
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_ksym.c b/tools/testing/selftests/bpf/progs/bpf_iter_ksym.c
new file mode 100644 (file)
index 0000000..285c008
--- /dev/null
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022, Oracle and/or its affiliates. */
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+unsigned long last_sym_value = 0;
+
+static inline char tolower(char c)
+{
+       if (c >= 'A' && c <= 'Z')
+               c += ('a' - 'A');
+       return c;
+}
+
+static inline char toupper(char c)
+{
+       if (c >= 'a' && c <= 'z')
+               c -= ('a' - 'A');
+       return c;
+}
+
+/* Dump symbols with max size; the latter is calculated by caching symbol N value
+ * and when iterating on symbol N+1, we can print max size of symbol N via
+ * address of N+1 - address of N.
+ */
+SEC("iter/ksym")
+int dump_ksym(struct bpf_iter__ksym *ctx)
+{
+       struct seq_file *seq = ctx->meta->seq;
+       struct kallsym_iter *iter = ctx->ksym;
+       __u32 seq_num = ctx->meta->seq_num;
+       unsigned long value;
+       char type;
+       int ret;
+
+       if (!iter)
+               return 0;
+
+       if (seq_num == 0) {
+               BPF_SEQ_PRINTF(seq, "ADDR TYPE NAME MODULE_NAME KIND MAX_SIZE\n");
+               return 0;
+       }
+       if (last_sym_value)
+               BPF_SEQ_PRINTF(seq, "0x%x\n", iter->value - last_sym_value);
+       else
+               BPF_SEQ_PRINTF(seq, "\n");
+
+       value = iter->show_value ? iter->value : 0;
+
+       last_sym_value = value;
+
+       type = iter->type;
+
+       if (iter->module_name[0]) {
+               type = iter->exported ? toupper(type) : tolower(type);
+               BPF_SEQ_PRINTF(seq, "0x%llx %c %s [ %s ] ",
+                              value, type, iter->name, iter->module_name);
+       } else {
+               BPF_SEQ_PRINTF(seq, "0x%llx %c %s ", value, type, iter->name);
+       }
+       if (!iter->pos_arch_end || iter->pos_arch_end > iter->pos)
+               BPF_SEQ_PRINTF(seq, "CORE ");
+       else if (!iter->pos_mod_end || iter->pos_mod_end > iter->pos)
+               BPF_SEQ_PRINTF(seq, "MOD ");
+       else if (!iter->pos_ftrace_mod_end || iter->pos_ftrace_mod_end > iter->pos)
+               BPF_SEQ_PRINTF(seq, "FTRACE_MOD ");
+       else if (!iter->pos_bpf_end || iter->pos_bpf_end > iter->pos)
+               BPF_SEQ_PRINTF(seq, "BPF ");
+       else
+               BPF_SEQ_PRINTF(seq, "KPROBE ");
+       return 0;
+}
index 05838ed..e1e1189 100644 (file)
@@ -64,9 +64,9 @@ int BPF_KPROBE(handle_sys_prctl)
        return 0;
 }
 
-SEC("kprobe/" SYS_PREFIX "sys_prctl")
-int BPF_KPROBE_SYSCALL(prctl_enter, int option, unsigned long arg2,
-                      unsigned long arg3, unsigned long arg4, unsigned long arg5)
+SEC("ksyscall/prctl")
+int BPF_KSYSCALL(prctl_enter, int option, unsigned long arg2,
+                unsigned long arg3, unsigned long arg4, unsigned long arg5)
 {
        pid_t pid = bpf_get_current_pid_tgid() >> 32;
 
index f1c88ad..a1e45fe 100644 (file)
@@ -1,11 +1,10 @@
 // SPDX-License-Identifier: GPL-2.0
 // Copyright (c) 2017 Facebook
 
-#include <linux/ptrace.h>
-#include <linux/bpf.h>
+#include "vmlinux.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_tracing.h>
-#include <stdbool.h>
+#include <bpf/bpf_core_read.h>
 #include "bpf_misc.h"
 
 int kprobe_res = 0;
@@ -31,8 +30,8 @@ int handle_kprobe(struct pt_regs *ctx)
        return 0;
 }
 
-SEC("kprobe/" SYS_PREFIX "sys_nanosleep")
-int BPF_KPROBE(handle_kprobe_auto)
+SEC("ksyscall/nanosleep")
+int BPF_KSYSCALL(handle_kprobe_auto, struct __kernel_timespec *req, struct __kernel_timespec *rem)
 {
        kprobe2_res = 11;
        return 0;
@@ -56,11 +55,11 @@ int handle_kretprobe(struct pt_regs *ctx)
        return 0;
 }
 
-SEC("kretprobe/" SYS_PREFIX "sys_nanosleep")
-int BPF_KRETPROBE(handle_kretprobe_auto)
+SEC("kretsyscall/nanosleep")
+int BPF_KRETPROBE(handle_kretprobe_auto, int ret)
 {
        kretprobe2_res = 22;
-       return 0;
+       return ret;
 }
 
 SEC("uprobe")
index f00a973..196cd8d 100644 (file)
@@ -8,6 +8,8 @@
 #define EINVAL 22
 #define ENOENT 2
 
+extern unsigned long CONFIG_HZ __kconfig;
+
 int test_einval_bpf_tuple = 0;
 int test_einval_reserved = 0;
 int test_einval_netns_id = 0;
@@ -16,6 +18,11 @@ int test_eproto_l4proto = 0;
 int test_enonet_netns_id = 0;
 int test_enoent_lookup = 0;
 int test_eafnosupport = 0;
+int test_alloc_entry = -EINVAL;
+int test_insert_entry = -EAFNOSUPPORT;
+int test_succ_lookup = -ENOENT;
+u32 test_delta_timeout = 0;
+u32 test_status = 0;
 
 struct nf_conn;
 
@@ -26,31 +33,44 @@ struct bpf_ct_opts___local {
        u8 reserved[3];
 } __attribute__((preserve_access_index));
 
+struct nf_conn *bpf_xdp_ct_alloc(struct xdp_md *, struct bpf_sock_tuple *, u32,
+                                struct bpf_ct_opts___local *, u32) __ksym;
 struct nf_conn *bpf_xdp_ct_lookup(struct xdp_md *, struct bpf_sock_tuple *, u32,
                                  struct bpf_ct_opts___local *, u32) __ksym;
+struct nf_conn *bpf_skb_ct_alloc(struct __sk_buff *, struct bpf_sock_tuple *, u32,
+                                struct bpf_ct_opts___local *, u32) __ksym;
 struct nf_conn *bpf_skb_ct_lookup(struct __sk_buff *, struct bpf_sock_tuple *, u32,
                                  struct bpf_ct_opts___local *, u32) __ksym;
+struct nf_conn *bpf_ct_insert_entry(struct nf_conn *) __ksym;
 void bpf_ct_release(struct nf_conn *) __ksym;
+void bpf_ct_set_timeout(struct nf_conn *, u32) __ksym;
+int bpf_ct_change_timeout(struct nf_conn *, u32) __ksym;
+int bpf_ct_set_status(struct nf_conn *, u32) __ksym;
+int bpf_ct_change_status(struct nf_conn *, u32) __ksym;
 
 static __always_inline void
-nf_ct_test(struct nf_conn *(*func)(void *, struct bpf_sock_tuple *, u32,
-                                  struct bpf_ct_opts___local *, u32),
+nf_ct_test(struct nf_conn *(*lookup_fn)(void *, struct bpf_sock_tuple *, u32,
+                                       struct bpf_ct_opts___local *, u32),
+          struct nf_conn *(*alloc_fn)(void *, struct bpf_sock_tuple *, u32,
+                                      struct bpf_ct_opts___local *, u32),
           void *ctx)
 {
        struct bpf_ct_opts___local opts_def = { .l4proto = IPPROTO_TCP, .netns_id = -1 };
        struct bpf_sock_tuple bpf_tuple;
        struct nf_conn *ct;
+       int err;
 
        __builtin_memset(&bpf_tuple, 0, sizeof(bpf_tuple.ipv4));
 
-       ct = func(ctx, NULL, 0, &opts_def, sizeof(opts_def));
+       ct = lookup_fn(ctx, NULL, 0, &opts_def, sizeof(opts_def));
        if (ct)
                bpf_ct_release(ct);
        else
                test_einval_bpf_tuple = opts_def.error;
 
        opts_def.reserved[0] = 1;
-       ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, sizeof(opts_def));
+       ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
+                      sizeof(opts_def));
        opts_def.reserved[0] = 0;
        opts_def.l4proto = IPPROTO_TCP;
        if (ct)
@@ -59,21 +79,24 @@ nf_ct_test(struct nf_conn *(*func)(void *, struct bpf_sock_tuple *, u32,
                test_einval_reserved = opts_def.error;
 
        opts_def.netns_id = -2;
-       ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, sizeof(opts_def));
+       ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
+                      sizeof(opts_def));
        opts_def.netns_id = -1;
        if (ct)
                bpf_ct_release(ct);
        else
                test_einval_netns_id = opts_def.error;
 
-       ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, sizeof(opts_def) - 1);
+       ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
+                      sizeof(opts_def) - 1);
        if (ct)
                bpf_ct_release(ct);
        else
                test_einval_len_opts = opts_def.error;
 
        opts_def.l4proto = IPPROTO_ICMP;
-       ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, sizeof(opts_def));
+       ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
+                      sizeof(opts_def));
        opts_def.l4proto = IPPROTO_TCP;
        if (ct)
                bpf_ct_release(ct);
@@ -81,37 +104,75 @@ nf_ct_test(struct nf_conn *(*func)(void *, struct bpf_sock_tuple *, u32,
                test_eproto_l4proto = opts_def.error;
 
        opts_def.netns_id = 0xf00f;
-       ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, sizeof(opts_def));
+       ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
+                      sizeof(opts_def));
        opts_def.netns_id = -1;
        if (ct)
                bpf_ct_release(ct);
        else
                test_enonet_netns_id = opts_def.error;
 
-       ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, sizeof(opts_def));
+       ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
+                      sizeof(opts_def));
        if (ct)
                bpf_ct_release(ct);
        else
                test_enoent_lookup = opts_def.error;
 
-       ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4) - 1, &opts_def, sizeof(opts_def));
+       ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4) - 1, &opts_def,
+                      sizeof(opts_def));
        if (ct)
                bpf_ct_release(ct);
        else
                test_eafnosupport = opts_def.error;
+
+       bpf_tuple.ipv4.saddr = bpf_get_prandom_u32(); /* src IP */
+       bpf_tuple.ipv4.daddr = bpf_get_prandom_u32(); /* dst IP */
+       bpf_tuple.ipv4.sport = bpf_get_prandom_u32(); /* src port */
+       bpf_tuple.ipv4.dport = bpf_get_prandom_u32(); /* dst port */
+
+       ct = alloc_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
+                     sizeof(opts_def));
+       if (ct) {
+               struct nf_conn *ct_ins;
+
+               bpf_ct_set_timeout(ct, 10000);
+               bpf_ct_set_status(ct, IPS_CONFIRMED);
+
+               ct_ins = bpf_ct_insert_entry(ct);
+               if (ct_ins) {
+                       struct nf_conn *ct_lk;
+
+                       ct_lk = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4),
+                                         &opts_def, sizeof(opts_def));
+                       if (ct_lk) {
+                               /* update ct entry timeout */
+                               bpf_ct_change_timeout(ct_lk, 10000);
+                               test_delta_timeout = ct_lk->timeout - bpf_jiffies64();
+                               test_delta_timeout /= CONFIG_HZ;
+                               test_status = IPS_SEEN_REPLY;
+                               bpf_ct_change_status(ct_lk, IPS_SEEN_REPLY);
+                               bpf_ct_release(ct_lk);
+                               test_succ_lookup = 0;
+                       }
+                       bpf_ct_release(ct_ins);
+                       test_insert_entry = 0;
+               }
+               test_alloc_entry = 0;
+       }
 }
 
 SEC("xdp")
 int nf_xdp_ct_test(struct xdp_md *ctx)
 {
-       nf_ct_test((void *)bpf_xdp_ct_lookup, ctx);
+       nf_ct_test((void *)bpf_xdp_ct_lookup, (void *)bpf_xdp_ct_alloc, ctx);
        return 0;
 }
 
 SEC("tc")
 int nf_skb_ct_test(struct __sk_buff *ctx)
 {
-       nf_ct_test((void *)bpf_skb_ct_lookup, ctx);
+       nf_ct_test((void *)bpf_skb_ct_lookup, (void *)bpf_skb_ct_alloc, ctx);
        return 0;
 }
 
diff --git a/tools/testing/selftests/bpf/progs/test_bpf_nf_fail.c b/tools/testing/selftests/bpf/progs/test_bpf_nf_fail.c
new file mode 100644 (file)
index 0000000..bf79af1
--- /dev/null
@@ -0,0 +1,134 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+
+struct nf_conn;
+
+struct bpf_ct_opts___local {
+       s32 netns_id;
+       s32 error;
+       u8 l4proto;
+       u8 reserved[3];
+} __attribute__((preserve_access_index));
+
+struct nf_conn *bpf_skb_ct_alloc(struct __sk_buff *, struct bpf_sock_tuple *, u32,
+                                struct bpf_ct_opts___local *, u32) __ksym;
+struct nf_conn *bpf_skb_ct_lookup(struct __sk_buff *, struct bpf_sock_tuple *, u32,
+                                 struct bpf_ct_opts___local *, u32) __ksym;
+struct nf_conn *bpf_ct_insert_entry(struct nf_conn *) __ksym;
+void bpf_ct_release(struct nf_conn *) __ksym;
+void bpf_ct_set_timeout(struct nf_conn *, u32) __ksym;
+int bpf_ct_change_timeout(struct nf_conn *, u32) __ksym;
+int bpf_ct_set_status(struct nf_conn *, u32) __ksym;
+int bpf_ct_change_status(struct nf_conn *, u32) __ksym;
+
+SEC("?tc")
+int alloc_release(struct __sk_buff *ctx)
+{
+       struct bpf_ct_opts___local opts = {};
+       struct bpf_sock_tuple tup = {};
+       struct nf_conn *ct;
+
+       ct = bpf_skb_ct_alloc(ctx, &tup, sizeof(tup.ipv4), &opts, sizeof(opts));
+       if (!ct)
+               return 0;
+       bpf_ct_release(ct);
+       return 0;
+}
+
+SEC("?tc")
+int insert_insert(struct __sk_buff *ctx)
+{
+       struct bpf_ct_opts___local opts = {};
+       struct bpf_sock_tuple tup = {};
+       struct nf_conn *ct;
+
+       ct = bpf_skb_ct_alloc(ctx, &tup, sizeof(tup.ipv4), &opts, sizeof(opts));
+       if (!ct)
+               return 0;
+       ct = bpf_ct_insert_entry(ct);
+       if (!ct)
+               return 0;
+       ct = bpf_ct_insert_entry(ct);
+       return 0;
+}
+
+SEC("?tc")
+int lookup_insert(struct __sk_buff *ctx)
+{
+       struct bpf_ct_opts___local opts = {};
+       struct bpf_sock_tuple tup = {};
+       struct nf_conn *ct;
+
+       ct = bpf_skb_ct_lookup(ctx, &tup, sizeof(tup.ipv4), &opts, sizeof(opts));
+       if (!ct)
+               return 0;
+       bpf_ct_insert_entry(ct);
+       return 0;
+}
+
+SEC("?tc")
+int set_timeout_after_insert(struct __sk_buff *ctx)
+{
+       struct bpf_ct_opts___local opts = {};
+       struct bpf_sock_tuple tup = {};
+       struct nf_conn *ct;
+
+       ct = bpf_skb_ct_alloc(ctx, &tup, sizeof(tup.ipv4), &opts, sizeof(opts));
+       if (!ct)
+               return 0;
+       ct = bpf_ct_insert_entry(ct);
+       if (!ct)
+               return 0;
+       bpf_ct_set_timeout(ct, 0);
+       return 0;
+}
+
+SEC("?tc")
+int set_status_after_insert(struct __sk_buff *ctx)
+{
+       struct bpf_ct_opts___local opts = {};
+       struct bpf_sock_tuple tup = {};
+       struct nf_conn *ct;
+
+       ct = bpf_skb_ct_alloc(ctx, &tup, sizeof(tup.ipv4), &opts, sizeof(opts));
+       if (!ct)
+               return 0;
+       ct = bpf_ct_insert_entry(ct);
+       if (!ct)
+               return 0;
+       bpf_ct_set_status(ct, 0);
+       return 0;
+}
+
+SEC("?tc")
+int change_timeout_after_alloc(struct __sk_buff *ctx)
+{
+       struct bpf_ct_opts___local opts = {};
+       struct bpf_sock_tuple tup = {};
+       struct nf_conn *ct;
+
+       ct = bpf_skb_ct_alloc(ctx, &tup, sizeof(tup.ipv4), &opts, sizeof(opts));
+       if (!ct)
+               return 0;
+       bpf_ct_change_timeout(ct, 0);
+       return 0;
+}
+
+SEC("?tc")
+int change_status_after_alloc(struct __sk_buff *ctx)
+{
+       struct bpf_ct_opts___local opts = {};
+       struct bpf_sock_tuple tup = {};
+       struct nf_conn *ct;
+
+       ct = bpf_skb_ct_alloc(ctx, &tup, sizeof(tup.ipv4), &opts, sizeof(opts));
+       if (!ct)
+               return 0;
+       bpf_ct_change_status(ct, 0);
+       return 0;
+}
+
+char _license[] SEC("license") = "GPL";
index 3ac3603..a3c7c10 100644 (file)
@@ -11,6 +11,7 @@
 static int (*bpf_missing_helper)(const void *arg1, int arg2) = (void *) 999;
 
 extern int LINUX_KERNEL_VERSION __kconfig;
+extern int LINUX_UNKNOWN_VIRTUAL_EXTERN __kconfig __weak;
 extern bool CONFIG_BPF_SYSCALL __kconfig; /* strong */
 extern enum libbpf_tristate CONFIG_TRISTATE __kconfig __weak;
 extern bool CONFIG_BOOL __kconfig __weak;
@@ -22,6 +23,7 @@ extern const char CONFIG_STR[8] __kconfig __weak;
 extern uint64_t CONFIG_MISSING __kconfig __weak;
 
 uint64_t kern_ver = -1;
+uint64_t unkn_virt_val = -1;
 uint64_t bpf_syscall = -1;
 uint64_t tristate_val = -1;
 uint64_t bool_val = -1;
@@ -38,6 +40,7 @@ int handle_sys_enter(struct pt_regs *ctx)
        int i;
 
        kern_ver = LINUX_KERNEL_VERSION;
+       unkn_virt_val = LINUX_UNKNOWN_VIRTUAL_EXTERN;
        bpf_syscall = CONFIG_BPF_SYSCALL;
        tristate_val = CONFIG_TRISTATE;
        bool_val = CONFIG_BOOL;
index 702578a..8e14950 100644 (file)
@@ -1,35 +1,20 @@
 // SPDX-License-Identifier: GPL-2.0
-
-#include <linux/ptrace.h>
-#include <linux/bpf.h>
-
-#include <netinet/in.h>
-
+#include "vmlinux.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
 #include "bpf_misc.h"
 
 static struct sockaddr_in old;
 
-SEC("kprobe/" SYS_PREFIX "sys_connect")
-int BPF_KPROBE(handle_sys_connect)
+SEC("ksyscall/connect")
+int BPF_KSYSCALL(handle_sys_connect, int fd, struct sockaddr_in *uservaddr, int addrlen)
 {
-#if SYSCALL_WRAPPER == 1
-       struct pt_regs *real_regs;
-#endif
        struct sockaddr_in new;
-       void *ptr;
-
-#if SYSCALL_WRAPPER == 0
-       ptr = (void *)PT_REGS_PARM2(ctx);
-#else
-       real_regs = (struct pt_regs *)PT_REGS_PARM1(ctx);
-       bpf_probe_read_kernel(&ptr, sizeof(ptr), &PT_REGS_PARM2(real_regs));
-#endif
 
-       bpf_probe_read_user(&old, sizeof(old), ptr);
+       bpf_probe_read_user(&old, sizeof(old), uservaddr);
        __builtin_memset(&new, 0xab, sizeof(new));
-       bpf_probe_write_user(ptr, &new, sizeof(new));
+       bpf_probe_write_user(uservaddr, &new, sizeof(new));
 
        return 0;
 }
index 1b1187d..1a4e93f 100644 (file)
@@ -51,6 +51,8 @@ int out_dynarr[4] SEC(".data.dyn") = { 1, 2, 3, 4 };
 int read_mostly_var __read_mostly;
 int out_mostly_var;
 
+char huge_arr[16 * 1024 * 1024];
+
 SEC("raw_tp/sys_enter")
 int handler(const void *ctx)
 {
@@ -71,6 +73,8 @@ int handler(const void *ctx)
 
        out_mostly_var = read_mostly_var;
 
+       huge_arr[sizeof(huge_arr) - 1] = 123;
+
        return 0;
 }
 
index 125d872..ba48fcb 100644 (file)
@@ -239,7 +239,7 @@ bool parse_udp(void *data, void *data_end,
        udp = data + off;
 
        if (udp + 1 > data_end)
-               return 0;
+               return false;
        if (!is_icmp) {
                pckt->flow.port16[0] = udp->source;
                pckt->flow.port16[1] = udp->dest;
@@ -247,7 +247,7 @@ bool parse_udp(void *data, void *data_end,
                pckt->flow.port16[0] = udp->dest;
                pckt->flow.port16[1] = udp->source;
        }
-       return 1;
+       return true;
 }
 
 static __attribute__ ((noinline))
@@ -261,7 +261,7 @@ bool parse_tcp(void *data, void *data_end,
 
        tcp = data + off;
        if (tcp + 1 > data_end)
-               return 0;
+               return false;
        if (tcp->syn)
                pckt->flags |= (1 << 1);
        if (!is_icmp) {
@@ -271,7 +271,7 @@ bool parse_tcp(void *data, void *data_end,
                pckt->flow.port16[0] = tcp->dest;
                pckt->flow.port16[1] = tcp->source;
        }
-       return 1;
+       return true;
 }
 
 static __attribute__ ((noinline))
@@ -287,7 +287,7 @@ bool encap_v6(struct xdp_md *xdp, struct ctl_value *cval,
        void *data;
 
        if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(struct ipv6hdr)))
-               return 0;
+               return false;
        data = (void *)(long)xdp->data;
        data_end = (void *)(long)xdp->data_end;
        new_eth = data;
@@ -295,7 +295,7 @@ bool encap_v6(struct xdp_md *xdp, struct ctl_value *cval,
        old_eth = data + sizeof(struct ipv6hdr);
        if (new_eth + 1 > data_end ||
            old_eth + 1 > data_end || ip6h + 1 > data_end)
-               return 0;
+               return false;
        memcpy(new_eth->eth_dest, cval->mac, 6);
        memcpy(new_eth->eth_source, old_eth->eth_dest, 6);
        new_eth->eth_proto = 56710;
@@ -314,7 +314,7 @@ bool encap_v6(struct xdp_md *xdp, struct ctl_value *cval,
        ip6h->saddr.in6_u.u6_addr32[2] = 3;
        ip6h->saddr.in6_u.u6_addr32[3] = ip_suffix;
        memcpy(ip6h->daddr.in6_u.u6_addr32, dst->dstv6, 16);
-       return 1;
+       return true;
 }
 
 static __attribute__ ((noinline))
@@ -335,7 +335,7 @@ bool encap_v4(struct xdp_md *xdp, struct ctl_value *cval,
        ip_suffix <<= 15;
        ip_suffix ^= pckt->flow.src;
        if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(struct iphdr)))
-               return 0;
+               return false;
        data = (void *)(long)xdp->data;
        data_end = (void *)(long)xdp->data_end;
        new_eth = data;
@@ -343,7 +343,7 @@ bool encap_v4(struct xdp_md *xdp, struct ctl_value *cval,
        old_eth = data + sizeof(struct iphdr);
        if (new_eth + 1 > data_end ||
            old_eth + 1 > data_end || iph + 1 > data_end)
-               return 0;
+               return false;
        memcpy(new_eth->eth_dest, cval->mac, 6);
        memcpy(new_eth->eth_source, old_eth->eth_dest, 6);
        new_eth->eth_proto = 8;
@@ -367,8 +367,8 @@ bool encap_v4(struct xdp_md *xdp, struct ctl_value *cval,
                csum += *next_iph_u16++;
        iph->check = ~((csum & 0xffff) + (csum >> 16));
        if (bpf_xdp_adjust_head(xdp, (int)sizeof(struct iphdr)))
-               return 0;
-       return 1;
+               return false;
+       return true;
 }
 
 static __attribute__ ((noinline))
@@ -386,10 +386,10 @@ bool decap_v6(struct xdp_md *xdp, void **data, void **data_end, bool inner_v4)
        else
                new_eth->eth_proto = 56710;
        if (bpf_xdp_adjust_head(xdp, (int)sizeof(struct ipv6hdr)))
-               return 0;
+               return false;
        *data = (void *)(long)xdp->data;
        *data_end = (void *)(long)xdp->data_end;
-       return 1;
+       return true;
 }
 
 static __attribute__ ((noinline))
@@ -404,10 +404,10 @@ bool decap_v4(struct xdp_md *xdp, void **data, void **data_end)
        memcpy(new_eth->eth_dest, old_eth->eth_dest, 6);
        new_eth->eth_proto = 8;
        if (bpf_xdp_adjust_head(xdp, (int)sizeof(struct iphdr)))
-               return 0;
+               return false;
        *data = (void *)(long)xdp->data;
        *data_end = (void *)(long)xdp->data_end;
-       return 1;
+       return true;
 }
 
 static __attribute__ ((noinline))
index 392d28c..49936c4 100755 (executable)
@@ -106,9 +106,9 @@ bpftool prog loadall \
 bpftool map update pinned $BPF_DIR/maps/tx_port key 0 0 0 0 value 122 0 0 0
 bpftool map update pinned $BPF_DIR/maps/tx_port key 1 0 0 0 value 133 0 0 0
 bpftool map update pinned $BPF_DIR/maps/tx_port key 2 0 0 0 value 111 0 0 0
-ip link set dev veth1 xdp pinned $BPF_DIR/progs/redirect_map_0
-ip link set dev veth2 xdp pinned $BPF_DIR/progs/redirect_map_1
-ip link set dev veth3 xdp pinned $BPF_DIR/progs/redirect_map_2
+ip link set dev veth1 xdp pinned $BPF_DIR/progs/xdp_redirect_map_0
+ip link set dev veth2 xdp pinned $BPF_DIR/progs/xdp_redirect_map_1
+ip link set dev veth3 xdp pinned $BPF_DIR/progs/xdp_redirect_map_2
 
 ip -n ${NS1} link set dev veth11 xdp obj xdp_dummy.o sec xdp
 ip -n ${NS2} link set dev veth22 xdp obj xdp_tx.o sec xdp
index 2d00236..a535d41 100644 (file)
        .expected_insns = { PSEUDO_CALL_INSN() },
        .unexpected_insns = { HELPER_CALL_INSN() },
        .result = ACCEPT,
+       .prog_type = BPF_PROG_TYPE_TRACEPOINT,
        .func_info = { { 0, MAIN_TYPE }, { 16, CALLBACK_TYPE } },
        .func_info_cnt = 2,
        BTF_TYPES
index 743ed34..3fb4f69 100644 (file)
        .result = REJECT,
        .errstr = "variable ptr_ access var_off=(0x0; 0x7) disallowed",
 },
+{
+       "calls: invalid kfunc call: referenced arg needs refcounted PTR_TO_BTF_ID",
+       .insns = {
+       BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
+       BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8),
+       BPF_ST_MEM(BPF_DW, BPF_REG_1, 0, 0),
+       BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0),
+       BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+       BPF_EXIT_INSN(),
+       BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+       BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+       BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0),
+       BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6, 16),
+       BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0),
+       BPF_MOV64_IMM(BPF_REG_0, 0),
+       BPF_EXIT_INSN(),
+       },
+       .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+       .fixup_kfunc_btf_id = {
+               { "bpf_kfunc_call_test_acquire", 3 },
+               { "bpf_kfunc_call_test_ref", 8 },
+               { "bpf_kfunc_call_test_ref", 10 },
+       },
+       .result_unpriv = REJECT,
+       .result = REJECT,
+       .errstr = "R1 must be referenced",
+},
+{
+       "calls: valid kfunc call: referenced arg needs refcounted PTR_TO_BTF_ID",
+       .insns = {
+       BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
+       BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8),
+       BPF_ST_MEM(BPF_DW, BPF_REG_1, 0, 0),
+       BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0),
+       BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+       BPF_EXIT_INSN(),
+       BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+       BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+       BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0),
+       BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+       BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0),
+       BPF_MOV64_IMM(BPF_REG_0, 0),
+       BPF_EXIT_INSN(),
+       },
+       .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+       .fixup_kfunc_btf_id = {
+               { "bpf_kfunc_call_test_acquire", 3 },
+               { "bpf_kfunc_call_test_ref", 8 },
+               { "bpf_kfunc_call_test_release", 10 },
+       },
+       .result_unpriv = REJECT,
+       .result = ACCEPT,
+},
 {
        "calls: basic sanity",
        .insns = {