git.monstr.eu Git - linux-2.6-microblaze.git/log

sched_ext: Implement sched_ext_ops.cpu_acquire/release()

Scheduler classes are strictly ordered and when a higher priority class has
tasks to run, the lower priority ones lose access to the CPU. Being able to
monitor and act on these events are necessary for use cases includling
strict core-scheduling and latency management.

This patch adds two operations ops.cpu_acquire() and .cpu_release(). The
former is invoked when a CPU becomes available to the BPF scheduler and the
opposite for the latter. This patch also implements
scx_bpf_reenqueue_local() which can be called from .cpu_release() to trigger
requeueing of all tasks in the local dsq of the CPU so that the tasks can be
reassigned to other available CPUs.

scx_pair is updated to use .cpu_acquire/release() along with
%SCX_KICK_WAIT to make the pair scheduling guarantee strict even when a CPU
is preempted by a higher priority scheduler class.

scx_qmap is updated to use .cpu_acquire/release() to empty the local
dsq of a preempted CPU. A similar approach can be adopted by BPF schedulers
that want to have a tight control over latency.

v4: Use the new SCX_KICK_IDLE to wake up a CPU after re-enqueueing.

v3: Drop the const qualifier from scx_cpu_release_args.task. BPF enforces
    access control through the verifier, so the qualifier isn't actually
    operative and only gets in the way when interacting with various
    helpers.

v2: Add p->scx.kf_mask annotation to allow calling scx_bpf_reenqueue_local()
    from ops.cpu_release() nested inside ops.init() and other sleepable
    operations.

Signed-off-by: David Vernet <dvernet@meta.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

sched_ext: Implement SCX_KICK_WAIT

If set when calling scx_bpf_kick_cpu(), the invoking CPU will busy wait for
the kicked cpu to enter the scheduler. See the following for example usage:

https://github.com/sched-ext/scx/blob/main/scheds/c/scx_pair.bpf.c

v2: - Updated to fit the updated kick_cpus_irq_workfn() implementation.

- Include SCX_KICK_WAIT related information in debug dump.

Signed-off-by: David Vernet <dvernet@meta.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

sched_ext: Track tasks that are subjects of the in-flight SCX operation

When some SCX operations are in flight, it is known that the subject task's
rq lock is held throughout which makes it safe to access certain fields of
the task - e.g. its current task_group. We want to add SCX kfunc helpers
that can make use of this guarantee - e.g. to help determining the currently
associated CPU cgroup from the task's current task_group.

As it'd be dangerous call such a helper on a task which isn't rq lock
protected, the helper should be able to verify the input task and reject
accordingly. This patch adds sched_ext_entity.kf_tasks[] that track the
tasks which are currently being operated on by a terminal SCX operation. The
new SCX_CALL_OP_[2]TASK[_RET]() can be used when invoking SCX operations
which take tasks as arguments and the scx_kf_allowed_on_arg_tasks() can be
used by kfunc helpers to verify the input task status.

Note that as sched_ext_entity.kf_tasks[] can't handle nesting, the tracking
is currently only limited to terminal SCX operations. If needed in the
future, this restriction can be removed by moving the tracking to the task
side with a couple per-task counters.

v2: Updated to reflect the addition of SCX_KF_SELECT_CPU.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>

sched_ext: Implement tickless support

Allow BPF schedulers to indicate tickless operation by setting p->scx.slice
to SCX_SLICE_INF. A CPU whose current task has infinte slice goes into
tickless operation.

scx_central is updated to use tickless operations for all tasks and
instead use a BPF timer to expire slices. This also uses the SCX_ENQ_PREEMPT
and task state tracking added by the previous patches.

Currently, there is no way to pin the timer on the central CPU, so it may
end up on one of the worker CPUs; however, outside of that, the worker CPUs
can go tickless both while running sched_ext tasks and idling.

With schbench running, scx_central shows:

  root@test ~# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
  LOC:     142024        656        664        449   Local timer interrupts
  LOC:     161663        663        665        449   Local timer interrupts

Without it:

  root@test ~ [SIGINT]# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
  LOC:     188778       3142       3793       3993   Local timer interrupts
  LOC:     198993       5314       6323       6438   Local timer interrupts

While scx_central itself is too barebone to be useful as a
production scheduler, a more featureful central scheduler can be built using
the same approach. Google's experience shows that such an approach can have
significant benefits for certain applications such as VM hosting.

v4: Allow operation even if BPF_F_TIMER_CPU_PIN is not available.

v3: Pin the central scheduler's timer on the central_cpu using
    BPF_F_TIMER_CPU_PIN.

v2: Convert to BPF inline iterators.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

sched_ext: Add task state tracking operations

Being able to track the task runnable and running state transitions are
useful for a variety of purposes including latency tracking and load factor
calculation.

Currently, BPF schedulers don't have a good way of tracking these
transitions. Becoming runnable can be determined from ops.enqueue() but
becoming quiescent can only be inferred from the lack of subsequent enqueue.
Also, as the local dsq can have multiple tasks and some events are handled
in the sched_ext core, it's difficult to determine when a given task starts
and stops executing.

This patch adds sched_ext_ops.runnable(), .running(), .stopping() and
.quiescent() operations to track the task runnable and running state
transitions. They're mostly self explanatory; however, we want to ensure
that running <-> stopping transitions are always contained within runnable
<-> quiescent transitions which is a bit different from how the scheduler
core behaves. This adds a bit of complication. See the comment in
dequeue_task_scx().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

sched_ext: Make watchdog handle ops.dispatch() looping stall

The dispatch path retries if the local DSQ is still empty after
ops.dispatch() either dispatched or consumed a task. This is both out of
necessity and for convenience. It has to retry because the dispatch path
might lose the tasks to dequeue while the rq lock is released while trying
to migrate tasks across CPUs, and the retry mechanism makes ops.dispatch()
implementation easier as it only needs to make some forward progress each
iteration.

However, this makes it possible for ops.dispatch() to stall CPUs by
repeatedly dispatching ineligible tasks. If all CPUs are stalled that way,
the watchdog or sysrq handler can't run and the system can't be saved. Let's
address the issue by breaking out of the dispatch loop after 32 iterations.

It is unlikely but not impossible for ops.dispatch() to legitimately go over
the iteration limit. We want to come back to the dispatch path in such cases
as not doing so risks stalling the CPU by idling with runnable tasks
pending. As the previous task is still current in balance_scx(),
resched_curr() doesn't do anything - it will just get cleared. Let's instead
use scx_kick_bpf() which will trigger reschedule after switching to the next
task which will likely be the idle task.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>

sched_ext: Add a central scheduler which makes all scheduling decisions on one CPU

This patch adds a new example scheduler, scx_central, which demonstrates
central scheduling where one CPU is responsible for making all scheduling
decisions in the system using scx_bpf_kick_cpu(). The central CPU makes
scheduling decisions for all CPUs in the system, queues tasks on the
appropriate local dsq's and preempts the worker CPUs. The worker CPUs in
turn preempt the central CPU when it needs tasks to run.

Currently, every CPU depends on its own tick to expire the current task. A
follow-up patch implementing tickless support for sched_ext will allow the
worker CPUs to go full tickless so that they can run completely undisturbed.

v3: - Kumar fixed a bug where the dispatch path could overflow the dispatch
      buffer if too many are dispatched to the fallback DSQ.

    - Use the new SCX_KICK_IDLE to wake up non-central CPUs.

    - Dropped '-p' option.

v2: - Use RESIZABLE_ARRAY() instead of fixed MAX_CPUS and use SCX_BUG[_ON]()
      to simplify error handling.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Julia Lawall <julia.lawall@inria.fr>

sched_ext: Implement scx_bpf_kick_cpu() and task preemption support

It's often useful to wake up and/or trigger reschedule on other CPUs. This
patch adds scx_bpf_kick_cpu() kfunc helper that BPF scheduler can call to
kick the target CPU into the scheduling path.

As a sched_ext task relinquishes its CPU only after its slice is depleted,
this patch also adds SCX_KICK_PREEMPT and SCX_ENQ_PREEMPT which clears the
slice of the target CPU's current task to guarantee that sched_ext's
scheduling path runs on the CPU.

If SCX_KICK_IDLE is specified, the target CPU is kicked iff the CPU is idle
to guarantee that the target CPU will go through at least one full sched_ext
scheduling cycle after the kicking. This can be used to wake up idle CPUs
without incurring unnecessary overhead if it isn't currently idle.

As a demonstration of how backward compatibility can be supported using BPF
CO-RE, tools/sched_ext/include/scx/compat.bpf.h is added. It provides
__COMPAT_scx_bpf_kick_cpu_IDLE() which uses SCX_KICK_IDLE if available or
becomes a regular kicking otherwise. This allows schedulers to use the new
SCX_KICK_IDLE while maintaining support for older kernels. The plan is to
temporarily use compat helpers to ease API updates and drop them after a few
kernel releases.

v5: - SCX_KICK_IDLE added. Note that this also adds a compat mechanism for
      schedulers so that they can support kernels without SCX_KICK_IDLE.
      This is useful as a demonstration of how new feature flags can be
      added in a backward compatible way.

    - kick_cpus_irq_workfn() reimplemented so that it touches the pending
      cpumasks only as necessary to reduce kicking overhead on machines with
      a lot of CPUs.

    - tools/sched_ext/include/scx/compat.bpf.h added.

v4: - Move example scheduler to its own patch.

v3: - Make scx_example_central switch all tasks by default.

    - Convert to BPF inline iterators.

v2: - Julia Lawall reported that scx_example_central can overflow the
      dispatch buffer and malfunction. As scheduling for other CPUs can't be
      handled by the automatic retry mechanism, fix by implementing an
      explicit overflow and retry handling.

    - Updated to use generic BPF cpumask helpers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

tools/sched_ext: Add scx_show_state.py

There are states which are interesting but don't quite fit the interface
exposed under /sys/kernel/sched_ext. Add tools/scx_show_state.py to show
them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>

sched_ext: Print debug dump after an error exit

If a BPF scheduler triggers an error, the scheduler is aborted and the
system is reverted to the built-in scheduler. In the process, a lot of
information which may be useful for figuring out what happened can be lost.

This patch adds debug dump which captures information which may be useful
for debugging including runqueue and runnable thread states at the time of
failure. The following shows a debug dump after triggering the watchdog:

  root@test ~# os/work/tools/sched_ext/build/bin/scx_qmap -t 100
  stats  : enq=1 dsp=0 delta=1 deq=0
  stats  : enq=90 dsp=90 delta=0 deq=0
  stats  : enq=156 dsp=156 delta=0 deq=0
  stats  : enq=218 dsp=218 delta=0 deq=0
  stats  : enq=255 dsp=255 delta=0 deq=0
  stats  : enq=271 dsp=271 delta=0 deq=0
  stats  : enq=284 dsp=284 delta=0 deq=0
  stats  : enq=293 dsp=293 delta=0 deq=0

  DEBUG DUMP
  ================================================================================

  kworker/u32:12[320] triggered exit kind 1026:
    runnable task stall (stress[1530] failed to run for 6.841s)

  Backtrace:
    scx_watchdog_workfn+0x136/0x1c0
    process_scheduled_works+0x2b5/0x600
    worker_thread+0x269/0x360
    kthread+0xeb/0x110
    ret_from_fork+0x36/0x40
    ret_from_fork_asm+0x1a/0x30

  QMAP FIFO[0]:
  QMAP FIFO[1]:
  QMAP FIFO[2]: 1436
  QMAP FIFO[3]:
  QMAP FIFO[4]:

  CPU states
  ----------

  CPU 0   : nr_run=1 ops_qseq=244
    curr=swapper/0[0] class=idle_sched_class

    QMAP: dsp_idx=1 dsp_cnt=0

    R stress[1530] -6841ms
scx_state/flags=3/0x1 ops_state/qseq=2/20
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=ff

      QMAP: force_local=0

      asm_sysvec_apic_timer_interrupt+0x16/0x20

  CPU 2   : nr_run=2 ops_qseq=142
    curr=swapper/2[0] class=idle_sched_class

    QMAP: dsp_idx=1 dsp_cnt=0

    R sshd[1703] -5905ms
scx_state/flags=3/0x9 ops_state/qseq=2/88
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=ff

      QMAP: force_local=1

      __x64_sys_ppoll+0xf6/0x120
      do_syscall_64+0x7b/0x150
      entry_SYSCALL_64_after_hwframe+0x76/0x7e

    R fish[1539] -4141ms
scx_state/flags=3/0x9 ops_state/qseq=2/124
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=ff

      QMAP: force_local=1

      futex_wait+0x60/0xe0
      do_futex+0x109/0x180
      __x64_sys_futex+0x117/0x190
      do_syscall_64+0x7b/0x150
      entry_SYSCALL_64_after_hwframe+0x76/0x7e

  CPU 3   : nr_run=2 ops_qseq=162
    curr=kworker/u32:12[320] class=ext_sched_class

    QMAP: dsp_idx=1 dsp_cnt=0

   *R kworker/u32:12[320] +0ms
scx_state/flags=3/0xd ops_state/qseq=0/0
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=ff

      QMAP: force_local=0

      scx_dump_state+0x613/0x6f0
      scx_ops_error_irq_workfn+0x1f/0x40
      irq_work_run_list+0x82/0xd0
      irq_work_run+0x14/0x30
      __sysvec_irq_work+0x40/0x140
      sysvec_irq_work+0x60/0x70
      asm_sysvec_irq_work+0x16/0x20
      scx_watchdog_workfn+0x15f/0x1c0
      process_scheduled_works+0x2b5/0x600
      worker_thread+0x269/0x360
      kthread+0xeb/0x110
      ret_from_fork+0x36/0x40
      ret_from_fork_asm+0x1a/0x30

    R kworker/3:2[1436] +0ms
scx_state/flags=3/0x9 ops_state/qseq=2/160
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=08

      QMAP: force_local=0

      kthread+0xeb/0x110
      ret_from_fork+0x36/0x40
      ret_from_fork_asm+0x1a/0x30

  CPU 7   : nr_run=0 ops_qseq=76
    curr=swapper/7[0] class=idle_sched_class

  ================================================================================

  EXIT: runnable task stall (stress[1530] failed to run for 6.841s)

It shows that CPU 3 was running the watchdog when it triggered the error
condition and the scx_qmap thread has been queued on CPU 0 for over 5
seconds but failed to run. It also prints out scx_qmap specific information
- e.g. which tasks are queued on each FIFO and so on using the dump_*() ops.
This dump has proved pretty useful for developing and debugging BPF
schedulers.

Debug dump is generated automatically when the BPF scheduler exits due to an
error. The debug buffer used in such cases is determined by
sched_ext_ops.exit_dump_len and defaults to 32k. If the debug dump overruns
the available buffer, the output is truncated and marked accordingly.

Debug dump output can also be read through the sched_ext_dump tracepoint.
When read through the tracepoint, there is no length limit.

SysRq-D can be used to trigger debug dump at any time while a BPF scheduler
is loaded. This is non-destructive - the scheduler keeps running afterwards.
The output can be read through the sched_ext_dump tracepoint.

v2: - The size of exit debug dump buffer can now be customized using
      sched_ext_ops.exit_dump_len.

    - sched_ext_ops.dump*() added to enable dumping of BPF scheduler
      specific information.

    - Tracpoint output and SysRq-D triggering added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>

sched_ext: Print sched_ext info when dumping stack

It would be useful to see what the sched_ext scheduler state is, and what
scheduler is running, when we're dumping a task's stack. This patch
therefore adds a new print_scx_info() function that's called in the same
context as print_worker_info() and print_stop_info(). An example dump
follows.

  BUG: kernel NULL pointer dereference, address: 0000000000000999
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 0 P4D 0
  Oops: 0002 [#1] PREEMPT SMP
  CPU: 13 PID: 2047 Comm: insmod Tainted: G           O       6.6.0-work-10323-gb58d4cae8e99-dirty #34
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 2/2/2022
  Sched_ext: qmap (enabled+all), task: runnable_at=-17ms
  RIP: 0010:init_module+0x9/0x1000 [test_module]
  ...

v3: - scx_ops_enable_state_str[] definition moved to an earlier patch as
      it's now used by core implementation.

    - Convert jiffy delta to msecs using jiffies_to_msecs() instead of
      multiplying by (HZ / MSEC_PER_SEC). The conversion is implemented in
      jiffies_delta_msecs().

v2: - We are now using scx_ops_enable_state_str[] outside
      CONFIG_SCHED_DEBUG. Move it outside of CONFIG_SCHED_DEBUG and to the
      top. This was reported by Changwoo and Andrea.

Signed-off-by: David Vernet <void@manifault.com>
Reported-by: Changwoo Min <changwoo@igalia.com>
Reported-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT

BPF schedulers might not want to schedule certain tasks - e.g. kernel
threads. This patch adds p->scx.disallow which can be set by BPF schedulers
in such cases. The field can be changed anytime and setting it in
ops.prep_enable() guarantees that the task can never be scheduled by
sched_ext.

scx_qmap is updated with the -d option to disallow a specific PID:

  # echo $$
  1092
  # grep -E '(policy)|(ext\.enabled)' /proc/self/sched
  policy                                       :                    0
  ext.enabled                                  :                    0
  # ./set-scx 1092
  # grep -E '(policy)|(ext\.enabled)' /proc/self/sched
  policy                                       :                    7
  ext.enabled                                  :                    0

Run "scx_qmap -p -d 1092" in another terminal.

  # cat /sys/kernel/sched_ext/nr_rejected
  1
  # grep -E '(policy)|(ext\.enabled)' /proc/self/sched
  policy                                       :                    0
  ext.enabled                                  :                    0
  # ./set-scx 1092
  setparam failed for 1092 (Permission denied)

- v4: Refreshed on top of tip:sched/core.

- v3: Update description to reflect /sys/kernel/sched_ext interface change.

- v2: Use atomic_long_t instead of atomic64_t for scx_kick_cpus_pnt_seqs to
      accommodate 32bit archs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Barret Rhoden <brho@google.com>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

sched_ext: Implement runnable task stall watchdog

The most common and critical way that a BPF scheduler can misbehave is by
failing to run runnable tasks for too long. This patch implements a
watchdog.

* All tasks record when they become runnable.

* A watchdog work periodically scans all runnable tasks. If any task has
  stayed runnable for too long, the BPF scheduler is aborted.

* scheduler_tick() monitors whether the watchdog itself is stuck. If so, the
  BPF scheduler is aborted.

Because the watchdog only scans the tasks which are currently runnable and
usually very infrequently, the overhead should be negligible.
scx_qmap is updated so that it can be told to stall user and/or
kernel tasks.

A detected task stall looks like the following:

sched_ext: BPF scheduler "qmap" errored, disabling
sched_ext: runnable task stall (dbus-daemon[953] failed to run for 6.478s)
    scx_check_timeout_workfn+0x10e/0x1b0
    process_one_work+0x287/0x560
    worker_thread+0x234/0x420
    kthread+0xe9/0x100
    ret_from_fork+0x1f/0x30

A detected watchdog stall:

sched_ext: BPF scheduler "qmap" errored, disabling
sched_ext: runnable task stall (watchdog failed to check in for 5.001s)
    scheduler_tick+0x2eb/0x340
    update_process_times+0x7a/0x90
    tick_sched_timer+0xd8/0x130
    __hrtimer_run_queues+0x178/0x3b0
    hrtimer_interrupt+0xfc/0x390
    __sysvec_apic_timer_interrupt+0xb7/0x2b0
    sysvec_apic_timer_interrupt+0x90/0xb0
    asm_sysvec_apic_timer_interrupt+0x1b/0x20
    default_idle+0x14/0x20
    arch_cpu_idle+0xf/0x20
    default_idle_call+0x50/0x90
    do_idle+0xe8/0x240
    cpu_startup_entry+0x1d/0x20
    kernel_init+0x0/0x190
    start_kernel+0x0/0x392
    start_kernel+0x324/0x392
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x104/0x109
    secondary_startup_64_no_verify+0xce/0xdb

Note that this patch exposes scx_ops_error[_type]() in kernel/sched/ext.h to
inline scx_notify_sched_tick().

v4: - While disabling, cancel_delayed_work_sync(&scx_watchdog_work) was
      being called before forward progress was guaranteed and thus could
      lead to system lockup. Relocated.

    - While enabling, it was comparing msecs against jiffies without
      conversion leading to spurious load failures on lower HZ kernels.
      Fixed.

    - runnable list management is now used by core bypass logic and moved to
      the patch implementing sched_ext core.

v3: - bpf_scx_init_member() was incorrectly comparing ops->timeout_ms
      against SCX_WATCHDOG_MAX_TIMEOUT which is in jiffies without
      conversion leading to spurious load failures in lower HZ kernels.
      Fixed.

v2: - Julia Lawall noticed that the watchdog code was mixing msecs and
      jiffies. Fix by using jiffies for everything.

Signed-off-by: David Vernet <dvernet@meta.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Cc: Julia Lawall <julia.lawall@inria.fr>

sched_ext: Add sysrq-S which disables the BPF scheduler

This enables the admin to abort the BPF scheduler and revert to CFS anytime.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

sched_ext: Add scx_simple and scx_example_qmap example schedulers

Add two simple example BPF schedulers - simple and qmap.

* simple: In terms of scheduling, it behaves identical to not having any
  operation implemented at all. The two operations it implements are only to
  improve visibility and exit handling. On certain homogeneous
  configurations, this actually can perform pretty well.

* qmap: A fixed five level priority scheduler to demonstrate queueing PIDs
  on BPF maps for scheduling. While not very practical, this is useful as a
  simple example and will be used to demonstrate different features.

v7: - Compat helpers stripped out in prepartion of upstreaming as the
      upstreamed patchset will be the baselinfe. Utility macros that can be
      used to implement compat features are kept.

    - Explicitly disable map autoattach on struct_ops to avoid trying to
      attach twice while maintaining compatbility with older libbpf.

v6: - Common header files reorganized and cleaned up. Compat helpers are
      added to demonstrate how schedulers can maintain backward
      compatibility with older kernels while making use of newly added
      features.

    - simple_select_cpu() added to keep track of the number of local
      dispatches. This is needed because the default ops.select_cpu()
      implementation is updated to dispatch directly and won't call
      ops.enqueue().

    - Updated to reflect the sched_ext API changes. Switching all tasks is
      the default behavior now and scx_qmap supports partial switching when
      `-p` is specified.

    - tools/sched_ext/Kconfig dropped. This will be included in the doc
      instead.

v5: - Improve Makefile. Build artifects are now collected into a separate
      dir which change be changed. Install and help targets are added and
      clean actually cleans everything.

    - MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR()
      and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss.

    - Add scx_common.h which provides common utilities to user code such as
      SCX_BUG[_ON]() and RESIZE_ARRAY().

    - Use SCX_BUG[_ON]() to simplify error handling.

v4: - Dropped _example prefix from scheduler names.

v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit
      to ease later additions. Comment updates.

    - Added declarations for BPF inline iterators. In the future, hopefully,
      these will be consolidated into a generic BPF header so that they
      don't need to be replicated here.

v2: - Updated with the generic BPF cpumask helpers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

sched_ext: Implement BPF extensible scheduler class

Implement a new scheduler class sched_ext (SCX), which allows scheduling
policies to be implemented as BPF programs to achieve the following:

1. Ease of experimentation and exploration: Enabling rapid iteration of new
   scheduling policies.

2. Customization: Building application-specific schedulers which implement
   policies that are not applicable to general-purpose schedulers.

3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
   policies in production environments.

sched_ext leverages BPF’s struct_ops feature to define a structure which
exports function callbacks and flags to BPF programs that wish to implement
scheduling policies. The struct_ops structure exported by sched_ext is
struct sched_ext_ops, and is conceptually similar to struct sched_class. The
role of sched_ext is to map the complex sched_class callbacks to the more
simple and ergonomic struct sched_ext_ops callbacks.

For more detailed discussion on the motivations and overview, please refer
to the cover letter.

Later patches will also add several example schedulers and documentation.

This patch implements the minimum core framework to enable implementation of
BPF schedulers. Subsequent patches will gradually add functionalities
including safety guarantee mechanisms, nohz and cgroup support.

include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on
top, each operation should be self-explanatory. The followings are worth
noting:

- Both "sched_ext" and its shorthand "scx" are used. If the identifier
  already has "sched" in it, "ext" is used; otherwise, "scx".

- In sched_ext_ops, only .name is mandatory. Every operation is optional and
  if omitted a simple but functional default behavior is provided.

- A new policy constant SCHED_EXT is added and a task can select sched_ext
  by invoking sched_setscheduler(2) with the new policy constant. However,
  if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL
  and the task is scheduled by CFS. When the BPF scheduler is loaded, all
  tasks which have the SCHED_EXT policy are switched to sched_ext.

- To bridge the workflow imbalance between the scheduler core and
  sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch
  queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and
  one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for
  convenience and need not be used by a scheduler that doesn't require it.
  SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting
  the next task on the CPU. The BPF scheduler can manage an arbitrary number
  of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().

- sched_ext guarantees system integrity no matter what the BPF scheduler
  does. To enable this, each task's ownership is tracked through
  p->scx.ops_state and all tasks are put on scx_tasks list. The disable path
  can always recover and revert all tasks back to CFS. See p->scx.ops_state
  and scx_tasks.

- A task is not tied to its rq while enqueued. This decouples CPU selection
  from queueing and allows sharing a scheduling queue across an arbitrary
  subset of CPUs. This adds some complexities as a task may need to be
  bounced between rq's right before it starts executing. See
  dispatch_to_local_dsq() and move_task_to_local_dsq().

- One complication that arises from the above weak association between task
  and rq is that synchronizing with dequeue() gets complicated as dequeue()
  may happen anytime while the task is enqueued and the dispatch path might
  need to release the rq lock to transfer the task. Solving this requires a
  bit of complexity. See the logic around p->scx.sticky_cpu and
  p->scx.ops_qseq.

- Both enable and disable paths are a bit complicated. The enable path
  switches all tasks without blocking to avoid issues which can arise from
  partially switched states (e.g. the switching task itself being starved).
  The disable path can't trust the BPF scheduler at all, so it also has to
  guarantee forward progress without blocking. See scx_ops_enable() and
  scx_ops_disable_workfn().

- When sched_ext is disabled, static_branches are used to shut down the
  entry points from hot paths.

v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab
      scx_ops_enable_mutex which can lead to deadlocks in the disable path.
      Fixed.

    - Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead
      to use-after-free.

    - Consolidated per-cpu variable usages and other cleanups.

v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple
      groups can be expressed. Later CPU hotplug operations are put into
      their own group.

    - SCX_OPS_DISABLING state is replaced with the new bypass mechanism
      which allows temporarily putting the system into simple FIFO
      scheduling mode bypassing the BPF scheduler. In addition to the shut
      down path, this will also be used to isolate the BPF scheduler across
      PM events. Enabling and disabling the bypass mode requires iterating
      all runnable tasks. rq->scx.runnable_list addition is moved from the
      later watchdog patch.

    - ops.prep_enable() is replaced with ops.init_task() and
      ops.enable/disable() are now called whenever the task enters and
      leaves sched_ext instead of when the task becomes schedulable on
      sched_ext and stops being so. A new operation - ops.exit_task() - is
      called when the task stops being schedulable on sched_ext.

    - scx_bpf_dispatch() can now be called from ops.select_cpu() too. This
      removes the need for communicating local dispatch decision made by
      ops.select_cpu() to ops.enqueue() via per-task storage.
      SCX_KF_SELECT_CPU is added to support the change.

    - SCX_TASK_ENQ_LOCAL which told the BPF scheudler that
      scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ
      was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly
      if it finds a suitable idle CPU. If such behavior is not desired,
      users can use scx_bpf_select_cpu_dfl() which returns the verdict in a
      bool out param.

    - scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up
      queueing many tasks on a local DSQ which makes tasks to execute in
      order while other CPUs stay idle which made some hackbench numbers
      really bad. Fixed.

    - The current state of sched_ext can now be monitored through files
      under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is
      to enable monitoring on kernels which don't enable debugfs.

    - sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may
      be NULL and a BPF scheduler which derefs the pointer without checking
      could crash the kernel. Tell BPF. This is currently a bit ugly. A
      better way to annotate this is expected in the future.

    - scx_exit_info updated to carry pointers to message buffers instead of
      embedding them directly. This decouples buffer sizes from API so that
      they can be changed without breaking compatibility.

    - exit_code added to scx_exit_info. This is used to indicate different
      exit conditions on non-error exits and will be used to handle e.g. CPU
      hotplugs.

    - The patch "sched_ext: Allow BPF schedulers to switch all eligible
      tasks into sched_ext" is folded in and the interface is changed so
      that partial switching is indicated with a new ops flag
      %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry
      and in turn SCX_KF_INIT. ops.init() is now called with
      SCX_KF_SLEEPABLE.

    - Code reorganized so that only the parts necessary to integrate with
      the rest of the kernel are in the header files.

    - Changes to reflect the BPF and other kernel changes including the
      addition of bpf_sched_ext_ops.cfi_stubs.

v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t
      instead of atomic64_t and scx_dsp_buf_ent.qseq which uses
      load_acquire/store_release is now unsigned long instead of u64.

    - Fix the bug where bpf_scx_btf_struct_access() was allowing write
      access to arbitrary fields.

    - Distinguish kfuncs which can be called from any sched_ext ops and from
      anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from
      sched_ext ops.

    - Rename "type" to "kind" in scx_exit_info to make it easier to use on
      languages in which "type" is a reserved keyword.

    - Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle
      setup"), PF_IDLE is not set on idle tasks which haven't been online
      yet which made scx_task_iter_next_filtered() include those idle tasks
      in iterations leading to oopses. Update scx_task_iter_next_filtered()
      to directly test p->sched_class against idle_sched_class instead of
      using is_idle_task() which tests PF_IDLE.

    - Other updates to match upstream changes such as adding const to
      set_cpumask() param and renaming check_preempt_curr() to
      wakeup_preempt().

v4: - SCHED_CHANGE_BLOCK replaced with the previous
      sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is
      because upstream is adaopting a different generic cleanup mechanism.
      Once that lands, the code will be adapted accordingly.

    - task_on_scx() used to test whether a task should be switched into SCX,
      which is confusing. Renamed to task_should_scx(). task_on_scx() now
      tests whether a task is currently on SCX.

    - scx_has_idle_cpus is barely used anymore and replaced with direct
      check on the idle cpumask.

    - SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer
      fully idle cores.

    - ops.enable() now sees up-to-date p->scx.weight value.

    - ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF
      schedulers expecting ->select_cpu() call.

    - Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest
      of the scheduler.

v3: - ops.set_weight() added to allow BPF schedulers to track weight changes
      without polling p->scx.weight.

    - move_task_to_local_dsq() was losing SCX-specific enq_flags when
      enqueueing the task on the target dsq because it goes through
      activate_task() which loses the upper 32bit of the flags. Carry the
      flags through rq->scx.extra_enq_flags.

    - scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running()
      and scx_bpf_task_cpu() now use the new KF_RCU instead of
      KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them.

    - The kfunc helper access control mechanism implemented through
      sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always
      used when invoking scx_ops operations.

v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is
      called from put_prev_taks_scx() and pick_next_task_scx() as necessary.
      To determine whether balance_scx() should be called from
      put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the
      comment in put_prev_task_scx() for details.

    - sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced
      with SCHED_CHANGE_BLOCK().

    - Unused all_dsqs list removed. This was a left-over from previous
      iterations.

    - p->scx.kf_mask is added to track and enforce which kfunc helpers are
      allowed. Also, init/exit sequences are updated to make some kfuncs
      always safe to call regardless of the current BPF scheduler state.
      Combined, this should make all the kfuncs safe.

    - BPF now supports sleepable struct_ops operations. Hacky workaround
      removed and operations and kfunc helpers are tagged appropriately.

    - BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask()
      and friends are added so that BPF schedulers can use the idle masks
      with the generic helpers. This replaces the hacky kfunc helpers added
      by a separate patch in V1.

    - CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is
      enabled. This restriction will be removed by a later patch which adds
      core-sched support.

    - Add MAINTAINERS entries and other misc changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Co-authored-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Cc: Andrea Righi <andrea.righi@canonical.com>

sched_ext: Add boilerplate for extensible scheduler class

This adds dummy implementations of sched_ext interfaces which interact with
the scheduler core and hook them in the correct places. As they're all
dummies, this doesn't cause any behavior changes. This is split out to help
reviewing.

v2: balance_scx_on_up() dropped. This will be handled in sched_ext proper.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

sched: Add normal_policy()

A new BPF extensible sched_class will need to dynamically change how a task
picks its sched_class. For example, if the loaded BPF scheduler progs fail,
the tasks will be forced back on CFS even if the task's policy is set to the
new sched_class. To support such mapping, add normal_policy() which wraps
testing for %SCHED_NORMAL. This doesn't cause any behavior changes.

v2: Update the description with more details on the expected use.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

sched: Factor out update_other_load_avgs() from __update_blocked_others()

RT, DL, thermal and irq load and utilization metrics need to be decayed and
updated periodically and before consumption to keep the numbers reasonable.
This is currently done from __update_blocked_others() as a part of the fair
class load balance path. Let's factor it out to update_other_load_avgs().
Pure refactor. No functional changes.

This will be used by the new BPF extensible scheduling class to ensure that
the above metrics are properly maintained.

v2: Refreshed on top of tip:sched/core.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>

sched: Factor out cgroup weight conversion functions

Factor out sched_weight_from/to_cgroup() which convert between scheduler
shares and cgroup weight. No functional change. The factored out functions
will be used by a new BPF extensible sched_class so that the weights can be
exposed to the BPF programs in a way which is consistent cgroup weights and
easier to interpret.

The weight conversions will be used regardless of cgroup usage. It's just
borrowing the cgroup weight range as it's more intuitive.
CGROUP_WEIGHT_MIN/DFL/MAX constants are moved outside CONFIG_CGROUPS so that
the conversion helpers can always be defined.

v2: The helpers are now defined regardless of COFNIG_CGROUPS.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

sched: Add sched_class->switching_to() and expose check_class_changing/changed()

When a task switches to a new sched_class, the prev and new classes are
notified through ->switched_from() and ->switched_to(), respectively, after
the switching is done.

A new BPF extensible sched_class will have callbacks that allow the BPF
scheduler to keep track of relevant task states (like priority and cpumask).
Those callbacks aren't called while a task is on a different sched_class.
When a task comes back, we wanna tell the BPF progs the up-to-date state
before the task gets enqueued, so we need a hook which is called before the
switching is committed.

This patch adds ->switching_to() which is called during sched_class switch
through check_class_changing() before the task is restored. Also, this patch
exposes check_class_changing/changed() in kernel/sched/sched.h. They will be
used by the new BPF extensible sched_class to implement implicit sched_class
switching which is used e.g. when falling back to CFS when the BPF scheduler
fails or unloads.

This is a prep patch and doesn't cause any behavior changes. The new
operation and exposed functions aren't used yet.

v3: Refreshed on top of tip:sched/core.

v2: Improve patch description w/ details on planned use.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

sched: Add sched_class->reweight_task()

Currently, during a task weight change, sched core directly calls
reweight_task() defined in fair.c if @p is on CFS. Let's make it a proper
sched_class operation instead. CFS's reweight_task() is renamed to
reweight_task_fair() and now called through sched_class.

While it turns a direct call into an indirect one, set_load_weight() isn't
called from a hot path and this change shouldn't cause any noticeable
difference. This will be used to implement reweight_task for a new BPF
extensible sched_class so that it can keep its cached task weight
up-to-date.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork()

A new BPF extensible sched_class will need more control over the forking
process. It wants to be able to fail from sched_cgroup_fork() after the new
task's sched_task_group is initialized so that the loaded BPF program can
prepare the task with its cgroup association is established and reject fork
if e.g. allocation fails.

Allow sched_cgroup_fork() to fail by making it return int instead of void
and adding sched_cancel_fork() to undo sched_fork() in the error path.

sched_cgroup_fork() doesn't fail yet and this patch shouldn't cause any
behavior changes.

v2: Patch description updated to detail the expected use.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

sched: Restructure sched_class order sanity checks in sched_init()

Currently, sched_init() checks that the sched_class'es are in the expected
order by testing each adjacency which is a bit brittle and makes it
cumbersome to add optional sched_class'es. Instead, let's verify whether
they're in the expected order using sched_class_above() which is what
matters.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: David Vernet <dvernet@meta.com>

Merge branch 'bpf/for-next' into sched_ext-base

Merge branch 'bpf-support-resilient-split-btf'

Alan Maguire says:

====================
bpf: support resilient split BTF

Split BPF Type Format (BTF) provides huge advantages in that kernel
modules only have to provide type information for types that they do not
share with the core kernel; for core kernel types, split BTF refers to
core kernel BTF type ids.  So for a STRUCT sk_buff, a module that
uses that structure (or a pointer to it) simply needs to refer to the
core kernel type id, saving the need to define the structure and its many
dependents.  This cuts down on duplication and makes BTF as compact
as possible.

However, there is a downside.  This scheme requires the references from
split BTF to base BTF to be valid not just at encoding time, but at use
time (when the module is loaded).  Even a small change in kernel types
can perturb the type ids in core kernel BTF, and - if the new reproducible
BTF option is not used - pahole's parallel processing of compilation units
can lead to different type ids for the same kernel if the BTF is
regenerated.

So we have a robustness problem for split BTF for cases where a module is
not always compiled at the same time as the kernel.  This problem is
particularly acute for distros which generally want module builders to be
able to compile a module for the lifetime of a Linux stable-based release,
and have it continue to be valid over the lifetime of that release, even
as changes in data structures (and hence BTF types) accrue.  Today it's not
possible to generate BTF for modules that works beyond the initial
kernel it is compiled against - kernel bugfixes etc invalidate the split
BTF references to vmlinux BTF, and BTF is no longer usable for the
module.

The goal of this series is to provide options to provide additional
context for cases like this.  That context comes in the form of
distilled base BTF; it stands in for the base BTF, and contains
information about the types referenced from split BTF, but not their
full descriptions.  The modified split BTF will refer to type ids in
this .BTF.base section, and when the kernel loads such modules it
will use that .BTF.base to map references from split BTF to the
equivalent current vmlinux base BTF types.  Once this relocation
process has succeeded, the module BTF available in /sys/kernel/btf
will look exactly as if it was built with the current vmlinux;
references to base types will be fixed up etc.

A module builder - using this series along with the pahole changes -
can then build a module with distilled base BTF via an out-of-tree
module build, i.e.

make -C . M=path/2/module

The module will have a .BTF section (the split BTF) and a
.BTF.base section.  The latter is small in size - distilled base
BTF does not need full struct/union/enum information for named
types for example.  For 2667 modules built with distilled base BTF,
the average size observed was 1556 bytes (stddev 1563).  The overall
size added to this 2667 modules was 5.3Mb.

Note that for the in-tree modules, this approach is not needed as
split and base BTF in the case of in-tree modules are always built
and re-built together.

The series first focuses on generating split BTF with distilled base
BTF; then relocation support is added to allow split BTF with
an associated distlled base to be relocated with a new base BTF.

Next Eduard's patch allows BTF ELF parsing to work with both
.BTF and .BTF.base sections; this ensures that bpftool will be
able to dump BTF for a module with a .BTF.base section for example,
or indeed dump relocated BTF where a module and a "-B vmlinux"
is supplied.

Then we add support to resolve_btfids to ignore base BTF - i.e.
to avoid relocation - if a .BTF.base section is found.  This ensures
the .BTF.ids section is populated with ids relative to the distilled
base (these will be relocated as part of module load).

Finally the series supports storage of .BTF.base data/size in modules
and supports sharing of relocation code with the kernel to allow
relocation of module BTF.  For the kernel, this relocation
process happens at module load time, and we relocate split BTF
references to point at types in the current vmlinux BTF.  As part of
this, .BTF.ids references need to be mapped also.

So concretely, what happens is

- we generate split BTF in the .BTF section of a module that refers to
  types in the .BTF.base section as base types; the latter are not full
  type descriptions but provide information about the base type.  So
  a STRUCT sk_buff would be represented as a FWD struct sk_buff in
  distilled base BTF for example.
- when the module is loaded, the split BTF is relocated with vmlinux
  BTF; in the case of the FWD struct sk_buff, we find the STRUCT sk_buff
  in vmlinux BTF and map all split BTF references to the distilled base
  FWD sk_buff, replacing them with references to the vmlinux BTF
  STRUCT sk_buff.

A previous approach to this problem [1] utilized standalone BTF for such
cases - where the BTF is not defined relative to base BTF so there is no
relocation required.  The problem with that approach is that from
the verifier perspective, some types are special, and having a custom
representation of a core kernel type that did not necessarily match the
current representation is not tenable.  So the approach taken here was
to preserve the split BTF model while minimizing the representation of
the context needed to relocate split and current vmlinux BTF.

To generate distilled .BTF.base sections the associated dwarves
patch (to be applied on the "next" branch there) is needed [3]
Without it, things will still work but modules will not be built
with a .BTF.base section.

Changes since v5[4]:

- Update search of distilled types to return the first occurrence
  of a string (or a string+size pair); this allows us to iterate
  over all matches in distilled base BTF (Andrii, patch 3)
- Update to use BTF field iterators (Andrii, patches 1, 3 and 8)
- Update tests to cover multiple match and associated error cases
  (Eduard, patch 4)
- Rename elf_sections_info to btf_elf_secs, remove use of
  libbpf_get_error(), reset btf->owns_base when relocation
  succeeds (Andrii, patch 5)

Changes since v4[5]:

- Moved embeddedness, duplicate name checks to relocation time
  and record struct/union size for all distilled struct/unions
  instead of using forwards.  This allows us to carry out
  type compatibility checks based on the base BTF we want to
  relocate with (Eduard, patches 1, 3)
- Moved to using qsort() instead of qsort_r() as support for
  qsort_r() appears to be missing in Android libc (Andrii, patch 3)
- Sorting/searching now incorporates size matching depending
  on BTF kind and embeddedness of struct/union (Eduard, Andrii,
  patch 3)
- Improved naming of various types during relocation to avoid
  confusion (Andrii, patch 3)
- Incorporated Eduard's patch (patch 5) which handles .BTF.base
  sections internally in btf_parse_elf().  This makes ELF parsing
  work with split BTF, split BTF with a distilled base, split
  BTF with a distilled base _and_ base BTF (by relocating) etc.
  Having this avoids the need for bpftool changes; it will work
  as-is with .BTF.base sections (Eduard, patch 4)
- Updated resolve_btfids to _not_ relocate BTF for modules
  where a .BTF.base section is present; in that one case we
  do not want to relocate BTF as the .BTF.ids section should
  reflect ids in .BTF.base which will later be relocated on
  module load (Eduard, Andrii, patch 5)

Changes since v3[6]:

- distill now checks for duplicate-named struct/unions and records
  them as a sized struct/union to help identify which of the
  multiple base BTF structs/unions it refers to (Eduard, patch 1)
- added test support for multiple name handling (Eduard, patch 2)
- simplified the string mapping when updating split BTF to use
  base BTF instead of distilled base.  Since the only string
  references split BTF can make to base BTF are the names of
  the base types, create a string map from distilled string
  offset -> base BTF string offset and update string offsets
  by visiting all strings in split BTF; this saves having to
  do costly searches of base BTF (Eduard, patch 7,10)
- fixed bpftool manpage and indentation issues (Quentin, patch 11)

Also explored Eduard's suggestion of doing an implicit fallback
to checking for .BTF.base section in btf__parse() when it is
called to get base BTF.  However while it is doable, it turned
out to be difficult operationally.  Since fallback is implicit
we do not know the source of the BTF - was it from .BTF or
.BTF.base? In bpftool, we want to try first standalone BTF,
then split, then split with distilled base.  Having a way
to explicitly request .BTF.base via btf__parse_opts() fits
that model better.

Changes since v2[7]:

- submitted patch to use --btf_features in Makefile.btf for pahole
  v1.26 and later separately (Andrii).  That has landed in bpf-next
  now.
- distilled base now encodes ENUM64 as fwd ENUM (size 8), eliminating
  the need for support for ENUM64 in btf__add_fwd (patch 1, Andrii)
- moved to distilling only named types, augmenting split BTF with
  associated reference types; this simplifies greatly the distilled
  base BTF and the mapping operation between distilled and base
  BTF when relocating (most of the series changes, Andrii)
- relocation now iterates over base BTF, looking for matches based
  on name in distilled BTF.  Distilled BTF is pre-sorted by name
  (Andrii, patch 8)
- removed most redundant compabitiliby checks aside from struct
  size for base types/embedded structs and kind compatibility
  (since we only match on name) (Andrii, patch 8)
- btf__parse_opts() now replaces btf_parse() internally in libbpf
  (Eduard, patch 3)

Changes since RFC [8]:

- updated terminology; we replace clunky "base reference" BTF with
  distilling base BTF into a .BTF.base section. Similarly BTF
  reconcilation becomes BTF relocation (Andrii, most patches)
- add distilled base BTF by default for out-of-tree modules
  (Alexei, patch 8)
- distill algorithm updated to record size of embedded struct/union
  by recording it as a 0-vlen STRUCT/UNION with size preserved
  (Andrii, patch 2)
- verify size match on relocation for such STRUCT/UNIONs (Andrii,
  patch 9)
- with embedded STRUCT/UNION recording size, we can have bpftool
  dump a header representation using .BTF.base + .BTF sections
  rather than special-casing and refusing to use "format c" for
  that case (patch 5)
- match enum with enum64 and vice versa (Andrii, patch 9)
- ensure that resolve_btfids works with BTF without .BTF.base
  section (patch 7)
- update tests to cover embedded types, arrays and function
  prototypes (patches 3, 12)

[1] https://lore.kernel.org/bpf/20231112124834.388735-14-alan.maguire@oracle.com/
[2] https://lore.kernel.org/bpf/20240501175035.2476830-1-alan.maguire@oracle.com/
[3] https://lore.kernel.org/bpf/20240517102714.4072080-1-alan.maguire@oracle.com/
[4] https://lore.kernel.org/bpf/20240528122408.3154936-1-alan.maguire@oracle.com/
[5] https://lore.kernel.org/bpf/20240517102246.4070184-1-alan.maguire@oracle.com/
[6] https://lore.kernel.org/bpf/20240510103052.850012-1-alan.maguire@oracle.com/
[7] https://lore.kernel.org/bpf/20240424154806.3417662-1-alan.maguire@oracle.com/
[8] https://lore.kernel.org/bpf/20240322102455.98558-1-alan.maguire@oracle.com/
====================

Link: https://lore.kernel.org/r/20240613095014.357981-1-alan.maguire@oracle.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

resolve_btfids: Handle presence of .BTF.base section

Now that btf_parse_elf() handles .BTF.base section presence,
we need to ensure that resolve_btfids uses .BTF.base when present
rather than the vmlinux base BTF passed in via the -B option.
Detect .BTF.base section presence and unset the base BTF path
to ensure that BTF ELF parsing will do the right thing.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613095014.357981-7-alan.maguire@oracle.com

libbpf: Make btf_parse_elf process .BTF.base transparently

Update btf_parse_elf() to check if .BTF.base section is present.
The logic is as follows:

  if .BTF.base section exists:
     distilled_base := btf_new(.BTF.base)
  if distilled_base:
     btf := btf_new(.BTF, .base_btf=distilled_base)
     if base_btf:
        btf_relocate(btf, base_btf)
  else:
     btf := btf_new(.BTF)
  return btf

In other words:
- if .BTF.base section exists, load BTF from it and use it as a base
  for .BTF load;
- if base_btf is specified and .BTF.base section exist, relocate newly
  loaded .BTF against base_btf.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240613095014.357981-6-alan.maguire@oracle.com

selftests/bpf: Extend distilled BTF tests to cover BTF relocation

Ensure relocated BTF looks as expected; in this case identical to
original split BTF, with a few duplicate anonymous types added to
split BTF by the relocation process. Also add relocation tests
for edge cases like missing type in base BTF and multiple types
of the same name.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613095014.357981-5-alan.maguire@oracle.com

libbpf: Split BTF relocation

Map distilled base BTF type ids referenced in split BTF and their
references to the base BTF passed in, and if the mapping succeeds,
reparent the split BTF to the base BTF.

Relocation is done by first verifying that distilled base BTF
only consists of named INT, FLOAT, ENUM, FWD, STRUCT and
UNION kinds; then we sort these to speed lookups.  Once sorted,
the base BTF is iterated, and for each relevant kind we check
for an equivalent in distilled base BTF.  When found, the
mapping from distilled -> base BTF id and string offset is recorded.
In establishing mappings, we need to ensure we check STRUCT/UNION
size when the STRUCT/UNION is embedded in a split BTF STRUCT/UNION,
and when duplicate names exist for the same STRUCT/UNION.  Otherwise
size is ignored in matching STRUCT/UNIONs.

Once all mappings are established, we can update type ids
and string offsets in split BTF and reparent it to the new base.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613095014.357981-4-alan.maguire@oracle.com

selftests/bpf: Test distilled base, split BTF generation

Test generation of split+distilled base BTF, ensuring that

- named base BTF STRUCTs and UNIONs are represented as 0-vlen sized
STRUCT/UNIONs
- named ENUM[64]s are represented as 0-vlen named ENUM[64]s
- anonymous struct/unions are represented in full in split BTF
- anonymous enums are represented in full in split BTF
- types unreferenced from split BTF are not present in distilled
base BTF

Also test that with vmlinux BTF and split BTF based upon it,
we only represent needed base types referenced from split BTF
in distilled base.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613095014.357981-3-alan.maguire@oracle.com

libbpf: Add btf__distill_base() creating split BTF with distilled base BTF

To support more robust split BTF, adding supplemental context for the
base BTF type ids that split BTF refers to is required.  Without such
references, a simple shuffling of base BTF type ids (without any other
significant change) invalidates the split BTF.  Here the attempt is made
to store additional context to make split BTF more robust.

This context comes in the form of distilled base BTF providing minimal
information (name and - in some cases - size) for base INTs, FLOATs,
STRUCTs, UNIONs, ENUMs and ENUM64s along with modified split BTF that
points at that base and contains any additional types needed (such as
TYPEDEF, PTR and anonymous STRUCT/UNION declarations).  This
information constitutes the minimal BTF representation needed to
disambiguate or remove split BTF references to base BTF.  The rules
are as follows:

- INT, FLOAT, FWD are recorded in full.
- if a named base BTF STRUCT or UNION is referred to from split BTF, it
  will be encoded as a zero-member sized STRUCT/UNION (preserving
  size for later relocation checks).  Only base BTF STRUCT/UNIONs
  that are either embedded in split BTF STRUCT/UNIONs or that have
  multiple STRUCT/UNION instances of the same name will _need_ size
  checks at relocation time, but as it is possible a different set of
  types will be duplicates in the later to-be-resolved base BTF,
  we preserve size information for all named STRUCT/UNIONs.
- if an ENUM[64] is named, a ENUM forward representation (an ENUM
  with no values) of the same size is used.
- in all other cases, the type is added to the new split BTF.

Avoiding struct/union/enum/enum64 expansion is important to keep the
distilled base BTF representation to a minimum size.

When successful, new representations of the distilled base BTF and new
split BTF that refers to it are returned.  Both need to be freed by the
caller.

So to take a simple example, with split BTF with a type referring
to "struct sk_buff", we will generate distilled base BTF with a
0-member STRUCT sk_buff of the appropriate size, and the split BTF
will refer to it instead.

Tools like pahole can utilize such split BTF to populate the .BTF
section (split BTF) and an additional .BTF.base section.  Then
when the split BTF is loaded, the distilled base BTF can be used
to relocate split BTF to reference the current (and possibly changed)
base BTF.

So for example if "struct sk_buff" was id 502 when the split BTF was
originally generated,  we can use the distilled base BTF to see that
id 502 refers to a "struct sk_buff" and replace instances of id 502
with the current (relocated) base BTF sk_buff type id.

Distilled base BTF is small; when building a kernel with all modules
using distilled base BTF as a test, overall module size grew by only
5.3Mb total across ~2700 modules.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613095014.357981-2-alan.maguire@oracle.com

selftests/bpf: Add tests for add_const

Improve arena based tests and add several C and asm tests
with specific pattern.
These tests would have failed without add_const verifier support.

Also add several loop_inside_iter*() tests that are not related to add_const,
but nice to have.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240613013815.953-5-alexei.starovoitov@gmail.com

bpf: Support can_loop/cond_break on big endian

Add big endian support for can_loop/cond_break macros.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240613013815.953-4-alexei.starovoitov@gmail.com

bpf: Track delta between "linked" registers.

Compilers can generate the code
  r1 = r2
  r1 += 0x1
  if r2 < 1000 goto ...
  use knowledge of r2 range in subsequent r1 operations

So remember constant delta between r2 and r1 and update r1 after 'if' condition.

Unfortunately LLVM still uses this pattern for loops with 'can_loop' construct:
for (i = 0; i < 1000 && can_loop; i++)

The "undo" pass was introduced in LLVM
https://reviews.llvm.org/D121937
to prevent this optimization, but it cannot cover all cases.
Instead of fighting middle end optimizer in BPF backend teach the verifier
about this pattern.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613013815.953-3-alexei.starovoitov@gmail.com

bpf: Relax tuple len requirement for sk helpers.

__bpf_skc_lookup() safely handles incorrect values of tuple len,
hence we can allow zero to be passed as tuple len.
This patch alone doesn't make an observable verifier difference.
It's a trivial improvement that might simplify bpf programs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613013815.953-2-alexei.starovoitov@gmail.com

Merge branch 'bpf-make-trusted-args-nullable'

Vadim Fedorenko says:

====================
bpf: make trusted args nullable

Current verifier checks for the arg to be nullable after checking for
certain pointer types. It prevents programs to pass NULL to kfunc args
even if they are marked as nullable. This patchset adjusts verifier and
changes bpf crypto kfuncs to allow null for IV parameter which is
optional for some ciphers. Benchmark shows ~4% improvements when there
is no need to initialise 0-sized dynptr.

v3:
- add special selftest for nullable parameters
v2:
- adjust kdoc accordingly
====================

Link: https://lore.kernel.org/r/20240613211817.1551967-1-vadfed@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests: bpf: add testmod kfunc for nullable params

Add special test to be sure that only __nullable BTF params can be
replaced by NULL. This patch adds fake kfuncs in bpf_testmod to
properly test different params.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240613211817.1551967-6-vadfed@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests: bpf: crypto: adjust bench to use nullable IV

The bench shows some improvements, around 4% faster on decrypt.

Before:

Benchmark 'crypto-decrypt' started.
Iter   0 (325.719us): hits    5.105M/s (  5.105M/prod), drops 0.000M/s, total operations    5.105M/s
Iter   1 (-17.295us): hits    5.224M/s (  5.224M/prod), drops 0.000M/s, total operations    5.224M/s
Iter   2 (  5.504us): hits    4.630M/s (  4.630M/prod), drops 0.000M/s, total operations    4.630M/s
Iter   3 (  9.239us): hits    5.148M/s (  5.148M/prod), drops 0.000M/s, total operations    5.148M/s
Iter   4 ( 37.885us): hits    5.198M/s (  5.198M/prod), drops 0.000M/s, total operations    5.198M/s
Iter   5 (-53.282us): hits    5.167M/s (  5.167M/prod), drops 0.000M/s, total operations    5.167M/s
Iter   6 (-17.809us): hits    5.186M/s (  5.186M/prod), drops 0.000M/s, total operations    5.186M/s
Summary: hits    5.092 ± 0.228M/s (  5.092M/prod), drops    0.000 ±0.000M/s, total operations    5.092 ± 0.228M/s

After:

Benchmark 'crypto-decrypt' started.
Iter   0 (268.912us): hits    5.312M/s (  5.312M/prod), drops 0.000M/s, total operations    5.312M/s
Iter   1 (124.869us): hits    5.354M/s (  5.354M/prod), drops 0.000M/s, total operations    5.354M/s
Iter   2 (-36.801us): hits    5.334M/s (  5.334M/prod), drops 0.000M/s, total operations    5.334M/s
Iter   3 (254.628us): hits    5.334M/s (  5.334M/prod), drops 0.000M/s, total operations    5.334M/s
Iter   4 (-77.691us): hits    5.275M/s (  5.275M/prod), drops 0.000M/s, total operations    5.275M/s
Iter   5 (-164.510us): hits    5.313M/s (  5.313M/prod), drops 0.000M/s, total operations    5.313M/s
Iter   6 (-81.376us): hits    5.346M/s (  5.346M/prod), drops 0.000M/s, total operations    5.346M/s
Summary: hits    5.326 ± 0.029M/s (  5.326M/prod), drops    0.000 ±0.000M/s, total operations    5.326 ± 0.029M/s

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240613211817.1551967-5-vadfed@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests: bpf: crypto: use NULL instead of 0-sized dynptr

Adjust selftests to use nullable option for state and IV arg.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240613211817.1551967-4-vadfed@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: crypto: make state and IV dynptr nullable

Some ciphers do not require state and IV buffer, but with current
implementation 0-sized dynptr is always needed. With adjustment to
verifier we can provide NULL instead of 0-sized dynptr. Make crypto
kfuncs ready for this.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240613211817.1551967-3-vadfed@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: verifier: make kfuncs args nullalble

Some arguments to kfuncs might be NULL in some cases. But currently it's
not possible to pass NULL to any BTF structures because the check for
the suffix is located after all type checks. Move it to earlier place
to allow nullable args.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240613211817.1551967-2-vadfed@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'fixes-for-kfunc-prototype-generation'

Daniel Xu says:

====================
Fixes for kfunc prototype generation

This patchset fixes new warnings and errors that kfunc prototype
generation caused.
====================

Link: https://lore.kernel.org/r/cover.1718295425.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: selftests: Do not use generated kfunc prototypes for arena progs

When selftests are built with a new enough clang, the arena selftests
opt-in to use LLVM address_space attribute annotations for arena
pointers.

These annotations are not emitted by kfunc prototype generation. This
causes compilation errors when clang sees conflicting prototypes.

Fix by opting arena selftests out of using generated kfunc prototypes.

Fixes: 770abbb5a25a ("bpftool: Support dumping kfunc prototypes from BTF")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202406131810.c1B8hTm8-lkp@intel.com/
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/fc59a617439ceea9ad8dfbb4786843c2169496ae.1718295425.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Fix bpf_dynptr documentation comments

The function argument names were changed but the doc comment was not.
Fix htmldocs build warning by updating doc comments.

Fixes: cce4c40b9606 ("bpf: treewide: Align kfunc signatures to prog point-of-view")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/d0b0eb05f91e12e5795966153b11998d3fc1d433.1718295425.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Validate CHECKSUM_COMPLETE option

Adjust skb program test to run with checksum validation.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240606145851.229116-2-vadfed@meta.com

bpf: Add CHECKSUM_COMPLETE to bpf test progs

Add special flag to validate that TC BPF program properly updates
checksum information in skb.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240606145851.229116-1-vadfed@meta.com

Merge branch 'bpf-support-dumping-kfunc-prototypes-from-btf'

Daniel Xu says:

====================
bpf: Support dumping kfunc prototypes from BTF

This patchset enables both detecting as well as dumping compilable
prototypes for kfuncs.

The first commit instructs pahole to DECL_TAG kfuncs when available.
This requires v1.27 which was released on 6/11/24. With it, users will
be able to look at BTF inside vmlinux (or modules) and check if the
kfunc they want is available.

The final commit teaches bpftool how to dump kfunc prototypes. This
is done for developer convenience.

The rest of the commits are fixups to enable selftests to use the
newly dumped kfunc prototypes. With these, selftests will regularly
exercise the newly added codepaths.

Tested with and without the required pahole changes:

* https://github.com/kernel-patches/bpf/pull/7186
* https://github.com/kernel-patches/bpf/pull/7187

=== Changelog ===
From v4:
* Change bpf_session_cookie() return type
* Only fixup used fentry test kfunc prototypes
* Extract out projection detection into shared btf_is_projection_of()
* Fix kernel test robot build warnings about doc comments

From v3:
* Teach selftests to use dumped prototypes

From v2:
* Update Makefile.btf with pahole flag
* More error checking
* Output formatting changes
* Drop already-merged commit

From v1:
* Add __weak annotation
* Use btf_dump for kfunc prototypes
* Update kernel bpf_rdonly_cast() signature
====================

Link: https://lore.kernel.org/r/cover.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpftool: Support dumping kfunc prototypes from BTF

This patch enables dumping kfunc prototypes from bpftool. This is useful
b/c with this patch, end users will no longer have to manually define
kfunc prototypes. For the kernel tree, this also means we can optionally
drop kfunc prototypes from:

        tools/testing/selftests/bpf/bpf_kfuncs.h
        tools/testing/selftests/bpf/bpf_experimental.h

Example usage:

        $ make PAHOLE=/home/dxu/dev/pahole/build/pahole -j30 vmlinux

        $ ./tools/bpf/bpftool/bpftool btf dump file ./vmlinux format c | rg "__ksym;" | head -3
        extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __weak __ksym;
        extern void cgroup_rstat_flush(struct cgroup *cgrp) __weak __ksym;
        extern struct bpf_key *bpf_lookup_user_key(u32 serial, u64 flags) __weak __ksym;

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/bf6c08f9263c4bd9d10a717de95199d766a13f61.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: selftests: xfrm: Opt out of using generated kfunc prototypes

The xfrm_info selftest locally defines an aliased type such that folks
with CONFIG_XFRM_INTERFACE=m/n configs can still build the selftests.
See commit aa67961f3243 ("selftests/bpf: Allow building bpf tests with CONFIG_XFRM_INTERFACE=[m|n]").

Thus, it is simpler if this selftest opts out of using enerated kfunc
prototypes. The preprocessor macro this commit uses will be introduced
in the final commit.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/afe0bb1c50487f52542cdd5230c4aef9e36ce250.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: selftests: nf: Opt out of using generated kfunc prototypes

The bpf-nf selftests play various games with aliased types such that
folks with CONFIG_NF_CONNTRACK=m/n configs can still build the
selftests. See commits:

1058b6a78db2 ("selftests/bpf: Do not fail build if CONFIG_NF_CONNTRACK=m/n")
92afc5329a5b ("selftests/bpf: Fix build errors if CONFIG_NF_CONNTRACK=m")

Thus, it is simpler if these selftests opt out of using generated kfunc
prototypes. The preprocessor macro this commit uses will be introduced
in the final commit.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/044a5b10cb3abd0d71cb1c818ee0bfc4a2239332.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: treewide: Align kfunc signatures to prog point-of-view

Previously, kfunc declarations in bpf_kfuncs.h (and others) used "user
facing" types for kfuncs prototypes while the actual kfunc definitions
used "kernel facing" types. More specifically: bpf_dynptr vs
bpf_dynptr_kern, __sk_buff vs sk_buff, and xdp_md vs xdp_buff.

It wasn't an issue before, as the verifier allows aliased types.
However, since we are now generating kfunc prototypes in vmlinux.h (in
addition to keeping bpf_kfuncs.h around), this conflict creates
compilation errors.

Fix this conflict by using "user facing" types in kfunc definitions.
This results in more casts, but otherwise has no additional runtime
cost.

Note, similar to 5b268d1ebcdc ("bpf: Have bpf_rdonly_cast() take a const
pointer"), we also make kfuncs take const arguments where appropriate in
order to make the kfunc more permissive.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/b58346a63a0e66bc9b7504da751b526b0b189a67.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: verifier: Relax caller requirements for kfunc projection type args

Currently, if a kfunc accepts a projection type as an argument (eg
struct __sk_buff *), the caller must exactly provide exactly the same
type with provable provenance.

However in practice, kfuncs that accept projection types _must_ cast to
the underlying type before use b/c projection type layouts are
completely made up. Thus, it is ok to relax the verifier rules around
implicit conversions.

We will use this functionality in the next commit when we align kfuncs
to user-facing types.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/e2c025cb09ccfd4af1ec9e18284dc3cecff7514d.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: selftests: Namespace struct_opt callbacks in bpf_dctcp

With generated kfunc prototypes, the existing callback names will
conflict. Fix by namespacing with a bpf_ prefix.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/efe7aadad8a054e5aeeba94b1d2e4502eee09d7a.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Make bpf_session_cookie() kfunc return long *

We will soon be generating kfunc prototypes from BTF. As part of that,
we need to align the manual signatures in bpf_kfuncs.h with the actual
kfunc definitions. There is currently a conflicting signature for
bpf_session_cookie() w.r.t. return type.

The original intent was to return long * and not __u64 *. You can see
evidence of that intent in a3a5113393cc ("selftests/bpf: Add kprobe
session cookie test").

Fix conflict by changing kfunc definition.

Fixes: 5c919acef851 ("bpf: Add support for kprobe session cookie")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/7043e1c251ab33151d6e3830f8ea1902ed2604ac.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: selftests: Fix bpf_map_sum_elem_count() kfunc prototype

The prototype in progs/map_percpu_stats.c is not in line with how the
actual kfuncs are defined in kernel/bpf/map_iter.c. This causes
compilation errors when kfunc prototypes are generated from BTF.

Fix by aligning with actual kfunc definitions.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/0497e11a71472dcb71ada7c90ad691523ae87c3b.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: selftests: Fix bpf_cpumask_first_zero() kfunc prototype

The prototype in progs/nested_trust_common.h is not in line with how the
actual kfuncs are defined in kernel/bpf/cpumask.c. This causes compilation
errors when kfunc prototypes are generated from BTF.

Fix by aligning with actual kfunc definitions.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/437936a4e554b02e04566dd6e3f0a5d08370cc8c.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: selftests: Fix fentry test kfunc prototypes

Some prototypes in progs/get_func_ip_test.c were not in line with how the
actual kfuncs are defined in net/bpf/test_run.c. This causes compilation
errors when kfunc prototypes are generated from BTF.

Fix by aligning with actual kfunc definitions.

Also remove two unused prototypes.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/1e68870e7626b7b9c6420e65076b307fc404a2f0.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: selftests: Fix bpf_iter_task_vma_new() prototype

bpf_iter_task_vma_new() is defined as taking a u64 as its 3rd argument.
u64 is a unsigned long long. bpf_experimental.h was defining the
prototype as unsigned long.

Fix by using __u64.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/fab4509bfee914f539166a91c3ff41e949f3df30.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

kbuild: bpf: Tell pahole to DECL_TAG kfuncs

With [0], pahole can now discover kfuncs and inject DECL_TAG
into BTF. With this commit, we will start shipping said DECL_TAGs
to downstream consumers if pahole supports it.

This is useful for feature probing kfuncs as well as generating
compilable prototypes. This is particularly important as kfuncs
do not have stable ABI.

[0]: https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?id=72e88f29c6f7e14201756e65bd66157427a61aaf

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/324aac5c627bddb80d9968c30df6382846994cc8.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpftool: Query only cgroup-related attach types

When CONFIG_NETKIT=y,
bpftool-cgroup shows error even if the cgroup's path is correct:

$ bpftool cgroup tree /sys/fs/cgroup
CgroupPath
ID       AttachType      AttachFlags     Name
Error: can't query bpf programs attached to /sys/fs/cgroup: No such device or address

>From strace and kernel tracing, I found netkit returned ENXIO and this command failed.
I think this AttachType(BPF_NETKIT_PRIMARY) is not relevant to cgroup.

bpftool-cgroup should query just only cgroup-related attach types.

v2->v3:
  - removed an unnecessary check

v1->v2:
  - used an array of cgroup attach types

Signed-off-by: Kenta Tada <tadakentaso@gmail.com>
Reviewed-by: Quentin Monnet <qmo@kernel.org>
Link: https://lore.kernel.org/r/20240607111704.6716-1-tadakentaso@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'intel-wired-lan-driver-updates-2024-06-03'

Jacob Keller says:

====================
Intel Wired LAN Driver Updates 2024-06-03

This series includes miscellaneous improvements for the ice as well as a
cleanup to the Makefiles for all Intel net drivers.

Andy fixes all of the Intel net driver Makefiles to use the documented
'*-y' syntax for specifying object files to link into kernel driver
modules, rather than the '*-objs' syntax which works but is documented as
reserved for user-space host programs.

Jacob has a cleanup to refactor rounding logic in the ice driver into a
common roundup_u64 helper function.

Michal Schmidt replaces irq_set_affinity_hint() to use
irq_update_affinity_hint() which behaves better with user-applied affinity
settings.

v2: https://lore.kernel.org/r/20240605-next-2024-06-03-intel-next-batch-v2-0-39c23963fa78@intel.com
v1: https://lore.kernel.org/r/20240603-next-2024-06-03-intel-next-batch-v1-0-e0523b28f325@intel.com
====================

Link: https://lore.kernel.org/r/20240607-next-2024-06-03-intel-next-batch-v3-0-d1470cee3347@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: use irq_update_affinity_hint()

irq_set_affinity_hint() is deprecated. Use irq_update_affinity_hint()
instead. This removes the side-effect of actually applying the affinity.

The driver does not really need to worry about spreading its IRQs across
CPUs. The core code already takes care of that.
On the contrary, when the driver applies affinities by itself, it breaks
the users' expectations:
1. The user configures irqbalance with IRQBALANCE_BANNED_CPULIST in
    order to prevent IRQs from being moved to certain CPUs that run a
    real-time workload.
2. ice reconfigures VSIs at runtime due to a MIB change
    (ice_dcb_process_lldp_set_mib_change). Reopening a VSI resets the
    affinity in ice_vsi_req_irq_msix().
3. ice has no idea about irqbalance's config, so it may move an IRQ to
    a banned CPU. The real-time workload suffers unacceptable latency.

I am not sure if updating the affinity hints is at all useful, because
irqbalance ignores them since 2016 ([1]), but at least it's harmless.

This ice change is similar to i40e commit d34c54d1739c ("i40e: Use
irq_update_affinity_hint()").

[1] https://github.com/Irqbalance/irqbalance/commit/dcc411e7bfdd

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Reviewed-by: Sunil Goutham <sgoutham@marvell.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240607-next-2024-06-03-intel-next-batch-v3-3-d1470cee3347@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: add and use roundup_u64 instead of open coding equivalent

In ice_ptp_cfg_clkout(), the ice driver needs to calculate the nearest next
second of a current time value specified in nanoseconds. It implements this
using div64_u64, because the time value is a u64. It could use div_u64
since NSEC_PER_SEC is smaller than 32-bits.

Ideally this would be implemented directly with roundup(), but that can't
work on all platforms due to a division which requires using the specific
macros and functions due to platform restrictions, and to ensure that the
most appropriate and fast instructions are used.

The kernel doesn't currently provide any 64-bit equivalents for doing
roundup. Attempting to use roundup() on a 32-bit platform will result in a
link failure due to not having a direct 64-bit division.

The closest equivalent for this is DIV64_U64_ROUND_UP, which does a
division always rounding up. However, this only computes the division, and
forces use of the div64_u64 in cases where the divisor is a 32bit value and
could make use of div_u64.

Introduce DIV_U64_ROUND_UP based on div_u64, and then use it to implement
roundup_u64 which takes a u64 input value and a u32 rounding value.

The name roundup_u64 matches the naming scheme of div_u64, and future
patches could implement roundup64_u64 if they need to round by a multiple
that is greater than 32-bits.

Replace the logic in ice_ptp.c which does this equivalent with the newly
added roundup_u64.

Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240607-next-2024-06-03-intel-next-batch-v3-2-d1470cee3347@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: intel: Use *-y instead of *-objs in Makefile

*-objs suffix is reserved rather for (user-space) host programs while
usually *-y suffix is used for kernel drivers (although *-objs works
for that purpose for now).

Let's correct the old usages of *-objs in Makefiles.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240607-next-2024-06-03-intel-next-batch-v3-1-d1470cee3347@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

isdn: add missing MODULE_DESCRIPTION() macros

make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/isdn/hardware/mISDN/hfcpci.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/isdn/hardware/mISDN/hfcmulti.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/isdn/hardware/mISDN/hfcsusb.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/isdn/hardware/mISDN/avmfritz.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/isdn/hardware/mISDN/speedfax.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/isdn/hardware/mISDN/mISDNinfineon.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/isdn/hardware/mISDN/w6692.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/isdn/hardware/mISDN/netjet.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/isdn/hardware/mISDN/mISDNipac.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/isdn/hardware/mISDN/mISDNisar.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/isdn/mISDN/mISDN_core.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/isdn/mISDN/mISDN_dsp.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/isdn/mISDN/l1oip.o

Add the missing invocations of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240607-md-drivers-isdn-v1-1-81fb7001bc3a@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-06-06

We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 50 files changed, 1887 insertions(+), 527 deletions(-).

The main changes are:

1) Add a user space notification mechanism via epoll when a struct_ops
   object is getting detached/unregistered, from Kui-Feng Lee.

2) Big batch of BPF selftest refactoring for sockmap and BPF congctl
   tests, from Geliang Tang.

3) Add BTF field (type and string fields, right now) iterator support
   to libbpf instead of using existing callback-based approaches,
   from Andrii Nakryiko.

4) Extend BPF selftests for the latter with a new btf_field_iter
   selftest, from Alan Maguire.

5) Add new kfuncs for a generic, open-coded bits iterator,
   from Yafang Shao.

6) Fix BPF selftests' kallsyms_find() helper under kernels configured
   with CONFIG_LTO_CLANG_THIN, from Yonghong Song.

7) Remove a bunch of unused structs in BPF selftests,
   from David Alan Gilbert.

8) Convert test_sockmap section names into names understood by libbpf
   so it can deduce program type and attach type, from Jakub Sitnicki.

9) Extend libbpf with the ability to configure log verbosity
   via LIBBPF_LOG_LEVEL environment variable, from Mykyta Yatsenko.

10) Fix BPF selftests with regards to bpf_cookie and find_vma flakiness
    in nested VMs, from Song Liu.

11) Extend riscv32/64 JITs to introduce shift/add helpers to generate Zba
    optimization, from Xiao Wang.

12) Enable BPF programs to declare arrays and struct fields with kptr,
    bpf_rb_root, and bpf_list_head, from Kui-Feng Lee.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
  selftests/bpf: Drop useless arguments of do_test in bpf_tcp_ca
  selftests/bpf: Use start_test in test_dctcp in bpf_tcp_ca
  selftests/bpf: Use start_test in test_dctcp_fallback in bpf_tcp_ca
  selftests/bpf: Add start_test helper in bpf_tcp_ca
  selftests/bpf: Use connect_to_fd_opts in do_test in bpf_tcp_ca
  libbpf: Auto-attach struct_ops BPF maps in BPF skeleton
  selftests/bpf: Add btf_field_iter selftests
  selftests/bpf: Fix send_signal test with nested CONFIG_PARAVIRT
  libbpf: Remove callback-based type/string BTF field visitor helpers
  bpftool: Use BTF field iterator in btfgen
  libbpf: Make use of BTF field iterator in BTF handling code
  libbpf: Make use of BTF field iterator in BPF linker code
  libbpf: Add BTF field iterator
  selftests/bpf: Ignore .llvm.<hash> suffix in kallsyms_find()
  selftests/bpf: Fix bpf_cookie and find_vma in nested VM
  selftests/bpf: Test global bpf_list_head arrays.
  selftests/bpf: Test global bpf_rb_root arrays and fields in nested struct types.
  selftests/bpf: Test kptr arrays and kptrs in nested struct fields.
  bpf: limit the number of levels of a nested struct type.
  bpf: look into the types of the fields of a struct type recursively.
  ...
====================

Link: https://lore.kernel.org/r/20240606223146.23020-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'wireless-next-2024-06-07' of git://git./linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.11

The first "new features" pull request for v6.11 with changes both in
stack and in drivers. Nothing out of ordinary, except that we have
two conflicts this time:

net/mac80211/cfg.c
  https://lore.kernel.org/all/20240531124415.05b25e7a@canb.auug.org.au

drivers/net/wireless/microchip/wilc1000/netdev.c
  https://lore.kernel.org/all/20240603110023.23572803@canb.auug.org.au

Major changes:

cfg80211/mac80211
* parse Transmit Power Envelope (TPE) data in mac80211 instead of in drivers

wilc1000
* read MAC address during probe to make it visible to user space

iwlwifi
* bump FW API to 91 for BZ/SC devices
* report 64-bit radiotap timestamp
* enable P2P low latency by default
* handle Transmit Power Envelope (TPE) advertised by AP
* start using guard()

rtlwifi
* RTL8192DU support

ath12k
* remove unsupported tx monitor handling
* channel 2 in 6 GHz band support
* Spatial Multiplexing Power Save (SMPS) in 6 GHz band support
* multiple BSSID (MBSSID) and Enhanced Multi-BSSID Advertisements (EMA)
   support
* dynamic VLAN support
* add panic handler for resetting the firmware state

ath10k
* add qcom,no-msa-ready-indicator Device Tree property
* LED support for various chipsets

* tag 'wireless-next-2024-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (194 commits)
  wifi: ath12k: add hw_link_id in ath12k_pdev
  wifi: ath12k: add panic handler
  wifi: rtw89: chan: Use swap() in rtw89_swap_sub_entity()
  wifi: brcm80211: remove unused structs
  wifi: brcm80211: use sizeof(*pointer) instead of sizeof(type)
  wifi: ath12k: do not process consecutive RDDM event
  dt-bindings: net: wireless: ath11k: Drop "qcom,ipq8074-wcss-pil" from example
  wifi: ath12k: fix memory leak in ath12k_dp_rx_peer_frag_setup()
  wifi: rtlwifi: handle return value of usb init TX/RX
  wifi: rtlwifi: Enable the new rtl8192du driver
  wifi: rtlwifi: Add rtl8192du/sw.c
  wifi: rtlwifi: Constify rtl_hal_cfg.{ops,usb_interface_cfg} and rtl_priv.cfg
  wifi: rtlwifi: Add rtl8192du/dm.{c,h}
  wifi: rtlwifi: Add rtl8192du/fw.{c,h} and rtl8192du/led.{c,h}
  wifi: rtlwifi: Add rtl8192du/rf.{c,h}
  wifi: rtlwifi: Add rtl8192du/trx.{c,h}
  wifi: rtlwifi: Add rtl8192du/phy.{c,h}
  wifi: rtlwifi: Add rtl8192du/hw.{c,h}
  wifi: rtlwifi: Add new members to struct rtl_priv for RTL8192DU
  wifi: rtlwifi: Add rtl8192du/table.{c,h}
  ...

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================

Link: https://lore.kernel.org/r/20240607093517.41394C2BBFC@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'fix-changing-dsa-conduit'

Marek Behún says:

====================
Fix changing DSA conduit

This series fixes an issue in the DSA code related to host interface UC
address installed into port FDB and port conduit address database when
live-changing port conduit.

The first patch refactores/deduplicates the installation/uninstallation
of the interface's MAC address and the second patch fixes the issue.

Cover letter for v1 and v2:
https://patchwork.kernel.org/project/netdevbpf/cover/20240429163627.16031-1-kabel@kernel.org/
https://patchwork.kernel.org/project/netdevbpf/cover/20240502122922.28139-1-kabel@kernel.org/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: update the unicast MAC address when changing conduit

When changing DSA user interface conduit while the user interface is up,
DSA exhibits different behavior in comparison to when the interface is
down. This different behavior concerns the primary unicast MAC address
stored in the port standalone FDB and in the conduit device UC database.

If we put a switch port down while changing the conduit with
  ip link set sw0p0 down
  ip link set sw0p0 type dsa conduit conduit1
  ip link set sw0p0 up
we delete the address in dsa_user_close() and install the (possibly
different) address in dsa_user_open().

But when changing the conduit on the fly, the old address is not
deleted and the new one is not installed.

Since we explicitly want to support live-changing the conduit, uninstall
the old address before calling dsa_port_assign_conduit() and install the
(possibly different) new address after the call.

Because conduit change might also trigger address change (the user
interface is supposed to inherit the conduit interface MAC address if no
address is defined in hardware (dp->mac is a zero address)), move the
eth_hw_addr_inherit() call from dsa_user_change_conduit() to
dsa_port_change_conduit(), just before installing the new address.

Although this is in theory a flaw in DSA core, it needs not be
backported, since there is currently no DSA driver that can be affected
by this. The only DSA driver that supports changing conduit is felix,
and, as explained by Vladimir Oltean [1]:

  There are 2 reasons why with felix the bug does not manifest itself.

  First is because both the 'ocelot' and the alternate 'ocelot-8021q'
  tagging protocols have the 'promisc_on_conduit = true' flag. So the
  unicast address doesn't have to be in the conduit's RX filter -
  neither the old or the new conduit.

  Second, dsa_user_host_uc_install() theoretically leaves behind host
  FDB entries installed towards the wrong (old) CPU port. But in
  felix_fdb_add(), we treat any FDB entry requested towards any CPU port
  as if it was a multicast FDB entry programmed towards _all_ CPU ports.
  For that reason, it is installed towards the port mask of the PGID_CPU
  port group ID:

if (dsa_port_is_cpu(dp))
port = PGID_CPU;

Therefore no Fixes tag for this change.

[1] https://lore.kernel.org/netdev/20240507201827.47suw4fwcjrbungy@skbuf/
Signed-off-by: Marek Behún <kabel@kernel.org>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: deduplicate code adding / deleting the port address to fdb

The sequence
  if (dsa_switch_supports_uc_filtering(ds))
    dsa_port_standalone_host_fdb_add(dp, addr, 0);
  if (!ether_addr_equal(addr, conduit->dev_addr))
    dev_uc_add(conduit, addr);
is executed both in dsa_user_open() and dsa_user_set_mac_addr().

Its reverse is executed both in dsa_user_close() and
dsa_user_set_mac_addr().

Refactor these sequences into new functions dsa_user_host_uc_install()
and dsa_user_host_uc_uninstall().

Signed-off-by: Marek Behún <kabel@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'rtnetlink-rtnl_lock'

Jakub Kicinski says:

====================
rtnetlink: move rtnl_lock handling out of af_netlink

With the changes done in commit 5b4b62a169e1 ("rtnetlink: make
the "split" NLM_DONE handling generic") we can also move the
rtnl locking out of af_netlink.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: netlink: remove the cb_mutex "injection" from netlink core

Back in 2007, in commit af65bdfce98d ("[NETLINK]: Switch cb_lock spinlock
to mutex and allow to override it") netlink core was extended to allow
subsystems to replace the dump mutex lock with its own lock.

The mechanism was used by rtnetlink to take rtnl_lock but it isn't
sufficiently flexible for other users. Over the 17 years since
it was added no other user appeared. Since rtnetlink needs conditional
locking now, and doesn't use it either, axe this feature complete.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rtnetlink: move rtnl_lock handling out of af_netlink

Now that we have an intermediate layer of code for handling
rtnl-level netlink dump quirks, we can move the rtnl_lock
taking there.

For dump handlers with RTNL_FLAG_DUMP_SPLIT_NLM_DONE we can
avoid taking rtnl_lock just to generate NLM_DONE, once again.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: hellcreek: Replace kernel.h with what is used

kernel.h is included solely for some other existing headers.
Include them directly and get rid of kernel.h.

While at it, sort headers alphabetically for easier maintenance.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tcp-up-pin-tw-timer'

Florian Westphal says:

====================
net: tcp: un-pin tw timer

Changes since previous iteration:
- Patch 1: update a comment, I copied Erics v7 RvB tag.
- Patch 2: move bh off/on into hashdance_schedule and get rid of
comment mentioning pinned tw timer.
I did not copy Erics RvB tag over from v7 because of the change.
- Patch 3 is unchanged, so I kept Erics RvB tag.

This is v8 of the series where the tw_timer is un-pinned to get rid of
interferences in isolated CPUs setups.

First patch makes necessary preparations, existing code relies on
TIMER_PINNED to avoid races.

Second patch un-pins the TW timer. Could be folded into the first one,
but it might help wrt. bisection.

Third patch is a minor cleanup to move a helper from .h to the only
remaining compilation unit.

Tested with iperf3 and stress-ng socket mode.
====================

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: move inet_twsk_schedule helper out of header

Its no longer used outside inet_timewait_sock.c, so move it there.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: tcp: un-pin the tw_timer

After previous patch, even if timer fires immediately on another CPU,
context that schedules the timer now holds the ehash spinlock, so timer
cannot reap tw socket until ehash lock is released.

BH disable is moved into hashdance_schedule.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: tcp/dccp: prepare for tw_timer un-pinning

The TCP timewait timer is proving to be problematic for setups where
scheduler CPU isolation is achieved at runtime via cpusets (as opposed to
statically via isolcpus=domains).

What happens there is a CPU goes through tcp_time_wait(), arming the
time_wait timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer
fires, causing interference for the now-isolated CPU. This is conceptually
similar to the issue described in commit e02b93124855 ("workqueue: Unbind
kworkers before sending them to exit()")

Move inet_twsk_schedule() to within inet_twsk_hashdance(), with the ehash
lock held. Expand the lock's critical section from inet_twsk_kill() to
inet_twsk_deschedule_put(), serializing the scheduling vs descheduling of
the timer. IOW, this prevents the following race:

     tcp_time_wait()
       inet_twsk_hashdance()
  inet_twsk_deschedule_put()
    del_timer_sync()
       inet_twsk_schedule()

Thanks to Paolo Abeni for suggesting to leverage the ehash lock.

This also restores a comment from commit ec94c2696f0b ("tcp/dccp: avoid
one atomic operation for timewait hashdance") as inet_twsk_hashdance() had
a "Step 1" and "Step 3" comment, but the "Step 2" had gone missing.

inet_twsk_deschedule_put() now acquires the ehash spinlock to synchronize
with inet_twsk_hashdance_schedule().

To ease possible regression search, actual un-pin is done in next patch.

Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-acl-fixes'

Petr Machata says:

====================
mlxsw: ACL fixes

Ido Schimmel writes:

Patches #1-#3 fix various spelling mistakes I noticed while working on
the code base.

Patch #4 fixes a general protection fault by bailing out when the error
occurs and warning.

Patch #5 fixes the warning.

Patch #6 fixes ACL scale regression and firmware errors.

See the commit messages for more info.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_acl: Fix ACL scale regression and firmware errors

ACLs that reside in the algorithmic TCAM (A-TCAM) in Spectrum-2 and
newer ASICs can share the same mask if their masks only differ in up to
8 consecutive bits. For example, consider the following filters:

# tc filter add dev swp1 ingress pref 1 proto ip flower dst_ip 192.0.2.0/24 action drop
# tc filter add dev swp1 ingress pref 1 proto ip flower dst_ip 198.51.100.128/25 action drop

The second filter can use the same mask as the first (dst_ip/24) with a
delta of 1 bit.

However, the above only works because the two filters have different
values in the common unmasked part (dst_ip/24). When entries have the
same value in the common unmasked part they create undesired collisions
in the device since many entries now have the same key. This leads to
firmware errors such as [1] and to a reduced scale.

Fix by adjusting the hash table key to only include the value in the
common unmasked part. That is, without including the delta bits. That
way the driver will detect the collision during filter insertion and
spill the filter into the circuit TCAM (C-TCAM).

Add a test case that fails without the fix and adjust existing cases
that check C-TCAM spillage according to the above limitation.

[1]
mlxsw_spectrum2 0000:06:00.0: EMAD reg access failed (tid=3379b18a00003394,reg_id=3027(ptce3),type=write,status=8(resource not available))

Fixes: c22291f7cf45 ("mlxsw: spectrum: acl: Implement delta for ERP")
Reported-by: Alexander Zubkov <green@qrator.net>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Tested-by: Alexander Zubkov <green@qrator.net>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_acl_erp: Fix object nesting warning

ACLs in Spectrum-2 and newer ASICs can reside in the algorithmic TCAM
(A-TCAM) or in the ordinary circuit TCAM (C-TCAM). The former can
contain more ACLs (i.e., tc filters), but the number of masks in each
region (i.e., tc chain) is limited.

In order to mitigate the effects of the above limitation, the device
allows filters to share a single mask if their masks only differ in up
to 8 consecutive bits. For example, dst_ip/25 can be represented using
dst_ip/24 with a delta of 1 bit. The C-TCAM does not have a limit on the
number of masks being used (and therefore does not support mask
aggregation), but can contain a limited number of filters.

The driver uses the "objagg" library to perform the mask aggregation by
passing it objects that consist of the filter's mask and whether the
filter is to be inserted into the A-TCAM or the C-TCAM since filters in
different TCAMs cannot share a mask.

The set of created objects is dependent on the insertion order of the
filters and is not necessarily optimal. Therefore, the driver will
periodically ask the library to compute a more optimal set ("hints") by
looking at all the existing objects.

When the library asks the driver whether two objects can be aggregated
the driver only compares the provided masks and ignores the A-TCAM /
C-TCAM indication. This is the right thing to do since the goal is to
move as many filters as possible to the A-TCAM. The driver also forbids
two identical masks from being aggregated since this can only happen if
one was intentionally put in the C-TCAM to avoid a conflict in the
A-TCAM.

The above can result in the following set of hints:

H1: {mask X, A-TCAM} -> H2: {mask Y, A-TCAM} // X is Y + delta
H3: {mask Y, C-TCAM} -> H4: {mask Z, A-TCAM} // Y is Z + delta

After getting the hints from the library the driver will start migrating
filters from one region to another while consulting the computed hints
and instructing the device to perform a lookup in both regions during
the transition.

Assuming a filter with mask X is being migrated into the A-TCAM in the
new region, the hints lookup will return H1. Since H2 is the parent of
H1, the library will try to find the object associated with it and
create it if necessary in which case another hints lookup (recursive)
will be performed. This hints lookup for {mask Y, A-TCAM} will either
return H2 or H3 since the driver passes the library an object comparison
function that ignores the A-TCAM / C-TCAM indication.

This can eventually lead to nested objects which are not supported by
the library [1].

Fix by removing the object comparison function from both the driver and
the library as the driver was the only user. That way the lookup will
only return exact matches.

I do not have a reliable reproducer that can reproduce the issue in a
timely manner, but before the fix the issue would reproduce in several
minutes and with the fix it does not reproduce in over an hour.

Note that the current usefulness of the hints is limited because they
include the C-TCAM indication and represent aggregation that cannot
actually happen. This will be addressed in net-next.

[1]
WARNING: CPU: 0 PID: 153 at lib/objagg.c:170 objagg_obj_parent_assign+0xb5/0xd0
Modules linked in:
CPU: 0 PID: 153 Comm: kworker/0:18 Not tainted 6.9.0-rc6-custom-g70fbc2c1c38b #42
Hardware name: Mellanox Technologies Ltd. MSN3700C/VMOD0008, BIOS 5.11 10/10/2018
Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work
RIP: 0010:objagg_obj_parent_assign+0xb5/0xd0
[...]
Call Trace:
<TASK>
__objagg_obj_get+0x2bb/0x580
objagg_obj_get+0xe/0x80
mlxsw_sp_acl_erp_mask_get+0xb5/0xf0
mlxsw_sp_acl_atcam_entry_add+0xe8/0x3c0
mlxsw_sp_acl_tcam_entry_create+0x5e/0xa0
mlxsw_sp_acl_tcam_vchunk_migrate_one+0x16b/0x270
mlxsw_sp_acl_tcam_vregion_rehash_work+0xbe/0x510
process_one_work+0x151/0x370

Fixes: 9069a3817d82 ("lib: objagg: implement optimization hints assembly and use hints for object creation")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Tested-by: Alexander Zubkov <green@qrator.net>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

lib: objagg: Fix general protection fault

The library supports aggregation of objects into other objects only if
the parent object does not have a parent itself. That is, nesting is not
supported.

Aggregation happens in two cases: Without and with hints, where hints
are a pre-computed recommendation on how to aggregate the provided
objects.

Nesting is not possible in the first case due to a check that prevents
it, but in the second case there is no check because the assumption is
that nesting cannot happen when creating objects based on hints. The
violation of this assumption leads to various warnings and eventually to
a general protection fault [1].

Before fixing the root cause, error out when nesting happens and warn.

[1]
general protection fault, probably for non-canonical address 0xdead000000000d90: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 1083 Comm: kworker/1:9 Tainted: G W 6.9.0-rc6-custom-gd9b4f1cca7fb #7
Hardware name: Mellanox Technologies Ltd. MSN3700/VMOD0005, BIOS 5.11 01/06/2019
Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work
RIP: 0010:mlxsw_sp_acl_erp_bf_insert+0x25/0x80
[...]
Call Trace:
<TASK>
mlxsw_sp_acl_atcam_entry_add+0x256/0x3c0
mlxsw_sp_acl_tcam_entry_create+0x5e/0xa0
mlxsw_sp_acl_tcam_vchunk_migrate_one+0x16b/0x270
mlxsw_sp_acl_tcam_vregion_rehash_work+0xbe/0x510
process_one_work+0x151/0x370
worker_thread+0x2cb/0x3e0
kthread+0xd0/0x100
ret_from_fork+0x34/0x50
ret_from_fork_asm+0x1a/0x30
</TASK>

Fixes: 9069a3817d82 ("lib: objagg: implement optimization hints assembly and use hints for object creation")
Reported-by: Alexander Zubkov <green@qrator.net>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Tested-by: Alexander Zubkov <green@qrator.net>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_acl_atcam: Fix wrong comment

The key is encoded, not encrypted.

Fixes: c22291f7cf45 ("mlxsw: spectrum: acl: Implement delta for ERP")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Tested-by: Alexander Zubkov <green@qrator.net>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

lib: test_objagg: Fix spelling

Fixes: 0a020d416d0a ("lib: introduce initial implementation of object aggregation manager")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Tested-by: Alexander Zubkov <green@qrator.net>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

lib: objagg: Fix spelling

Fixes: 0a020d416d0a ("lib: introduce initial implementation of object aggregation manager")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Tested-by: Alexander Zubkov <green@qrator.net>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

dmaengine: ti: k3-udma-glue: clean up return in k3_udma_glue_rx_get_irq()

Currently the k3_udma_glue_rx_get_irq() function returns either negative
error codes or zero on error.  Generally, in the kernel, zero means
success so this be confusing and has caused bugs in the past.  Also the
"tx" version of this function only returns negative error codes.  Let's
clean this "rx" function so both functions match.

This patch has no effect on runtime.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Peter Ujfalusi <peter.ujfalusi@gmail.com>
Acked-by: Vinod Koul <vkoul@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

tools: ynl: make user space policies const

Dan, who's working on C++ YNL, pointed out that the C code
does not make policies const. Sprinkle some 'const's around.

Reported-by: Dan Melnic <dmm@meta.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

page_pool: remove WARN_ON() with OR

Having an OR in WARN_ON() makes me sad because it's impossible to tell
which condition is true when triggered.

Split a WARN_ON() with an OR in page_pool_disable_direct_recycling().

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ti: icssg-prueth: Add multicast filtering support

Add multicast filtering support for ICSSG Driver. Multicast addresses will
be updated by __dev_mc_sync() API. icssg_prueth_add_macst () and
icssg_prueth_del_mcast() will be sync and unsync APIs for the driver
respectively.

To add a mac_address for a port, driver needs to call icssg_fdb_add_del()
and pass the mac_address and BIT(port_id) to the API. The ICSSG firmware
will then configure the rules and allow filtering.

If a mac_address is added to port0 and the same mac_address needs to be
added for port1, driver needs to pass BIT(port0) | BIT(port1) to the
icssg_fdb_add_del() API. If driver just pass BIT(port1) then the entry for
port0 will be overwritten / lost. This is a design constraint on the
firmware side.

To overcome this in the driver, to add any mac_address for let's say portX
driver first checks if the same mac_address is already added for any other
port. If yes driver calls icssg_fdb_add_del() with BIT(portX) |
BIT(other_existing_port). If not, driver calls icssg_fdb_add_del() with
BIT(portX).

The same thing is applicable for deleting mac_addresses as well. This
logic is in icssg_prueth_add_mcast / icssg_prueth_del_mcast APIs.

Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8152: Set NET_ADDR_STOLEN if using passthru MAC

Some docks support MAC pass-through - MAC address
is taken from another device.

Driver should indicate that with NET_ADDR_STOLEN flag.

This should help to avoid collisions if network interface
names are generated with MAC policy.

Reported and discussed here
https://github.com/systemd/systemd/issues/33104

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240605153340.25694-1-gmazyland@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/bpf: Drop useless arguments of do_test in bpf_tcp_ca

bpf_map_lookup_elem() has been removed from do_test(), it makes the
sk_stg_map argument of do_test() useless. In addition, two exactly the
same opts are passed in all the places where do_test() is invoked, so
cli_opts argument can be dropped too.

This patch drops these two useless arguments of do_test() in bpf_tcp_ca.c.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/7056eab111d78a05bce29d2821228dc93f240de4.1717054461.git.tanggeliang@kylinos.cn

selftests/bpf: Use start_test in test_dctcp in bpf_tcp_ca

The "if (sk_stg_map)" block in do_test() is only used by test_dctcp(),
it makes sense to move it from do_test() into test_dctcp(). Then
do_test() can be used by other tests except test_dctcp().

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/9938916627b9810c877e5c03a621bc0ba5acf5c5.1717054461.git.tanggeliang@kylinos.cn

selftests/bpf: Use start_test in test_dctcp_fallback in bpf_tcp_ca

The newly added helper start_test() can be used in test_dctcp_fallback()
too, to replace start_server_str() and connect_to_fd_opts(). In that
way, two network_helper_opts srv_opts and cli_opts are used instead of
the previously shared opts.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/792ca3bb013fa06e618176da02d75e4f79a76733.1717054461.git.tanggeliang@kylinos.cn

selftests/bpf: Add start_test helper in bpf_tcp_ca

For moving the "if (sk_stg_map)" block out of do_test(), extract the
code before this block as a new function start_test(). It creates
server-side and client-side sockets and returns them to the caller.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/48f2921ff9be958f5d3d28fe6bb7269a61cafa9f.1717054461.git.tanggeliang@kylinos.cn

selftests/bpf: Use connect_to_fd_opts in do_test in bpf_tcp_ca

This patch uses connect_to_fd_opts() instead of using connect_fd_to_fd()
and settcpca() in do_test() in prog_tests/bpf_tcp_ca.c to accept a struct
network_helper_opts argument.

Then define a dctcp dedicated post_socket_cb callback stg_post_socket_cb(),
invoking both settcpca() and bpf_map_update_elem() in it, and set it in
test_dctcp(). For passing map_fd into stg_post_socket_cb() callback, a new
member map_fd is added in struct cb_opts.

Add another "const struct network_helper_opts *cli_opts" to do_test() to
separate it from the server "opts".

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/876ec90430865bc468e3b7f6fb2648420b075548.1717054461.git.tanggeliang@kylinos.cn

Merge git://git./linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/pensando/ionic/ionic_txrx.c
  d9c04209990b ("ionic: Mark error paths in the data path as unlikely")
  491aee894a08 ("ionic: fix kernel panic in XDP_TX action")

net/ipv6/ip6_fib.c
  b4cb4a1391dc ("net: use unrcu_pointer() helper")
  b01e1c030770 ("ipv6: fix possible race in __fib6_drop_pcpu_from()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

libbpf: Auto-attach struct_ops BPF maps in BPF skeleton

Similarly to `bpf_program`, support `bpf_map` automatic attachment in
`bpf_object__attach_skeleton`. Currently only struct_ops maps could be
attached.

On bpftool side, code-generate links in skeleton struct for struct_ops maps.
Similarly to `bpf_program_skeleton`, set links in `bpf_map_skeleton`.

On libbpf side, extend `bpf_map` with new `autoattach` field to support
enabling or disabling autoattach functionality, introducing
getter/setter for this field.

`bpf_object__(attach|detach)_skeleton` is extended with
attaching/detaching struct_ops maps logic.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240605175135.117127-1-yatsenko@meta.com

Merge tag 'net-6.10-rc3' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from BPF and big collection of fixes for WiFi core and
  drivers.

  Current release - regressions:

   - vxlan: fix regression when dropping packets due to invalid src
     addresses

   - bpf: fix a potential use-after-free in bpf_link_free()

   - xdp: revert support for redirect to any xsk socket bound to the
     same UMEM as it can result in a corruption

   - virtio_net:
      - add missing lock protection when reading return code from
        control_buf
      - fix false-positive lockdep splat in DIM
      - Revert "wifi: wilc1000: convert list management to RCU"

   - wifi: ath11k: fix error path in ath11k_pcic_ext_irq_config

  Previous releases - regressions:

   - rtnetlink: make the "split" NLM_DONE handling generic, restore the
     old behavior for two cases where we started coalescing those
     messages with normal messages, breaking sloppily-coded userspace

   - wifi:
      - cfg80211: validate HE operation element parsing
      - cfg80211: fix 6 GHz scan request building
      - mt76: mt7615: add missing chanctx ops
      - ath11k: move power type check to ASSOC stage, fix connecting to
        6 GHz AP
      - ath11k: fix WCN6750 firmware crash caused by 17 num_vdevs
      - rtlwifi: ignore IEEE80211_CONF_CHANGE_RETRY_LIMITS
      - iwlwifi: mvm: fix a crash on 7265

  Previous releases - always broken:

   - ncsi: prevent multi-threaded channel probing, a spec violation

   - vmxnet3: disable rx data ring on dma allocation failure

   - ethtool: init tsinfo stats if requested, prevent unintentionally
     reporting all-zero stats on devices which don't implement any

   - dst_cache: fix possible races in less common IPv6 features

   - tcp: auth: don't consider TCP_CLOSE to be in TCP_AO_ESTABLISHED

   - ax25: fix two refcounting bugs

   - eth: ionic: fix kernel panic in XDP_TX action

  Misc:

   - tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTAB"

* tag 'net-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (107 commits)
  selftests: net: lib: set 'i' as local
  selftests: net: lib: avoid error removing empty netns name
  selftests: net: lib: support errexit with busywait
  net: ethtool: fix the error condition in ethtool_get_phy_stats_ethtool()
  ipv6: fix possible race in __fib6_drop_pcpu_from()
  af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill().
  af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen().
  af_unix: Use skb_queue_empty_lockless() in unix_release_sock().
  af_unix: Use unix_recvq_full_lockless() in unix_stream_connect().
  af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen.
  af_unix: Annotate data-races around sk->sk_sndbuf.
  af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG.
  af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb().
  af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg().
  af_unix: Annotate data-race of sk->sk_state in unix_accept().
  af_unix: Annotate data-race of sk->sk_state in unix_stream_connect().
  af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll().
  af_unix: Annotate data-race of sk->sk_state in unix_inq_len().
  af_unix: Annodate data-races around sk->sk_state for writers.
  af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  ...

Merge tag 'tomoyo-pr-20240606' of git://git.code.sf.net/p/tomoyo/tomoyo

Pull tomoyo fixlet from Tetsuo Handa:
"Single patch to update project links, no behavior changes"

* tag 'tomoyo-pr-20240606' of git://git.code.sf.net/p/tomoyo/tomoyo:
tomoyo: update project links