linux-2.6-microblaze.git
2 years agoMerge branch 'for-next/entry' into for-next/core
Catalin Marinas [Thu, 26 Aug 2021 10:49:33 +0000 (11:49 +0100)]
Merge branch 'for-next/entry' into for-next/core

* for-next/entry:
  : More entry.S clean-ups and conversion to C.
  arm64: entry: call exit_to_user_mode() from C
  arm64: entry: move bulk of ret_to_user to C
  arm64: entry: clarify entry/exit helpers
  arm64: entry: consolidate entry/exit helpers

2 years agoMerge branches 'for-next/mte', 'for-next/misc' and 'for-next/kselftest', remote-track...
Catalin Marinas [Thu, 26 Aug 2021 10:49:27 +0000 (11:49 +0100)]
Merge branches 'for-next/mte', 'for-next/misc' and 'for-next/kselftest', remote-tracking branch 'arm64/for-next/perf' into for-next/core

* arm64/for-next/perf:
  arm64/perf: Replace '0xf' instances with ID_AA64DFR0_PMUVER_IMP_DEF

* for-next/mte:
  : Miscellaneous MTE improvements.
  arm64/cpufeature: Optionally disable MTE via command-line
  arm64: kasan: mte: remove redundant mte_report_once logic
  arm64: kasan: mte: use a constant kernel GCR_EL1 value
  arm64: avoid double ISB on kernel entry
  arm64: mte: optimize GCR_EL1 modification on kernel entry/exit
  Documentation: document the preferred tag checking mode feature
  arm64: mte: introduce a per-CPU tag checking mode preference
  arm64: move preemption disablement to prctl handlers
  arm64: mte: change ASYNC and SYNC TCF settings into bitfields
  arm64: mte: rename gcr_user_excl to mte_ctrl
  arm64: mte: avoid TFSRE0_EL1 related operations unless in async mode

* for-next/misc:
  : Miscellaneous updates.
  arm64: Do not trap PMSNEVFR_EL1
  arm64: mm: fix comment typo of pud_offset_phys()
  arm64: signal32: Drop pointless call to sigdelsetmask()
  arm64/sve: Better handle failure to allocate SVE register storage
  arm64: Document the requirement for SCR_EL3.HCE
  arm64: head: avoid over-mapping in map_memory
  arm64/sve: Add a comment documenting the binutils needed for SVE asm
  arm64/sve: Add some comments for sve_save/load_state()
  arm64: replace in_irq() with in_hardirq()
  arm64: mm: Fix TLBI vs ASID rollover
  arm64: entry: Add SYM_CODE annotation for __bad_stack
  arm64: fix typo in a comment
  arm64: move the (z)install rules to arch/arm64/Makefile
  arm64/sve: Make fpsimd_bind_task_to_cpu() static
  arm64: unnecessary end 'return;' in void functions
  arm64/sme: Document boot requirements for SME
  arm64: use __func__ to get function name in pr_err
  arm64: SSBS/DIT: print SSBS and DIT bit when printing PSTATE
  arm64: cpufeature: Use defined macro instead of magic numbers
  arm64/kexec: Test page size support with new TGRAN range values

* for-next/kselftest:
  : Kselftest additions for arm64.
  kselftest/arm64: signal: Add a TODO list for signal handling tests
  kselftest/arm64: signal: Add test case for SVE register state in signals
  kselftest/arm64: signal: Verify that signals can't change the SVE vector length
  kselftest/arm64: signal: Check SVE signal frame shows expected vector length
  kselftest/arm64: signal: Support signal frames with SVE register data
  kselftest/arm64: signal: Add SVE to the set of features we can check for
  kselftest/arm64: pac: Fix skipping of tests on systems without PAC
  kselftest/arm64: mte: Fix misleading output when skipping tests
  kselftest/arm64: Add a TODO list for floating point tests
  kselftest/arm64: Add tests for SVE vector configuration
  kselftest/arm64: Validate vector lengths are set in sve-probe-vls
  kselftest/arm64: Provide a helper binary and "library" for SVE RDVL
  kselftest/arm64: Ignore check_gcr_el1_cswitch binary

2 years agoarm64: Do not trap PMSNEVFR_EL1
Alexandru Elisei [Tue, 24 Aug 2021 15:45:23 +0000 (16:45 +0100)]
arm64: Do not trap PMSNEVFR_EL1

Commit 31c00d2aeaa2 ("arm64: Disable fine grained traps on boot") zeroed
the fine grained trap registers to prevent unwanted register traps from
occuring. However, for the PMSNEVFR_EL1 register, the corresponding
HDFG{R,W}TR_EL2.nPMSNEVFR_EL1 fields must be 1 to disable trapping. Set
both fields to 1 if FEAT_SPEv1p2 is detected to disable read and write
traps.

Fixes: 31c00d2aeaa2 ("arm64: Disable fine grained traps on boot")
Cc: <stable@vger.kernel.org> # 5.13.x
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210824154523.906270-1-alexandru.elisei@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: mm: fix comment typo of pud_offset_phys()
Xujun Leng [Wed, 25 Aug 2021 15:05:26 +0000 (23:05 +0800)]
arm64: mm: fix comment typo of pud_offset_phys()

Fix a typo in the comment of macro pud_offset_phys().

Signed-off-by: Xujun Leng <lengxujun2007@126.com>
Link: https://lore.kernel.org/r/20210825150526.12582-1-lengxujun2007@126.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: signal32: Drop pointless call to sigdelsetmask()
Will Deacon [Wed, 25 Aug 2021 09:39:11 +0000 (10:39 +0100)]
arm64: signal32: Drop pointless call to sigdelsetmask()

Commit 77097ae503b1 ("most of set_current_blocked() callers want
SIGKILL/SIGSTOP removed from set") extended set_current_blocked() to
remove SIGKILL and SIGSTOP from the new signal set and updated all
callers accordingly.

Unfortunately, this collided with the merge of the arm64 architecture,
which duly removes these signals when restoring the compat sigframe, as
this was what was previously done by arch/arm/.

Remove the redundant call to sigdelsetmask() from
compat_restore_sigframe().

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210825093911.24493-1-will@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64/sve: Better handle failure to allocate SVE register storage
Mark Brown [Tue, 24 Aug 2021 15:34:17 +0000 (16:34 +0100)]
arm64/sve: Better handle failure to allocate SVE register storage

Currently we "handle" failure to allocate the SVE register storage by
doing a BUG_ON() and hoping for the best. This is obviously not great and
the memory allocation failure will already be loud enough without the
BUG_ON(). As the comment says it is a corner case but let's try to do a bit
better, remove the BUG_ON() and add code to handle the failure in the
callers.

For the ptrace and signal code we can return -ENOMEM gracefully however
we have no real error reporting path available to us for the SVE access
trap so instead generate a SIGKILL if the allocation fails there. This
at least means that we won't try to soldier on and end up trying to
access the nonexistant state and while it's obviously not ideal for
userspace SIGKILL doesn't allow any handling so minimises the ABI
impact, making it easier to improve the interface later if we come up
with a better idea.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210824153417.18371-1-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: Document the requirement for SCR_EL3.HCE
Marc Zyngier [Thu, 12 Aug 2021 19:02:13 +0000 (20:02 +0100)]
arm64: Document the requirement for SCR_EL3.HCE

It is amazing that we never documented this absolutely basic
requirement: if you boot the kernel at EL2, you'd better
enable the HVC instruction from EL3.

Really, just do it.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20210812190213.2601506-6-maz@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: head: avoid over-mapping in map_memory
Mark Rutland [Mon, 23 Aug 2021 10:12:53 +0000 (11:12 +0100)]
arm64: head: avoid over-mapping in map_memory

The `compute_indices` and `populate_entries` macros operate on inclusive
bounds, and thus the `map_memory` macro which uses them also operates
on inclusive bounds.

We pass `_end` and `_idmap_text_end` to `map_memory`, but these are
exclusive bounds, and if one of these is sufficiently aligned (as a
result of kernel configuration, physical placement, and KASLR), then:

* In `compute_indices`, the computed `iend` will be in the page/block *after*
  the final byte of the intended mapping.

* In `populate_entries`, an unnecessary entry will be created at the end
  of each level of table. At the leaf level, this entry will map up to
  SWAPPER_BLOCK_SIZE bytes of physical addresses that we did not intend
  to map.

As we may map up to SWAPPER_BLOCK_SIZE bytes more than intended, we may
violate the boot protocol and map physical address past the 2MiB-aligned
end address we are permitted to map. As we map these with Normal memory
attributes, this may result in further problems depending on what these
physical addresses correspond to.

The final entry at each level may require an additional table at that
level. As EARLY_ENTRIES() calculates an inclusive bound, we allocate
enough memory for this.

Avoid the extraneous mapping by having map_memory convert the exclusive
end address to an inclusive end address by subtracting one, and do
likewise in EARLY_ENTRIES() when calculating the number of required
tables. For clarity, comments are updated to more clearly document which
boundaries the macros operate on.  For consistency with the other
macros, the comments in map_memory are also updated to describe `vstart`
and `vend` as virtual addresses.

Fixes: 0370b31e4845 ("arm64: Extend early page table code to allow for larger kernels")
Cc: <stable@vger.kernel.org> # 4.16.x
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Will Deacon <will@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210823101253.55567-1-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64/sve: Add a comment documenting the binutils needed for SVE asm
Mark Brown [Mon, 16 Aug 2021 12:50:23 +0000 (13:50 +0100)]
arm64/sve: Add a comment documenting the binutils needed for SVE asm

At some point it would be nice to avoid the need to manually encode SVE
instructions, add a note of the binutils version required to save looking
it up.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210816125024.8112-1-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64/sve: Add some comments for sve_save/load_state()
Mark Brown [Thu, 12 Aug 2021 20:11:43 +0000 (21:11 +0100)]
arm64/sve: Add some comments for sve_save/load_state()

The use of macros for the actual function bodies means legibility is always
going to be a bit of a challenge, especially while we can't rely on SVE
support in the toolchain, but this helps a little.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210812201143.35578-1-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agokselftest/arm64: signal: Add a TODO list for signal handling tests
Mark Brown [Thu, 19 Aug 2021 13:42:45 +0000 (14:42 +0100)]
kselftest/arm64: signal: Add a TODO list for signal handling tests

Note down a few gaps in our coverage.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210819134245.13935-7-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agokselftest/arm64: signal: Add test case for SVE register state in signals
Mark Brown [Thu, 19 Aug 2021 13:42:44 +0000 (14:42 +0100)]
kselftest/arm64: signal: Add test case for SVE register state in signals

Currently this doesn't actually verify that the register contents do the
right thing, it just verifes that a SVE context with appropriate size
appears.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210819134245.13935-6-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agokselftest/arm64: signal: Verify that signals can't change the SVE vector length
Mark Brown [Thu, 19 Aug 2021 13:42:43 +0000 (14:42 +0100)]
kselftest/arm64: signal: Verify that signals can't change the SVE vector length

We do not support changing the SVE vector length as part of signal return,
verify that this is the case if the system supports multiple vector lengths.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210819134245.13935-5-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agokselftest/arm64: signal: Check SVE signal frame shows expected vector length
Mark Brown [Thu, 19 Aug 2021 13:42:42 +0000 (14:42 +0100)]
kselftest/arm64: signal: Check SVE signal frame shows expected vector length

As a basic check that the SVE signal frame is being set up correctly
verify that the vector length in the signal frame is the vector length
that the process has.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210819134245.13935-4-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agokselftest/arm64: signal: Support signal frames with SVE register data
Mark Brown [Thu, 19 Aug 2021 13:42:41 +0000 (14:42 +0100)]
kselftest/arm64: signal: Support signal frames with SVE register data

A signal frame with SVE may validly either be a bare struct sve_context or
a struct sve_context followed by vector length dependent register data.
Support either in the generic helpers for the signal tests, and while we're
at it validate the SVE vector length reported.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210819134245.13935-3-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agokselftest/arm64: signal: Add SVE to the set of features we can check for
Mark Brown [Thu, 19 Aug 2021 13:42:40 +0000 (14:42 +0100)]
kselftest/arm64: signal: Add SVE to the set of features we can check for

Allow testcases for SVE signal handling to flag the dependency and be
skipped on systems without SVE support.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210819134245.13935-2-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: replace in_irq() with in_hardirq()
Changbin Du [Sat, 14 Aug 2021 00:54:05 +0000 (08:54 +0800)]
arm64: replace in_irq() with in_hardirq()

Replace the obsolete and ambiguos macro in_irq() with new
macro in_hardirq().

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Link: https://lore.kernel.org/r/20210814005405.2658-1-changbin.du@gmail.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agokselftest/arm64: pac: Fix skipping of tests on systems without PAC
Mark Brown [Thu, 19 Aug 2021 16:57:23 +0000 (17:57 +0100)]
kselftest/arm64: pac: Fix skipping of tests on systems without PAC

The PAC tests check to see if the system supports the relevant PAC features
but instead of skipping the tests if they can't be executed they fail the
tests which makes things look like they're not working when they are.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210819165723.43903-1-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agokselftest/arm64: mte: Fix misleading output when skipping tests
Mark Brown [Thu, 19 Aug 2021 17:29:02 +0000 (18:29 +0100)]
kselftest/arm64: mte: Fix misleading output when skipping tests

When skipping the tests due to a lack of system support for MTE we
currently print a message saying FAIL which makes it look like the test
failed even though the test did actually report KSFT_SKIP, creating some
confusion. Change the error message to say SKIP instead so things are
clearer.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210819172902.56211-1-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64/perf: Replace '0xf' instances with ID_AA64DFR0_PMUVER_IMP_DEF
Anshuman Khandual [Wed, 11 Aug 2021 03:27:07 +0000 (08:57 +0530)]
arm64/perf: Replace '0xf' instances with ID_AA64DFR0_PMUVER_IMP_DEF

ID_AA64DFR0_PMUVER_IMP_DEF, indicating an "implementation defined" PMU,
never actually gets used although there are '0xf' instances scattered
all around. Use the symbolic name instead of the raw hex constant.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/1628652427-24695-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2 years agoarm64: mm: Fix TLBI vs ASID rollover
Will Deacon [Fri, 6 Aug 2021 11:31:04 +0000 (12:31 +0100)]
arm64: mm: Fix TLBI vs ASID rollover

When switching to an 'mm_struct' for the first time following an ASID
rollover, a new ASID may be allocated and assigned to 'mm->context.id'.
This reassignment can happen concurrently with other operations on the
mm, such as unmapping pages and subsequently issuing TLB invalidation.

Consequently, we need to ensure that (a) accesses to 'mm->context.id'
are atomic and (b) all page-table updates made prior to a TLBI using the
old ASID are guaranteed to be visible to CPUs running with the new ASID.

This was found by inspection after reviewing the VMID changes from
Shameer but it looks like a real (yet hard to hit) bug.

Cc: <stable@vger.kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Jade Alglave <jade.alglave@arm.com>
Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20210806113109.2475-2-will@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: entry: Add SYM_CODE annotation for __bad_stack
Mark Brown [Wed, 4 Aug 2021 18:17:10 +0000 (19:17 +0100)]
arm64: entry: Add SYM_CODE annotation for __bad_stack

When converting arm64 to modern assembler annotations __bad_stack was left
as a raw local label without annotations. While this will have little if
any practical impact at present it may cause issues in the future if we
start using the annotations for things like reliable stack trace. Add
SYM_CODE annotations to fix this.

Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210804181710.19059-1-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: entry: call exit_to_user_mode() from C
Mark Rutland [Mon, 2 Aug 2021 14:07:33 +0000 (15:07 +0100)]
arm64: entry: call exit_to_user_mode() from C

When handling an exception from EL0, we perform the entry work in that
exception's C handler, and once the C handler has finished, we return
back to the entry assembly. Subsequently in the common `ret_to_user`
assembly we perform the exit work that balances with the entry work.
This can be somewhat difficult to follow, and makes it hard to rework
the return paths (e.g. to pass additional context to the exit code, or
to have exception return logic for specific exceptions).

This patch reworks the entry code such that each EL0 C exception handler
is responsible for both the entry and exit work. This clearly balances
the two (and will permit additional variation in future), and avoids an
unnecessary bounce between assembly and C in the common case, leaving
`ret_from_fork` as the only place assembly has to call the exit code.
This means that the exit work is now inlined into the C handler, which
is already the case for the entry work, and allows the compiler to
generate better code (e.g. by immediately returning when there is no
exit work to perform).

To align with other exception entry/exit helpers, enter_from_user_mode()
is updated to take the EL0 pt_regs as a parameter, though this is
currently unused.

There should be no functional change as a result of this patch. However,
this should lead to slightly better backtraces when an error is
encountered within do_notify_resume(), as the C handler should appear in
the backtrace, indicating the specific exception that the kernel was
entered with.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20210802140733.52716-5-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: entry: move bulk of ret_to_user to C
Mark Rutland [Mon, 2 Aug 2021 14:07:32 +0000 (15:07 +0100)]
arm64: entry: move bulk of ret_to_user to C

In `ret_to_user` we perform some conditional work depending on the
thread flags, then perform some IRQ/context tracking which is intended
to balance with the IRQ/context tracking performed in the entry C code.

For simplicity and consistency, it would be preferable to move this all
to C. As a step towards that, this patch moves the conditional work and
IRQ/context tracking into a C helper function. To aid bisectability,
this is called from the `ret_to_user` assembly, and a subsequent patch
will move the call to C code.

As local_daif_mask() handles all necessary tracing and PMR manipulation,
we no longer need to handle this explicitly. As we call
exit_to_user_mode() directly, the `user_enter_irqoff` macro is no longer
used, and can be removed. As enter_from_user_mode() and
exit_to_user_mode() are no longer called from assembly, these can be
made static, and as these are typically very small, they are marked
__always_inline to avoid the overhead of a function call.

For now, enablement of single-step is left in entry.S, and for this we
still need to read the flags in ret_to_user(). It is safe to read this
separately as TIF_SINGLESTEP is not part of _TIF_WORK_MASK.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20210802140733.52716-4-mark.rutland@arm.com
[catalin.marinas@arm.com: removed unused gic_prio_kentry_setup macro]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: entry: clarify entry/exit helpers
Mark Rutland [Mon, 2 Aug 2021 14:07:31 +0000 (15:07 +0100)]
arm64: entry: clarify entry/exit helpers

When entering an exception, we must perform irq/context state management
before we can use instrumentable C code. Similarly, when exiting an
exception we cannot use instrumentable C code after we perform
irq/context state management.

Originally, we'd intended that the enter_from_*() and exit_to_*()
helpers would enforce this by virtue of being the first and last
functions called, respectively, in an exception handler. However, as
they now call instrumentable code themselves, this is not as clearly
true.

To make this more robust, this patch splits the irq/context state
management into separate helpers, with all the helpers commented to make
their intended purpose more obvious.

In exit_to_kernel_mode() we'll now check TFSR_EL1 before we assert that
IRQs are disabled, but this ordering is not important, and other than
this there should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20210802140733.52716-3-mark.rutland@arm.com
[catalin.marinas@arm.com: comment typos fix-up]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: entry: consolidate entry/exit helpers
Mark Rutland [Mon, 2 Aug 2021 14:07:30 +0000 (15:07 +0100)]
arm64: entry: consolidate entry/exit helpers

To make the various entry/exit helpers easier to understand and easier
to compare, this patch moves all the entry/exit helpers to be adjacent
at the top of entry-common.c, rather than being spread out throughout
the file.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20210802140733.52716-2-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agokselftest/arm64: Add a TODO list for floating point tests
Mark Brown [Tue, 3 Aug 2021 14:04:50 +0000 (15:04 +0100)]
kselftest/arm64: Add a TODO list for floating point tests

Write down some ideas for additional coverage for floating point in case
someone feels inspired to look into them.

Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://lore.kernel.org/r/20210803140450.46624-5-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agokselftest/arm64: Add tests for SVE vector configuration
Mark Brown [Tue, 3 Aug 2021 14:04:49 +0000 (15:04 +0100)]
kselftest/arm64: Add tests for SVE vector configuration

We provide interfaces for configuring the SVE vector length seen by
processes using prctl and also via /proc for configuring the default
values. Provide tests that exercise all these interfaces and verify that
they take effect as expected, though at present no test fully enumerates
all the possible vector lengths.

A subset of this is already tested via sve-probe-vls but the /proc
interfaces are not currently covered at all.

In preparation for the forthcoming support for SME, the Scalable Matrix
Extension, which has separately but similarly configured vector lengths
which we expect to offer similar userspace interfaces for, all the actual
files and prctls used are parameterised and we don't validate that the
architectural minimum vector length is the minimum we see.

Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://lore.kernel.org/r/20210803140450.46624-4-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agokselftest/arm64: Validate vector lengths are set in sve-probe-vls
Mark Brown [Tue, 3 Aug 2021 14:04:48 +0000 (15:04 +0100)]
kselftest/arm64: Validate vector lengths are set in sve-probe-vls

Currently sve-probe-vls does not verify that the vector lengths reported
by the prctl() interface are actually what is reported by the architecture,
use the rdvl_sve() helper to validate this.

Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://lore.kernel.org/r/20210803140450.46624-3-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agokselftest/arm64: Provide a helper binary and "library" for SVE RDVL
Mark Brown [Tue, 3 Aug 2021 14:04:47 +0000 (15:04 +0100)]
kselftest/arm64: Provide a helper binary and "library" for SVE RDVL

SVE provides an instruction RDVL which reports the currently configured
vector length. In order to validate that our vector length configuration
interfaces are working correctly without having to build the C code for
our test programs with SVE enabled provide a trivial assembly library
with a C callable function that executes RDVL. Since these interfaces
also control behaviour on exec*() provide a trivial wrapper program which
reports the currently configured vector length on stdout, tests can use
this to verify that behaviour on exec*() is as expected.

In preparation for providing similar helper functionality for SME, the
Scalable Matrix Extension, which allows separately configured vector
lengths to be read back both the assembler function and wrapper binary
have SVE included in their name.

Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://lore.kernel.org/r/20210803140450.46624-2-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: fix typo in a comment
Jason Wang [Tue, 3 Aug 2021 14:20:20 +0000 (22:20 +0800)]
arm64: fix typo in a comment

The double 'the' after 'If' in this comment "If the the TLB range ops
are supported..." is repeated. Consequently, one 'the' should be
removed from the comment.

Signed-off-by: Jason Wang <wangborong@cdjrlc.com>
Link: https://lore.kernel.org/r/20210803142020.124230-1-wangborong@cdjrlc.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: move the (z)install rules to arch/arm64/Makefile
Masahiro Yamada [Thu, 29 Jul 2021 14:05:27 +0000 (23:05 +0900)]
arm64: move the (z)install rules to arch/arm64/Makefile

Currently, the (z)install targets in arch/arm64/Makefile descend into
arch/arm64/boot/Makefile to invoke the shell script, but there is no
good reason to do so.

arch/arm64/Makefile can run the shell script directly.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20210729140527.443116-1-masahiroy@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64/cpufeature: Optionally disable MTE via command-line
Yee Lee [Tue, 3 Aug 2021 07:08:22 +0000 (15:08 +0800)]
arm64/cpufeature: Optionally disable MTE via command-line

MTE support needs to be optionally disabled in runtime
for HW issue workaround, FW development and some
evaluation works on system resource and performance.

This patch makes two changes:
(1) moves init of tag-allocation bits(ATA/ATA0) to
cpu_enable_mte() as not cached in TLB.

(2) allows ID_AA64PFR1_EL1.MTE to be overridden on
its shadow value by giving "arm64.nomte" on cmdline.

When the feature value is off, ATA and TCF will not set
and the related functionalities are accordingly suppressed.

Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: Marc Zyngier <maz@kernel.org>
Suggested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Yee Lee <yee.lee@mediatek.com>
Link: https://lore.kernel.org/r/20210803070824.7586-2-yee.lee@mediatek.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: kasan: mte: remove redundant mte_report_once logic
Mark Rutland [Wed, 14 Jul 2021 14:38:43 +0000 (15:38 +0100)]
arm64: kasan: mte: remove redundant mte_report_once logic

We have special logic to suppress MTE tag check fault reporting, based
on a global `mte_report_once` and `reported` variables. These can be
used to suppress calling kasan_report() when taking a tag check fault,
but do not prevent taking the fault in the first place, nor does they
affect the way we disable tag checks upon taking a fault.

The core KASAN code already defaults to reporting a single fault, and
has a `multi_shot` control to permit reporting multiple faults. The only
place we transiently alter `mte_report_once` is in lib/test_kasan.c,
where we also the `multi_shot` state as the same time. Thus
`mte_report_once` and `reported` are redundant, and can be removed.

When a tag check fault is taken, tag checking will be disabled by
`do_tag_recovery` and must be explicitly re-enabled if desired. The test
code does this by calling kasan_enable_tagging_sync().

This patch removes the redundant mte_report_once() logic and associated
variables.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Tested-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://lore.kernel.org/r/20210714143843.56537-4-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: kasan: mte: use a constant kernel GCR_EL1 value
Mark Rutland [Wed, 14 Jul 2021 14:38:42 +0000 (15:38 +0100)]
arm64: kasan: mte: use a constant kernel GCR_EL1 value

When KASAN_HW_TAGS is selected, KASAN is enabled at boot time, and the
hardware supports MTE, we'll initialize `kernel_gcr_excl` with a value
dependent on KASAN_TAG_MAX. While the resulting value is a constant
which depends on KASAN_TAG_MAX, we have to perform some runtime work to
generate the value, and have to read the value from memory during the
exception entry path. It would be better if we could generate this as a
constant at compile-time, and use it as such directly.

Early in boot within __cpu_setup(), we initialize GCR_EL1 to a safe
value, and later override this with the value required by KASAN. If
CONFIG_KASAN_HW_TAGS is not selected, or if KASAN is disabeld at boot
time, the kernel will not use IRG instructions, and so the initial value
of GCR_EL1 is does not matter to the kernel. Thus, we can instead have
__cpu_setup() initialize GCR_EL1 to a value consistent with
KASAN_TAG_MAX, and avoid the need to re-initialize it during hotplug and
resume form suspend.

This patch makes arem64 use a compile-time constant KERNEL_GCR_EL1
value, which is compatible with KASAN_HW_TAGS when this is selected.
This removes the need to re-initialize GCR_EL1 dynamically, and acts as
an optimization to the entry assembly, which no longer needs to load
this value from memory. The redundant initialization hooks are removed.

In order to do this, KASAN_TAG_MAX needs to be visible outside of the
core KASAN code. To do this, I've moved the KASAN_TAG_* values into
<linux/kasan-tags.h>.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Tested-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://lore.kernel.org/r/20210714143843.56537-3-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64/sve: Make fpsimd_bind_task_to_cpu() static
Mark Brown [Fri, 30 Jul 2021 16:58:46 +0000 (17:58 +0100)]
arm64/sve: Make fpsimd_bind_task_to_cpu() static

This function is not referenced outside fpsimd.c so can be static, making
it that little bit easier to follow what is called from where.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210730165846.18558-1-broonie@kernel.org
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agokselftest/arm64: Ignore check_gcr_el1_cswitch binary
Mark Brown [Wed, 28 Jul 2021 17:35:39 +0000 (18:35 +0100)]
kselftest/arm64: Ignore check_gcr_el1_cswitch binary

We added check_gcr_el1_cswitch but did not ignore the generated binary,
add it.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210728173539.6231-1-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: unnecessary end 'return;' in void functions
Jason Wang [Mon, 26 Jul 2021 12:39:40 +0000 (20:39 +0800)]
arm64: unnecessary end 'return;' in void functions

The end 'return;' in a void function is useless and verbose. It can
be removed safely.

Signed-off-by: Jason Wang <wangborong@cdjrlc.com>
Link: https://lore.kernel.org/r/20210726123940.63232-1-wangborong@cdjrlc.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64/sme: Document boot requirements for SME
Mark Brown [Tue, 20 Jul 2021 20:42:20 +0000 (21:42 +0100)]
arm64/sme: Document boot requirements for SME

Document our requirements for initialisation of the Scalable Matrix
Extension (SME) at kernel start. While we do have the ability to handle
mismatched vector lengths we will reject any late CPUs that can't support
the minimum set we determine at boot so for clarity we document a
requirement that all CPUs make the same vector length available.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210720204220.22951-1-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: use __func__ to get function name in pr_err
Jason Wang [Mon, 26 Jul 2021 12:29:07 +0000 (20:29 +0800)]
arm64: use __func__ to get function name in pr_err

Prefer using '"%s...", __func__' to get current function's name in
a debug message.

Signed-off-by: Jason Wang <wangborong@cdjrlc.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210726122907.51529-1-wangborong@cdjrlc.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: SSBS/DIT: print SSBS and DIT bit when printing PSTATE
Lingyan Huang [Thu, 22 Jul 2021 02:20:36 +0000 (10:20 +0800)]
arm64: SSBS/DIT: print SSBS and DIT bit when printing PSTATE

The current code to print PSTATE when generating backtraces does not
include SSBS bit and DIT bit, so add this information.

Cc: Vladimir Murzin <vladimir.murzin@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/1626920436-54816-1-git-send-email-zhangshaokun@hisilicon.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: cpufeature: Use defined macro instead of magic numbers
Shaokun Zhang [Fri, 16 Jul 2021 05:58:09 +0000 (13:58 +0800)]
arm64: cpufeature: Use defined macro instead of magic numbers

Use defined macro to simplify the code and make it more readable.

Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Link: https://lore.kernel.org/r/1626415089-57584-1-git-send-email-zhangshaokun@hisilicon.com
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64/kexec: Test page size support with new TGRAN range values
Anshuman Khandual [Wed, 14 Jul 2021 04:46:15 +0000 (10:16 +0530)]
arm64/kexec: Test page size support with new TGRAN range values

The commit 26f55386f964 ("arm64/mm: Fix __enable_mmu() for new TGRAN range
values") had already switched into testing ID_AA64MMFR0_TGRAN range values.
This just changes system_supports_[4|16|64]kb_granule() helpers to perform
similar range tests as well. While here, it standardizes page size specific
supported min and max TGRAN values.

Cc: Will Deacon <will@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/1626237975-1909-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: avoid double ISB on kernel entry
Peter Collingbourne [Tue, 27 Jul 2021 20:54:39 +0000 (13:54 -0700)]
arm64: avoid double ISB on kernel entry

Although an ISB is required in order to make the MTE-related system
register update to GCR_EL1 effective, and the same is true for
PAC-related updates to SCTLR_EL1 or APIAKey{Hi,Lo}_EL1, we issue two
ISBs on machines that support both features while we only need to
issue one. To avoid the unnecessary additional ISB, remove the ISBs
from the PAC and MTE-specific alternative blocks and add a couple
of additional blocks that cause us to only execute one ISB if both
features are supported.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/Idee7e8114d5ae5a0b171d06220a0eb4bb015a51c
Link: https://lore.kernel.org/r/20210727205439.2557419-1-pcc@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: mte: optimize GCR_EL1 modification on kernel entry/exit
Peter Collingbourne [Wed, 14 Jul 2021 01:36:38 +0000 (18:36 -0700)]
arm64: mte: optimize GCR_EL1 modification on kernel entry/exit

Accessing GCR_EL1 and issuing an ISB can be expensive on some
microarchitectures. Although we must write to GCR_EL1, we can
restructure the code to avoid reading from it because the new value
can be derived entirely from the exclusion mask, which is already in
a GPR. Do so.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I560a190a74176ca4cc5191dad08f77f6b1577c75
Link: https://lore.kernel.org/r/20210714013638.3995315-1-pcc@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoDocumentation: document the preferred tag checking mode feature
Peter Collingbourne [Tue, 27 Jul 2021 20:52:59 +0000 (13:52 -0700)]
Documentation: document the preferred tag checking mode feature

Document the functionality added in the previous patches.

Link: https://linux-review.googlesource.com/id/I48217cc3e8b8da33abc08cbaddc11cf4360a1b86
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210727205300.2554659-6-pcc@google.com
Acked-by: Will Deacon <will@kernel.org>
[catalin.marinas@arm.com: clarify that the change happens on task scheduling]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: mte: introduce a per-CPU tag checking mode preference
Peter Collingbourne [Tue, 27 Jul 2021 20:52:58 +0000 (13:52 -0700)]
arm64: mte: introduce a per-CPU tag checking mode preference

Add a per-CPU sysfs node, mte_tcf_preferred, that allows the preferred
tag checking mode to be configured. The current possible values are
async and sync.

Link: https://linux-review.googlesource.com/id/I7493dcd533a2785a1437b16c3f6b50919f840854
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20210727205300.2554659-5-pcc@google.com
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: move preemption disablement to prctl handlers
Peter Collingbourne [Tue, 27 Jul 2021 20:52:57 +0000 (13:52 -0700)]
arm64: move preemption disablement to prctl handlers

In the next patch, we will start reading sctlr_user from
mte_update_sctlr_user and subsequently writing a new value based on the
task's TCF setting and potentially the per-CPU TCF preference. This
means that we need to be careful to disable preemption around any
code sequences that read from sctlr_user and subsequently write to
sctlr_user and/or SCTLR_EL1, so that we don't end up writing a stale
value (based on the previous CPU's TCF preference) to either of them.

We currently have four such sequences, in the prctl handlers for
PR_SET_TAGGED_ADDR_CTRL and PR_PAC_SET_ENABLED_KEYS, as well as in
the task initialization code that resets the prctl settings. Change
the prctl handlers to disable preemption in the handlers themselves
rather than the functions that they call, and change the task
initialization code to call the respective prctl handlers instead of
setting sctlr_user directly.

As a result of this change, we no longer need the helper function
set_task_sctlr_el1, nor does its behavior make sense any more, so
remove it.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/Ic0e8a0c00bb47d786c1e8011df0b7fe99bee4bb5
Link: https://lore.kernel.org/r/20210727205300.2554659-4-pcc@google.com
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: mte: change ASYNC and SYNC TCF settings into bitfields
Peter Collingbourne [Tue, 27 Jul 2021 20:52:56 +0000 (13:52 -0700)]
arm64: mte: change ASYNC and SYNC TCF settings into bitfields

Allow the user program to specify both ASYNC and SYNC TCF modes by
repurposing the existing constants as bitfields. This will allow the
kernel to select one of the modes on behalf of the user program. With
this patch the kernel will always select async mode, but a subsequent
patch will make this configurable.

Link: https://linux-review.googlesource.com/id/Icc5923c85a8ea284588cc399ae74fd19ec291230
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20210727205300.2554659-3-pcc@google.com
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: mte: rename gcr_user_excl to mte_ctrl
Peter Collingbourne [Tue, 27 Jul 2021 20:52:55 +0000 (13:52 -0700)]
arm64: mte: rename gcr_user_excl to mte_ctrl

We are going to use this field to store more data. To prepare for
that, rename it and change the users to rely on the bit position of
gcr_user_excl in mte_ctrl.

Link: https://linux-review.googlesource.com/id/Ie1fd18e480100655f5d22137f5b22f4f3a9f9e2e
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20210727205300.2554659-2-pcc@google.com
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoarm64: mte: avoid TFSRE0_EL1 related operations unless in async mode
Peter Collingbourne [Fri, 9 Jul 2021 02:35:32 +0000 (19:35 -0700)]
arm64: mte: avoid TFSRE0_EL1 related operations unless in async mode

There is no reason to touch TFSRE0_EL1 nor issue a DSB unless our task
is in asynchronous mode. Since these operations (especially the DSB) may
be expensive on certain microarchitectures, only perform them if
necessary.

Furthermore, stop clearing TFSRE0_EL1 on entry because it will be
cleared on exit and it is not necessary to have any particular value in
TFSRE0_EL1 between entry and exit.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/Ib353a63e3d0abc2b0b008e96aa2d9692cfc1b815
Link: https://lore.kernel.org/r/20210709023532.2133673-1-pcc@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2 years agoLinux 5.14-rc3
Linus Torvalds [Sun, 25 Jul 2021 22:35:14 +0000 (15:35 -0700)]
Linux 5.14-rc3

2 years agosmpboot: fix duplicate and misplaced inlining directive
Linus Torvalds [Sun, 25 Jul 2021 18:06:37 +0000 (11:06 -0700)]
smpboot: fix duplicate and misplaced inlining directive

gcc doesn't care, but clang quite reasonably pointed out that the recent
commit e9ba16e68cce ("smpboot: Mark idle_init() as __always_inlined to
work around aggressive compiler un-inlining") did some really odd
things:

    kernel/smpboot.c:50:20: warning: duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier]
    static inline void __always_inline idle_init(unsigned int cpu)
                       ^

which not only has that duplicate inlining specifier, but the new
__always_inline was put in the wrong place of the function definition.

We put the storage class specifiers (ie things like "static" and
"extern") first, and the type information after that.  And while the
compiler may not care, we put the inline specifier before the types.

So it should be just

    static __always_inline void idle_init(unsigned int cpu)

instead.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agoMerge tag 'powerpc-5.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sun, 25 Jul 2021 17:33:48 +0000 (10:33 -0700)]
Merge tag 'powerpc-5.14-3' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Fix guest to host memory corruption in H_RTAS due to missing nargs
   check.

 - Fix guest triggerable host crashes due to bad handling of nested
   guest TM state.

 - Fix possible crashes due to incorrect reference counting in
   kvm_arch_vcpu_ioctl().

 - Two commits fixing some regressions in KVM transactional memory
   handling introduced by the recent rework of the KVM code.

Thanks to Nicholas Piggin, Alexey Kardashevskiy, and Michael Neuling.

* tag 'powerpc-5.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  KVM: PPC: Book3S HV Nested: Sanitise H_ENTER_NESTED TM state
  KVM: PPC: Book3S: Fix H_RTAS rets buffer overflow
  KVM: PPC: Fix kvm_arch_vcpu_ioctl vcpu_load leak
  KVM: PPC: Book3S: Fix CONFIG_TRANSACTIONAL_MEM=n crash
  KVM: PPC: Book3S HV P9: Fix guest TM support

2 years agoMerge tag 'timers-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 25 Jul 2021 17:27:44 +0000 (10:27 -0700)]
Merge tag 'timers-urgent-2021-07-25' of git://git./linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:
 "A small set of timer related fixes:

   - Plug a race between rearm and process tick in the posix CPU timers
     code

   - Make the optimization to avoid recalculation of the next timer
     interrupt work correctly when there are no timers pending"

* tag 'timers-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers: Fix get_next_timer_interrupt() with no timers pending
  posix-cpu-timers: Fix rearm racing against process tick

2 years agoMerge tag 'locking-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 25 Jul 2021 17:21:19 +0000 (10:21 -0700)]
Merge tag 'locking-urgent-2021-07-25' of git://git./linux/kernel/git/tip/tip

Pull x86 jump label fix from Thomas Gleixner:
 "A single fix for jump labels to prevent the compiler from agressive
  un-inlining which results in a section mismatch"

* tag 'locking-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  jump_labels: Mark __jump_label_transform() as __always_inlined to work around aggressive compiler un-inlining

2 years agoMerge tag 'efi-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 25 Jul 2021 17:04:27 +0000 (10:04 -0700)]
Merge tag 'efi-urgent-2021-07-25' of git://git./linux/kernel/git/tip/tip

Pull EFI fixes from Thomas Gleixner:
 "A set of EFI fixes:

   - Prevent memblock and I/O reserved resources to get out of sync when
     EFI memreserve is in use.

   - Don't claim a non-existing table is invalid

   - Don't warn when firmware memory is already reserved correctly"

* tag 'efi-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi/mokvar: Reserve the table only if it is in boot services data
  efi/libstub: Fix the efi_load_initrd function description
  firmware/efi: Tell memblock about EFI iomem reservations
  efi/tpm: Differentiate missing and invalid final event log table.

2 years agoMerge tag 'core-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 25 Jul 2021 16:52:48 +0000 (09:52 -0700)]
Merge tag 'core-urgent-2021-07-25' of git://git./linux/kernel/git/tip/tip

Pull core fix from Thomas Gleixner:
 "A single update for the boot code to prevent aggressive un-inlining
  which causes a section mismatch"

* tag 'core-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smpboot: Mark idle_init() as __always_inlined to work around aggressive compiler un-inlining

2 years agoMerge tag 'dma-mapping-5.14-1' of git://git.infradead.org/users/hch/dma-mapping
Linus Torvalds [Sun, 25 Jul 2021 16:46:17 +0000 (09:46 -0700)]
Merge tag 'dma-mapping-5.14-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fix from Christoph Hellwig:

 - handle vmalloc addresses in dma_common_{mmap,get_sgtable} (Roman
   Skakun)

* tag 'dma-mapping-5.14-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: handle vmalloc addresses in dma_common_{mmap,get_sgtable}

2 years agoMerge tag '5.14-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sun, 25 Jul 2021 00:26:47 +0000 (17:26 -0700)]
Merge tag '5.14-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Five cifs/smb3 fixes, including a DFS failover fix, two fallocate
  fixes, and two trivial coverity cleanups"

* tag '5.14-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix fallocate when trying to allocate a hole.
  CIFS: Clarify SMB1 code for POSIX delete file
  CIFS: Clarify SMB1 code for POSIX Create
  cifs: support share failover when remounting
  cifs: only write 64kb at a time when fallocating a small region of a file

2 years agoMerge tag 'riscv-for-linus-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 24 Jul 2021 22:34:04 +0000 (15:34 -0700)]
Merge tag 'riscv-for-linus-5.14-rc3' of git://git./linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:

 - properly set the memory size, which fixes 32-bit systems

 - allow initrd to load anywhere in memory, rather that restricting it
   to the first 256MiB

 - fix the 'mem=' parameter on 64-bit systems to properly account for
   the maximum supported memory now that the kernel is outside the
   linear map

 - avoid installing mappings into the last 4KiB of memory, which
   conflicts with error values

 - avoid the stack from being freed while it is being walked

 - a handful of fixes to the new copy to/from user routines

* tag 'riscv-for-linus-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: __asm_copy_to-from_user: Fix: Typos in comments
  riscv: __asm_copy_to-from_user: Remove unnecessary size check
  riscv: __asm_copy_to-from_user: Fix: fail on RV32
  riscv: __asm_copy_to-from_user: Fix: overrun copy
  riscv: stacktrace: pin the task's stack in get_wchan
  riscv: Make sure the kernel mapping does not overlap with IS_ERR_VALUE
  riscv: Make sure the linear mapping does not use the kernel mapping
  riscv: Fix memory_limit for 64-bit kernel
  RISC-V: load initrd wherever it fits into memory
  riscv: Fix 32-bit RISC-V boot failure

2 years agoACPI: fix NULL pointer dereference
Linus Torvalds [Sat, 24 Jul 2021 22:25:54 +0000 (15:25 -0700)]
ACPI: fix NULL pointer dereference

Commit 71f642833284 ("ACPI: utils: Fix reference counting in
for_each_acpi_dev_match()") started doing "acpi_dev_put()" on a pointer
that was possibly NULL.  That fails miserably, because that helper
inline function is not set up to handle that case.

Just make acpi_dev_put() silently accept a NULL pointer, rather than
calling down to put_device() with an invalid offset off that NULL
pointer.

Link: https://lore.kernel.org/lkml/a607c149-6bf6-0fd0-0e31-100378504da2@kernel.dk/
Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Daniel Scally <djrscally@gmail.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Sat, 24 Jul 2021 20:08:31 +0000 (13:08 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Four fixes, all in drivers, all of which can lead to user visible
  problems in certain situations"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: target: Fix NULL dereference on XCOPY completion
  scsi: mpt3sas: Transition IOC to Ready state during shutdown
  scsi: target: Fix protect handling in WRITE SAME(32)
  scsi: iscsi: Fix iface sysfs attr detection

2 years agoMerge tag 'io_uring-5.14-2021-07-24' of git://git.kernel.dk/linux-block
Linus Torvalds [Sat, 24 Jul 2021 20:03:40 +0000 (13:03 -0700)]
Merge tag 'io_uring-5.14-2021-07-24' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Fix a memory leak due to a race condition in io_init_wq_offload
   (Yang)

 - Poll error handling fixes (Pavel)

 - Fix early fdput() regression (me)

 - Don't reissue iopoll requests off release path (me)

 - Add a safety check for io-wq queue off wrong path (me)

* tag 'io_uring-5.14-2021-07-24' of git://git.kernel.dk/linux-block:
  io_uring: explicitly catch any illegal async queue attempt
  io_uring: never attempt iopoll reissue from release path
  io_uring: fix early fdput() of file
  io_uring: fix memleak in io_init_wq_offload()
  io_uring: remove double poll entry on arm failure
  io_uring: explicitly count entries for poll reqs

2 years agoMerge tag 'block-5.14-2021-07-24' of git://git.kernel.dk/linux-block
Linus Torvalds [Sat, 24 Jul 2021 19:57:06 +0000 (12:57 -0700)]
Merge tag 'block-5.14-2021-07-24' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request (Christoph):
    - tracing fix (Keith Busch)
    - fix multipath head refcounting (Hannes Reinecke)
    - Write Zeroes vs PI fix (me)
    - drop a bogus WARN_ON (Zhihao Cheng)

 - Increase max blk-cgroup policy size, now that mq-deadline
   uses it too (Oleksandr)

* tag 'block-5.14-2021-07-24' of git://git.kernel.dk/linux-block:
  nvme: set the PRACT bit when using Write Zeroes with T10 PI
  nvme: fix nvme_setup_command metadata trace event
  nvme: fix refcounting imbalance when all paths are down
  nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING
  block: increase BLKCG_MAX_POLS

2 years agoMerge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Sat, 24 Jul 2021 19:55:06 +0000 (12:55 -0700)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
 "Two bugfixes for the I2C subsystem"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: mpc: Poll for MCF
  misc: eeprom: at24: Always append device id even if label property is set.

2 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Sat, 24 Jul 2021 19:27:16 +0000 (12:27 -0700)]
Merge branch 'akpm' (patches from Andrew)

Merge misc mm fixes from Andrew Morton:
 "15 patches.

  VM subsystems affected by this patch series: userfaultfd, kfence,
  highmem, pagealloc, memblock, pagecache, secretmem, pagemap, and
  hugetlbfs"

* akpm:
  hugetlbfs: fix mount mode command line processing
  mm: fix the deadlock in finish_fault()
  mm: mmap_lock: fix disabling preemption directly
  mm/secretmem: wire up ->set_page_dirty
  writeback, cgroup: do not reparent dax inodes
  writeback, cgroup: remove wb from offline list before releasing refcnt
  memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions
  mm: page_alloc: fix page_poison=1 / INIT_ON_ALLOC_DEFAULT_ON interaction
  mm: use kmap_local_page in memzero_page
  mm: call flush_dcache_page() in memcpy_to_page() and memzero_page()
  kfence: skip all GFP_ZONEMASK allocations
  kfence: move the size check to the beginning of __kfence_alloc()
  kfence: defer kfence_test_init to ensure that kunit debugfs is created
  selftest: use mmap instead of posix_memalign to allocate memory
  userfaultfd: do not untag user pointers

2 years agoriscv: __asm_copy_to-from_user: Fix: Typos in comments
Akira Tsukamoto [Tue, 20 Jul 2021 08:53:23 +0000 (17:53 +0900)]
riscv: __asm_copy_to-from_user: Fix: Typos in comments

Fixing typos and grammar mistakes and using more intuitive label
name.

Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
Fixes: ca6eaaa210de ("riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall")
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2 years agoriscv: __asm_copy_to-from_user: Remove unnecessary size check
Akira Tsukamoto [Tue, 20 Jul 2021 08:52:36 +0000 (17:52 +0900)]
riscv: __asm_copy_to-from_user: Remove unnecessary size check

Clean up:

The size of 0 will be evaluated in the next step. Not
required here.

Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
Fixes: ca6eaaa210de ("riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall")
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2 years agoriscv: __asm_copy_to-from_user: Fix: fail on RV32
Akira Tsukamoto [Tue, 20 Jul 2021 08:51:45 +0000 (17:51 +0900)]
riscv: __asm_copy_to-from_user: Fix: fail on RV32

Had a bug when converting bytes to bits when the cpu was rv32.

The a3 contains the number of bytes and multiple of 8
would be the bits. The LGREG is holding 2 for RV32 and 3 for
RV32, so to achieve multiple of 8 it must always be constant 3.
The 2 was mistakenly used for rv32.

Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
Fixes: ca6eaaa210de ("riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall")
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2 years agoriscv: __asm_copy_to-from_user: Fix: overrun copy
Akira Tsukamoto [Tue, 20 Jul 2021 08:50:52 +0000 (17:50 +0900)]
riscv: __asm_copy_to-from_user: Fix: overrun copy

There were two causes for the overrun memory access.

The threshold size was too small.
The aligning dst require one SZREG and unrolling word copy requires
8*SZREG, total have to be at least 9*SZREG.

Inside the unrolling copy, the subtracting -(8*SZREG-1) would make
iteration happening one extra loop. Proper value is -(8*SZREG).

Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
Fixes: ca6eaaa210de ("riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall")
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2 years agohugetlbfs: fix mount mode command line processing
Mike Kravetz [Fri, 23 Jul 2021 22:50:44 +0000 (15:50 -0700)]
hugetlbfs: fix mount mode command line processing

In commit 32021982a324 ("hugetlbfs: Convert to fs_context") processing
of the mount mode string was changed from match_octal() to fsparam_u32.

This changed existing behavior as match_octal does not require octal
values to have a '0' prefix, but fsparam_u32 does.

Use fsparam_u32oct which provides the same behavior as match_octal.

Link: https://lkml.kernel.org/r/20210721183326.102716-1-mike.kravetz@oracle.com
Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Dennis Camera <bugs+kernel.org@dtnr.ch>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agomm: fix the deadlock in finish_fault()
Qi Zheng [Fri, 23 Jul 2021 22:50:41 +0000 (15:50 -0700)]
mm: fix the deadlock in finish_fault()

Commit 63f3655f9501 ("mm, memcg: fix reclaim deadlock with writeback")
fix the following ABBA deadlock by pre-allocating the pte page table
without holding the page lock.

                                lock_page(A)
                                        SetPageWriteback(A)
                                        unlock_page(A)
  lock_page(B)
                                        lock_page(B)
  pte_alloc_one
    shrink_page_list
      wait_on_page_writeback(A)
                                        SetPageWriteback(B)
                                        unlock_page(B)

                                        # flush A, B to clear the writeback

Commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault()
codepaths") reworked the relevant code but ignored this race.  This will
cause the deadlock above to appear again, so fix it.

Link: https://lkml.kernel.org/r/20210721074849.57004-1-zhengqi.arch@bytedance.com
Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agomm: mmap_lock: fix disabling preemption directly
Muchun Song [Fri, 23 Jul 2021 22:50:38 +0000 (15:50 -0700)]
mm: mmap_lock: fix disabling preemption directly

Commit 832b50725373 ("mm: mmap_lock: use local locks instead of
disabling preemption") fixed a bug by using local locks.

But commit d01079f3d0c0 ("mm/mmap_lock: remove dead code for
!CONFIG_TRACING configurations") changed those lines back to the
original version.

I guess it was introduced by fixing conflicts.

Link: https://lkml.kernel.org/r/20210720074228.76342-1-songmuchun@bytedance.com
Fixes: d01079f3d0c0 ("mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agomm/secretmem: wire up ->set_page_dirty
Mike Rapoport [Fri, 23 Jul 2021 22:50:35 +0000 (15:50 -0700)]
mm/secretmem: wire up ->set_page_dirty

Make secretmem up to date with the changes done in commit 0af573780b0b
("mm: require ->set_page_dirty to be explicitly wired up") so that
unconditional call to this method won't cause crashes.

Link: https://lkml.kernel.org/r/20210716063933.31633-1-rppt@kernel.org
Fixes: 0af573780b0b ("mm: require ->set_page_dirty to be explicitly wired up")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agowriteback, cgroup: do not reparent dax inodes
Roman Gushchin [Fri, 23 Jul 2021 22:50:32 +0000 (15:50 -0700)]
writeback, cgroup: do not reparent dax inodes

The inode switching code is not suited for dax inodes.  An attempt to
switch a dax inode to a parent writeback structure (as a part of a
writeback cleanup procedure) results in a panic like this:

  run fstests generic/270 at 2021-07-15 05:54:02
  XFS (pmem0p2): EXPERIMENTAL big timestamp feature in use.  Use at your own risk!
  XFS (pmem0p2): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
  XFS (pmem0p2): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
  XFS (pmem0p2): Mounting V5 Filesystem
  XFS (pmem0p2): Ending clean mount
  XFS (pmem0p2): Quotacheck needed: Please wait.
  XFS (pmem0p2): Quotacheck: Done.
  XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
  XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
  XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
  BUG: unable to handle page fault for address: 0000000005b0f669
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  CPU: 13 PID: 10479 Comm: kworker/13:16 Not tainted 5.14.0-rc1-master-8096acd7442e+ #8
  Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 09/13/2016
  Workqueue: inode_switch_wbs inode_switch_wbs_work_fn
  RIP: 0010:inode_do_switch_wbs+0xaf/0x470
  Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
  RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002
  RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0
  RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff
  RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228
  R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130
  R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0
  FS:  0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0
  Call Trace:
   inode_switch_wbs_work_fn+0xb6/0x2a0
   process_one_work+0x1e6/0x380
   worker_thread+0x53/0x3d0
   kthread+0x10f/0x130
   ret_from_fork+0x22/0x30
  Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink bridge stp llc rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm mgag200 i2c_algo_bit iTCO_wdt irqbypass drm_kms_helper iTCO_vendor_support acpi_ipmi rapl syscopyarea sysfillrect intel_cstate ipmi_si sysimgblt ioatdma dax_pmem_compat fb_sys_fops ipmi_devintf device_dax i2c_i801 pcspkr intel_uncore hpilo nd_pmem cec dax_pmem_core dca i2c_smbus acpi_tad lpc_ich ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel tg3 ghash_clmulni_intel serio_raw hpsa hpwdt scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod
  CR2: 0000000005b0f669
  ---[ end trace ed2105faff8384f3 ]---
  RIP: 0010:inode_do_switch_wbs+0xaf/0x470
  Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
  RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002
  RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0
  RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff
  RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228
  R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130
  R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0
  FS:  0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: 0x15200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
  ---[ end Kernel panic - not syncing: Fatal exception ]---

The crash happens on an attempt to iterate over attached pagecache pages
and check the dirty flag: a dax inode's xarray contains pfn's instead of
generic struct page pointers.

This happens for DAX and not for other kinds of non-page entries in the
inodes because it's a tagged iteration, and shadow/swap entries are
never tagged; only DAX entries get tagged.

Fix the problem by bailing out (with the false return value) of
inode_prepare_sbs_switch() if a dax inode is passed.

[willy@infradead.org: changelog addition]

Link: https://lkml.kernel.org/r/20210719171350.3876830-1-guro@fb.com
Fixes: c22d70a162d3 ("writeback, cgroup: release dying cgwbs by switching attached inodes")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Reported-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Murphy Zhou <jencce.kernel@gmail.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agowriteback, cgroup: remove wb from offline list before releasing refcnt
Roman Gushchin [Fri, 23 Jul 2021 22:50:29 +0000 (15:50 -0700)]
writeback, cgroup: remove wb from offline list before releasing refcnt

Boyang reported that the commit c22d70a162d3 ("writeback, cgroup:
release dying cgwbs by switching attached inodes") causes the kernel to
crash while running xfstests generic/256 on ext4 on aarch64 and ppc64le.

  run fstests generic/256 at 2021-07-12 05:41:40
  EXT4-fs (vda3): mounted filesystem with ordered data mode. Opts: . Quota mode: none.
  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
  Mem abort info:
     ESR = 0x96000005
     EC = 0x25: DABT (current EL), IL = 32 bits
     SET = 0, FnV = 0
     EA = 0, S1PTW = 0
     FSC = 0x05: level 1 translation fault
  Data abort info:
     ISV = 0, ISS = 0x00000005
     CM = 0, WnR = 0
  user pgtable: 64k pages, 48-bit VAs, pgdp=00000000b0502000
  [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
  Internal error: Oops: 96000005 [#1] SMP
  Modules linked in: dm_flakey dm_snapshot dm_bufio dm_zero dm_mod loop tls rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc ext4 vfat fat mbcache jbd2 drm fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_blk virtio_net net_failover virtio_console failover virtio_mmio aes_neon_bs [last unloaded: scsi_debug]
  CPU: 0 PID: 408468 Comm: kworker/u8:5 Tainted: G X --------- ---  5.14.0-0.rc1.15.bx.el9.aarch64 #1
  Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
  Workqueue: events_unbound cleanup_offline_cgwbs_workfn
  pstate: 004000c5 (nzcv daIF +PAN -UAO -TCO BTYPE=--)
  pc : cleanup_offline_cgwbs_workfn+0x320/0x394
  lr : cleanup_offline_cgwbs_workfn+0xe0/0x394
  sp : ffff80001554fd10
  x29: ffff80001554fd10 x28: 0000000000000000 x27: 0000000000000001
  x26: 0000000000000000 x25: 00000000000000e0 x24: ffffd2a2fbe671a8
  x23: ffff80001554fd88 x22: ffffd2a2fbe67198 x21: ffffd2a2fc25a730
  x20: ffff210412bc3000 x19: ffff210412bc3280 x18: 0000000000000000
  x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
  x14: 0000000000000000 x13: 0000000000000030 x12: 0000000000000040
  x11: ffff210481572238 x10: ffff21048157223a x9 : ffffd2a2fa276c60
  x8 : ffff210484106b60 x7 : 0000000000000000 x6 : 000000000007d18a
  x5 : ffff210416a86400 x4 : ffff210412bc0280 x3 : 0000000000000000
  x2 : ffff80001554fd88 x1 : ffff210412bc0280 x0 : 0000000000000003
  Call trace:
     cleanup_offline_cgwbs_workfn+0x320/0x394
     process_one_work+0x1f4/0x4b0
     worker_thread+0x184/0x540
     kthread+0x114/0x120
     ret_from_fork+0x10/0x18
  Code: d63f0020 97f99963 17ffffa6 f8588263 (f9400061)
  ---[ end trace e250fe289272792a ]---
  Kernel panic - not syncing: Oops: Fatal exception
  SMP: stopping secondary CPUs
  SMP: failed to stop secondary CPUs 0-2
  Kernel Offset: 0x52a2e9fa0000 from 0xffff800010000000
  PHYS_OFFSET: 0xfff0defca0000000
  CPU features: 0x00200251,23200840
  Memory Limit: none
  ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

The problem happens when cgwb_release_workfn() races with
cleanup_offline_cgwbs_workfn(): wb_tryget() in
cleanup_offline_cgwbs_workfn() can be called after percpu_ref_exit() is
cgwb_release_workfn(), which is basically a use-after-free error.

Fix the problem by making removing the writeback structure from the
offline list before releasing the percpu reference counter.  It will
guarantee that cleanup_offline_cgwbs_workfn() will not see and not
access writeback structures which are about to be released.

Link: https://lkml.kernel.org/r/20210716201039.3762203-1-guro@fb.com
Fixes: c22d70a162d3 ("writeback, cgroup: release dying cgwbs by switching attached inodes")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Boyang Xue <bxue@redhat.com>
Suggested-by: Jan Kara <jack@suse.cz>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agomemblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions
Mike Rapoport [Fri, 23 Jul 2021 22:50:26 +0000 (15:50 -0700)]
memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions

Commit b10d6bca8720 ("arch, drivers: replace for_each_membock() with
for_each_mem_range()") didn't take into account that when there is
movable_node parameter in the kernel command line, for_each_mem_range()
would skip ranges marked with MEMBLOCK_HOTPLUG.

The page table setup code in POWER uses for_each_mem_range() to create
the linear mapping of the physical memory and since the regions marked
as MEMORY_HOTPLUG are skipped, they never make it to the linear map.

A later access to the memory in those ranges will fail:

  BUG: Unable to handle kernel data access on write at 0xc000000400000000
  Faulting instruction address: 0xc00000000008a3c0
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in:
  CPU: 0 PID: 53 Comm: kworker/u2:0 Not tainted 5.13.0 #7
  NIP:  c00000000008a3c0 LR: c0000000003c1ed8 CTR: 0000000000000040
  REGS: c000000008a57770 TRAP: 0300   Not tainted  (5.13.0)
  MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 84222202  XER: 20040000
  CFAR: c0000000003c1ed4 DAR: c000000400000000 DSISR: 42000000 IRQMASK: 0
  GPR00: c0000000003c1ed8 c000000008a57a10 c0000000019da700 c000000400000000
  GPR04: 0000000000000280 0000000000000180 0000000000000400 0000000000000200
  GPR08: 0000000000000100 0000000000000080 0000000000000040 0000000000000300
  GPR12: 0000000000000380 c000000001bc0000 c0000000001660c8 c000000006337e00
  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR20: 0000000040000000 0000000020000000 c000000001a81990 c000000008c30000
  GPR24: c000000008c20000 c000000001a81998 000fffffffff0000 c000000001a819a0
  GPR28: c000000001a81908 c00c000001000000 c000000008c40000 c000000008a64680
  NIP clear_user_page+0x50/0x80
  LR __handle_mm_fault+0xc88/0x1910
  Call Trace:
    __handle_mm_fault+0xc44/0x1910 (unreliable)
    handle_mm_fault+0x130/0x2a0
    __get_user_pages+0x248/0x610
    __get_user_pages_remote+0x12c/0x3e0
    get_arg_page+0x54/0xf0
    copy_string_kernel+0x11c/0x210
    kernel_execve+0x16c/0x220
    call_usermodehelper_exec_async+0x1b0/0x2f0
    ret_from_kernel_thread+0x5c/0x70
  Instruction dump:
  79280fa4 79271764 79261f24 794ae8e2 7ca94214 7d683a14 7c893a14 7d893050
  7d4903a6 60000000 60000000 60000000 <7c001fec7c091fec 7c081fec 7c051fec
  ---[ end trace 490b8c67e6075e09 ]---

Making for_each_mem_range() include MEMBLOCK_HOTPLUG regions in the
traversal fixes this issue.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1976100
Link: https://lkml.kernel.org/r/20210712071132.20902-1-rppt@kernel.org
Fixes: b10d6bca8720 ("arch, drivers: replace for_each_membock() with for_each_mem_range()")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Greg Kurz <groug@kaod.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org> [5.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agomm: page_alloc: fix page_poison=1 / INIT_ON_ALLOC_DEFAULT_ON interaction
Sergei Trofimovich [Fri, 23 Jul 2021 22:50:23 +0000 (15:50 -0700)]
mm: page_alloc: fix page_poison=1 / INIT_ON_ALLOC_DEFAULT_ON interaction

To reproduce the failure we need the following system:

 - kernel command: page_poison=1 init_on_free=0 init_on_alloc=0

 - kernel config:
    * CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y
    * CONFIG_INIT_ON_FREE_DEFAULT_ON=y
    * CONFIG_PAGE_POISONING=y

Resulting in:

    0000000085629bdd: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    0000000022861832: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00000000c597f5b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    CPU: 11 PID: 15195 Comm: bash Kdump: loaded Tainted: G     U     O      5.13.1-gentoo-x86_64 #1
    Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2801 01/13/2021
    Call Trace:
     dump_stack+0x64/0x7c
     __kernel_unpoison_pages.cold+0x48/0x84
     post_alloc_hook+0x60/0xa0
     get_page_from_freelist+0xdb8/0x1000
     __alloc_pages+0x163/0x2b0
     __get_free_pages+0xc/0x30
     pgd_alloc+0x2e/0x1a0
     mm_init+0x185/0x270
     dup_mm+0x6b/0x4f0
     copy_process+0x190d/0x1b10
     kernel_clone+0xba/0x3b0
     __do_sys_clone+0x8f/0xb0
     do_syscall_64+0x68/0x80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

Before commit 51cba1ebc60d ("init_on_alloc: Optimize static branches")
init_on_alloc never enabled static branch by default.  It could only be
enabed explicitly by init_mem_debugging_and_hardening().

But after commit 51cba1ebc60d, a static branch could already be enabled
by default.  There was no code to ever disable it.  That caused
page_poison=1 / init_on_free=1 conflict.

This change extends init_mem_debugging_and_hardening() to also disable
static branch disabling.

Link: https://lkml.kernel.org/r/20210714031935.4094114-1-keescook@chromium.org
Link: https://lore.kernel.org/r/20210712215816.1512739-1-slyfox@gentoo.org
Fixes: 51cba1ebc60d ("init_on_alloc: Optimize static branches")
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Kees Cook <keescook@chromium.org>
Reported-by: Mikhail Morfikov <mmorfikov@gmail.com>
Reported-by: <bowsingbetee@pm.me>
Tested-by: <bowsingbetee@protonmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agomm: use kmap_local_page in memzero_page
Christoph Hellwig [Fri, 23 Jul 2021 22:50:20 +0000 (15:50 -0700)]
mm: use kmap_local_page in memzero_page

The commit message introducing the global memzero_page explicitly
mentions switching to kmap_local_page in the commit log but doesn't
actually do that.

Link: https://lkml.kernel.org/r/20210713055231.137602-3-hch@lst.de
Fixes: 28961998f858 ("iov_iter: lift memzero_page() to highmem.h")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agomm: call flush_dcache_page() in memcpy_to_page() and memzero_page()
Christoph Hellwig [Fri, 23 Jul 2021 22:50:17 +0000 (15:50 -0700)]
mm: call flush_dcache_page() in memcpy_to_page() and memzero_page()

memcpy_to_page and memzero_page can write to arbitrary pages, which
could be in the page cache or in high memory, so call
flush_kernel_dcache_pages to flush the dcache.

This is a problem when using these helpers on dcache challeneged
architectures.  Right now there are just a few users, chances are no one
used the PC floppy driver, the aha1542 driver for an ISA SCSI HBA, and a
few advanced and optional btrfs and ext4 features on those platforms yet
since the conversion.

Link: https://lkml.kernel.org/r/20210713055231.137602-2-hch@lst.de
Fixes: bb90d4bc7b6a ("mm/highmem: Lift memcpy_[to|from]_page to core")
Fixes: 28961998f858 ("iov_iter: lift memzero_page() to highmem.h")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agokfence: skip all GFP_ZONEMASK allocations
Alexander Potapenko [Fri, 23 Jul 2021 22:50:14 +0000 (15:50 -0700)]
kfence: skip all GFP_ZONEMASK allocations

Allocation requests outside ZONE_NORMAL (MOVABLE, HIGHMEM or DMA) cannot
be fulfilled by KFENCE, because KFENCE memory pool is located in a zone
different from the requested one.

Because callers of kmem_cache_alloc() may actually rely on the
allocation to reside in the requested zone (e.g.  memory allocations
done with __GFP_DMA must be DMAable), skip all allocations done with
GFP_ZONEMASK and/or respective SLAB flags (SLAB_CACHE_DMA and
SLAB_CACHE_DMA32).

Link: https://lkml.kernel.org/r/20210714092222.1890268-2-glider@google.com
Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: <stable@vger.kernel.org> [5.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agokfence: move the size check to the beginning of __kfence_alloc()
Alexander Potapenko [Fri, 23 Jul 2021 22:50:11 +0000 (15:50 -0700)]
kfence: move the size check to the beginning of __kfence_alloc()

Check the allocation size before toggling kfence_allocation_gate.

This way allocations that can't be served by KFENCE will not result in
waiting for another CONFIG_KFENCE_SAMPLE_INTERVAL without allocating
anything.

Link: https://lkml.kernel.org/r/20210714092222.1890268-1-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Suggested-by: Marco Elver <elver@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org> [5.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agokfence: defer kfence_test_init to ensure that kunit debugfs is created
Weizhao Ouyang [Fri, 23 Jul 2021 22:50:08 +0000 (15:50 -0700)]
kfence: defer kfence_test_init to ensure that kunit debugfs is created

kfence_test_init and kunit_init both use the same level late_initcall,
which means if kfence_test_init linked ahead of kunit_init,
kfence_test_init will get a NULL debugfs_rootdir as parent dentry, then
kfence_test_init and kfence_debugfs_init both create a debugfs node
named "kfence" under debugfs_mount->mnt_root, and it will throw out
"debugfs: Directory 'kfence' with parent '/' already present!" with
EEXIST.  So kfence_test_init should be deferred.

Link: https://lkml.kernel.org/r/20210714113140.2949995-1-o451686892@gmail.com
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
Tested-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agoselftest: use mmap instead of posix_memalign to allocate memory
Peter Collingbourne [Fri, 23 Jul 2021 22:50:04 +0000 (15:50 -0700)]
selftest: use mmap instead of posix_memalign to allocate memory

This test passes pointers obtained from anon_allocate_area to the
userfaultfd and mremap APIs.  This causes a problem if the system
allocator returns tagged pointers because with the tagged address ABI
the kernel rejects tagged addresses passed to these APIs, which would
end up causing the test to fail.  To make this test compatible with such
system allocators, stop using the system allocator to allocate memory in
anon_allocate_area, and instead just use mmap.

Link: https://lkml.kernel.org/r/20210714195437.118982-3-pcc@google.com
Link: https://linux-review.googlesource.com/id/Icac91064fcd923f77a83e8e133f8631c5b8fc241
Fixes: c47174fc362a ("userfaultfd: selftest")
Co-developed-by: Lokesh Gidra <lokeshgidra@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alistair Delva <adelva@google.com>
Cc: William McVicker <willmcvicker@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agouserfaultfd: do not untag user pointers
Peter Collingbourne [Fri, 23 Jul 2021 22:50:01 +0000 (15:50 -0700)]
userfaultfd: do not untag user pointers

Patch series "userfaultfd: do not untag user pointers", v5.

If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl.  This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].

When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg.  However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation.  We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.

Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail.  Also change the kselftest to use
mmap so that it doesn't encounter this problem.

[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c

This patch (of 2):

Do not untag pointers passed to the userfaultfd ioctls.  Instead, let
the system call fail.  This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.

The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.

[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c

Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agoriscv: stacktrace: pin the task's stack in get_wchan
Jisheng Zhang [Fri, 23 Jul 2021 00:22:26 +0000 (08:22 +0800)]
riscv: stacktrace: pin the task's stack in get_wchan

Pin the task's stack before calling walk_stackframe() in get_wchan().
This can fix the panic as reported by Andreas when CONFIG_VMAP_STACK=y:

[   65.609696] Unable to handle kernel paging request at virtual address ffffffd0003bbde8
[   65.610460] Oops [#1]
[   65.610626] Modules linked in: virtio_blk virtio_mmio rtc_goldfish btrfs blake2b_generic libcrc32c xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
[   65.611670] CPU: 2 PID: 1 Comm: systemd Not tainted 5.14.0-rc1-1.g34fe32a-default #1 openSUSE Tumbleweed (unreleased) c62f7109153e5a0897ee58ba52393ad99b070fd2
[   65.612334] Hardware name: riscv-virtio,qemu (DT)
[   65.613008] epc : get_wchan+0x5c/0x88
[   65.613334]  ra : get_wchan+0x42/0x88
[   65.613625] epc : ffffffff800048a4 ra : ffffffff8000488a sp : ffffffd00021bb90
[   65.614008]  gp : ffffffff817709f8 tp : ffffffe07fe91b80 t0 : 00000000000001f8
[   65.614411]  t1 : 0000000000020000 t2 : 0000000000000000 s0 : ffffffd00021bbd0
[   65.614818]  s1 : ffffffd0003bbdf0 a0 : 0000000000000001 a1 : 0000000000000002
[   65.615237]  a2 : ffffffff81618008 a3 : 0000000000000000 a4 : 0000000000000000
[   65.615637]  a5 : ffffffd0003bc000 a6 : 0000000000000002 a7 : ffffffe27d370000
[   65.616022]  s2 : ffffffd0003bbd90 s3 : ffffffff8071a81e s4 : 0000000000003fff
[   65.616407]  s5 : ffffffffffffc000 s6 : 0000000000000000 s7 : ffffffff81618008
[   65.616845]  s8 : 0000000000000001 s9 : 0000000180000040 s10: 0000000000000000
[   65.617248]  s11: 000000000000016b t3 : 000000ff00000000 t4 : 0c6aec92de5e3fd7
[   65.617672]  t5 : fff78f60608fcfff t6 : 0000000000000078
[   65.618088] status: 0000000000000120 badaddr: ffffffd0003bbde8 cause: 000000000000000d
[   65.618621] [<ffffffff800048a4>] get_wchan+0x5c/0x88
[   65.619008] [<ffffffff8022da88>] do_task_stat+0x7a2/0xa46
[   65.619325] [<ffffffff8022e87e>] proc_tgid_stat+0xe/0x16
[   65.619637] [<ffffffff80227dd6>] proc_single_show+0x46/0x96
[   65.619979] [<ffffffff801ccb1e>] seq_read_iter+0x190/0x31e
[   65.620341] [<ffffffff801ccd70>] seq_read+0xc4/0x104
[   65.620633] [<ffffffff801a6bfe>] vfs_read+0x6a/0x112
[   65.620922] [<ffffffff801a701c>] ksys_read+0x54/0xbe
[   65.621206] [<ffffffff801a7094>] sys_read+0xe/0x16
[   65.621474] [<ffffffff8000303e>] ret_from_syscall+0x0/0x2
[   65.622169] ---[ end trace f24856ed2b8789c5 ]---
[   65.622832] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2 years agoio_uring: explicitly catch any illegal async queue attempt
Jens Axboe [Fri, 23 Jul 2021 17:53:54 +0000 (11:53 -0600)]
io_uring: explicitly catch any illegal async queue attempt

Catch an illegal case to queue async from an unrelated task that got
the ring fd passed to it. This should not be possible to hit, but
better be proactive and catch it explicitly. io-wq is extended to
check for early IO_WQ_WORK_CANCEL being set on a work item as well,
so it can run the request through the normal cancelation path.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoio_uring: never attempt iopoll reissue from release path
Jens Axboe [Fri, 23 Jul 2021 17:49:29 +0000 (11:49 -0600)]
io_uring: never attempt iopoll reissue from release path

There are two reasons why this shouldn't be done:

1) Ring is exiting, and we're canceling requests anyway. Any request
   should be canceled anyway. In theory, this could iterate for a
   number of times if someone else is also driving the target block
   queue into request starvation, however the likelihood of this
   happening is miniscule.

2) If the original task decided to pass the ring to another task, then
   we don't want to be reissuing from this context as it may be an
   unrelated task or context. No assumptions should be made about
   the context in which ->release() is run. This can only happen for pure
   read/write, and we'll get -EFAULT on them anyway.

Link: https://lore.kernel.org/io-uring/YPr4OaHv0iv0KTOc@zeniv-ca.linux.org.uk/
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoMerge tag 'for-5.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Fri, 23 Jul 2021 19:49:07 +0000 (12:49 -0700)]
Merge tag 'for-5.14-rc2-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few fixes and one patch to help some block layer API cleanups:

   - skip missing device when running fstrim

   - fix unpersisted i_size on fsync after expanding truncate

   - fix lock inversion problem when doing qgroup extent tracing

   - replace bdgrab/bdput usage, replace gendisk by block_device"

* tag 'for-5.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: store a block_device in struct btrfs_ordered_extent
  btrfs: fix lock inversion problem when doing qgroup extent tracing
  btrfs: check for missing device in btrfs_trim_fs
  btrfs: fix unpersisted i_size on fsync after expanding truncate

2 years agoMerge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
Linus Torvalds [Fri, 23 Jul 2021 18:30:12 +0000 (11:30 -0700)]
Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A subtle deadlock on lock_rwsem (marked for stable) and rbd fixes for
  a -rc1 regression.

  Also included a rare WARN condition tweak"

* tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client:
  rbd: resurrect setting of disk->private_data in rbd_init_disk()
  ceph: don't WARN if we're still opening a session to an MDS
  rbd: don't hold lock_rwsem while running_list is being drained
  rbd: always kick acquire on "acquired" and "released" notifications

2 years agoMerge tag 'trace-v5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt...
Linus Torvalds [Fri, 23 Jul 2021 18:25:21 +0000 (11:25 -0700)]
Merge tag 'trace-v5.14-rc2' of git://git./linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix deadloop in ring buffer because of using stale "read" variable

 - Fix synthetic event use of field_pos as boolean and not an index

 - Fixed histogram special var "cpu" overriding event fields called
   "cpu"

 - Cleaned up error prone logic in alloc_synth_event()

 - Removed call to synchronize_rcu_tasks_rude() when not needed

 - Removed redundant initialization of a local variable "ret"

 - Fixed kernel crash when updating tracepoint callbacks of different
   priorities.

* tag 'trace-v5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracepoints: Update static_call before tp_funcs when adding a tracepoint
  ftrace: Remove redundant initialization of variable ret
  ftrace: Avoid synchronize_rcu_tasks_rude() call when not necessary
  tracing: Clean up alloc_synth_event()
  tracing/histogram: Rename "cpu" to "common_cpu"
  tracing: Synthetic event field_pos is an index not a boolean
  tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.

2 years agoMerge tag 'm68k-for-v5.14-tag2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 23 Jul 2021 18:19:57 +0000 (11:19 -0700)]
Merge tag 'm68k-for-v5.14-tag2' of git://git./linux/kernel/git/geert/linux-m68k

Pull m68k fix from Geert Uytterhoeven:

 - Fix a Mac defconfig regression due to the IDE -> ATA switch

* tag 'm68k-for-v5.14-tag2' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
  m68k: MAC should select HAVE_PATA_PLATFORM

2 years agoMerge tag 'acpi-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Fri, 23 Jul 2021 18:08:06 +0000 (11:08 -0700)]
Merge tag 'acpi-5.14-rc3' of git://git./linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
 "These fix a recently broken Kconfig dependency and ACPI device
  reference counting in an iterator macro.

  Specifics:

   - Fix recently broken Kconfig dependency for the ACPI table override
     via built-in initrd (Robert Richter)

   - Fix ACPI device reference counting in the for_each_acpi_dev_match()
     helper macro to avoid use-after-free (Andy Shevchenko)"

* tag 'acpi-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: utils: Fix reference counting in for_each_acpi_dev_match()
  ACPI: Kconfig: Fix table override from built-in initrd

2 years agoMerge tag 'driver-core-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 23 Jul 2021 17:20:15 +0000 (10:20 -0700)]
Merge tag 'driver-core-5.14-rc3' of git://git./linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are two small driver core fixes to resolve some reported problems
  for 5.14-rc3. They include:

   - aux bus memory leak fix

   - unneeded warning message removed when removing a device link.

  Both have been in linux-next with no reported problems"

* tag 'driver-core-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  driver core: Prevent warning when removing a device link from unregistered consumer
  driver core: auxiliary bus: Fix memory leak when driver_register() fail

2 years agoMerge tag 'char-misc-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregk...
Linus Torvalds [Fri, 23 Jul 2021 17:14:56 +0000 (10:14 -0700)]
Merge tag 'char-misc-5.14-rc3' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc fixes from Greg KH:
 "Here are some small char/misc driver fixes for 5.14-rc3.

  Included in here are:

   - MAINTAINERS file updates for two changes in different driver
     subsystems

   - mhi bus bugfixes

   - nds32 bugfix that resolves a reported problem

  All have been in linux-next with no reported problems"

* tag 'char-misc-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  nds32: fix up stack guard gap
  MAINTAINERS: Change ACRN HSM driver maintainer
  MAINTAINERS: Update for VMCI driver
  bus: mhi: pci_generic: Fix inbound IPCR channel
  bus: mhi: core: Validate channel ID when processing command completions
  bus: mhi: pci_generic: Apply no-op for wake using sideband wake boolean

2 years agoMerge tag 'usb-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Fri, 23 Jul 2021 17:09:27 +0000 (10:09 -0700)]
Merge tag 'usb-5.14-rc3' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here are some USB fixes for 5.14-rc3 to resolve a bunch of tiny
  problems reported. Included in here are:

   - dtsi revert to resolve a problem which broke android systems that
     relied on the dts name to find the USB controller device.

     People are still working out the "real" solution for this, but for
     now the revert is needed.

   - core USB fix for pipe calculation found by syzbot

   - typec fixes

   - gadget driver fixes

   - new usb-serial device ids

   - new USB quirks

   - xhci fixes

   - usb hub fixes for power management issues reported

   - other tiny fixes

  All have been in linux-next with no reported problems"

* tag 'usb-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (27 commits)
  USB: serial: cp210x: add ID for CEL EM3588 USB ZigBee stick
  Revert "USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem"
  usb: cdc-wdm: fix build error when CONFIG_WWAN_CORE is not set
  Revert "arm64: dts: qcom: Harmonize DWC USB3 DT nodes name"
  usb: dwc2: gadget: Fix sending zero length packet in DDMA mode.
  usb: dwc2: Skip clock gating on Samsung SoCs
  usb: renesas_usbhs: Fix superfluous irqs happen after usb_pkt_pop()
  usb: dwc2: gadget: Fix GOUTNAK flow for Slave mode.
  usb: phy: Fix page fault from usb_phy_uevent
  usb: xhci: avoid renesas_usb_fw.mem when it's unusable
  usb: gadget: u_serial: remove WARN_ON on null port
  usb: dwc3: avoid NULL access of usb_gadget_driver
  usb: max-3421: Prevent corruption of freed memory
  usb: gadget: Fix Unbalanced pm_runtime_enable in tegra_xudc_probe
  MAINTAINERS: repair reference in USB IP DRIVER FOR HISILICON KIRIN 970
  usb: typec: stusb160x: Don't block probing of consumer of "connector" nodes
  usb: typec: stusb160x: register role switch before interrupt registration
  USB: usb-storage: Add LaCie Rugged USB3-FW to IGNORE_UAS
  usb: ehci: Prevent missed ehci interrupts with edge-triggered MSI
  usb: hub: Disable USB 3 device initiated lpm if exit latency is too high
  ...

2 years agoMerge tag 'sound-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai...
Linus Torvalds [Fri, 23 Jul 2021 16:58:23 +0000 (09:58 -0700)]
Merge tag 'sound-5.14-rc3' of git://git./linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "A collection of small fixes, mostly covering device-specific
  regressions and bugs over ASoC, HD-audio and USB-audio, while
  the ALSA PCM core received a few additional fixes for the
  possible (new and old) regressions"

* tag 'sound-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (29 commits)
  ALSA: usb-audio: Add registration quirk for JBL Quantum headsets
  ALSA: hda/hdmi: Add quirk to force pin connectivity on NUC10
  ALSA: pcm: Fix mmap without buffer preallocation
  ALSA: pcm: Fix mmap capability check
  ALSA: hda: intel-dsp-cfg: add missing ElkhartLake PCI ID
  ASoC: ti: j721e-evm: Check for not initialized parent_clk_id
  ASoC: ti: j721e-evm: Fix unbalanced domain activity tracking during startup
  ALSA: hda/realtek: Fix pop noise and 2 Front Mic issues on a machine
  ALSA: hdmi: Expose all pins on MSI MS-7C94 board
  ALSA: sb: Fix potential ABBA deadlock in CSP driver
  ASoC: rt5682: Fix the issue of garbled recording after powerd_dbus_suspend
  ASoC: amd: reverse stop sequence for stoneyridge platform
  ASoC: soc-pcm: add a flag to reverse the stop sequence
  ASoC: codecs: wcd938x: setup irq during component bind
  ASoC: dt-bindings: renesas: rsnd: Fix incorrect 'port' regex schema
  ALSA: usb-audio: Add missing proc text entry for BESPOKEN type
  ASoC: codecs: wcd938x: make sdw dependency explicit in Kconfig
  ASoC: SOF: Intel: Update ADL descriptor to use ACPI power states
  ASoC: rt5631: Fix regcache sync errors on resume
  ALSA: pcm: Call substream ack() method upon compat mmap commit
  ...

2 years agoMerge branch 'acpi-utils'
Rafael J. Wysocki [Fri, 23 Jul 2021 15:06:15 +0000 (17:06 +0200)]
Merge branch 'acpi-utils'

* acpi-utils:
  ACPI: utils: Fix reference counting in for_each_acpi_dev_match()

2 years agotracepoints: Update static_call before tp_funcs when adding a tracepoint
Steven Rostedt (VMware) [Fri, 23 Jul 2021 01:52:18 +0000 (21:52 -0400)]
tracepoints: Update static_call before tp_funcs when adding a tracepoint

Because of the significant overhead that retpolines pose on indirect
calls, the tracepoint code was updated to use the new "static_calls" that
can modify the running code to directly call a function instead of using
an indirect caller, and this function can be changed at runtime.

In the tracepoint code that calls all the registered callbacks that are
attached to a tracepoint, the following is done:

it_func_ptr = rcu_dereference_raw((&__tracepoint_##name)->funcs);
if (it_func_ptr) {
__data = (it_func_ptr)->data;
static_call(tp_func_##name)(__data, args);
}

If there's just a single callback, the static_call is updated to just call
that callback directly. Once another handler is added, then the static
caller is updated to call the iterator, that simply loops over all the
funcs in the array and calls each of the callbacks like the old method
using indirect calling.

The issue was discovered with a race between updating the funcs array and
updating the static_call. The funcs array was updated first and then the
static_call was updated. This is not an issue as long as the first element
in the old array is the same as the first element in the new array. But
that assumption is incorrect, because callbacks also have a priority
field, and if there's a callback added that has a higher priority than the
callback on the old array, then it will become the first callback in the
new array. This means that it is possible to call the old callback with
the new callback data element, which can cause a kernel panic.

static_call = callback1()
funcs[] = {callback1,data1};
callback2 has higher priority than callback1

CPU 1 CPU 2
----- -----

   new_funcs = {callback2,data2},
               {callback1,data1}

   rcu_assign_pointer(tp->funcs, new_funcs);

  /*
   * Now tp->funcs has the new array
   * but the static_call still calls callback1
   */

it_func_ptr = tp->funcs [ new_funcs ]
data = it_func_ptr->data [ data2 ]
static_call(callback1, data);

/* Now callback1 is called with
 * callback2's data */

[ KERNEL PANIC ]

   update_static_call(iterator);

To prevent this from happening, always switch the static_call to the
iterator before assigning the tp->funcs to the new array. The iterator will
always properly match the callback with its data.

To trigger this bug:

  In one terminal:

    while :; do hackbench 50; done

  In another terminal

    echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
    while :; do
        echo 1 > /sys/kernel/tracing/set_event_pid;
        sleep 0.5
        echo 0 > /sys/kernel/tracing/set_event_pid;
        sleep 0.5
   done

And it doesn't take long to crash. This is because the set_event_pid adds
a callback to the sched_waking tracepoint with a high priority, which will
be called before the sched_waking trace event callback is called.

Note, the removal to a single callback updates the array first, before
changing the static_call to single callback, which is the proper order as
the first element in the array is the same as what the static_call is
being changed to.

Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/
Cc: stable@vger.kernel.org
Fixes: d25e37d89dd2f ("tracepoint: Optimize using static_call()")
Reported-by: Stefan Metzmacher <metze@samba.org>
tested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>