linux-2.6-microblaze.git
3 years agopowerpc/32: Refactor booke critical registers saving
Christophe Leroy [Fri, 12 Mar 2021 12:50:31 +0000 (12:50 +0000)]
powerpc/32: Refactor booke critical registers saving

Refactor booke critical registers saving into a few macros
and move it into the exception prolog directly.

Keep the dedicated transfert_to_handler entry point for the
moment allthough they are empty. They will be removed in a
later patch to reduce churn.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/269171496f1f5f22afa621695bded22976c9d48d.1615552867.git.christophe.leroy@csgroup.eu
3 years agopowerpc/32: Provide a name to exception prolog continuation in virtual mode
Christophe Leroy [Fri, 12 Mar 2021 12:50:30 +0000 (12:50 +0000)]
powerpc/32: Provide a name to exception prolog continuation in virtual mode

Now that the prolog continuation is separated in .text, give it a name
and mark it _ASM_NOKPROBE_SYMBOL.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d96374218815a6627e1e922ab2aba994050fb87a.1615552867.git.christophe.leroy@csgroup.eu
3 years agopowerpc/32: Move exception prolog code into .text once MMU is back on
Christophe Leroy [Fri, 12 Mar 2021 12:50:29 +0000 (12:50 +0000)]
powerpc/32: Move exception prolog code into .text once MMU is back on

The space in the head section is rather constrained by the fact that
exception vectors are spread every 0x100 bytes and sometimes we
need to have "out of line" code because it doesn't fit.

Now that we are enabling MMU early in the prolog, take that opportunity
to jump somewhere else in the .text section where we don't have any
space constraint.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/38b31ca4bc782a4985bc7952a675404d7ff27c24.1615552867.git.christophe.leroy@csgroup.eu
3 years agopowerpc/32: Use START_EXCEPTION() as much as possible
Christophe Leroy [Fri, 12 Mar 2021 12:50:28 +0000 (12:50 +0000)]
powerpc/32: Use START_EXCEPTION() as much as possible

Everywhere where it is possible, use START_EXCEPTION().

This will help for proper exception init in future patches.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d47c1cc242bbbef8658327503726abdaef9b63ef.1615552867.git.christophe.leroy@csgroup.eu
3 years agopowerpc/32: Add vmap_stack_overflow label inside the macro
Christophe Leroy [Fri, 12 Mar 2021 12:50:27 +0000 (12:50 +0000)]
powerpc/32: Add vmap_stack_overflow label inside the macro

For consistency, add in the macro the label used by exception prolog
to branch to stack overflow processing.

While at it, enclose the macro in #ifdef CONFIG_VMAP_STACK on the 8xx
as already done on book3s/32.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/cf80056f5b946572ad98aea9d915dd25b23beda6.1615552867.git.christophe.leroy@csgroup.eu
3 years agopowerpc/32: Statically initialise first emergency context
Christophe Leroy [Fri, 12 Mar 2021 12:50:25 +0000 (12:50 +0000)]
powerpc/32: Statically initialise first emergency context

The check of the emergency context initialisation in
vmap_stack_overflow is buggy for the SMP case, as it
compares r1 with 0 while in the SMP case r1 is offseted
by the CPU id.

Instead of fixing it, just perform static initialisation
of the first emergency context.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4a67ba422be75713286dca0c86ee0d3df2eb6dfa.1615552867.git.christophe.leroy@csgroup.eu
3 years agopowerpc/32: Enable instruction translation at the same time as data translation
Christophe Leroy [Fri, 12 Mar 2021 12:50:24 +0000 (12:50 +0000)]
powerpc/32: Enable instruction translation at the same time as data translation

On 40x and 8xx, kernel text is pinned.
On book3s/32, kernel text is mapped by BATs.

Enable instruction translation at the same time as data translation, it
makes things simpler.

In syscall handler, MSR_RI can also be set at the same time because
srr0/srr1 are already saved and r1 is set properly.

On booke, translation is always on, so at the end all PPC32
have translation on early. Just update msr.

Also update comment in power_save_ppc32_restore().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5269c7e5f5d2117358af3a89744d75a116be27b0.1615552867.git.christophe.leroy@csgroup.eu
3 years agopowerpc/32: Tag DAR in EXCEPTION_PROLOG_2 for the 8xx
Christophe Leroy [Fri, 12 Mar 2021 12:50:23 +0000 (12:50 +0000)]
powerpc/32: Tag DAR in EXCEPTION_PROLOG_2 for the 8xx

8xx requires to tag the DAR with a magic value in order to
fixup DAR on faults generated by 'dcbX', as the 8xx
forgets to update the DAR for those faults.

Do the tagging as early as possible, that is before enabling MMU.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/853a2e28ca7c5fc85617037030f99fe6070c9536.1615552867.git.christophe.leroy@csgroup.eu
3 years agopowerpc/32: Always enable data translation in exception prolog
Christophe Leroy [Fri, 12 Mar 2021 12:50:22 +0000 (12:50 +0000)]
powerpc/32: Always enable data translation in exception prolog

If the code can use a stack in vm area, it can also use a
stack in linear space.

Simplify code by removing old non VMAP stack code on PPC32.

That means the data translation is now re-enabled early in
exception prolog in all cases, not only when using VMAP stacks.

While we are touching EXCEPTION_PROLOG macros, remove the
unused for_rtas parameter in EXCEPTION_PROLOG_1.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7cd6440c60a7e8f4f035b245c57720f51e225aae.1615552866.git.christophe.leroy@csgroup.eu
3 years agopowerpc/32: Remove ksp_limit
Christophe Leroy [Fri, 12 Mar 2021 12:50:21 +0000 (12:50 +0000)]
powerpc/32: Remove ksp_limit

ksp_limit is there to help detect stack overflows.
That is specific to ppc32 as it was removed from ppc64 in
commit cbc9565ee826 ("powerpc: Remove ksp_limit on ppc64").

There are other means for detecting stack overflows.

As ppc64 has proven to not need it, ppc32 should be able to do
without it too.

Lets remove it and simplify exception handling.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d789c3385b22e07bedc997613c0d26074cb513e7.1615552866.git.christophe.leroy@csgroup.eu
3 years agopowerpc/32: Use fast instruction to set MSR RI in exception prolog on 8xx
Christophe Leroy [Fri, 12 Mar 2021 12:50:20 +0000 (12:50 +0000)]
powerpc/32: Use fast instruction to set MSR RI in exception prolog on 8xx

8xx has registers SPRN_NRI, SPRN_EID and SPRN_EIE for changing
MSR EE and RI.

Use SPRN_EID in exception prolog to set RI.

On an 8xx, it reduces the null_syscall test by 3 cycles.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/65f6bda827c2a2abce71ea7e07543e791163da33.1615552866.git.christophe.leroy@csgroup.eu
3 years agopowerpc/32: Handle bookE debugging in C in exception entry
Christophe Leroy [Fri, 12 Mar 2021 12:50:19 +0000 (12:50 +0000)]
powerpc/32: Handle bookE debugging in C in exception entry

The handling of SPRN_DBCR0 and other registers can easily
be done in C instead of ASM.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/6d6b2497115890b90cfa72a2b3ab1da5f78123c2.1615552866.git.christophe.leroy@csgroup.eu
3 years agopowerpc/32: Entry cpu time accounting in C
Christophe Leroy [Fri, 12 Mar 2021 12:50:18 +0000 (12:50 +0000)]
powerpc/32: Entry cpu time accounting in C

There is no need for this to be in asm,
use the new interrupt entry wrapper.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/daca4c3e05cdfe54d237162a0718b3aaca897662.1615552866.git.christophe.leroy@csgroup.eu
3 years agopowerpc/32: Reconcile interrupts in C
Christophe Leroy [Fri, 12 Mar 2021 12:50:17 +0000 (12:50 +0000)]
powerpc/32: Reconcile interrupts in C

There is no need for this to be in asm anymore,
use the new interrupt entry wrapper.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/602e1ec47e15ca540f7edb9cf6feb6c249911bd6.1615552866.git.christophe.leroy@csgroup.eu
3 years agopowerpc/40x: Prepare normal exception handler for enabling MMU early
Christophe Leroy [Fri, 12 Mar 2021 12:50:16 +0000 (12:50 +0000)]
powerpc/40x: Prepare normal exception handler for enabling MMU early

Ensure normal exception handler are able to manage stuff with
MMU enabled. For that we use CONFIG_VMAP_STACK related code
allthough there is no intention to really activate CONFIG_VMAP_STACK
on powerpc 40x for the moment.

40x uses SPRN_DEAR instead of SPRN_DAR and SPRN_ESR instead of
SPRN_DSISR. Take it into account in common macros.

40x MSR value doesn't fit on 15 bits, use LOAD_REG_IMMEDIATE() in
common macros that will be used also with 40x.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/01963af2b83037bca270d7bf1336ffcf35da8282.1615552866.git.christophe.leroy@csgroup.eu
3 years agopowerpc/40x: Prepare for enabling MMU in critical exception prolog
Christophe Leroy [Fri, 12 Mar 2021 12:50:15 +0000 (12:50 +0000)]
powerpc/40x: Prepare for enabling MMU in critical exception prolog

In order the enable MMU early in exception prolog, implement
CONFIG_VMAP_STACK principles in critical exception prolog.

There is no intention to use CONFIG_VMAP_STACK on 40x,
but related code will be used to enable MMU early in exception
in a later patch.

Also address (critirq_ctx - PAGE_OFFSET) directly instead of
using tophys() in order to win one instruction.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/3fd75ee54c48307119acdbf66cfea966c1463bbd.1615552866.git.christophe.leroy@csgroup.eu
3 years agopowerpc/40x: Reorder a few instructions in critical exception prolog
Christophe Leroy [Fri, 12 Mar 2021 12:50:14 +0000 (12:50 +0000)]
powerpc/40x: Reorder a few instructions in critical exception prolog

In order to ease preparation for CONFIG_VMAP_STACK, reorder
a few instruction, especially save r1 into stack frame earlier.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c895ecf958c86d1736bdd2ff6f36626b55f35fd2.1615552866.git.christophe.leroy@csgroup.eu
3 years agopowerpc/40x: Save SRR0/SRR1 and r10/r11 earlier in critical exception
Christophe Leroy [Fri, 12 Mar 2021 12:50:13 +0000 (12:50 +0000)]
powerpc/40x: Save SRR0/SRR1 and r10/r11 earlier in critical exception

In order to be able to switch MMU on in exception prolog, save
SRR0 and SRR1 earlier.

Also save r10 and r11 into stack earlier to better match with the
normal exception prolog.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/79a93f253d72dc97ac968c9c62b5066960b688ed.1615552866.git.christophe.leroy@csgroup.eu
3 years agopowerpc/40x: Change CRITICAL_EXCEPTION_PROLOG macro to a gas macro
Christophe Leroy [Fri, 12 Mar 2021 12:50:12 +0000 (12:50 +0000)]
powerpc/40x: Change CRITICAL_EXCEPTION_PROLOG macro to a gas macro

Change CRITICAL_EXCEPTION_PROLOG macro to a gas macro to
remove the ugly ; and \ on each line.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/73291fb9dc9ec58182c27a40dfc3db204e3f4024.1615552866.git.christophe.leroy@csgroup.eu
3 years agopowerpc/40x: Don't use SPRN_SPRG_SCRATCH0/1 in TLB miss handlers
Christophe Leroy [Fri, 12 Mar 2021 12:50:11 +0000 (12:50 +0000)]
powerpc/40x: Don't use SPRN_SPRG_SCRATCH0/1 in TLB miss handlers

SPRN_SPRG_SCRATCH5 is used to save SPRN_PID.
SPRN_SPRG_SCRATCH6 is already available.

SPRN_PID is only 8 bits. We have r12 that contains CR.
We only need to preserve CR0, so we have space available in r12
to save PID.

Keep PID in r12 and free up SPRN_SPRG_SCRATCH5.

Then In TLB miss handlers, instead of using SPRN_SPRG_SCRATCH0 and
SPRN_SPRG_SCRATCH1, use SPRN_SPRG_SCRATCH5 and SPRN_SPRG_SCRATCH6
to avoid future conflicts with normal exception prologs.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4cdaa85d38e14d594ba902424060ec55babf2c42.1615552866.git.christophe.leroy@csgroup.eu
3 years agopowerpc/traps: Declare unrecoverable_exception() as __noreturn
Christophe Leroy [Fri, 12 Mar 2021 12:50:10 +0000 (12:50 +0000)]
powerpc/traps: Declare unrecoverable_exception() as __noreturn

unrecoverable_exception() is never expected to return, most callers
have an infiniteloop in case it returns.

Ensure it really never returns by terminating it with a BUG(), and
declare it __no_return.

It always GCC to really simplify functions calling it. In the exemple
below, it avoids the stack frame in the likely fast path and avoids
code duplication for the exit.

With this patch:

00000348 <interrupt_exit_kernel_prepare>:
 348: 81 43 00 84  lwz     r10,132(r3)
 34c: 71 48 00 02  andi.   r8,r10,2
 350: 41 82 00 2c  beq     37c <interrupt_exit_kernel_prepare+0x34>
 354: 71 4a 40 00  andi.   r10,r10,16384
 358: 40 82 00 20  bne     378 <interrupt_exit_kernel_prepare+0x30>
 35c: 80 62 00 70  lwz     r3,112(r2)
 360: 74 63 00 01  andis.  r3,r3,1
 364: 40 82 00 28  bne     38c <interrupt_exit_kernel_prepare+0x44>
 368: 7d 40 00 a6  mfmsr   r10
 36c: 7c 11 13 a6  mtspr   81,r0
 370: 7c 12 13 a6  mtspr   82,r0
 374: 4e 80 00 20  blr
 378: 48 00 00 00  b       378 <interrupt_exit_kernel_prepare+0x30>
 37c: 94 21 ff f0  stwu    r1,-16(r1)
 380: 7c 08 02 a6  mflr    r0
 384: 90 01 00 14  stw     r0,20(r1)
 388: 48 00 00 01  bl      388 <interrupt_exit_kernel_prepare+0x40>
388: R_PPC_REL24 unrecoverable_exception
 38c: 38 e2 00 70  addi    r7,r2,112
 390: 3d 00 00 01  lis     r8,1
 394: 7c c0 38 28  lwarx   r6,0,r7
 398: 7c c6 40 78  andc    r6,r6,r8
 39c: 7c c0 39 2d  stwcx.  r6,0,r7
 3a0: 40 a2 ff f4  bne     394 <interrupt_exit_kernel_prepare+0x4c>
 3a4: 38 60 00 01  li      r3,1
 3a8: 4b ff ff c0  b       368 <interrupt_exit_kernel_prepare+0x20>

Without this patch:

00000348 <interrupt_exit_kernel_prepare>:
 348: 94 21 ff f0  stwu    r1,-16(r1)
 34c: 93 e1 00 0c  stw     r31,12(r1)
 350: 7c 7f 1b 78  mr      r31,r3
 354: 81 23 00 84  lwz     r9,132(r3)
 358: 71 2a 00 02  andi.   r10,r9,2
 35c: 41 82 00 34  beq     390 <interrupt_exit_kernel_prepare+0x48>
 360: 71 29 40 00  andi.   r9,r9,16384
 364: 40 82 00 28  bne     38c <interrupt_exit_kernel_prepare+0x44>
 368: 80 62 00 70  lwz     r3,112(r2)
 36c: 74 63 00 01  andis.  r3,r3,1
 370: 40 82 00 3c  bne     3ac <interrupt_exit_kernel_prepare+0x64>
 374: 7d 20 00 a6  mfmsr   r9
 378: 7c 11 13 a6  mtspr   81,r0
 37c: 7c 12 13 a6  mtspr   82,r0
 380: 83 e1 00 0c  lwz     r31,12(r1)
 384: 38 21 00 10  addi    r1,r1,16
 388: 4e 80 00 20  blr
 38c: 48 00 00 00  b       38c <interrupt_exit_kernel_prepare+0x44>
 390: 7c 08 02 a6  mflr    r0
 394: 90 01 00 14  stw     r0,20(r1)
 398: 48 00 00 01  bl      398 <interrupt_exit_kernel_prepare+0x50>
398: R_PPC_REL24 unrecoverable_exception
 39c: 80 01 00 14  lwz     r0,20(r1)
 3a0: 81 3f 00 84  lwz     r9,132(r31)
 3a4: 7c 08 03 a6  mtlr    r0
 3a8: 4b ff ff b8  b       360 <interrupt_exit_kernel_prepare+0x18>
 3ac: 39 02 00 70  addi    r8,r2,112
 3b0: 3d 40 00 01  lis     r10,1
 3b4: 7c e0 40 28  lwarx   r7,0,r8
 3b8: 7c e7 50 78  andc    r7,r7,r10
 3bc: 7c e0 41 2d  stwcx.  r7,0,r8
 3c0: 40 a2 ff f4  bne     3b4 <interrupt_exit_kernel_prepare+0x6c>
 3c4: 38 60 00 01  li      r3,1
 3c8: 4b ff ff ac  b       374 <interrupt_exit_kernel_prepare+0x2c>

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1e883e9d93fdb256853d1434c8ad77c257349b2d.1615552866.git.christophe.leroy@csgroup.eu
3 years agocxl: don't manipulate the mm.mm_users field directly
Laurent Dufour [Wed, 10 Mar 2021 17:44:05 +0000 (18:44 +0100)]
cxl: don't manipulate the mm.mm_users field directly

It is better to rely on the API provided by the MM layer instead of
directly manipulating the mm_users field.

Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Acked-by: Frederic Barrat <fbarrat@linux.ibm.com>
Acked-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210310174405.51044-1-ldufour@linux.ibm.com
3 years agopowerpc/uprobes: Validation for prefixed instruction
Ravi Bangoria [Thu, 11 Mar 2021 09:15:38 +0000 (14:45 +0530)]
powerpc/uprobes: Validation for prefixed instruction

As per ISA 3.1, prefixed instruction should not cross 64-byte
boundary. So don't allow Uprobe on such prefixed instruction.

There are two ways probed instruction is changed in mapped pages.
First, when Uprobe is activated, it searches for all the relevant
pages and replace instruction in them. In this case, if that probe
is on the 64-byte unaligned prefixed instruction, error out
directly. Second, when Uprobe is already active and user maps a
relevant page via mmap(), instruction is replaced via mmap() code
path. But because Uprobe is invalid, entire mmap() operation can
not be stopped. In this case just print an error and continue.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210311091538.368590-1-ravi.bangoria@linux.ibm.com
3 years agopowerpc/signal: Use __get_user() to copy sigset_t
Christopher M. Riedl [Sat, 27 Feb 2021 01:12:59 +0000 (19:12 -0600)]
powerpc/signal: Use __get_user() to copy sigset_t

Usually sigset_t is exactly 8B which is a "trivial" size and does not
warrant using __copy_from_user(). Use __get_user() directly in
anticipation of future work to remove the trivial size optimizations
from __copy_from_user().

The ppc32 implementation of get_sigset_t() previously called
copy_from_user() which, unlike __copy_from_user(), calls access_ok().
Replacing this w/ __get_user() (no access_ok()) is fine here since both
callsites in signal_32.c are preceded by an earlier access_ok().

Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210227011259.11992-11-cmr@codefail.de
3 years agopowerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches
Daniel Axtens [Sat, 27 Feb 2021 01:12:58 +0000 (19:12 -0600)]
powerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches

Add uaccess blocks and use the 'unsafe' versions of functions doing user
access where possible to reduce the number of times uaccess has to be
opened/closed.

Co-developed-by: Christopher M. Riedl <cmr@codefail.de>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210227011259.11992-10-cmr@codefail.de
3 years agopowerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess switches
Daniel Axtens [Sat, 27 Feb 2021 01:12:57 +0000 (19:12 -0600)]
powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess switches

Add uaccess blocks and use the 'unsafe' versions of functions doing user
access where possible to reduce the number of times uaccess has to be
opened/closed.

There is no 'unsafe' version of copy_siginfo_to_user, so move it
slightly to allow for a "longer" uaccess block.

Co-developed-by: Christopher M. Riedl <cmr@codefail.de>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210227011259.11992-9-cmr@codefail.de
3 years agopowerpc/signal64: Replace restore_sigcontext() w/ unsafe_restore_sigcontext()
Christopher M. Riedl [Sat, 27 Feb 2021 01:12:56 +0000 (19:12 -0600)]
powerpc/signal64: Replace restore_sigcontext() w/ unsafe_restore_sigcontext()

Previously restore_sigcontext() performed a costly KUAP switch on every
uaccess operation. These repeated uaccess switches cause a significant
drop in signal handling performance.

Rewrite restore_sigcontext() to assume that a userspace read access
window is open by replacing all uaccess functions with their 'unsafe'
versions. Modify the callers to first open, call
unsafe_restore_sigcontext(), and then close the uaccess window.

Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210227011259.11992-8-cmr@codefail.de
3 years agopowerpc/signal64: Replace setup_sigcontext() w/ unsafe_setup_sigcontext()
Christopher M. Riedl [Sat, 27 Feb 2021 01:12:55 +0000 (19:12 -0600)]
powerpc/signal64: Replace setup_sigcontext() w/ unsafe_setup_sigcontext()

Previously setup_sigcontext() performed a costly KUAP switch on every
uaccess operation. These repeated uaccess switches cause a significant
drop in signal handling performance.

Rewrite setup_sigcontext() to assume that a userspace write access window
is open by replacing all uaccess functions with their 'unsafe' versions.
Modify the callers to first open, call unsafe_setup_sigcontext() and
then close the uaccess window.

Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210227011259.11992-7-cmr@codefail.de
3 years agopowerpc/signal64: Remove TM ifdefery in middle of if/else block
Christopher M. Riedl [Sat, 27 Feb 2021 01:12:54 +0000 (19:12 -0600)]
powerpc/signal64: Remove TM ifdefery in middle of if/else block

Both rt_sigreturn() and handle_rt_signal_64() contain TM-related ifdefs
which break-up an if/else block. Provide stubs for the ifdef-guarded TM
functions and remove the need for an ifdef in rt_sigreturn().

Rework the remaining TM ifdef in handle_rt_signal64() similar to
commit f1cf4f93de2f ("powerpc/signal32: Remove ifdefery in middle of if/else").

Unlike in the commit for ppc32, the ifdef can't be removed entirely
since uc_transact in sigframe depends on CONFIG_PPC_TRANSACTIONAL_MEM.

Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210227011259.11992-6-cmr@codefail.de
3 years agopowerpc: Reference parameter in MSR_TM_ACTIVE() macro
Christopher M. Riedl [Sat, 27 Feb 2021 01:12:53 +0000 (19:12 -0600)]
powerpc: Reference parameter in MSR_TM_ACTIVE() macro

Unlike the other MSR_TM_* macros, MSR_TM_ACTIVE does not reference or
use its parameter unless CONFIG_PPC_TRANSACTIONAL_MEM is defined. This
causes an 'unused variable' compile warning unless the variable is also
guarded with CONFIG_PPC_TRANSACTIONAL_MEM.

Reference but do nothing with the argument in the macro to avoid a
potential compile warning.

Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210227011259.11992-5-cmr@codefail.de
3 years agopowerpc/signal64: Remove non-inline calls from setup_sigcontext()
Christopher M. Riedl [Sat, 27 Feb 2021 01:12:52 +0000 (19:12 -0600)]
powerpc/signal64: Remove non-inline calls from setup_sigcontext()

The majority of setup_sigcontext() can be refactored to execute in an
"unsafe" context assuming an open uaccess window except for some
non-inline function calls. Move these out into a separate
prepare_setup_sigcontext() function which must be called first and
before opening up a uaccess window. Non-inline function calls should be
avoided during a uaccess window for a few reasons:

- KUAP should be enabled for as much kernel code as possible.
  Opening a uaccess window disables KUAP which means any code
  executed during this time contributes to a potential attack
  surface.

- Non-inline functions default to traceable which means they are
  instrumented for ftrace. This adds more code which could run
  with KUAP disabled.

- Powerpc does not currently support the objtool UACCESS checks.
  All code running with uaccess must be audited manually which
  means: less code -> less work -> fewer problems (in theory).

A follow-up commit converts setup_sigcontext() to be "unsafe".

Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210227011259.11992-4-cmr@codefail.de
3 years agopowerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()
Christopher M. Riedl [Sat, 27 Feb 2021 01:12:51 +0000 (19:12 -0600)]
powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

Reuse the "safe" implementation from signal.c but call unsafe_get_user()
directly in a loop to avoid the intermediate copy into a local buffer.

Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210227011259.11992-3-cmr@codefail.de
3 years agopowerpc/uaccess: Add unsafe_copy_from_user()
Christopher M. Riedl [Sat, 27 Feb 2021 01:12:50 +0000 (19:12 -0600)]
powerpc/uaccess: Add unsafe_copy_from_user()

Use the same approach as unsafe_copy_to_user() but instead call
unsafe_get_user() in a loop.

Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210227011259.11992-2-cmr@codefail.de
3 years agopowerpc/qspinlock: Use generic smp_cond_load_relaxed
Davidlohr Bueso [Tue, 9 Mar 2021 01:59:50 +0000 (17:59 -0800)]
powerpc/qspinlock: Use generic smp_cond_load_relaxed

49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
busy-waiting pausing with a preferred SMT priority pattern, lowering
the priority (reducing decode cycles) during the whole loop slowpath.

However, data shows that while this pattern works well with simple
spinlocks, queued spinlocks benefit more being kept in medium priority,
with a cpu_relax() instead, being a low+medium combo on powerpc.

Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with
2 sockets and 8 threads per core.

1. locktorture.

This is data for the lowest and most artificial/pathological level,
with increasing thread counts pounding on the lock. Metrics are total
ops/minute. Despite some small hits in the 4-8 range, scenarios are
either neutral or favorable to this patch.

+=========+==========+==========+=======+
| # tasks | vanilla  | dirty    | %diff |
+=========+==========+==========+=======+
| 2       | 46718565 | 48751350 | 4.35  |
+---------+----------+----------+-------+
| 4       | 51740198 | 50369082 | -2.65 |
+---------+----------+----------+-------+
| 8       | 63756510 | 62568821 | -1.86 |
+---------+----------+----------+-------+
| 16      | 67824531 | 70966546 | 4.63  |
+---------+----------+----------+-------+
| 32      | 53843519 | 61155508 | 13.58 |
+---------+----------+----------+-------+
| 64      | 53005778 | 53104412 | 0.18  |
+---------+----------+----------+-------+
| 128     | 53331980 | 54606910 | 2.39  |
+=========+==========+==========+=======+

2. sockperf (tcp throughput)

Here a client will do one-way throughput tests to a localhost server, with
increasing message sizes, dealing with the sk_lock. This patch shows to put
the performance of the qspinlock back to par with that of the simple lock:

     simple-spinlock           vanilla dirty
Hmean     14        73.50 (   0.00%)       54.44 * -25.93%*       73.45 * -0.07%*
Hmean     100      654.47 (   0.00%)      385.61 * -41.08%*      771.43 * 17.87%*
Hmean     300     2719.39 (   0.00%)     2181.67 * -19.77%*     2666.50 * -1.94%*
Hmean     500     4400.59 (   0.00%)     3390.77 * -22.95%*     4322.14 * -1.78%*
Hmean     850     6726.21 (   0.00%)     5264.03 * -21.74%*     6863.12 * 2.04%*

3. dbench (tmpfs)

Configured to run with up to ncpusx8 clients, it shows both latency and
throughput metrics. For the latency, with the exception of the 64 case,
there is really nothing to go by:
     vanilla                dirty
Amean     latency-1          1.67 (   0.00%)        1.67 *   0.09%*
Amean     latency-2          2.15 (   0.00%)        2.08 *   3.36%*
Amean     latency-4          2.50 (   0.00%)        2.56 *  -2.27%*
Amean     latency-8          2.49 (   0.00%)        2.48 *   0.31%*
Amean     latency-16         2.69 (   0.00%)        2.72 *  -1.37%*
Amean     latency-32         2.96 (   0.00%)        3.04 *  -2.60%*
Amean     latency-64         7.78 (   0.00%)        8.17 *  -5.07%*
Amean     latency-512      186.91 (   0.00%)      186.41 *   0.27%*

For the dbench4 Throughput (misleading but traditional) there's a small
but rather constant improvement:

     vanilla                dirty
Hmean     1        849.13 (   0.00%)      851.51 *   0.28%*
Hmean     2       1664.03 (   0.00%)     1663.94 *  -0.01%*
Hmean     4       3073.70 (   0.00%)     3104.29 *   1.00%*
Hmean     8       5624.02 (   0.00%)     5694.16 *   1.25%*
Hmean     16      9169.49 (   0.00%)     9324.43 *   1.69%*
Hmean     32     11969.37 (   0.00%)    12127.09 *   1.32%*
Hmean     64     15021.12 (   0.00%)    15243.14 *   1.48%*
Hmean     512    14891.27 (   0.00%)    15162.11 *   1.82%*

Measuring the dbench4 Per-VFS Operation latency, shows some very minor
differences within the noise level, around the 0-1% ranges.

Fixes: 49a7d46a06c3 ("powerpc: Implement smp_cond_load_relaxed()")
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210318204702.71417-1-dave@stgolabs.net
3 years agopowerpc/spinlock: Unserialize spin_is_locked
Davidlohr Bueso [Tue, 9 Mar 2021 01:59:49 +0000 (17:59 -0800)]
powerpc/spinlock: Unserialize spin_is_locked

c6f5d02b6a0f (locking/spinlocks/arm64: Remove smp_mb() from
arch_spin_is_locked()) made it pretty official that the call
semantics do not imply any sort of barriers, and any user that
gets creative must explicitly do any serialization.

This creativity, however, is nowadays pretty limited:

1. spin_unlock_wait() has been removed from the kernel in favor
of a lock/unlock combo. Furthermore, queued spinlocks have now
for a number of years no longer relied on _Q_LOCKED_VAL for the
call, but any non-zero value to indicate a locked state. There
were cases where the delayed locked store could lead to breaking
mutual exclusion with crossed locking; such as with sysv ipc and
netfilter being the most extreme.

2. The auditing Andrea did in verified that remaining spin_is_locked()
no longer rely on such semantics. Most callers just use it to assert
a lock is taken, in a debug nature. The only user that gets cute is
NOLOCK qdisc, as of:

   96009c7d500e (sched: replace __QDISC_STATE_RUNNING bit with a spin lock)

... which ironically went in the next day after c6f5d02b6a0f. This
change replaces test_bit() with spin_is_locked() to know whether
to take the busylock heuristic to reduce contention on the main
qdisc lock. So any races against spin_is_locked() for archs that
use LL/SC for spin_lock() will be benign and not break any mutual
exclusion; furthermore, both the seqlock and busylock have the same
scope.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210309015950.27688-3-dave@stgolabs.net
3 years agopowerpc/spinlock: Define smp_mb__after_spinlock only once
Davidlohr Bueso [Tue, 9 Mar 2021 01:59:48 +0000 (17:59 -0800)]
powerpc/spinlock: Define smp_mb__after_spinlock only once

Instead of both queued and simple spinlocks doing it. Move
it into the arch's spinlock.h.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210309015950.27688-2-dave@stgolabs.net
3 years agopowerpc/ptrace: Convert gpr32_set_common() to user access block
Christophe Leroy [Wed, 10 Mar 2021 17:57:07 +0000 (17:57 +0000)]
powerpc/ptrace: Convert gpr32_set_common() to user access block

Use user access block in gpr32_set_common() instead of
repetitive __get_user() which imply repetitive KUAP open/close.

To get it clean, force inlining of the small set of tiny functions
called inside the block.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/bdcb8652c3bb4ab5b8b3bfd08147434be8fc04c9.1615398498.git.christophe.leroy@csgroup.eu
3 years agopowerpc/futex: Switch to user_access block
Christophe Leroy [Wed, 10 Mar 2021 17:57:06 +0000 (17:57 +0000)]
powerpc/futex: Switch to user_access block

Use user_access_begin() instead of the access_ok/allow_access sequence.

This brings the missing might_fault() check.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/6cd202cdc4f939d47822e4ddd3c0856210431a58.1615398498.git.christophe.leroy@csgroup.eu
3 years agopowerpc/net: Switch csum_and_copy_{to/from}_user to user_access block
Christophe Leroy [Wed, 10 Mar 2021 17:57:05 +0000 (17:57 +0000)]
powerpc/net: Switch csum_and_copy_{to/from}_user to user_access block

Use user_access_begin() instead of the
might_sleep/access_ok/allow_access sequence.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/2dee286d2d6dc9a27d99e31ac564bad4fae2cb49.1615398498.git.christophe.leroy@csgroup.eu
3 years agopowerpc/lib: Don't use __put_user_asm_goto() outside of uaccess.h
Christophe Leroy [Wed, 10 Mar 2021 17:57:04 +0000 (17:57 +0000)]
powerpc/lib: Don't use __put_user_asm_goto() outside of uaccess.h

__put_user_asm_goto() is internal to uaccess.h

Use __put_kernel_nofault() instead. The generated code is identical.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/3e32c4f0361933909368b68f5ee569e5de661c1b.1615398498.git.christophe.leroy@csgroup.eu
3 years agopowerpc/syscalls: Use sys_old_select() in ppc_select()
Christophe Leroy [Wed, 10 Mar 2021 17:57:03 +0000 (17:57 +0000)]
powerpc/syscalls: Use sys_old_select() in ppc_select()

Instead of opencodying the copy of parameters, use
the generic sys_old_select().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4de983ad254739da1fe6e9f273baf387b7043ae0.1615398498.git.christophe.leroy@csgroup.eu
3 years agopowerpc/uaccess: Move copy_mc_xxx() functions down
Christophe Leroy [Wed, 10 Mar 2021 17:57:02 +0000 (17:57 +0000)]
powerpc/uaccess: Move copy_mc_xxx() functions down

copy_mc_xxx() functions are in the middle of raw_copy functions.

For clarity, move them out of the raw_copy functions block.

They are using access_ok, so they need to be after the general
functions in order to eventually allow the inclusion of
asm-generic/uaccess.h in some future.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/2cdecb6e5a2fcee6c158d18dd254b71ec0e0da4d.1615398498.git.christophe.leroy@csgroup.eu
3 years agopowerpc/uaccess: Swap clear_user() and __clear_user()
Christophe Leroy [Wed, 10 Mar 2021 17:57:01 +0000 (17:57 +0000)]
powerpc/uaccess: Swap clear_user() and __clear_user()

It is clear_user() which is expected to call __clear_user(),
not the reverse.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d8ec01fb22f33d87321451d5e5f01cb56dacaa39.1615398498.git.christophe.leroy@csgroup.eu
3 years agopowerpc/uaccess: Also perform 64 bits copies in unsafe_copy_to_user() on ppc32
Christophe Leroy [Wed, 10 Mar 2021 17:57:00 +0000 (17:57 +0000)]
powerpc/uaccess: Also perform 64 bits copies in unsafe_copy_to_user() on ppc32

ppc32 has an efficiant 64 bits __put_user(), so also use it in
order to unroll loops more.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ccc08a16eea682d6fa4acc957ffe34003a8f0844.1615398498.git.christophe.leroy@csgroup.eu
3 years agopowerpc/pseries: export LPAR security flavor in lparcfg
Laurent Dufour [Fri, 5 Mar 2021 12:55:54 +0000 (13:55 +0100)]
powerpc/pseries: export LPAR security flavor in lparcfg

This is helpful to read the security flavor from inside the LPAR.

In /sys/kernel/debug/powerpc/security_features it can be seen if
mitigations are on or off but not the level set through the ASMI menu.
Furthermore, reporting it through /proc/powerpc/lparcfg allows an easy
processing by the lparstat command [1].

Export it like this in /proc/powerpc/lparcfg:

  $ grep security_flavor /proc/powerpc/lparcfg
  security_flavor=1

Value follows what is documented on the IBM support page [2]:

  0 Speculative execution fully enabled
  1 Speculative execution controls to mitigate user-to-kernel attacks
  2 Speculative execution controls to mitigate user-to-kernel and
    user-to-user side-channel attacks

[1] https://groups.google.com/g/powerpc-utils-devel/c/NaKXvdyl_UI/m/wa2stpIDAQAJ
[2] https://www.ibm.com/support/pages/node/715841

Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210305125554.5165-1-ldufour@linux.ibm.com
3 years agopowerpc: Enable KFENCE for PPC32
Christophe Leroy [Thu, 4 Mar 2021 14:35:09 +0000 (14:35 +0000)]
powerpc: Enable KFENCE for PPC32

Add architecture specific implementation details for KFENCE and enable
KFENCE for the ppc32 architecture. In particular, this implements the
required interface in <asm/kfence.h>.

KFENCE requires that attributes for pages from its memory pool can
individually be set. Therefore, force the Read/Write linear map to be
mapped at page granularity.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/8dfe1bd2abde26337c1d8c1ad0acfcc82185e0d5.1614868445.git.christophe.leroy@csgroup.eu
3 years agopowerpc/ptrace: Remove duplicate check from pt_regs_check()
Denis Efremov [Fri, 5 Mar 2021 11:28:07 +0000 (14:28 +0300)]
powerpc/ptrace: Remove duplicate check from pt_regs_check()

"offsetof(struct pt_regs, msr) == offsetof(struct user_pt_regs, msr)"
checked in pt_regs_check() twice in a row. Remove the second check.

Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210305112807.26299-1-efremov@linux.com
3 years agopowerpc/pseries: Move hvc_vio_init_early() prototype to shared location
Lee Jones [Wed, 3 Mar 2021 12:46:03 +0000 (12:46 +0000)]
powerpc/pseries: Move hvc_vio_init_early() prototype to shared location

Fixes the following W=1 kernel build warning(s):

 drivers/tty/hvc/hvc_vio.c:385:13: warning: no previous prototype for ‘hvc_vio_init_early’
 385 | void __init hvc_vio_init_early(void)
 | ^~~~~~~~~~~~~~~~~~

Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210303124603.3150175-1-lee.jones@linaro.org
3 years agopowerpc: Fix misspellings in tlbflush.h
Zhang Yunkai [Thu, 4 Mar 2021 03:13:18 +0000 (19:13 -0800)]
powerpc: Fix misspellings in tlbflush.h

The comment marking the end of the include guard is wrong, fix it up.

Signed-off-by: Zhang Yunkai <zhang.yunkai@zte.com.cn>
[mpe: Rewrite commit message]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210304031318.188447-1-zhang.yunkai@zte.com.cn
3 years agopowerpc: Remove duplicate includes
Zhang Yunkai [Thu, 4 Mar 2021 04:49:43 +0000 (20:49 -0800)]
powerpc: Remove duplicate includes

asm/tm.h included in traps.c is duplicated. It is also included on
the 62nd line.

asm/udbg.h included in setup-common.c is duplicated. It is also
included on the 61st line.

asm/bug.h included in arch/powerpc/include/asm/book3s/64/mmu-hash.h
is duplicated. It is also included on the 12th line.

asm/tlbflush.h included in arch/powerpc/include/asm/pgtable.h is
duplicated. It is also included on the 11th line.

asm/page.h included in arch/powerpc/include/asm/thread_info.h is
duplicated. It is also included on the 13th line.

Signed-off-by: Zhang Yunkai <zhang.yunkai@zte.com.cn>
[mpe: Squash together from multiple commits]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
3 years agopowerpc/prom: Mark identical_pvr_fixup as __init
Nathan Chancellor [Tue, 2 Mar 2021 20:08:29 +0000 (13:08 -0700)]
powerpc/prom: Mark identical_pvr_fixup as __init

If identical_pvr_fixup() is not inlined, there are two modpost warnings:

WARNING: modpost: vmlinux.o(.text+0x54e8): Section mismatch in reference
from the function identical_pvr_fixup() to the function
.init.text:of_get_flat_dt_prop()
The function identical_pvr_fixup() references
the function __init of_get_flat_dt_prop().
This is often because identical_pvr_fixup lacks a __init
annotation or the annotation of of_get_flat_dt_prop is wrong.

WARNING: modpost: vmlinux.o(.text+0x551c): Section mismatch in reference
from the function identical_pvr_fixup() to the function
.init.text:identify_cpu()
The function identical_pvr_fixup() references
the function __init identify_cpu().
This is often because identical_pvr_fixup lacks a __init
annotation or the annotation of identify_cpu is wrong.

identical_pvr_fixup() calls two functions marked as __init and is only
called by a function marked as __init so it should be marked as __init
as well. At the same time, remove the inline keywork as it is not
necessary to inline this function. The compiler is still free to do so
if it feels it is worthwhile since commit 889b3c1245de ("compiler:
remove CONFIG_OPTIMIZE_INLINING entirely").

Fixes: 14b3d926a22b ("[POWERPC] 4xx: update 440EP(x)/440GR(x) identical PVR issue workaround")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://github.com/ClangBuiltLinux/linux/issues/1316
Link: https://lore.kernel.org/r/20210302200829.2680663-1-nathan@kernel.org
3 years agopowerpc/fadump: Mark fadump_calculate_reserve_size as __init
Nathan Chancellor [Tue, 2 Mar 2021 19:50:14 +0000 (12:50 -0700)]
powerpc/fadump: Mark fadump_calculate_reserve_size as __init

If fadump_calculate_reserve_size() is not inlined, there is a modpost
warning:

WARNING: modpost: vmlinux.o(.text+0x5196c): Section mismatch in
reference from the function fadump_calculate_reserve_size() to the
function .init.text:parse_crashkernel()
The function fadump_calculate_reserve_size() references
the function __init parse_crashkernel().
This is often because fadump_calculate_reserve_size lacks a __init
annotation or the annotation of parse_crashkernel is wrong.

fadump_calculate_reserve_size() calls parse_crashkernel(), which is
marked as __init and fadump_calculate_reserve_size() is called from
within fadump_reserve_mem(), which is also marked as __init.

Mark fadump_calculate_reserve_size() as __init to fix the section
mismatch. Additionally, remove the inline keyword as it is not necessary
to inline this function; the compiler is still free to do so if it feels
it is worthwhile since commit 889b3c1245de ("compiler: remove
CONFIG_OPTIMIZE_INLINING entirely").

Fixes: 11550dc0a00b ("powerpc/fadump: reuse crashkernel parameter for fadump memory reservation")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://github.com/ClangBuiltLinux/linux/issues/1300
Link: https://lore.kernel.org/r/20210302195013.2626335-1-nathan@kernel.org
3 years agoselftests/powerpc: Fix L1D flushing tests for Power10
Russell Currey [Tue, 23 Feb 2021 07:02:27 +0000 (17:02 +1000)]
selftests/powerpc: Fix L1D flushing tests for Power10

The rfi_flush and entry_flush selftests work by using the PM_LD_MISS_L1
perf event to count L1D misses.  The value of this event has changed
over time:

- Power7 uses 0x400f0
- Power8 and Power9 use both 0x400f0 and 0x3e054
- Power10 uses only 0x3e054

Rather than relying on raw values, configure perf to count L1D read
misses in the most explicit way available.

This fixes the selftests to work on systems without 0x400f0 as
PM_LD_MISS_L1, and should change no behaviour for systems that the tests
already worked on.

The only potential downside is that referring to a specific perf event
requires PMU support implemented in the kernel for that platform.

Signed-off-by: Russell Currey <ruscur@russell.cc>
Acked-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210223070227.2916871-1-ruscur@russell.cc
3 years agopowerpc: Fix spelling of "droping" to "dropping" in traps.c
Bhaskar Chowdhury [Wed, 24 Feb 2021 07:55:47 +0000 (13:25 +0530)]
powerpc: Fix spelling of "droping" to "dropping" in traps.c

s/droping/dropping/

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210224075547.763063-1-unixbhaskar@gmail.com
3 years agopowerpc: remove unneeded semicolon
Jiapeng Chong [Wed, 24 Feb 2021 07:29:21 +0000 (15:29 +0800)]
powerpc: remove unneeded semicolon

Fix the following coccicheck warnings:

./arch/powerpc/kernel/prom_init.c:2986:2-3: Unneeded semicolon.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1614151761-53721-1-git-send-email-jiapeng.chong@linux.alibaba.com
3 years agopowerpc/chrp: Make hydra_init() static
Geert Uytterhoeven [Tue, 23 Feb 2021 09:53:45 +0000 (10:53 +0100)]
powerpc/chrp: Make hydra_init() static

Commit 407d418f2fd4c20a ("powerpc/chrp: Move PHB discovery") moved the
sole call to hydra_init() to the source file where it is defined, so it
can be made static.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210223095345.2139416-1-geert@linux-m68k.org
3 years agopowerpc/mm: Move the linear_mapping_mutex to the ifdef where it is used
Sebastian Andrzej Siewior [Fri, 19 Feb 2021 16:56:48 +0000 (17:56 +0100)]
powerpc/mm: Move the linear_mapping_mutex to the ifdef where it is used

The mutex linear_mapping_mutex is defined at the of the file while its
only two user are within the CONFIG_MEMORY_HOTPLUG block.
A compile without CONFIG_MEMORY_HOTPLUG set fails on PREEMPT_RT because
its mutex implementation is smart enough to realize that it is unused.

Move the definition of linear_mapping_mutex to ifdef block where it is
used.

Fixes: 1f73ad3e8d755 ("powerpc/mm: print warning in arch_remove_linear_mapping()")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210219165648.2505482-1-bigeasy@linutronix.de
3 years agoLinux 5.12-rc3
Linus Torvalds [Sun, 14 Mar 2021 21:41:02 +0000 (14:41 -0700)]
Linux 5.12-rc3

3 years agoprctl: fix PR_SET_MM_AUXV kernel stack leak
Alexey Dobriyan [Sun, 14 Mar 2021 20:51:14 +0000 (23:51 +0300)]
prctl: fix PR_SET_MM_AUXV kernel stack leak

Doing a

prctl(PR_SET_MM, PR_SET_MM_AUXV, addr, 1);

will copy 1 byte from userspace to (quite big) on-stack array
and then stash everything to mm->saved_auxv.
AT_NULL terminator will be inserted at the very end.

/proc/*/auxv handler will find that AT_NULL terminator
and copy original stack contents to userspace.

This devious scheme requires CAP_SYS_RESOURCE.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoMerge tag 'irq-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 14 Mar 2021 20:33:33 +0000 (13:33 -0700)]
Merge tag 'irq-urgent-2021-03-14' of git://git./linux/kernel/git/tip/tip

Pull irq fixes from Thomas Gleixner:
 "A set of irqchip updates:

   - Make the GENERIC_IRQ_MULTI_HANDLER configuration correct

   - Add a missing DT compatible string for the Ingenic driver

   - Remove the pointless debugfs_file pointer from struct irqdomain"

* tag 'irq-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/ingenic: Add support for the JZ4760
  dt-bindings/irq: Add compatible string for the JZ4760B
  irqchip: Do not blindly select CONFIG_GENERIC_IRQ_MULTI_HANDLER
  ARM: ep93xx: Select GENERIC_IRQ_MULTI_HANDLER directly
  irqdomain: Remove debugfs_file from struct irq_domain

3 years agoMerge tag 'timers-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 14 Mar 2021 20:29:38 +0000 (13:29 -0700)]
Merge tag 'timers-urgent-2021-03-14' of git://git./linux/kernel/git/tip/tip

Pull timer fix from Thomas Gleixner:
 "A single fix in for hrtimers to prevent an interrupt storm caused by
  the lack of reevaluation of the timers which expire in softirq context
  under certain circumstances, e.g. when the clock was set"

* tag 'timers-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

3 years agoMerge tag 'sched-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 14 Mar 2021 20:27:06 +0000 (13:27 -0700)]
Merge tag 'sched-urgent-2021-03-14' of git://git./linux/kernel/git/tip/tip

Pull scheduler fixes from Thomas Gleixner:
 "A set of scheduler updates:

   - Prevent a NULL pointer dereference in the migration_stop_cpu()
     mechanims

   - Prevent self concurrency of affine_move_task()

   - Small fixes and cleanups related to task migration/affinity setting

   - Ensure that sync_runqueues_membarrier_state() is invoked on the
     current CPU when it is in the cpu mask"

* tag 'sched-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/membarrier: fix missing local execution of ipi_sync_rq_state()
  sched: Simplify set_affinity_pending refcounts
  sched: Fix affine_move_task() self-concurrency
  sched: Optimize migration_cpu_stop()
  sched: Collate affine_move_task() stoppers
  sched: Simplify migration_cpu_stop()
  sched: Fix migration_cpu_stop() requeueing

3 years agoMerge tag 'objtool-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 14 Mar 2021 20:15:55 +0000 (13:15 -0700)]
Merge tag 'objtool-urgent-2021-03-14' of git://git./linux/kernel/git/tip/tip

Pull objtool fix from Thomas Gleixner:
 "A single objtool fix to handle the PUSHF/POPF validation correctly for
  the paravirt changes which modified arch_local_irq_restore not to use
  popf"

* tag 'objtool-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool,x86: Fix uaccess PUSHF/POPF validation

3 years agoMerge tag 'locking-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 14 Mar 2021 20:03:21 +0000 (13:03 -0700)]
Merge tag 'locking-urgent-2021-03-14' of git://git./linux/kernel/git/tip/tip

Pull locking fixes from Thomas Gleixner:
 "A couple of locking fixes:

   - A fix for the static_call mechanism so it handles unaligned
     addresses correctly.

   - Make u64_stats_init() a macro so every instance gets a seperate
     lockdep key.

   - Make seqcount_latch_init() a macro as well to preserve the static
     variable which is used for the lockdep key"

* tag 'locking-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  seqlock,lockdep: Fix seqcount_latch_init()
  u64_stats,lockdep: Fix u64_stats_init() vs lockdep
  static_call: Fix the module key fixup

3 years agoMerge tag 'perf_urgent_for_v5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 14 Mar 2021 19:57:17 +0000 (12:57 -0700)]
Merge tag 'perf_urgent_for_v5.12-rc3' of git://git./linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Make sure PMU internal buffers are flushed for per-CPU events too and
   properly handle PID/TID for large PEBS.

 - Handle the case properly when there's no PMU and therefore return an
   empty list of perf MSRs for VMX to switch instead of reading random
   garbage from the stack.

* tag 'perf_urgent_for_v5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/perf: Use RET0 as default for guest_get_msrs to handle "no PMU" case
  perf/x86/intel: Set PERF_ATTACH_SCHED_CB for large PEBS and LBR
  perf/core: Flush PMU internal buffers for per-CPU events

3 years agoMerge tag 'efi-urgent-for-v5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 14 Mar 2021 19:54:56 +0000 (12:54 -0700)]
Merge tag 'efi-urgent-for-v5.12-rc2' of git://git./linux/kernel/git/tip/tip

Pull EFI fix from Ard Biesheuvel via Borislav Petkov:
 "Fix an oversight in the handling of EFI_RT_PROPERTIES_TABLE, which was
  added v5.10, but failed to take the SetVirtualAddressMap() RT service
  into account"

* tag 'efi-urgent-for-v5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi: stub: omit SetVirtualAddressMap() if marked unsupported in RT_PROP table

3 years agoMerge tag 'x86_urgent_for_v5.12_rc3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 14 Mar 2021 19:48:10 +0000 (12:48 -0700)]
Merge tag 'x86_urgent_for_v5.12_rc3' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - A couple of SEV-ES fixes and robustifications: verify usermode stack
   pointer in NMI is not coming from the syscall gap, correctly track
   IRQ states in the #VC handler and access user insn bytes atomically
   in same handler as latter cannot sleep.

 - Balance 32-bit fast syscall exit path to do the proper work on exit
   and thus not confuse audit and ptrace frameworks.

 - Two fixes for the ORC unwinder going "off the rails" into KASAN
   redzones and when ORC data is missing.

* tag 'x86_urgent_for_v5.12_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev-es: Use __copy_from_user_inatomic()
  x86/sev-es: Correctly track IRQ states in runtime #VC handler
  x86/sev-es: Check regs->sp is trusted before adjusting #VC IST stack
  x86/sev-es: Introduce ip_within_syscall_gap() helper
  x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls
  x86/unwind/orc: Silence warnings caused by missing ORC data
  x86/unwind/orc: Disable KASAN checking in the ORC unwinder, part 2

3 years agoMerge tag 'powerpc-5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sun, 14 Mar 2021 19:37:43 +0000 (12:37 -0700)]
Merge tag 'powerpc-5.12-3' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Some more powerpc fixes for 5.12:

   - Fix wrong instruction encoding for lis in ppc_function_entry(),
     which could potentially lead to missed kprobes.

   - Fix SET_FULL_REGS on 32-bit and 64e, which prevented ptrace of
     non-volatile GPRs immediately after exec.

   - Clean up a missed SRR specifier in the recent interrupt rework.

   - Don't treat unrecoverable_exception() as an interrupt handler, it's
     called from other handlers so shouldn't do the interrupt entry/exit
     accounting itself.

   - Fix build errors caused by missing declarations for
     [en/dis]able_kernel_vsx().

  Thanks to Christophe Leroy, Daniel Axtens, Geert Uytterhoeven, Jiri
  Olsa, Naveen N. Rao, and Nicholas Piggin"

* tag 'powerpc-5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/traps: unrecoverable_exception() is not an interrupt handler
  powerpc: Fix missing declaration of [en/dis]able_kernel_vsx()
  powerpc/64s/exception: Clean up a missed SRR specifier
  powerpc: Fix inverted SET_FULL_REGS bitop
  powerpc/64s: Use symbolic macros for function entry encoding
  powerpc/64s: Fix instruction encoding for lis in ppc_function_entry()

3 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Sun, 14 Mar 2021 19:35:02 +0000 (12:35 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "More fixes for ARM and x86"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: LAPIC: Advancing the timer expiration on guest initiated write
  KVM: x86/mmu: Skip !MMU-present SPTEs when removing SP in exclusive mode
  KVM: kvmclock: Fix vCPUs > 64 can't be online/hotpluged
  kvm: x86: annotate RCU pointers
  KVM: arm64: Fix exclusive limit for IPA size
  KVM: arm64: Reject VM creation when the default IPA size is unsupported
  KVM: arm64: Ensure I-cache isolation between vcpus of a same VM
  KVM: arm64: Don't use cbz/adr with external symbols
  KVM: arm64: Fix range alignment when walking page tables
  KVM: arm64: Workaround firmware wrongly advertising GICv2-on-v3 compatibility
  KVM: arm64: Rename __vgic_v3_get_ich_vtr_el2() to __vgic_v3_get_gic_config()
  KVM: arm64: Don't access PMSELR_EL0/PMUSERENR_EL0 when no PMU is available
  KVM: arm64: Turn kvm_arm_support_pmu_v3() into a static key
  KVM: arm64: Fix nVHE hyp panic host context restore
  KVM: arm64: Avoid corrupting vCPU context register in guest exit
  KVM: arm64: nvhe: Save the SPE context early
  kvm: x86: use NULL instead of using plain integer as pointer
  KVM: SVM: Connect 'npt' module param to KVM's internal 'npt_enabled'
  KVM: x86: Ensure deadline timer has truly expired before posting its IRQ

3 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Sun, 14 Mar 2021 19:23:34 +0000 (12:23 -0700)]
Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
 "28 patches.

  Subsystems affected by this series: mm (memblock, pagealloc, hugetlb,
  highmem, kfence, oom-kill, madvise, kasan, userfaultfd, memcg, and
  zram), core-kernel, kconfig, fork, binfmt, MAINTAINERS, kbuild, and
  ia64"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (28 commits)
  zram: fix broken page writeback
  zram: fix return value on writeback_store
  mm/memcg: set memcg when splitting page
  mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument
  ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign
  ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls
  mm/userfaultfd: fix memory corruption due to writeprotect
  kasan: fix KASAN_STACK dependency for HW_TAGS
  kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC
  mm/madvise: replace ptrace attach requirement for process_madvise
  include/linux/sched/mm.h: use rcu_dereference in in_vfork()
  kfence: fix reports if constant function prefixes exist
  kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations
  kfence: fix printk format for ptrdiff_t
  linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP*
  MAINTAINERS: exclude uapi directories in API/ABI section
  binfmt_misc: fix possible deadlock in bm_register_write
  mm/highmem.c: fix zero_user_segments() with start > end
  hugetlb: do early cow when page pinned on src mm
  mm: use is_cow_mapping() across tree where proper
  ...

3 years agoMerge tag 'irqchip-fixes-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Thomas Gleixner [Sun, 14 Mar 2021 15:34:35 +0000 (16:34 +0100)]
Merge tag 'irqchip-fixes-5.12-1' of git://git./linux/kernel/git/maz/arm-platforms into irq/urgent

Pull irqchip fixes from Marc Zyngier:

  - More compatible strings for the Ingenic irqchip (introducing the
    JZ4760B SoC)
  - Select GENERIC_IRQ_MULTI_HANDLER on the ARM ep93xx platform
  - Drop all GENERIC_IRQ_MULTI_HANDLER selections from the irqchip
    Kconfig, now relying on the architecture to get it right
  - Drop the debugfs_file field from struct irq_domain, now that
    debugfs can track things on its own

3 years agoMerge tag 'char-misc-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregk...
Linus Torvalds [Sat, 13 Mar 2021 20:38:44 +0000 (12:38 -0800)]
Merge tag 'char-misc-5.12-rc3' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
 "Here are some small misc/char driver fixes to resolve some reported
  problems:

   - habanalabs driver fixes

   - Acrn build fixes (reported many times)

   - pvpanic module table export fix

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  misc/pvpanic: Export module FDT device table
  misc: fastrpc: restrict user apps from sending kernel RPC messages
  virt: acrn: Correct type casting of argument of copy_from_user()
  virt: acrn: Use EPOLLIN instead of POLLIN
  virt: acrn: Use vfs_poll() instead of f_op->poll()
  virt: acrn: Make remove_cpu sysfs invisible with !CONFIG_HOTPLUG_CPU
  cpu/hotplug: Fix build error of using {add,remove}_cpu() with !CONFIG_SMP
  habanalabs: fix debugfs address translation
  habanalabs: Disable file operations after device is removed
  habanalabs: Call put_pid() when releasing control device
  drivers: habanalabs: remove unused dentry pointer for debugfs files
  habanalabs: mark hl_eq_inc_ptr() as static

3 years agoMerge tag 'staging-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Sat, 13 Mar 2021 20:36:53 +0000 (12:36 -0800)]
Merge tag 'staging-5.12-rc3' of git://git./linux/kernel/git/gregkh/staging

Pull staging driver fixes from Greg KH:
 "Here are some small staging driver fixes for reported problems. They
  include:

   - wfx header file cleanup patch reverted as it could cause problems

   - comedi driver endian fixes

   - buffer overflow problems for staging wifi drivers

   - build dependency issue for rtl8192e driver

  All have been in linux-next for a while with no reported problems"

* tag 'staging-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (23 commits)
  Revert "staging: wfx: remove unused included header files"
  staging: rtl8188eu: prevent ->ssid overflow in rtw_wx_set_scan()
  staging: rtl8188eu: fix potential memory corruption in rtw_check_beacon_data()
  staging: rtl8192u: fix ->ssid overflow in r8192_wx_set_scan()
  staging: comedi: pcl726: Use 16-bit 0 for interrupt data
  staging: comedi: ni_65xx: Use 16-bit 0 for interrupt data
  staging: comedi: ni_6527: Use 16-bit 0 for interrupt data
  staging: comedi: comedi_parport: Use 16-bit 0 for interrupt data
  staging: comedi: amplc_pc236_common: Use 16-bit 0 for interrupt data
  staging: comedi: pcl818: Fix endian problem for AI command data
  staging: comedi: pcl711: Fix endian problem for AI command data
  staging: comedi: me4000: Fix endian problem for AI command data
  staging: comedi: dmm32at: Fix endian problem for AI command data
  staging: comedi: das800: Fix endian problem for AI command data
  staging: comedi: das6402: Fix endian problem for AI command data
  staging: comedi: adv_pci1710: Fix endian problem for AI command data
  staging: comedi: addi_apci_1500: Fix endian problem for command sample
  staging: comedi: addi_apci_1032: Fix endian problem for COS sample
  staging: ks7010: prevent buffer overflow in ks_wlan_set_scan()
  staging: rtl8712: Fix possible buffer overflow in r8712_sitesurvey_cmd
  ...

3 years agoMerge tag 'tty-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Linus Torvalds [Sat, 13 Mar 2021 20:34:29 +0000 (12:34 -0800)]
Merge tag 'tty-5.12-rc3' of git://git./linux/kernel/git/gregkh/tty

Pull tty/serial fixes from Greg KH:
 "Here are some small tty and serial driver fixes to resolve some
  reported problems:

   - led tty trigger fixes based on review and were acked by the led
     maintainer

   - revert a max310x serial driver patch as it was causing problems

   - revert a pty change as it was also causing problems

  All of these have been in linux-next for a while with no reported
  problems"

* tag 'tty-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  Revert "drivers:tty:pty: Fix a race causing data loss on close"
  Revert "serial: max310x: rework RX interrupt handling"
  leds: trigger/tty: Use led_set_brightness_sync() from workqueue
  leds: trigger: Fix error path to not unlock the unlocked mutex

3 years agoMerge tag 'usb-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Sat, 13 Mar 2021 20:32:57 +0000 (12:32 -0800)]
Merge tag 'usb-5.12-rc3' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here are a small number of USB fixes for 5.12-rc3 to resolve a bunch
  of reported issues:

   - usbip fixups for issues found by syzbot

   - xhci driver fixes and quirk additions

   - gadget driver fixes

   - dwc3 QCOM driver fix

   - usb-serial new ids and fixes

   - usblp fix for a long-time issue

   - cdc-acm quirk addition

   - other tiny fixes for reported problems

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'usb-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (25 commits)
  xhci: Fix repeated xhci wake after suspend due to uncleared internal wake state
  usb: xhci: Fix ASMedia ASM1042A and ASM3242 DMA addressing
  xhci: Improve detection of device initiated wake signal.
  usb: xhci: do not perform Soft Retry for some xHCI hosts
  usbip: fix vudc usbip_sockfd_store races leading to gpf
  usbip: fix vhci_hcd attach_store() races leading to gpf
  usbip: fix stub_dev usbip_sockfd_store() races leading to gpf
  usbip: fix vudc to check for stream socket
  usbip: fix vhci_hcd to check for stream socket
  usbip: fix stub_dev to check for stream socket
  usb: dwc3: qcom: Add missing DWC3 OF node refcount decrement
  USB: usblp: fix a hang in poll() if disconnected
  USB: gadget: udc: s3c2410_udc: fix return value check in s3c2410_udc_probe()
  usb: renesas_usbhs: Clear PIPECFG for re-enabling pipe with other EPNUM
  usb: dwc3: qcom: Honor wakeup enabled/disabled state
  usb: gadget: f_uac1: stop playback on function disable
  usb: gadget: f_uac2: always increase endpoint max_packet_size by one audio slot
  USB: gadget: u_ether: Fix a configfs return code
  usb: dwc3: qcom: add ACPI device id for sc8180x
  Goodix Fingerprint device is not a modem
  ...

3 years agoMerge tag 'erofs-for-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang...
Linus Torvalds [Sat, 13 Mar 2021 20:26:22 +0000 (12:26 -0800)]
Merge tag 'erofs-for-5.12-rc3' of git://git./linux/kernel/git/xiang/erofs

Pull erofs fix from Gao Xiang:
 "Fix an urgent regression introduced by commit baa2c7c97153 ("block:
  set .bi_max_vecs as actual allocated vector number"), which could
  cause unexpected hung since linux 5.12-rc1.

  Resolve it by avoiding using bio->bi_max_vecs completely"

* tag 'erofs-for-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix bio->bi_max_vecs behavior change

3 years agoMerge tag 'kbuild-fixes-v5.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 13 Mar 2021 20:18:59 +0000 (12:18 -0800)]
Merge tag 'kbuild-fixes-v5.12-2' of git://git./linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - avoid 'make image_name' invoking syncconfig

 - fix a couple of bugs in scripts/dummy-tools

 - fix LLD_VENDOR and locale issues in scripts/ld-version.sh

 - rebuild GCC plugins when the compiler is upgraded

 - allow LTO to be enabled with KASAN_HW_TAGS

 - allow LTO to be enabled without LLVM=1

* tag 'kbuild-fixes-v5.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: fix ld-version.sh to not be affected by locale
  kbuild: remove meaningless parameter to $(call if_changed_rule,dtc)
  kbuild: remove LLVM=1 test from HAS_LTO_CLANG
  kbuild: remove unneeded -O option to dtc
  kbuild: dummy-tools: adjust to scripts/cc-version.sh
  kbuild: Allow LTO to be selected with KASAN_HW_TAGS
  kbuild: dummy-tools: support MPROFILE_KERNEL checks for ppc
  kbuild: rebuild GCC plugins when the compiler is upgraded
  kbuild: Fix ld-version.sh script if LLD was built with LLD_VENDOR
  kbuild: dummy-tools: fix inverted tests for gcc
  kbuild: add image_name to no-sync-config-targets

3 years agozram: fix broken page writeback
Minchan Kim [Sat, 13 Mar 2021 05:08:41 +0000 (21:08 -0800)]
zram: fix broken page writeback

commit 0d8359620d9b ("zram: support page writeback") introduced two
problems.  It overwrites writeback_store's return value as kstrtol's
return value, which makes return value zero so user could see zero as
return value of write syscall even though it wrote data successfully.

It also breaks index value in the loop in that it doesn't increase the
index any longer.  It means it can write only first starting block index
so user couldn't write all idle pages in the zram so lose memory saving
chance.

This patch fixes those issues.

Link: https://lkml.kernel.org/r/20210312173949.2197662-2-minchan@kernel.org
Fixes: 0d8359620d9b("zram: support page writeback")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Amos Bianchi <amosbianchi@google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: John Dias <joaodias@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agozram: fix return value on writeback_store
Minchan Kim [Sat, 13 Mar 2021 05:08:38 +0000 (21:08 -0800)]
zram: fix return value on writeback_store

writeback_store's return value is overwritten by submit_bio_wait's return
value.  Thus, writeback_store will return zero since there was no IO
error.  In the end, write syscall from userspace will see the zero as
return value, which could make the process stall to keep trying the write
until it will succeed.

Link: https://lkml.kernel.org/r/20210312173949.2197662-1-minchan@kernel.org
Fixes: 3b82a051c101("drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: John Dias <joaodias@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/memcg: set memcg when splitting page
Zhou Guanghui [Sat, 13 Mar 2021 05:08:33 +0000 (21:08 -0800)]
mm/memcg: set memcg when splitting page

As described in the split_page() comment, for the non-compound high order
page, the sub-pages must be freed individually.  If the memcg of the first
page is valid, the tail pages cannot be uncharged when be freed.

For example, when alloc_pages_exact is used to allocate 1MB continuous
physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
set).  When make_alloc_exact free the unused 1MB and free_pages_exact free
the applied 1MB, actually, only 4KB(one page) is uncharged.

Therefore, the memcg of the tail page needs to be set when splitting a
page.

Michel:

There are at least two explicit users of __GFP_ACCOUNT with
alloc_exact_pages added recently.  See 7efe8ef274024 ("KVM: arm64:
Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
just a theoretical issue.

Link: https://lkml.kernel.org/r/20210304074053.65527-3-zhouguanghui1@huawei.com
Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Rui Xiang <rui.xiang@huawei.com>
Cc: Tianhong Ding <dingtianhong@huawei.com>
Cc: Weilong Chen <chenweilong@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages...
Zhou Guanghui [Sat, 13 Mar 2021 05:08:30 +0000 (21:08 -0800)]
mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument

Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly pass
in page number argument.

In this way, the interface name is more common and can be used by
potential users.  In addition, the complete info(memcg and flag) of the
memcg needs to be set to the tail pages.

Link: https://lkml.kernel.org/r/20210304074053.65527-2-zhouguanghui1@huawei.com
Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Tianhong Ding <dingtianhong@huawei.com>
Cc: Weilong Chen <chenweilong@huawei.com>
Cc: Rui Xiang <rui.xiang@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign
Sergei Trofimovich [Sat, 13 Mar 2021 05:08:27 +0000 (21:08 -0800)]
ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign

In https://bugs.gentoo.org/769614 Dmitry noticed that
`ptrace(PTRACE_GET_SYSCALL_INFO)` does not return error sign properly.

The bug is in mismatch between get/set errors:

static inline long syscall_get_error(struct task_struct *task,
                                     struct pt_regs *regs)
{
        return regs->r10 == -1 ? regs->r8:0;
}

static inline long syscall_get_return_value(struct task_struct *task,
                                            struct pt_regs *regs)
{
        return regs->r8;
}

static inline void syscall_set_return_value(struct task_struct *task,
                                            struct pt_regs *regs,
                                            int error, long val)
{
        if (error) {
                /* error < 0, but ia64 uses > 0 return value */
                regs->r8 = -error;
                regs->r10 = -1;
        } else {
                regs->r8 = val;
                regs->r10 = 0;
        }
}

Tested on v5.10 on rx3600 machine (ia64 9040 CPU).

Link: https://lkml.kernel.org/r/20210221002554.333076-2-slyfox@gentoo.org
Link: https://bugs.gentoo.org/769614
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reported-by: Dmitry V. Levin <ldv@altlinux.org>
Reviewed-by: Dmitry V. Levin <ldv@altlinux.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoia64: fix ia64_syscall_get_set_arguments() for break-based syscalls
Sergei Trofimovich [Sat, 13 Mar 2021 05:08:23 +0000 (21:08 -0800)]
ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls

In https://bugs.gentoo.org/769614 Dmitry noticed that
`ptrace(PTRACE_GET_SYSCALL_INFO)` does not work for syscalls called via
glibc's syscall() wrapper.

ia64 has two ways to call syscalls from userspace: via `break` and via
`eps` instructions.

The difference is in stack layout:

1. `eps` creates simple stack frame: no locals, in{0..7} == out{0..8}
2. `break` uses userspace stack frame: may be locals (glibc provides
   one), in{0..7} == out{0..8}.

Both work fine in syscall handling cde itself.

But `ptrace(PTRACE_GET_SYSCALL_INFO)` uses unwind mechanism to
re-extract syscall arguments but it does not account for locals.

The change always skips locals registers. It should not change `eps`
path as kernel's handler already enforces locals=0 and fixes `break`.

Tested on v5.10 on rx3600 machine (ia64 9040 CPU).

Link: https://lkml.kernel.org/r/20210221002554.333076-1-slyfox@gentoo.org
Link: https://bugs.gentoo.org/769614
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reported-by: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/userfaultfd: fix memory corruption due to writeprotect
Nadav Amit [Sat, 13 Mar 2021 05:08:17 +0000 (21:08 -0800)]
mm/userfaultfd: fix memory corruption due to writeprotect

Userfaultfd self-test fails occasionally, indicating a memory corruption.

Analyzing this problem indicates that there is a real bug since mmap_lock
is only taken for read in mwriteprotect_range() and defers flushes, and
since there is insufficient consideration of concurrent deferred TLB
flushes in wp_page_copy().  Although the PTE is flushed from the TLBs in
wp_page_copy(), this flush takes place after the copy has already been
performed, and therefore changes of the page are possible between the time
of the copy and the time in which the PTE is flushed.

To make matters worse, memory-unprotection using userfaultfd also poses a
problem.  Although memory unprotection is logically a promotion of PTE
permissions, and therefore should not require a TLB flush, the current
userrfaultfd code might actually cause a demotion of the architectural PTE
permission: when userfaultfd_writeprotect() unprotects memory region, it
unintentionally *clears* the RW-bit if it was already set.  Note that this
unprotecting a PTE that is not write-protected is a valid use-case: the
userfaultfd monitor might ask to unprotect a region that holds both
write-protected and write-unprotected PTEs.

The scenario that happens in selftests/vm/userfaultfd is as follows:

cpu0 cpu1 cpu2
---- ---- ----
[ Writable PTE
  cached in TLB ]
userfaultfd_writeprotect()
[ write-*unprotect* ]
mwriteprotect_range()
mmap_read_lock()
change_protection()

change_protection_range()
...
change_pte_range()
[ *clear* “write”-bit ]
[ defer TLB flushes ]
[ page-fault ]
...
wp_page_copy()
 cow_user_page()
  [ copy page ]
[ write to old
  page ]
...
 set_pte_at_notify()

A similar scenario can happen:

cpu0 cpu1 cpu2 cpu3
---- ---- ---- ----
[ Writable PTE
     cached in TLB ]
userfaultfd_writeprotect()
[ write-protect ]
[ deferred TLB flush ]
userfaultfd_writeprotect()
[ write-unprotect ]
[ deferred TLB flush]
[ page-fault ]
wp_page_copy()
 cow_user_page()
 [ copy page ]
 ... [ write to page ]
set_pte_at_notify()

This race exists since commit 292924b26024 ("userfaultfd: wp: apply
_PAGE_UFFD_WP bit").  Yet, as Yu Zhao pointed, these races became apparent
since commit 09854ba94c6a ("mm: do_wp_page() simplification") which made
wp_page_copy() more likely to take place, specifically if page_count(page)
> 1.

To resolve the aforementioned races, check whether there are pending
flushes on uffd-write-protected VMAs, and if there are, perform a flush
before doing the COW.

Further optimizations will follow to avoid during uffd-write-unprotect
unnecassary PTE write-protection and TLB flushes.

Link: https://lkml.kernel.org/r/20210304095423.3825684-1-namit@vmware.com
Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Signed-off-by: Nadav Amit <namit@vmware.com>
Suggested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Tested-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org> [5.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agokasan: fix KASAN_STACK dependency for HW_TAGS
Andrey Konovalov [Sat, 13 Mar 2021 05:08:13 +0000 (21:08 -0800)]
kasan: fix KASAN_STACK dependency for HW_TAGS

There's a runtime failure when running HW_TAGS-enabled kernel built with
GCC on hardware that doesn't support MTE.  GCC-built kernels always have
CONFIG_KASAN_STACK enabled, even though stack instrumentation isn't
supported by HW_TAGS.  Having that config enabled causes KASAN to issue
MTE-only instructions to unpoison kernel stacks, which causes the failure.

Fix the issue by disallowing CONFIG_KASAN_STACK when HW_TAGS is used.

(The commit that introduced CONFIG_KASAN_HW_TAGS specified proper
 dependency for CONFIG_KASAN_STACK_ENABLE but not for CONFIG_KASAN_STACK.)

Link: https://lkml.kernel.org/r/59e75426241dbb5611277758c8d4d6f5f9298dac.1615215441.git.andreyknvl@google.com
Fixes: 6a63a63ff1ac ("kasan: introduce CONFIG_KASAN_HW_TAGS")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: <stable@vger.kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agokasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC
Andrey Konovalov [Sat, 13 Mar 2021 05:08:10 +0000 (21:08 -0800)]
kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC

Currently, kasan_free_nondeferred_pages()->kasan_free_pages() is called
after debug_pagealloc_unmap_pages(). This causes a crash when
debug_pagealloc is enabled, as HW_TAGS KASAN can't set tags on an
unmapped page.

This patch puts kasan_free_nondeferred_pages() before
debug_pagealloc_unmap_pages() and arch_free_page(), which can also make
the page unavailable.

Link: https://lkml.kernel.org/r/24cd7db274090f0e5bc3adcdc7399243668e3171.1614987311.git.andreyknvl@google.com
Fixes: 94ab5b61ee16 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/madvise: replace ptrace attach requirement for process_madvise
Suren Baghdasaryan [Sat, 13 Mar 2021 05:08:06 +0000 (21:08 -0800)]
mm/madvise: replace ptrace attach requirement for process_madvise

process_madvise currently requires ptrace attach capability.
PTRACE_MODE_ATTACH gives one process complete control over another
process.  It effectively removes the security boundary between the two
processes (in one direction).  Granting ptrace attach capability even to a
system process is considered dangerous since it creates an attack surface.
This severely limits the usage of this API.

The operations process_madvise can perform do not affect the correctness
of the operation of the target process; they only affect where the data is
physically located (and therefore, how fast it can be accessed).  What we
want is the ability for one process to influence another process in order
to optimize performance across the entire system while leaving the
security boundary intact.

Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ and
CAP_SYS_NICE.  PTRACE_MODE_READ to prevent leaking ASLR metadata and
CAP_SYS_NICE for influencing process performance.

Link: https://lkml.kernel.org/r/20210303185807.2160264-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <stable@vger.kernel.org> [5.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoinclude/linux/sched/mm.h: use rcu_dereference in in_vfork()
Matthew Wilcox (Oracle) [Sat, 13 Mar 2021 05:08:03 +0000 (21:08 -0800)]
include/linux/sched/mm.h: use rcu_dereference in in_vfork()

Fix a sparse warning by using rcu_dereference().  Technically this is a
bug and a sufficiently aggressive compiler could reload the `real_parent'
pointer outside the protection of the rcu lock (and access freed memory),
but I think it's pretty unlikely to happen.

Link: https://lkml.kernel.org/r/20210221194207.1351703-1-willy@infradead.org
Fixes: b18dc5f291c0 ("mm, oom: skip vforked tasks from being selected")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agokfence: fix reports if constant function prefixes exist
Marco Elver [Sat, 13 Mar 2021 05:08:00 +0000 (21:08 -0800)]
kfence: fix reports if constant function prefixes exist

Some architectures prefix all functions with a constant string ('.' on
ppc64).  Add ARCH_FUNC_PREFIX, which may optionally be defined in
<asm/kfence.h>, so that get_stack_skipnr() can work properly.

Link: https://lkml.kernel.org/r/f036c53d-7e81-763c-47f4-6024c6c5f058@csgroup.eu
Link: https://lkml.kernel.org/r/20210304144000.1148590-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agokfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations
Marco Elver [Sat, 13 Mar 2021 05:07:53 +0000 (21:07 -0800)]
kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations

cache_alloc_debugcheck_after() performs checks on an object, including
adjusting the returned pointer.  None of this should apply to KFENCE
objects.  While for non-bulk allocations, the checks are skipped when we
allocate via KFENCE, for bulk allocations cache_alloc_debugcheck_after()
is called via cache_alloc_debugcheck_after_bulk().

Fix it by skipping cache_alloc_debugcheck_after() for KFENCE objects.

Link: https://lkml.kernel.org/r/20210304205256.2162309-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agokfence: fix printk format for ptrdiff_t
Marco Elver [Sat, 13 Mar 2021 05:07:50 +0000 (21:07 -0800)]
kfence: fix printk format for ptrdiff_t

Use %td for ptrdiff_t.

Link: https://lkml.kernel.org/r/3abbe4c9-16ad-c168-a90f-087978ccd8f7@csgroup.eu
Link: https://lkml.kernel.org/r/20210303121157.3430807-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agolinux/compiler-clang.h: define HAVE_BUILTIN_BSWAP*
Arnd Bergmann [Sat, 13 Mar 2021 05:07:47 +0000 (21:07 -0800)]
linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP*

Separating compiler-clang.h from compiler-gcc.h inadventently dropped the
definitions of the three HAVE_BUILTIN_BSWAP macros, which requires falling
back to the open-coded version and hoping that the compiler detects it.

Since all versions of clang support the __builtin_bswap interfaces, add
back the flags and have the headers pick these up automatically.

This results in a 4% improvement of compilation speed for arm defconfig.

Note: it might also be worth revisiting which architectures set
CONFIG_ARCH_USE_BUILTIN_BSWAP for one compiler or the other, today this is
set on six architectures (arm32, csky, mips, powerpc, s390, x86), while
another ten architectures define custom helpers (alpha, arc, ia64, m68k,
mips, nios2, parisc, sh, sparc, xtensa), and the rest (arm64, h8300,
hexagon, microblaze, nds32, openrisc, riscv) just get the unoptimized
version and rely on the compiler to detect it.

A long time ago, the compiler builtins were architecture specific, but
nowadays, all compilers that are able to build the kernel have correct
implementations of them, though some may not be as optimized as the inline
asm versions.

The patch that dropped the optimization landed in v4.19, so as discussed
it would be fairly safe to backport this revert to stable kernels to the
4.19/5.4/5.10 stable kernels, but there is a remaining risk for
regressions, and it has no known side-effects besides compile speed.

Link: https://lkml.kernel.org/r/20210226161151.2629097-1-arnd@kernel.org
Link: https://lore.kernel.org/lkml/20210225164513.3667778-1-arnd@kernel.org/
Fixes: 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually exclusive")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Guo Ren <guoren@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Arvind Sankar <nivedita@alum.mit.edu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoMAINTAINERS: exclude uapi directories in API/ABI section
Vlastimil Babka [Sat, 13 Mar 2021 05:07:44 +0000 (21:07 -0800)]
MAINTAINERS: exclude uapi directories in API/ABI section

Commit 7b4693e644cb ("MAINTAINERS: add uapi directories to API/ABI
section") added include/uapi/ and arch/*/include/uapi/ so that patches
modifying them CC linux-api.  However that was already done in the past
and resulted in too much noise and thus later removed, as explained in
b14fd334ff3d ("MAINTAINERS: trim the file triggers for ABI/API")

To prevent another round of addition and removal in the future, change the
entries to X: (explicit exclusion) for documentation purposes, although
they are not subdirectories of broader included directories, as there is
apparently no defined way to add plain comments in subsystem sections.

Link: https://lkml.kernel.org/r/20210301100255.25229-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agobinfmt_misc: fix possible deadlock in bm_register_write
Lior Ribak [Sat, 13 Mar 2021 05:07:41 +0000 (21:07 -0800)]
binfmt_misc: fix possible deadlock in bm_register_write

There is a deadlock in bm_register_write:

First, in the begining of the function, a lock is taken on the binfmt_misc
root inode with inode_lock(d_inode(root)).

Then, if the user used the MISC_FMT_OPEN_FILE flag, the function will call
open_exec on the user-provided interpreter.

open_exec will call a path lookup, and if the path lookup process includes
the root of binfmt_misc, it will try to take a shared lock on its inode
again, but it is already locked, and the code will get stuck in a deadlock

To reproduce the bug:
$ echo ":iiiii:E::ii::/proc/sys/fs/binfmt_misc/bla:F" > /proc/sys/fs/binfmt_misc/register

backtrace of where the lock occurs (#5):
0  schedule () at ./arch/x86/include/asm/current.h:15
1  0xffffffff81b51237 in rwsem_down_read_slowpath (sem=0xffff888003b202e0, count=<optimized out>, state=state@entry=2) at kernel/locking/rwsem.c:992
2  0xffffffff81b5150a in __down_read_common (state=2, sem=<optimized out>) at kernel/locking/rwsem.c:1213
3  __down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1222
4  down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1355
5  0xffffffff811ee22a in inode_lock_shared (inode=<optimized out>) at ./include/linux/fs.h:783
6  open_last_lookups (op=0xffffc9000022fe34, file=0xffff888004098600, nd=0xffffc9000022fd10) at fs/namei.c:3177
7  path_openat (nd=nd@entry=0xffffc9000022fd10, op=op@entry=0xffffc9000022fe34, flags=flags@entry=65) at fs/namei.c:3366
8  0xffffffff811efe1c in do_filp_open (dfd=<optimized out>, pathname=pathname@entry=0xffff8880031b9000, op=op@entry=0xffffc9000022fe34) at fs/namei.c:3396
9  0xffffffff811e493f in do_open_execat (fd=fd@entry=-100, name=name@entry=0xffff8880031b9000, flags=<optimized out>, flags@entry=0) at fs/exec.c:913
10 0xffffffff811e4a92 in open_exec (name=<optimized out>) at fs/exec.c:948
11 0xffffffff8124aa84 in bm_register_write (file=<optimized out>, buffer=<optimized out>, count=19, ppos=<optimized out>) at fs/binfmt_misc.c:682
12 0xffffffff811decd2 in vfs_write (file=file@entry=0xffff888004098500, buf=buf@entry=0xa758d0 ":iiiii:E::ii::i:CF
", count=count@entry=19, pos=pos@entry=0xffffc9000022ff10) at fs/read_write.c:603
13 0xffffffff811defda in ksys_write (fd=<optimized out>, buf=0xa758d0 ":iiiii:E::ii::i:CF
", count=19) at fs/read_write.c:658
14 0xffffffff81b49813 in do_syscall_64 (nr=<optimized out>, regs=0xffffc9000022ff58) at arch/x86/entry/common.c:46
15 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120

To solve the issue, the open_exec call is moved to before the write
lock is taken by bm_register_write

Link: https://lkml.kernel.org/r/20210228224414.95962-1-liorribak@gmail.com
Fixes: 948b701a607f1 ("binfmt_misc: add persistent opened binary handler for containers")
Signed-off-by: Lior Ribak <liorribak@gmail.com>
Acked-by: Helge Deller <deller@gmx.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/highmem.c: fix zero_user_segments() with start > end
OGAWA Hirofumi [Sat, 13 Mar 2021 05:07:37 +0000 (21:07 -0800)]
mm/highmem.c: fix zero_user_segments() with start > end

zero_user_segments() is used from __block_write_begin_int(), for example
like the following

zero_user_segments(page, 4096, 1024, 512, 918)

But new the zero_user_segments() implementation for for HIGHMEM +
TRANSPARENT_HUGEPAGE doesn't handle "start > end" case correctly, and hits
BUG_ON().  (we can fix __block_write_begin_int() instead though, it is the
old and multiple usage)

Also it calls kmap_atomic() unnecessarily while start == end == 0.

Link: https://lkml.kernel.org/r/87v9ab60r4.fsf@mail.parknet.co.jp
Fixes: 0060ef3b4e6d ("mm: support THPs in zero_user_segments")
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agohugetlb: do early cow when page pinned on src mm
Peter Xu [Sat, 13 Mar 2021 05:07:33 +0000 (21:07 -0800)]
hugetlb: do early cow when page pinned on src mm

This is the last missing piece of the COW-during-fork effort when there're
pinned pages found.  One can reference 70e806e4e645 ("mm: Do early cow for
pinned pages during fork() for ptes", 2020-09-27) for more information,
since we do similar things here rather than pte this time, but just for
hugetlb.

Note that after Jason's recent work on 57efa1fe5957 ("mm/gup: prevent
gup_fast from racing with COW during fork", 2020-12-15) which is safer and
easier to understand, we're safe now within the whole copy_page_range()
against gup-fast, we don't need the wr-protect trick that proposed in
70e806e4e645 anymore.

Link: https://lkml.kernel.org/r/20210217233547.93892-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Gal Pressman <galpress@amazon.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Roland Scheidegger <sroland@vmware.com>
Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
Cc: Wei Zhang <wzam@amazon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: use is_cow_mapping() across tree where proper
Peter Xu [Sat, 13 Mar 2021 05:07:30 +0000 (21:07 -0800)]
mm: use is_cow_mapping() across tree where proper

After is_cow_mapping() is exported in mm.h, replace some manual checks
elsewhere throughout the tree but start to use the new helper.

Link: https://lkml.kernel.org/r/20210217233547.93892-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
Cc: Roland Scheidegger <sroland@vmware.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Gal Pressman <galpress@amazon.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Wei Zhang <wzam@amazon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: introduce page_needs_cow_for_dma() for deciding whether cow
Peter Xu [Sat, 13 Mar 2021 05:07:26 +0000 (21:07 -0800)]
mm: introduce page_needs_cow_for_dma() for deciding whether cow

We've got quite a few places (pte, pmd, pud) that explicitly checked
against whether we should break the cow right now during fork().  It's
easier to provide a helper, especially before we work the same thing on
hugetlbfs.

Since we'll reference is_cow_mapping() in mm.h, move it there too.
Actually it suites mm.h more since internal.h is mm/ only, but mm.h is
exported to the whole kernel.  With that we should expect another patch to
use is_cow_mapping() whenever we can across the kernel since we do use it
quite a lot but it's always done with raw code against VM_* flags.

Link: https://lkml.kernel.org/r/20210217233547.93892-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Gal Pressman <galpress@amazon.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Roland Scheidegger <sroland@vmware.com>
Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
Cc: Wei Zhang <wzam@amazon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agohugetlb: break earlier in add_reservation_in_range() when we can
Peter Xu [Sat, 13 Mar 2021 05:07:22 +0000 (21:07 -0800)]
hugetlb: break earlier in add_reservation_in_range() when we can

All the regions maintained in hugetlb reserved map is inclusive on "from"
but exclusive on "to".  We can break earlier even if rg->from==t because
it already means no possible intersection.

This does not need a Fixes in all cases because when it happens
(rg->from==t) we'll not break out of the loop while we should, however the
next thing we'd do is still add the last file_region we'd need and quit
the loop in the next round.  So this change is not a bugfix (since the old
code should still run okay iiuc), but we'd better still touch it up to
make it logically sane.

Link: https://lkml.kernel.org/r/20210217233547.93892-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Gal Pressman <galpress@amazon.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Roland Scheidegger <sroland@vmware.com>
Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
Cc: Wei Zhang <wzam@amazon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agohugetlb: dedup the code to add a new file_region
Peter Xu [Sat, 13 Mar 2021 05:07:18 +0000 (21:07 -0800)]
hugetlb: dedup the code to add a new file_region

Patch series "mm/hugetlb: Early cow on fork, and a few cleanups", v5.

As reported by Gal [1], we still miss the code clip to handle early cow
for hugetlb case, which is true.  Again, it still feels odd to fork()
after using a few huge pages, especially if they're privately mapped to
me..  However I do agree with Gal and Jason in that we should still have
that since that'll complete the early cow on fork effort at least, and
it'll still fix issues where buffers are not well under control and not
easy to apply MADV_DONTFORK.

The first two patches (1-2) are some cleanups I noticed when reading into
the hugetlb reserve map code.  I think it's good to have but they're not
necessary for fixing the fork issue.

The last two patches (3-4) are the real fix.

I tested this with a fork() after some vfio-pci assignment, so I'm pretty
sure the page copy path could trigger well (page will be accounted right
after the fork()), but I didn't do data check since the card I assigned is
some random nic.

  https://github.com/xzpeter/linux/tree/fork-cow-pin-huge

[1] https://lore.kernel.org/lkml/27564187-4a08-f187-5a84-3df50009f6ca@amazon.com/

Introduce hugetlb_resv_map_add() helper to add a new file_region rather
than duplication the similar code twice in add_reservation_in_range().

Link: https://lkml.kernel.org/r/20210217233547.93892-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210217233547.93892-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Gal Pressman <galpress@amazon.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Wei Zhang <wzam@amazon.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Roland Scheidegger <sroland@vmware.com>
Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>