mm: use pmdp_get_lockless() without surplus barrier()
authorHugh Dickins <hughd@google.com>
Fri, 9 Jun 2023 01:06:53 +0000 (18:06 -0700)
committerAndrew Morton <akpm@linux-foundation.org>
Mon, 19 Jun 2023 23:19:12 +0000 (16:19 -0700)
Patch series "mm: allow pte_offset_map[_lock]() to fail", v2.

What is it all about?  Some mmap_lock avoidance i.e.  latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs; but
likely to be relied upon later in other contexts e.g.  freeing of empty
page tables (but that's not work I'm doing).  mmap_write_lock avoidance
when collapsing to anon THPs?  Perhaps, but again that's not work I've
done: a quick attempt was not as easy as the shmem/file case.

I would much prefer not to have to make these small but wide-ranging
changes for such a niche case; but failed to find another way, and have
heard that shmem MADV_COLLAPSE's usefulness is being limited by that
mmap_write_lock it currently requires.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

What is this preparatory series about?

The current mmap locking will not be enough to guard against that tricky
transition between pmd entry pointing to page table, and empty pmd entry,
and pmd entry pointing to huge page: pte_offset_map() will have to
validate the pmd entry for itself, returning NULL if no page table is
there.  What to do about that varies: sometimes nearby error handling
indicates just to skip it; but in many cases an ACTION_AGAIN or "goto
again" is appropriate (and if that risks an infinite loop, then there must
have been an oops, or pfn 0 mistaken for page table, before).

Given the likely extension to freeing empty page tables, I have not
limited this set of changes to a THP config; and it has been easier, and
sets a better example, if each site is given appropriate handling: even
where deeper study might prove that failure could only happen if the pmd
table were corrupted.

Several of the patches are, or include, cleanup on the way; and by the
end, pmd_trans_unstable() and suchlike are deleted: pte_offset_map() and
pte_offset_map_lock() then handle those original races and more.  Most
uses of pte_lockptr() are deprecated, with pte_offset_map_nolock() taking
its place.

This patch (of 32):

Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
reliable result with PAE (or READ_ONCE as before without PAE); and remove
the unnecessary extra barrier()s which got left behind in its callers.

HOWEVER: Note the small print in linux/pgtable.h, where it was designed
specifically for fast GUP, and depends on interrupts being disabled for
its full guarantee: most callers which have been added (here and before)
do NOT have interrupts disabled, so there is still some need for caution.

Link: https://lkml.kernel.org/r/f35279a9-9ac0-de22-d245-591afbfb4dc@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
fs/userfaultfd.c
include/linux/pgtable.h
mm/gup.c
mm/hmm.c
mm/khugepaged.c
mm/ksm.c
mm/memory.c
mm/mprotect.c
mm/page_vma_mapped.c

index 0fd96d6..f7a0817 100644 (file)
@@ -349,15 +349,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
        if (!pud_present(*pud))
                goto out;
        pmd = pmd_offset(pud, address);
-       /*
-        * READ_ONCE must function as a barrier with narrower scope
-        * and it must be equivalent to:
-        *      _pmd = *pmd; barrier();
-        *
-        * This is to deal with the instability (as in
-        * pmd_trans_unstable) of the pmd.
-        */
-       _pmd = READ_ONCE(*pmd);
+       _pmd = pmdp_get_lockless(pmd);
        if (pmd_none(_pmd))
                goto out;
 
index c5a5148..8ec27fe 100644 (file)
@@ -1344,23 +1344,6 @@ static inline int pud_trans_unstable(pud_t *pud)
 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 {
        pmd_t pmdval = pmdp_get_lockless(pmd);
-       /*
-        * The barrier will stabilize the pmdval in a register or on
-        * the stack so that it will stop changing under the code.
-        *
-        * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
-        * pmdp_get_lockless is allowed to return a not atomic pmdval
-        * (for example pointing to an hugepage that has never been
-        * mapped in the pmd). The below checks will only care about
-        * the low part of the pmd with 32bit PAE x86 anyway, with the
-        * exception of pmd_none(). So the important thing is that if
-        * the low part of the pmd is found null, the high part will
-        * be also null or the pmd_none() check below would be
-        * confused.
-        */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-       barrier();
-#endif
        /*
         * !pmd_present() checks for pmd migration entries
         *
index a718b95..d448fd2 100644 (file)
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -654,11 +654,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
        struct mm_struct *mm = vma->vm_mm;
 
        pmd = pmd_offset(pudp, address);
-       /*
-        * The READ_ONCE() will stabilize the pmdval in a register or
-        * on the stack so that it will stop changing under the code.
-        */
-       pmdval = READ_ONCE(*pmd);
+       pmdval = pmdp_get_lockless(pmd);
        if (pmd_none(pmdval))
                return no_page_table(vma, flags);
        if (!pmd_present(pmdval))
index 6a151c0..e230433 100644 (file)
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -332,7 +332,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
        pmd_t pmd;
 
 again:
-       pmd = READ_ONCE(*pmdp);
+       pmd = pmdp_get_lockless(pmdp);
        if (pmd_none(pmd))
                return hmm_vma_walk_hole(start, end, -1, walk);
 
index 3649ba1..2d206e6 100644 (file)
@@ -959,11 +959,6 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
                return SCAN_PMD_NULL;
 
        pmde = pmdp_get_lockless(*pmd);
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-       /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
-       barrier();
-#endif
        if (pmd_none(pmde))
                return SCAN_PMD_NONE;
        if (!pmd_present(pmde))
index 0156bde..df2aa28 100644 (file)
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1194,8 +1194,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
         * without holding anon_vma lock for write.  So when looking for a
         * genuine pmde (in which to find pte), test present and !THP together.
         */
-       pmde = *pmd;
-       barrier();
+       pmde = pmdp_get_lockless(pmd);
        if (!pmd_present(pmde) || pmd_trans_huge(pmde))
                goto out;
 
index 36082fd..221b216 100644 (file)
@@ -4923,18 +4923,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
                 * So now it's safe to run pte_offset_map().
                 */
                vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
-               vmf->orig_pte = *vmf->pte;
+               vmf->orig_pte = ptep_get_lockless(vmf->pte);
                vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
 
-               /*
-                * some architectures can have larger ptes than wordsize,
-                * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
-                * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
-                * accesses.  The code below just needs a consistent view
-                * for the ifs and we later double check anyway with the
-                * ptl lock held. So here a barrier will do.
-                */
-               barrier();
                if (pte_none(vmf->orig_pte)) {
                        pte_unmap(vmf->pte);
                        vmf->pte = NULL;
@@ -5058,9 +5049,8 @@ retry_pud:
                if (!(ret & VM_FAULT_FALLBACK))
                        return ret;
        } else {
-               vmf.orig_pmd = *vmf.pmd;
+               vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
 
-               barrier();
                if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
                        VM_BUG_ON(thp_migration_supported() &&
                                          !is_pmd_migration_entry(vmf.orig_pmd));
index 92d3d3c..c5a13c0 100644 (file)
@@ -309,11 +309,6 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
 {
        pmd_t pmdval = pmdp_get_lockless(pmd);
 
-       /* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-       barrier();
-#endif
-
        if (pmd_none(pmdval))
                return 1;
        if (pmd_trans_huge(pmdval))
index 4e448cf..64aff67 100644 (file)
@@ -210,7 +210,7 @@ restart:
                 * compiler and used as a stale value after we've observed a
                 * subsequent update.
                 */
-               pmde = READ_ONCE(*pvmw->pmd);
+               pmde = pmdp_get_lockless(pvmw->pmd);
 
                if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) ||
                    (pmd_present(pmde) && pmd_devmap(pmde))) {