git.monstr.eu Git - linux-2.6-microblaze.git/log

mm: introduce PAGEFLAGS_MASK to replace ((1UL << NR_PAGEFLAGS) - 1)

Instead of hard-coding ((1UL << NR_PAGEFLAGS) - 1) everywhere, introducing
PAGEFLAGS_MASK to make the code clear to get the page flags.

Link: https://lkml.kernel.org/r/20210819150712.59948-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: in_irq() cleanup

Replace the obsolete and ambiguos macro in_irq() with new macro
in_hardirq().

Link: https://lkml.kernel.org/r/20210813145245.86070-1-changbin.du@gmail.com
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [kmemleak]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

highmem: don't disable preemption on RT in kmap_atomic()

kmap_atomic() disables preemption and pagefaults for historical reasons.
The conversion to kmap_local(), which only disables migration, cannot be
done wholesale because quite some call sites need to be updated to
accommodate with the changed semantics.

On PREEMPT_RT enabled kernels the kmap_atomic() semantics are problematic
due to the implicit disabling of preemption which makes it impossible to
acquire 'sleeping' spinlocks within the kmap atomic sections.

PREEMPT_RT replaces the preempt_disable() with a migrate_disable() for
more than a decade. It could be argued that this is a justification to do
this unconditionally, but PREEMPT_RT covers only a limited number of
architectures and it disables some functionality which limits the coverage
further.

Limit the replacement to PREEMPT_RT for now.

Link: https://lkml.kernel.org/r/20210810091116.pocdmaatdcogvdso@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/early_ioremap.c: remove redundant early_ioremap_shutdown()

early_ioremap_reset() reserved a weak function so that architectures can
provide a specific cleanup. Now no architectures use it, remove this
redundant function.

Link: https://lkml.kernel.org/r/20210901082917.399953-1-o451686892@gmail.com
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: don't allow executable ioremap mappings

There is no need to execute from iomem (and most platforms it is
impossible anyway), so add the pgprot_nx() call similar to vmap.

Link: https://lkml.kernel.org/r/20210824091259.1324527-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: move ioremap_page_range to vmalloc.c

Patch series "small ioremap cleanups".

The first patch moves a little code around the vmalloc/ioremap boundary
following a bigger move by Nick earlier. The second enforces
non-executable mapping on ioremap just like we do for vmap. No driver
currently uses executable mappings anyway, as they should.

This patch (of 2):

This keeps it together with the implementation, and to remove the
vmap_range wrapper.

Link: https://lkml.kernel.org/r/20210824091259.1324527-1-hch@lst.de
Link: https://lkml.kernel.org/r/20210824091259.1324527-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

riscv: only select GENERIC_IOREMAP if MMU support is enabled

nommu ioremap is an inline stub in asm-generic/io.h.

Link: https://lkml.kernel.org/r/20210825072036.GA29161@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: remove redundant compound_head() calling

There is a READ_ONCE() in the macro of compound_head(), which will prevent
compiler from optimizing the code when there are more than once calling of
it in a function. Remove the redundant calling of compound_head() from
page_to_index() and page_add_file_rmap() for better code generation.

Link: https://lkml.kernel.org/r/20210811101431.83940-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/memory_hotplug: use helper zone_is_zone_device() to simplify the code

Patch series "Cleanup and fixups for memory hotplug".

This series contains cleanup to use helper function to simplify the code.
Also we fix some potential bugs. More details can be found in the
respective changelogs.

This patch (of 3):

Use helper zone_is_zone_device() to simplify the code and remove some
explicit CONFIG_ZONE_DEVICE codes.

Link: https://lkml.kernel.org/r/20210821094246.10149-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210821094246.10149-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Chris Goldsworthy <cgoldswo@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online policy

Currently, the "auto-movable" online policy does not allow for hotplugged
KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we
can have, primarily, because there is no coordiantion across memory
devices and we don't want to create zone-imbalances accidentially when
unplugging memory.

However, within a single memory device it's different.  Let's allow for
KERNEL memory within a dynamic memory group to allow for more MOVABLE
within the same memory group.  The only thing we have to take care of is
that the managing driver avoids zone imbalances by unplugging MOVABLE
memory first, otherwise there can be corner cases where unplug of memory
could result in (accidential) zone imbalances.

virtio-mem is the only user of dynamic memory groups and recently added
support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we
don't need a new toggle to enable it for dynamic memory groups.

We limit this handling to dynamic memory groups, because:

* We want to keep the runtime overhead for collecting stats when
  onlining a single memory block small.  We tend to have only a handful of
  dynamic memory groups, but we can have quite some static memory groups
  (e.g., 256 DIMMs).

* It doesn't make too much sense for static memory groups, as we try
  onlining all applicable memory blocks either completely to ZONE_MOVABLE
  or not.  In ordinary operation, we won't have a mixture of zones within
  a static memory group.

When adding memory to a dynamic memory group, we'll first online memory to
ZONE_MOVABLE as long as early KERNEL memory allows for it.  Then, we'll
online the next unit(s) to ZONE_NORMAL, until we can online the next
unit(s) to ZONE_MOVABLE.

For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it will
result in a layout like:

  [M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]...
  ^ movable memory due to early kernel memory
   ^ allows for more movable memory ...
      ^-----^ ... here
       ^ allows for more movable memory ...
          ^-----^ ... here

While the created layout is sub-optimal when it comes to contiguous zones,
it gives us the maximum flexibility when dynamically growing/shrinking a
device; we can grow small VMs really big in small steps, and still shrink
reliably to e.g., 1/4 of the maximum VM size in this example, removing
full memory blocks along with meta data more reliably.

Mark dynamic memory groups in the xarray such that we can efficiently
iterate over them when collecting stats.  In usual setups, we have one
virtio-mem device per NUMA node, and usually only a small number of NUMA
nodes.

Note: for now, there seems to be no compelling reason to make this
behavior configurable.

Link: https://lkml.kernel.org/r/20210806124715.17090-10-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/memory_hotplug: memory group aware "auto-movable" online policy

Use memory groups to improve our "auto-movable" onlining policy:

1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE
   only if all other memory blocks in the group are either MOVABLE or could
   be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture.

2. For dynamic memory groups (e.g., a virtio-mem device), online a
   memory block MOVABLE only if all other memory blocks inside the
   current unit are either MOVABLE or could be onlined MOVABLE. For a
   virtio-mem device with a device block size with 512 MiB, all 128 MiB
   memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not
   a mixture.

We have to pass the memory group to zone_for_pfn_range() to take the
memory group into account.

Note: for now, there seems to be no compelling reason to make this
behavior configurable.

Link: https://lkml.kernel.org/r/20210806124715.17090-9-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

virtio-mem: use a single dynamic memory group for a single virtio-mem device

Let's use a single dynamic memory group.

Link: https://lkml.kernel.org/r/20210806124715.17090-8-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dax/kmem: use a single static memory group for a single probed unit

Although dax/kmem users often disable auto-onlining and instead online
memory manually (usually to ZONE_MOVABLE), there is still value in having
auto-onlining be aware of the relationship of memory blocks.

Let's treat one probed unit as a single static memory device, similar to a
single ACPI memory device.

Link: https://lkml.kernel.org/r/20210806124715.17090-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ACPI: memhotplug: use a single static memory group for a single memory device

Let's group all memory we add for a single memory device - we want a
single node for that (which also seems to be the sane thing to do).

We won't care for now about memory that was already added to the system
(e.g., via e820) -- usually *all* memory of a memory device was already
added and we'll fail acpi_memory_enable_device().

Link: https://lkml.kernel.org/r/20210806124715.17090-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/memory_hotplug: track present pages in memory groups

Let's track all present pages in each memory group. Especially, track
memory present in ZONE_MOVABLE and memory present in one of the kernel
zones (which really only is ZONE_NORMAL right now as memory groups only
apply to hotplugged memory) separately within a memory group, to prepare
for making smart auto-online decision for individual memory blocks within
a memory group based on group statistics.

Link: https://lkml.kernel.org/r/20210806124715.17090-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

drivers/base/memory: introduce "memory groups" to logically group memory blocks

In our "auto-movable" memory onlining policy, we want to make decisions
across memory blocks of a single memory device.  Examples of memory
devices include ACPI memory devices (in the simplest case a single DIMM)
and virtio-mem.  For now, we don't have a connection between a single
memory block device and the real memory device.  Each memory device
consists of 1..X memory block devices.

Let's logically group memory blocks belonging to the same memory device in
"memory groups".  Memory groups can span multiple physical ranges and a
memory group itself does not contain any information regarding physical
ranges, only properties (e.g., "max_pages") necessary for improved memory
onlining.

Introduce two memory group types:

1) Static memory group: E.g., a single ACPI memory device, consisting
   of 1..X memory resources.  A memory group consists of 1..Y memory
   blocks.  The whole group is added/removed in one go.  If any part
   cannot get offlined, the whole group cannot be removed.

2) Dynamic memory group: E.g., a single virtio-mem device.  Memory is
   dynamically added/removed in a fixed granularity, called a "unit",
   consisting of 1..X memory blocks.  A unit is added/removed in one go.
   If any part of a unit cannot get offlined, the whole unit cannot be
   removed.

In case of 1) we usually want either all memory managed by ZONE_MOVABLE or
none.  In case of 2) we usually want to have as many units as possible
managed by ZONE_MOVABLE.  We want a single unit to be of the same type.

For now, memory groups are an internal concept that is not exposed to user
space; we might want to change that in the future, though.

add_memory() users can specify a mgid instead of a nid when passing the
MHP_NID_IS_MGID flag.

Link: https://lkml.kernel.org/r/20210806124715.17090-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/memory_hotplug: introduce "auto-movable" online policy

When onlining without specifying a zone (using "online" instead of
"online_kernel" or "online_movable"), we currently select a zone such that
existing zones are kept contiguous.  This online policy made sense in the
past, where contiguous zones where required.

We'd like to implement smarter policies, however:

* User space has little insight.  As one example, it has no idea which
  memory blocks logically belong together (e.g., to a DIMM or to a
  virtio-mem device).

* Drivers that add memory in separate memory blocks, especially
  virtio-mem, want memory to get onlined right from the kernel when
  adding.

So we really want to have onlining to differing zones managed in the
kernel, configured by user space.

We see more and more cases where we might eventually hotplug a lot of
memory in the future (e.g., eventually grow a 2 GiB VM to 64 GiB),
however:

* Resizing happens dynamically, in smaller steps in both directions
  (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...)

* We still want as much flexibility as possible, especially,
  hotunplugging as much memory as possible later.

We can really only use "online_movable" if we know that the amount of
memory we are going to hotplug upfront, and we know that it won't result
in a zone imbalance.  So in our example, a 2 GiB VM that could grow to 64
GiB could currently not use "online_movable", and instead, "online_kernel"
would have to be used, resulting in worse (no) memory hotunplug
reliability.

Let's add a new "auto-movable" online policy that considers the current
zone ratios (global, per-node) to determine, whether we a memory block can
be onlined to ZONE_MOVABLE:

MOVABLE : KERNEL

However, internally we'll only consider the following ratio for now:

MOVABLE : KERNEL_EARLY

For now, we don't allow for hotplugged KERNEL memory to allow for more
MOVABLE memory, because there is no coordination across memory devices.
In follow-up patches, we will allow for more KERNEL memory within a memory
device to allow for more MOVABLE memory within the same memory device --
which only makes sense for special memory device types.

We base our calculation on "present pages", see the code comments for
details.  Hotplugged memory will get online to ZONE_MOVABLE if the
configured ratio allows for it.  Depending on the setup, this can result
in fragmented zones, which can make compaction slower and dynamic
allocation of gigantic pages when not using CMA less reliable (...  which
is already pretty unreliable).

The old policy will be the default and called "contig-zones".  In
follow-up patches, our new policy will use additional information, such as
memory groups, to make even smarter decisions across memory blocks.

Configuration:

* memory_hotplug.online_policy is used to switch between both polices
  and defaults to "contig-zones".

* memory_hotplug.auto_movable_ratio defines the maximum ratio is in
  percent and defaults to "301" -- allowing e.g., most 8 GiB machines to
  grow to 32 GiB and have all hotplugged memory in ZONE_MOVABLE.  The
  additional percent accounts for a handful of lost present pages (e.g.,
  firmware allocations).  User space is expected to adjust this ratio when
  enabling the new "auto-movable" policy, though.

* memory_hotplug.auto_movable_numa_aware considers numa node stats in
  addition to global stats, and defaults to "true".

Note: just like the old policy, the new policy won't take things like
unmovable huge pages or memory ballooning that doesn't support balloon
compaction into account.  User space has to configure onlining
accordingly.

Link: https://lkml.kernel.org/r/20210806124715.17090-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: track present early pages per zone

Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.

I. Goal

The goal of this series is improving in-kernel auto-online support.  It
tackles the fundamental problems that:

1) We can create zone imbalances when onlining all memory blindly to
    ZONE_MOVABLE, in the worst case crashing the system. We have to know
    upfront how much memory we are going to hotplug such that we can
    safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
    via "online_movable". This is far from practical and only applicable in
    limited setups -- like inside VMs under the RHV/oVirt hypervisor which
    will never hotplug more than 3 times the boot memory (and the
    limitation is only in place due to the Linux limitation).

2) We see more setups that implement dynamic VM resizing, hot(un)plugging
    memory to resize VM memory. In these setups, we might hotplug a lot of
    memory, but it might happen in various small steps in both directions
    (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
    primary driver of this upstream right now, performing such dynamic
    resizing NUMA-aware via multiple virtio-mem devices.

    Onlining all hotplugged memory to ZONE_NORMAL means we basically have
    no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
    easily run into zone imbalances when growing a VM. We want a mixture,
    and we want as much memory as reasonable/configured in ZONE_MOVABLE.
    Details regarding zone imbalances can be found at [1].

3) Memory devices consist of 1..X memory block devices, however, the
    kernel doesn't really track the relationship. Consequently, also user
    space has no idea. We want to make per-device decisions.

    As one example, for memory hotunplug it doesn't make sense to use a
    mixture of zones within a single DIMM: we want all MOVABLE if
    possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
    block the whole DIMM from getting hotunplugged.

    As another example, virtio-mem operates on individual units that span
    1..X memory blocks. Similar to a DIMM, we want a unit to either be all
    MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
    all units of a virtio-mem device logically belong together and are
    managed (added/removed) by a single driver. We want as much memory of
    a virtio-mem device to be MOVABLE as possible.

4) We want memory onlining to be done right from the kernel while adding
    memory, not triggered by user space via udev rules; for example, this
    is reqired for fast memory hotplug for drivers that add individual
    memory blocks, like virito-mem. We want a way to configure a policy in
    the kernel and avoid implementing advanced policies in user space.

The auto-onlining support we have in the kernel is not sufficient.  All we
have is a) online everything MOVABLE (online_movable) b) online everything
!MOVABLE (online_kernel) c) keep zones contiguous (online).  This series
allows configuring c) to mean instead "online movable if possible
according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
-- a new onlining policy.

II. Approach

This series does 3 things:

1) Introduces the "auto-movable" online policy that initially operates on
    individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
    to make a decision whether a memory block will be onlined to
    ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
    memory does not allow for more MOVABLE memory (details in the
    patches). CMA memory is treated like MOVABLE memory.

2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
    groups and uses group information to make decisions in the
    "auto-movable" online policy across memory blocks of a single memory
    device (modeled as memory group). More details can be found in patch
    #3 or in the DIMM example below.

3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
    allowing ZONE_NORMAL memory within a dynamic memory group to allow for
    more ZONE_MOVABLE memory within the same memory group. The target use
    case is dynamic VM resizing using virtio-mem. See the virtio-mem
    example below.

I remember that the basic idea of using a ratio to implement a policy in
the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
lost the pointer to that discussion).

For me, the main use case is using it along with virtio-mem (and DIMMs /
ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
amount of memory we can hotunplug reliably again if we might eventually
hotplug a lot of memory to a VM.

III. Target Usage

The target usage will be:

1) Linux boots with "mhp_default_online_type=offline"

2) User space (e.g., systemd unit) configures memory onlining (according
    to a config file and system properties), for example:
    * Setting memory_hotplug.online_policy=auto-movable
    * Setting memory_hotplug.auto_movable_ratio=301
    * Setting memory_hotplug.auto_movable_numa_aware=true

3) User space enabled auto onlining via "echo online >
    /sys/devices/system/memory/auto_online_blocks"

4) User space triggers manual onlining of all already-offline memory
    blocks (go over offline memory blocks and set them to "online")

IV. Example

For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
301% results in the following layout:
Memory block 0-15:    DMA32   (early)
Memory block 32-47:   Normal  (early)
Memory block 48-79:   Movable (DIMM 0)
Memory block 80-111:  Movable (DIMM 1)
Memory block 112-143: Movable (DIMM 2)
Memory block 144-275: Normal  (DIMM 3)
Memory block 176-207: Normal  (DIMM 4)
... all Normal
(-> hotplugged Normal memory does not allow for more Movable memory)

For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
will result in the following layout:
Memory block 0-15:    DMA32   (early)
Memory block 32-47:   Normal  (early)
Memory block 48-143:  Movable (virtio-mem, first 12 GiB)
Memory block 144:     Normal  (virtio-mem, next 128 MiB)
Memory block 145-147: Movable (virtio-mem, next 384 MiB)
Memory block 148:     Normal  (virtio-mem, next 128 MiB)
Memory block 149-151: Movable (virtio-mem, next 384 MiB)
... Normal/Movable mixture as above
(-> hotplugged Normal memory allows for more Movable memory within
    the same device)

Which gives us maximum flexibility when dynamically growing/shrinking a
VM in smaller steps.

V. Doc Update

I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
usptream. Until then, details can be found in patch #2.

VI. Future Work

1) Use memory groups for ppc64 dlpar
2) Being able to specify a portion of (early) kernel memory that will be
    excluded from the ratio. Like "128 MiB globally/per node" are excluded.

    This might be helpful when starting VMs with extremely small memory
    footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
    the first hotplugged units getting onlined to ZONE_MOVABLE. One
    alternative would be a trigger to not consider ZONE_DMA memory
    in the ratio. We'll have to see if this is really rrequired.
3) Indicate to user space that MOVABLE might be a bad idea -- especially
    relevant when memory ballooning without support for balloon compaction
    is active.

This patch (of 9):

For implementing a new memory onlining policy, which determines when to
online memory blocks to ZONE_MOVABLE semi-automatically, we need the
number of present early (boot) pages -- present pages excluding hotplugged
pages.  Let's track these pages per zone.

Pass a page instead of the zone to adjust_present_page_count(), similar as
adjust_managed_page_count() and derive the zone from the page.

It's worth noting that a memory block to be offlined/onlined is either
completely "early" or "not early".  add_memory() and friends can only add
complete memory blocks and we only online/offline complete (individual)
memory blocks.

Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ACPI: memhotplug: memory resources cannot be enabled yet

We allocate + initialize everything from scratch. In case enabling the
device fails, we free all memory resourcs.

Link: https://lkml.kernel.org/r/20210712124052.26491-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/memory_hotplug: remove nid parameter from remove_memory() and friends

There is only a single user remaining. We can simply lookup the nid only
used for node offlining purposes when walking our memory blocks. We don't
expect to remove multi-nid ranges; and if we'd ever do, we most probably
don't care about removing multi-nid ranges that actually result in empty
nodes.

If ever required, we can detect the "multi-nid" scenario and simply try
offlining all online nodes.

Link: https://lkml.kernel.org/r/20210712124052.26491-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta@ionos.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/memory_hotplug: remove nid parameter from arch_remove_memory()

The parameter is unused, let's remove it.

Link: https://lkml.kernel.org/r/20210712124052.26491-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Heiko Carstens <hca@linux.ibm.com> [s390]
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Joe Perches <joe@perches.com>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: Jia He <justin.he@arm.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()

Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory"

These are all cleanups and one fix previously sent as part of [1]:
[PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory
groups.

These patches make sense even without the other series, therefore I pulled
them out to make the other series easier to digest.

[1] https://lkml.kernel.org/r/20210607195430.48228-1-david@redhat.com

This patch (of 4):

Checkpatch complained on a follow-up patch that we are using "unsigned"
here, which defaults to "unsigned int" and checkpatch is correct.

As we will search for a fitting zone using the wrong pfn, we might end
up onlining memory to one of the special kernel zones, such as ZONE_DMA,
which can end badly as the onlined memory does not satisfy properties of
these zones.

Use "unsigned long" instead, just as we do in other places when handling
PFNs. This can bite us once we have physical addresses in the range of
multiple TB.

Link: https://lkml.kernel.org/r/20210712124052.26491-2-david@redhat.com
Fixes: e5e689302633 ("mm, memory_hotplug: display allowed zones in the preferred ordering")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: virtualization@lists.linux-foundation.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: memory_hotplug: cleanup after removal of pfn_valid_within()

When test_pages_in_a_zone() used pfn_valid_within() is has some logic
surrounding pfn_valid_within() checks.

Since pfn_valid_within() is gone, this logic can be removed.

Link: https://lkml.kernel.org/r/20210713080035.7464-3-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE

Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".

After recent updates to freeing unused parts of the memory map, no
architecture can have holes in the memory map within a pageblock. This
makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration
option redundant.

The first patch removes them both in a mechanical way and the second patch
simplifies memory_hotplug::test_pages_in_a_zone() that had
pfn_valid_within() surrounded by more logic than simple if.

This patch (of 2):

After recent changes in freeing of the unused parts of the memory map and
rework of pfn_valid() in arm and arm64 there are no architectures that can
have holes in the memory map within a pageblock and so nothing can enable
CONFIG_HOLES_IN_ZONE which guards non trivial implementation of
pfn_valid_within().

With that, pfn_valid_within() is always hardwired to 1 and can be
completely removed.

Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.

Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memory-hotplug.rst: complete admin-guide overhaul

The memory hot(un)plug documentation is outdated and incomplete. Most of
the content dates back to 2007, so it's time for a major overhaul.

Let's rewrite, reorganize and update most parts of the documentation. In
addition to memory hot(un)plug, also add some details regarding
ZONE_MOVABLE, with memory hotunplug being one of its main consumers.

Drop the file history, that information can more reliably be had from the
git log.

The style of the document is also properly fixed that e.g., "restview"
renders it cleanly now.

In the future, we might add some more details about virt users like
virtio-mem, the XEN balloon, the Hyper-V balloon and ppc64 dlpar.

Link: https://lkml.kernel.org/r/20210707073205.3835-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memory-hotplug.rst: remove locking details from admin-guide

Patch series "memory-hotplug.rst: complete admin-guide overhaul", v3.

This patch (of 2):

We have the same content at Documentation/core-api/memory-hotplug.rst and
it doesn't fit into the admin-guide. The documentation was accidentially
duplicated when merging.

Link: https://lkml.kernel.org/r/20210707073205.3835-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210707073205.3835-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Makefile: use -Wno-main in the full kernel tree

When using gcc (SUSE Linux) 7.5.0 (on openSUSE 15.3), I see a build
warning:

  kernel/trace/trace_osnoise.c: In function 'start_kthread':
  kernel/trace/trace_osnoise.c:1461:8: warning: 'main' is usually a function [-Wmain]
    void *main = osnoise_main;
          ^~~~

Quieten that warning by using "-Wno-main".  It's OK to use "main" as a
declaration name in the kernel.

Build-tested on most ARCHes.

[ v2: only do it for gcc, since clang doesn't have that particular warning ]

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/lkml/20210813224131.25803-1-rdunlap@infradead.org/
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: linux-kbuild@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branches 'acpi-pm' and 'acpi-docs'

* acpi-pm:
ACPI: PM: s2idle: Run both AMD and Microsoft methods if both are supported

* acpi-docs:
Documentation: ACPI: Align the SSDT overlays file with the code

Merge branch 'pm-opp'

* pm-opp:
  dt-bindings: opp: Convert to DT schema
  dt-bindings: Clean-up OPP binding node names in examples
  ARM: dts: omap: Drop references to opp.txt

Merge branch 'pm-cpufreq'

* pm-cpufreq:
  Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification"
  cpufreq: mediatek-hw: Add support for CPUFREQ HW
  cpufreq: Add of_perf_domain_get_sharing_cpumask
  dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW
  cpufreq: Remove ready() callback
  cpufreq: sh: Remove sh_cpufreq_cpu_ready()
  cpufreq: acpi: Remove acpi_cpufreq_cpu_ready()
  cpufreq: qcom-hw: Set dvfs_possible_from_any_cpu cpufreq driver flag
  cpufreq: blocklist more Qualcomm platforms in cpufreq-dt-platdev
  cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support
  cpufreq: scmi: Use .register_em() to register with energy model
  cpufreq: vexpress: Use .register_em() to register with energy model
  cpufreq: scpi: Use .register_em() to register with energy model
  cpufreq: qcom-cpufreq-hw: Use .register_em() to register with energy model
  cpufreq: omap: Use .register_em() to register with energy model
  cpufreq: mediatek: Use .register_em() to register with energy model
  cpufreq: imx6q: Use .register_em() to register with energy model
  cpufreq: dt: Use .register_em() to register with energy model
  cpufreq: Add callback to register with energy model
  cpufreq: vexpress: Set CPUFREQ_IS_COOLING_DEV flag

IB/hfi1: make hist static

This symbol is not used outside of trace.c, so marks it static.

Fix the following sparse warning:

drivers/infiniband/hw/hfi1/trace.c:491:23: warning: symbol 'hist' was not declared. Should it be static?

Link: https://lore.kernel.org/r/1630921723-21545-1-git-send-email-jiapeng.chong@linux.alibaba.com
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: chongjiapeng <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/bnxt_re: Prefer kcalloc over open coded arithmetic

As noted in the "Deprecated Interfaces, Language Features, Attributes, and
Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead to
values wrapping around and a smaller allocation being made than the caller
was expecting. Using those allocations could lead to linear overflows of
heap memory and other misbehaviors.

In this case this is not actually dynamic sizes: both sides of the
multiplication are constant values. However it is best to refactor this
anyway, just to keep the open-coded math idiom out of code.

So, use the purpose specific kcalloc() function instead of the argument
size * count in the kzalloc() function.

Also, remove the unnecessary initialization of the sqp_tbl variable since
it is set a few lines later.

[1] https://www.kernel.org/doc/html/v5.14/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments

Link: https://lore.kernel.org/r/20210905081812.17113-1-len.baker@gmx.com
Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

IB/qib: Fix null pointer subtraction compiler warning

>> drivers/infiniband/hw/qib/qib_sysfs.c:411:1: warning: performing pointer subtraction with a null pointer has undefined behavior
+[-Wnull-pointer-subtraction]
   QIB_DIAGC_ATTR(rc_resends);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/infiniband/hw/qib/qib_sysfs.c:408:51: note: expanded from macro 'QIB_DIAGC_ATTR'
                   .counter = &((struct qib_ibport *)0)->rvp.n_##N - (u64 *)0,    \

Use offsetof and accomplish the type check using static_assert.

Fixes: 4a7aaf88c89f ("RDMA/qib: Use attributes for the port sysfs")
Link: https://lore.kernel.org/r/0-v1-43ae3c759177+65-qib_type_jgg@nvidia.com
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/mlx5: Fix xlt_chunk_align calculation

The XLT chunk alignment depends on ent_size not sizeof(ent_size) aka
sizeof(size_t). The incoming ent_size is either 8 or 16, so the
miscalculation when 16 is required is only an over-alignment and
functional harmless.

Fixes: 8010d74b9965 ("RDMA/mlx5: Split the WR setup out of mlx5_ib_update_xlt()")
Link: https://lore.kernel.org/r/20210908081849.7948-2-schnelle@linux.ibm.com
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/mlx5: Fix number of allocated XLT entries

In commit 8010d74b9965b ("RDMA/mlx5: Split the WR setup out of
mlx5_ib_update_xlt()") the allocation logic was split out of
mlx5_ib_update_xlt() and the logic was changed to enable better OOM
handling. Sadly this change introduced a miscalculation of the number of
entries that were actually allocated when under memory pressure where it
can actually become 0 which on s390 lets dma_map_single() fail.

It can also lead to corruption of the free pages list when the wrong
number of entries is used in the calculation of sg->length which is used
as argument for free_pages().

Fix this by using the allocation size instead of misusing get_order(size).

Cc: stable@vger.kernel.org
Fixes: 8010d74b9965 ("RDMA/mlx5: Split the WR setup out of mlx5_ib_update_xlt()")
Link: https://lore.kernel.org/r/20210908081849.7948-1-schnelle@linux.ibm.com
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Merge tag 'pci-v5.15-changes' of git://git./linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
"Enumeration:
   - Convert controller drivers to generic_handle_domain_irq() (Marc
     Zyngier)
   - Simplify VPD (Vital Product Data) access and search (Heiner
     Kallweit)
   - Update bnx2, bnx2x, bnxt, cxgb4, cxlflash, sfc, tg3 drivers to use
     simplified VPD interfaces (Heiner Kallweit)
   - Run Max Payload Size quirks before configuring MPS; work around
     ASMedia ASM1062 SATA MPS issue (Marek Behún)

  Resource management:
   - Refactor pci_ioremap_bar() and pci_ioremap_wc_bar() (Krzysztof
     Wilczyński)
   - Optimize pci_resource_len() to reduce kernel size (Zhen Lei)

  PCI device hotplug:
   - Fix a double unmap in ibmphp (Vishal Aslot)

  PCIe port driver:
   - Enable Bandwidth Notification only if port supports it (Stuart
     Hayes)

  Sysfs/proc/syscalls:
   - Add schedule point in proc_bus_pci_read() (Krzysztof Wilczyński)
   - Return ~0 data on pciconfig_read() CAP_SYS_ADMIN failure (Krzysztof
     Wilczyński)
   - Return "int" from pciconfig_read() syscall (Krzysztof Wilczyński)

  Virtualization:
   - Extend "pci=noats" to also turn on Translation Blocking to protect
     against some DMA attacks (Alex Williamson)
   - Add sysfs mechanism to control the type of reset used between
     device assignments to VMs (Amey Narkhede)
   - Add support for ACPI _RST reset method (Shanker Donthineni)
   - Add ACS quirks for Cavium multi-function devices (George Cherian)
   - Add ACS quirks for NXP LX2xx0 and LX2xx2 platforms (Wasim Khan)
   - Allow HiSilicon AMBA devices that appear as fake PCI devices to use
     PASID and SVA (Zhangfei Gao)

  Endpoint framework:
   - Add support for SR-IOV Endpoint devices (Kishon Vijay Abraham I)
   - Zero-initialize endpoint test tool parameters so we don't use
     random parameters (Shunyong Yang)

  APM X-Gene PCIe controller driver:
   - Remove redundant dev_err() call in xgene_msi_probe() (ErKun Yang)

  Broadcom iProc PCIe controller driver:
   - Don't fail devm_pci_alloc_host_bridge() on missing 'ranges' because
     it's optional on BCMA devices (Rob Herring)
   - Fix BCMA probe resource handling (Rob Herring)

  Cadence PCIe driver:
   - Work around J7200 Link training electrical issue by increasing
     delays in LTSSM (Nadeem Athani)

  Intel IXP4xx PCI controller driver:
   - Depend on ARCH_IXP4XX to avoid useless config questions (Geert
     Uytterhoeven)

  Intel Keembay PCIe controller driver:
   - Add Intel Keem Bay PCIe controller (Srikanth Thokala)

  Marvell Aardvark PCIe controller driver:
   - Work around config space completion handling issues (Evan Wang)
   - Increase timeout for config access completions (Pali Rohár)
   - Emulate CRS Software Visibility bit (Pali Rohár)
   - Configure resources from DT 'ranges' property to fix I/O space
     access (Pali Rohár)
   - Serialize INTx mask/unmask (Pali Rohár)

  MediaTek PCIe controller driver:
   - Add MT7629 support in DT (Chuanjia Liu)
   - Fix an MSI issue (Chuanjia Liu)
   - Get syscon regmap ("mediatek,generic-pciecfg"), IRQ number
     ("pci_irq"), PCI domain ("linux,pci-domain") from DT properties if
     present (Chuanjia Liu)

  Microsoft Hyper-V host bridge driver:
   - Add ARM64 support (Boqun Feng)
   - Support "Create Interrupt v3" message (Sunil Muthuswamy)

  NVIDIA Tegra PCIe controller driver:
   - Use seq_puts(), move err_msg from stack to static, fix OF node leak
     (Christophe JAILLET)

  NVIDIA Tegra194 PCIe driver:
   - Disable suspend when in Endpoint mode (Om Prakash Singh)
   - Fix MSI-X address programming error (Om Prakash Singh)
   - Disable interrupts during suspend to avoid spurious AER link down
     (Om Prakash Singh)

  Renesas R-Car PCIe controller driver:
   - Work around hardware issue that prevents Link L1->L0 transition
     (Marek Vasut)
   - Fix runtime PM refcount leak (Dinghao Liu)

  Rockchip DesignWare PCIe controller driver:
   - Add Rockchip RK356X host controller driver (Simon Xue)

  TI J721E PCIe driver:
   - Add support for J7200 and AM64 (Kishon Vijay Abraham I)

  Toshiba Visconti PCIe controller driver:
   - Add Toshiba Visconti PCIe host controller driver (Nobuhiro
     Iwamatsu)

  Xilinx NWL PCIe controller driver:
   - Enable PCIe reference clock via CCF (Hyun Kwon)

  Miscellaneous:
   - Convert sta2x11 from 'pci_' to 'dma_' API (Christophe JAILLET)
   - Fix pci_dev_str_match_path() alloc while atomic bug (used for
     kernel parameters that specify devices) (Dan Carpenter)
   - Remove pointless Precision Time Management warning when PTM is
     present but not enabled (Jakub Kicinski)
   - Remove surplus "break" statements (Krzysztof Wilczyński)"

* tag 'pci-v5.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (132 commits)
  PCI: ibmphp: Fix double unmap of io_mem
  x86/PCI: sta2x11: switch from 'pci_' to 'dma_' API
  PCI/VPD: Use unaligned access helpers
  PCI/VPD: Clean up public VPD defines and inline functions
  cxgb4: Use pci_vpd_find_id_string() to find VPD ID string
  PCI/VPD: Add pci_vpd_find_id_string()
  PCI/VPD: Include post-processing in pci_vpd_find_tag()
  PCI/VPD: Stop exporting pci_vpd_find_info_keyword()
  PCI/VPD: Stop exporting pci_vpd_find_tag()
  PCI: Set dma-can-stall for HiSilicon chips
  PCI: rockchip-dwc: Add Rockchip RK356X host controller driver
  PCI: dwc: Remove surplus break statement after return
  PCI: artpec6: Remove local code block from switch statement
  PCI: artpec6: Remove surplus break statement after return
  MAINTAINERS: Add entries for Toshiba Visconti PCIe controller
  PCI: visconti: Add Toshiba Visconti PCIe host controller driver
  PCI/portdrv: Enable Bandwidth Notification only if port supports it
  PCI: Allow PASID on fake PCIe devices without TLP prefixes
  PCI: mediatek: Use PCI domain to handle ports detection
  PCI: mediatek: Add new method to get irq number
  ...

kbuild: Only default to -Werror if COMPILE_TEST

The cross-product of the kernel's supported toolchains, architectures,
and configuration options is large. So large, that it's generally
accepted to be infeasible to enumerate and build+test them all
(many compile-testers rely on randomly generated configs).

Without the possibility to enumerate all possible combinations of
toolchains, architectures, and configuration options, it is inevitable
that compiler warnings in this space exist.

With -Werror, this means that an innumerable set of kernels are now
broken, yet had been perfectly usable before (confused compilers, code
with warnings unused, or luck).

Distributors will necessarily pick a point in the toolchain X arch X
config space, and if unlucky, will have a broken build. Granted, those
will likely disable CONFIG_WERROR and move on.

The kernel's default configuration is unlikely to be suitable for all
users, but it's inappropriate to force many users to set CONFIG_WERROR=n.

This also holds for CI systems which are focused on runtime testing,
where the odd warning in some subsystem will disrupt testing of the rest
of the kernel. Many of those runtime-focused CI systems run tests or
fuzz the kernel using runtime debugging tools. Runtime testing of
different subsystems can proceed in parallel, and potentially uncover
serious bugs; halting runtime testing of the entire kernel because of
the odd warning (now error) in a subsystem or driver is simply
inappropriate.

Therefore, runtime-focused CI systems will likely choose CONFIG_WERROR=n
as well.

The appropriate usecase for -Werror is therefore compile-test focused
builds (often done by developers or CI systems).

Reflect this in the Kconfig option by making the default value of WERROR
match COMPILE_TEST.

Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Reviwed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'net-5.15-rc1' of git://git./linux/kernel/git/netdev/net

Pull networking fixes and stragglers from Jakub Kicinski:
"Networking stragglers and fixes, including changes from netfilter,
  wireless and can.

  Current release - regressions:

   - qrtr: revert check in qrtr_endpoint_post(), fixes audio and wifi

   - ip_gre: validate csum_start only on pull

   - bnxt_en: fix 64-bit doorbell operation on 32-bit kernels

   - ionic: fix double use of queue-lock, fix a sleeping in atomic

   - can: c_can: fix null-ptr-deref on ioctl()

   - cs89x0: disable compile testing on powerpc

  Current release - new code bugs:

   - bridge: mcast: fix vlan port router deadlock, consistently disable
     BH

  Previous releases - regressions:

   - dsa: tag_rtl4_a: fix egress tags, only port 0 was working

   - mptcp: fix possible divide by zero

   - netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex

   - netfilter: socket: icmp6: fix use-after-scope

   - stmmac: fix MAC not working when system resume back with WoL active

  Previous releases - always broken:

   - ip/ip6_gre: use the same logic as SIT interfaces when computing
     v6LL address

   - seg6: set fc_nlinfo in nh_create_ipv4, nh_create_ipv6

   - mptcp: only send extra TCP acks in eligible socket states

   - dsa: lantiq_gswip: fix maximum frame length

   - stmmac: fix overall budget calculation for rxtx_napi

   - bnxt_en: fix firmware version reporting via devlink

   - renesas: sh_eth: add missing barrier to fix freeing wrong tx
     descriptor

  Stragglers:

   - netfilter: conntrack: switch to siphash

   - netfilter: refuse insertion if chain has grown too large

   - ncsi: add get MAC address command to get Intel i210 MAC address"

* tag 'net-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
  ieee802154: Remove redundant initialization of variable ret
  net: stmmac: fix MAC not working when system resume back with WoL active
  net: phylink: add suspend/resume support
  net: renesas: sh_eth: Fix freeing wrong tx descriptor
  bonding: 3ad: pass parameter bond_params by reference
  cxgb3: fix oops on module removal
  can: c_can: fix null-ptr-deref on ioctl()
  can: rcar_canfd: add __maybe_unused annotation to silence warning
  net: wwan: iosm: Unify IO accessors used in the driver
  net: wwan: iosm: Replace io.*64_lo_hi() with regular accessors
  net: qcom/emac: Replace strlcpy with strscpy
  ip6_gre: Revert "ip6_gre: add validation for csum_start"
  net: hns3: make hclgevf_cmd_caps_bit_map0 and hclge_cmd_caps_bit_map0 static
  selftests/bpf: Test XDP bonding nest and unwind
  bonding: Fix negative jump label count on nested bonding
  MAINTAINERS: add VM SOCKETS (AF_VSOCK) entry
  stmmac: dwmac-loongson:Fix missing return value
  iwlwifi: fix printk format warnings in uefi.c
  net: create netdev->dev_addr assignment helpers
  bnxt_en: Fix possible unintended driver initiated error recovery
  ...

Merge tag 'linux-watchdog-5.15-rc1' of git://linux-watchdog.org/linux-watchdog

Pull watchdog updates from Wim Van Sebroeck:

- add Mediatek MT7986 & MT8195 wdt support

- add Maxim MAX63xx

- drop bd70528 support

- rewrite ixp4xx to watchdog framework

- constify static struct watchdog_ops for sl28cpld_wdt, mpc8xxx_wdt and
   tqmx86

- introduce watchdog_dev_suspend/resume

- several fixes and improvements

* tag 'linux-watchdog-5.15-rc1' of git://www.linux-watchdog.org/linux-watchdog:
  dt-bindings: watchdog: Add compatible for Mediatek MT7986
  watchdog: ixp4xx: Rewrite driver to use core
  watchdog: Start watchdog in watchdog_set_last_hw_keepalive only if appropriate
  watchdog: max63xx_wdt: Add device tree probing
  dt-bindings: watchdog: Add Maxim MAX63xx bindings
  watchdog: mediatek: mt8195: add wdt support
  dt-bindings: reset: mt8195: add toprgu reset-controller header file
  watchdog: tqmx86: Constify static struct watchdog_ops
  watchdog: mpc8xxx_wdt: Constify static struct watchdog_ops
  watchdog: sl28cpld_wdt: Constify static struct watchdog_ops
  watchdog: iTCO_wdt: Fix detection of SMI-off case
  watchdog: bcm2835_wdt: consider system-power-controller property
  watchdog: imx2_wdg: notify wdog core to stop ping worker on suspend
  watchdog: introduce watchdog_dev_suspend/resume
  watchdog: Fix NULL pointer dereference when releasing cdev
  watchdog: only run driver set_pretimeout op if device supports it
  watchdog: bd70528 drop bd70528 support

Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
"ARM:
   - Page ownership tracking between host EL1 and EL2
   - Rely on userspace page tables to create large stage-2 mappings
   - Fix incompatibility between pKVM and kmemleak
   - Fix the PMU reset state, and improve the performance of the virtual
     PMU
   - Move over to the generic KVM entry code
   - Address PSCI reset issues w.r.t. save/restore
   - Preliminary rework for the upcoming pKVM fixed feature
   - A bunch of MM cleanups
   - a vGIC fix for timer spurious interrupts
   - Various cleanups

  s390:
   - enable interpretation of specification exceptions
   - fix a vcpu_idx vs vcpu_id mixup

  x86:
   - fast (lockless) page fault support for the new MMU
   - new MMU now the default
   - increased maximum allowed VCPU count
   - allow inhibit IRQs on KVM_RUN while debugging guests
   - let Hyper-V-enabled guests run with virtualized LAPIC as long as
     they do not enable the Hyper-V "AutoEOI" feature
   - fixes and optimizations for the toggling of AMD AVIC (virtualized
     LAPIC)
   - tuning for the case when two-dimensional paging (EPT/NPT) is
     disabled
   - bugfixes and cleanups, especially with respect to vCPU reset and
     choosing a paging mode based on CR0/CR4/EFER
   - support for 5-level page table on AMD processors

  Generic:
   - MMU notifier invalidation callbacks do not take mmu_lock unless
     necessary
   - improved caching of LRU kvm_memory_slot
   - support for histogram statistics
   - add statistics for halt polling and remote TLB flush requests"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (210 commits)
  KVM: Drop unused kvm_dirty_gfn_invalid()
  KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted
  KVM: MMU: mark role_regs and role accessors as maybe unused
  KVM: MIPS: Remove a "set but not used" variable
  x86/kvm: Don't enable IRQ when IRQ enabled in kvm_wait
  KVM: stats: Add VM stat for remote tlb flush requests
  KVM: Remove unnecessary export of kvm_{inc,dec}_notifier_count()
  KVM: x86/mmu: Move lpage_disallowed_link further "down" in kvm_mmu_page
  KVM: x86/mmu: Relocate kvm_mmu_page.tdp_mmu_page for better cache locality
  Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()"
  KVM: x86/mmu: Remove unused field mmio_cached in struct kvm_mmu_page
  kvm: x86: Increase KVM_SOFT_MAX_VCPUS to 710
  kvm: x86: Increase MAX_VCPUS to 1024
  kvm: x86: Set KVM_MAX_VCPU_ID to 4*KVM_MAX_VCPUS
  KVM: VMX: avoid running vmx_handle_exit_irqoff in case of emulation
  KVM: x86/mmu: Don't freak out if pml5_root is NULL on 4-level host
  KVM: s390: index kvm->arch.idle_mask by vcpu_idx
  KVM: s390: Enable specification exception interpretation
  KVM: arm64: Trim guest debug exception handling
  KVM: SVM: Add 5-level page table support for SVM
  ...

Merge branch 'dmi-for-linus' of git://git./linux/kernel/git/jdelvare/staging

Pull dmi fix from Jean Delvare.

Unbreak some existing udev/hwdb modalias matches due to misplaced
product_sku field.

* 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
firmware: dmi: Move product_sku info to the end of the modalias

Merge tag 'ntb-5.15' of git://github.com/jonmason/ntb

Pull NTB updates from Jon Mason:
"Bug fixes and clean-ups for Linux v5.15"

* tag 'ntb-5.15' of git://github.com/jonmason/ntb:
  NTB: switch from 'pci_' to 'dma_' API
  ntb: ntb_pingpong: remove redundant initialization of variables msg_data and spad_data
  NTB: perf: Fix an error code in perf_setup_inbuf()
  NTB: Fix an error code in ntb_msit_probe()
  ntb: intel: remove invalid email address in header comment

Merge tag 'rproc-v5.15' of git://git./linux/kernel/git/andersson/remoteproc

Pull remoteproc updates from Bjorn Andersson:

- move the crash recovery worker to the freezable work queue to avoid
   interaction with other drivers during suspend & resume

- fix a couple of typos in comments

- add support for handling the audio DSP on SDM660

- fix a race between the Qualcomm wireless subsystem driver and the
   associated driver for the RF chip

* tag 'rproc-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc:
  remoteproc: q6v5_pas: Add sdm660 ADSP PIL compatible
  dt-bindings: remoteproc: qcom: adsp: Add SDM660 ADSP
  remoteproc: use freezable workqueue for crash notifications
  remoteproc: fix kernel doc for struct rproc_ops
  remoteproc: fix an typo in fw_elf_get_class code comments
  remoteproc: qcom: wcnss: Fix race with iris probe

Merge tag 'backlight-next-5.15' of git://git./linux/kernel/git/lee/backlight

Pull backlight updates from Lee Jones:
"Fix-ups:
   - Improve bootloader/kernel device handover

  Bug Fixes:
   - Stabilise backlight in ktd253 driver"

* tag 'backlight-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
  backlight: pwm_bl: Improve bootloader/kernel device handover
  backlight: ktd253: Stabilize backlight

Merge tag 'mfd-next-5.15' of git://git./linux/kernel/git/lee/mfd

Pull MFD updates from Lee Jones:
"Core Frameworks:
   - Add support for registering devices via MFD cells to Simple MFD (I2C)

  New Drivers:
   - Add support for Renesas Synchronization Management Unit (SMU)

  New Device Support:
   - Add support for N5010 to Intel M10 BMC
   - Add support for Cannon Lake to Intel LPSS ACPI
   - Add support for Samsung SSG{1,2} to ST-Ericsson's U8500 family
   - Add support for TQMx110EB and TQMxE40x to TQ-Systems PLD TQMx86

  New Functionality:
   - Add support for GPIO to Intel LPC ICH
   - Add support for Reset to Texas Instruments TPS65086

  Fix-ups:
   - Trivial, sorting, whitespace, renaming, etc; mt6360-core, db8500-prcmu-regs, tqmx86
   - Device Tree fiddling; syscon, axp20x, qcom,pm8008, ti,tps65086, brcm,cru
   - Use proper APIs for IRQ map resolution; ab8500-core, stmpe, tc3589x, wm8994-irq
   - Pass 'supplied-from' property through axp288_fuel_gauge via swnode
   - Remove unused file entry; MAINTAINERS
   - Make interrupt line optional; tps65086
   - Rename db8500-cpuidle driver symbol; db8500-prcmu
   - Remove support for unused hardware; tqmx86
   - Provide a standard LPC clock frequency for unknown boards; tqmx86
   - Remove unused code; ti_am335x_tscadc
   - Use of_iomap() instead of ioremap(); syscon

  Bug Fixes:
   - Clear GPIO IRQ resource flags when no IRQ is set; tqmx86
   - Fix incorrect/misleading frequencies; db8500-prcmu
   - Mitigate namespace clash with other GPIOBASE users"

* tag 'mfd-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (31 commits)
  mfd: lpc_sch: Rename GPIOBASE to prevent build error
  mfd: syscon: Use of_iomap() instead of ioremap()
  dt-bindings: mfd: Add Broadcom CRU
  mfd: ti_am335x_tscadc: Delete superfluous error message
  mfd: tqmx86: Assume 24MHz LPC clock for unknown boards
  mfd: tqmx86: Add support for TQ-Systems DMI IDs
  mfd: tqmx86: Add support for TQMx110EB and TQMxE40x
  mfd: tqmx86: Fix typo in "platform"
  mfd: tqmx86: Remove incorrect TQMx90UC board ID
  mfd: tqmx86: Clear GPIO IRQ resource when no IRQ is set
  mfd: simple-mfd-i2c: Add support for registering devices via MFD cells
  mfd/cpuidle: ux500: Rename driver symbol
  mfd: tps65086: Add cell entry for reset driver
  mfd: tps65086: Make interrupt line optional
  dt-bindings: mfd: Convert tps65086.txt to YAML
  MAINTAINERS: Adjust ARM/NOMADIK/Ux500 ARCHITECTURES to file renaming
  mfd: db8500-prcmu: Handle missing FW variant
  mfd: db8500-prcmu: Rename register header
  mfd: axp20x: Add supplied-from property to axp288_fuel_gauge cell
  mfd: Don't use irq_create_mapping() to resolve a mapping
  ...

Merge tag 'gpio-updates-for-v5.15' of git://git./linux/kernel/git/brgl/linux

Pull gpio updates from Bartosz Golaszewski:
"We mostly have various improvements and refactoring all over the place
  but also some interesting new features - like the virtio GPIO driver
  that allows guest VMs to use host's GPIOs. We also have a new/old GPIO
  driver for rockchip - this one has been split out of the pinctrl
  driver.

  Summary:

   - new driver: gpio-virtio allowing a guest VM running linux to access
     GPIO lines provided by the host

   - split the GPIO driver out of the rockchip pin control driver

   - add support for a new model to gpio-aspeed-sgpio, refactor the
     driver and use generic device property interfaces, improve property
     sanitization

   - add ACPI support to gpio-tegra186

   - improve the code setting the line names to support multiple GPIO
     banks per device

   - constify a bunch of OF functions in the core GPIO code and make the
     declaration for one of the core OF functions we use consistent
     within its header

   - use software nodes in intel_quark_i2c_gpio

   - add support for the gpio-line-names property in gpio-mt7621

   - use the standard GPIO function for setting the GPIO names in
     gpio-brcmstb

   - fix a bunch of leaks and other bugs in gpio-mpc8xxx

   - use generic pm callbacks in gpio-ml-ioh

   - improve resource management and PM handling in gpio-mlxbf2

   - modernize and improve the gpio-dwapb driver

   - coding style improvements in gpio-rcar

   - documentation fixes and improvements

   - update the MAINTAINERS entry for gpio-zynq

   - minor tweaks in several drivers"

* tag 'gpio-updates-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: (35 commits)
  gpio: mpc8xxx: Use 'devm_gpiochip_add_data()' to simplify the code and avoid a leak
  gpio: mpc8xxx: Fix a potential double iounmap call in 'mpc8xxx_probe()'
  gpio: mpc8xxx: Fix a resources leak in the error handling path of 'mpc8xxx_probe()'
  gpio: viperboard: remove platform_set_drvdata() call in probe
  gpio: virtio: Add missing mailings lists in MAINTAINERS entry
  gpio: virtio: Fix sparse warnings
  gpio: remove the obsolete MX35 3DS BOARD MC9S08DZ60 GPIO functions
  gpio: max730x: Use the right include
  gpio: Add virtio-gpio driver
  gpio: mlxbf2: Use DEFINE_RES_MEM_NAMED() helper macro
  gpio: mlxbf2: Use devm_platform_ioremap_resource()
  gpio: mlxbf2: Drop wrong use of ACPI_PTR()
  gpio: mlxbf2: Convert to device PM ops
  gpio: dwapb: Get rid of legacy platform data
  mfd: intel_quark_i2c_gpio: Convert GPIO to use software nodes
  gpio: dwapb: Read GPIO base from gpio-base property
  gpio: dwapb: Unify ACPI enumeration checks in get_irq() and configure_irqs()
  gpiolib: Deduplicate forward declaration in the consumer.h header
  MAINTAINERS: update gpio-zynq.yaml reference
  gpio: tegra186: Add ACPI support
  ...

Merge tag 'fuse-update-5.15' of git://git./linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

- Allow mounting an active fuse device. Previously the fuse device
   would always be mounted during initialization, and sharing a fuse
   superblock was only possible through mount or namespace cloning

- Fix data flushing in syncfs (virtiofs only)

- Fix data flushing in copy_file_range()

- Fix a possible deadlock in atomic O_TRUNC

- Misc fixes and cleanups

* tag 'fuse-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: remove unused arg in fuse_write_file_get()
  fuse: wait for writepages in syncfs
  fuse: flush extending writes
  fuse: truncate pagecache on atomic_o_trunc
  fuse: allow sharing existing sb
  fuse: move fget() to fuse_get_tree()
  fuse: move option checking into fuse_fill_super()
  fuse: name fs_context consistently
  fuse: fix use after free in fuse_read_interrupt()

Merge tag 'kgdb-5.15-rc1' of git://git./linux/kernel/git/danielt/linux

Pull kgdb updates from Daniel Thompson:
"Changes for kgdb/kdb this cycle are dominated by a change from Sumit
  that removes as small (256K) private heap from kdb. This is change
  I've hoped for ever since I discovered how few users of this heap
  remained in the kernel, so many thanks to Sumit for hunting these
  down.

  The other change is an incremental step towards SPDX headers"

* tag 'kgdb-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kernel: debug: Convert to SPDX identifier
  kdb: Rename members of struct kdbtab_t
  kdb: Simplify kdb_defcmd macro logic
  kdb: Get rid of redundant kdb_register_flags()
  kdb: Rename struct defcmd_set to struct kdb_macro
  kdb: Get rid of custom debug heap allocator

Revert "memcg: enable accounting for pollfd and select bits arrays"

This reverts commit b655843444152c0a14b749308e4cb35d91cbcf0b.

Just like with the memcg lock accounting, the kernel test robot reports
a sizeable performance regression for this commit, and while it clearly
does the rigth thing in theory, we'll need to look at just how to avoid
or minimize the performance overhead of the memcg accounting.

People already have suggestions on how to do that, but it's "future
work".

So revert it for now.

[ Note: the first link below is for this same commit but a different
commit ID, because it's the kernel test robot ended up noticing it in
Andrew Morton's patch queue ]

Link: https://lore.kernel.org/lkml/20210905132732.GC15026@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/lkml/20210907150757.GE17617@xsang-OptiPlex-9020/
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Revert "memcg: enable accounting for file lock caches"

This reverts commit 0f12156dff2862ac54235fc72703f18770769042.

The kernel test robot reports a sizeable performance regression for this
commit, and while it clearly does the rigth thing in theory, we'll need
to look at just how to avoid or minimize the performance overhead of the
memcg accounting.

People already have suggestions on how to do that, but it's "future
work".

So revert it for now.

Link: https://lore.kernel.org/lkml/20210907150757.GE17617@xsang-OptiPlex-9020/
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly"

This reverts commit 9857a17f206ff374aea78bccfb687f145368be2e.

That commit was completely broken, and I should have caught on to it
earlier.  But happily, the kernel test robot noticed the breakage fairly
quickly.

The breakage is because "try_get_page()" is about avoiding the page
reference count overflow case, but is otherwise the exact same as a
plain "get_page()".

In contrast, "try_get_compound_head()" is an entirely different beast,
and uses __page_cache_add_speculative() because it's not just about the
page reference count, but also about possibly racing with the underlying
page going away.

So all the commentary about how

"try_get_page() has fallen a little behind in terms of maintenance,
  try_get_compound_head() handles speculative page references more
  thoroughly"

was just completely wrong: yes, try_get_compound_head() handles
speculative page references, but the point is that try_get_page() does
not, and must not.

So there's no lack of maintainance - there are fundamentally different
semantics.

A speculative page reference would be entirely wrong in "get_page()",
and it's entirely wrong in "try_get_page()".  It's not about
speculation, it's purely about "uhhuh, you can't get this page because
you've tried to increment the reference count too much already".

The reason the kernel test robot noticed this bug was that it hit the
VM_BUG_ON() in __page_cache_add_speculative(), which is all about
verifying that the context of any speculative page access is correct.
But since that isn't what try_get_page() is all about, the VM_BUG_ON()
tests things that are not correct to test for try_get_page().

Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'cpufreq/arm/linux-next' of git://git./linux/kernel/git/vireshk/pm

Pull more ARM cpufreq changes for v5.15-rc1 from Viresh Kumar:

"This adds a new cpufreq driver for Mediatek, which had been going
through reviews since last one year."

* 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
  cpufreq: mediatek-hw: Add support for CPUFREQ HW
  cpufreq: Add of_perf_domain_get_sharing_cpumask
  dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW

Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification"

Revert commit d0e936adbd22 ("cpufreq: intel_pstate: Process HWP
Guaranteed change notification"), because it causes a NULL pointer
dereference to occur on Lenovo X1 gen9 laptops due to an HWP
guaranteed performance change interrupt arriving prematurely.

This feature will be revisited in the next cycle.

Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

ieee802154: Remove redundant initialization of variable ret

The variable ret is being initialized with a value that is never read, it
is being updated later on. The assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'stmmac-wol-fix'

Joakim Zhang says:

====================
net: stmmac: fix WoL issue

This patch set fixes stmmac not working after system resume back with WoL
active. Thanks a lot for Russell King keeps looking into this issue.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: fix MAC not working when system resume back with WoL active

We can reproduce this issue with below steps:
1) enable WoL on the host
2) host system suspended
3) remote client send out wakeup packets
We can see that host system resume back, but can't work, such as ping failed.

After a bit digging, this issue is introduced by the commit 46f69ded988d
("net: stmmac: Use resolved link config in mac_link_up()"), which use
the finalised link parameters in mac_link_up() rather than the
parameters in mac_config().

There are two scenarios for MAC suspend/resume in STMMAC driver:

1) MAC suspend with WoL inactive, stmmac_suspend() call
phylink_mac_change() to notify phylink machine that a change in MAC
state, then .mac_link_down callback would be invoked. Further, it will
call phylink_stop() to stop the phylink instance. When MAC resume back,
firstly phylink_start() is called to start the phylink instance, then
call phylink_mac_change() which will finally trigger phylink machine to
invoke .mac_config and .mac_link_up callback. All is fine since
configuration in these two callbacks will be initialized, that means MAC
can restore the state.

2) MAC suspend with WoL active, phylink_mac_change() will put link
down, but there is no phylink_stop() to stop the phylink instance, so it
will link up again, that means .mac_config and .mac_link_up would be
invoked before system suspended. After system resume back, it will do
DMA initialization and SW reset which let MAC lost the hardware setting
(i.e MAC_Configuration register(offset 0x0) is reset). Since link is up
before system suspended, so .mac_link_up would not be invoked after
system resume back, lead to there is no chance to initialize the
configuration in .mac_link_up callback, as a result, MAC can't work any
longer.

After discussed with Russell King [1], we confirm that phylink framework
have not take WoL into consideration yet. This patch calls
phylink_suspend()/phylink_resume() functions which is newly introduced
by Russell King to fix this issue.

[1] https://lore.kernel.org/netdev/20210901090228.11308-1-qiangqing.zhang@nxp.com/

Fixes: 46f69ded988d ("net: stmmac: Use resolved link config in mac_link_up()")
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phylink: add suspend/resume support

Joakim Zhang reports that Wake-on-Lan with the stmmac ethernet driver broke
when moving the incorrect handling of mac link state out of mac_config().
This reason this breaks is because the stmmac's WoL is handled by the MAC
rather than the PHY, and phylink doesn't cater for that scenario.

This patch adds the necessary phylink code to handle suspend/resume events
according to whether the MAC still needs a valid link or not. This is the
barest minimum for this support.

Reported-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Tested-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: renesas: sh_eth: Fix freeing wrong tx descriptor

The cur_tx counter must be incremented after TACT bit of
txdesc->status was set. However, a CPU is possible to reorder
instructions and/or memory accesses between cur_tx and
txdesc->status. And then, if TX interrupt happened at such a
timing, the sh_eth_tx_free() may free the descriptor wrongly.
So, add wmb() before cur_tx++.
Otherwise NETDEV WATCHDOG timeout is possible to happen.

Fixes: 86a74ff21a7a ("net: sh_eth: add support for Renesas SuperH Ethernet")
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bonding: 3ad: pass parameter bond_params by reference

The parameter bond_params is a relatively large 192 byte sized
struct so pass it by reference rather than by value to reduce
copying.

Addresses-Coverity: ("Big parameter passed by value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'linux-can-fixes-for-5.15-20210907' of git://git./linux/kernel/git/mkl/linux-can

linux-can-fixes-for-5.15-20210907

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'wireless-drivers-2021-09-07' of git://git./linux/kernel/git/kvalo/wireless-drivers

Kalle Valo says:

====================
wireless-drivers fixes for v5.15

First set of fixes for v5.15 and only iwlwifi patches this time. Most
important being support for new hardware and new firmware API.

I had already earlier applied a fix which also Linus applied to this
tree as commit 1476ff21abb4 ("iwl: fix debug printf format strings"),
but this doesn't seem to cause any conflicts so I left it there.

iwlwifi

* add support for firmware API 66

* add support for Samsung Galaxy Book Flex2 Alpha

* fix a leak happening every time module is loaded

* fix a printk compiler warning
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb3: fix oops on module removal

When removing the driver module w/o bringing an interface up before
the error below occurs. Reason seems to be that cancel_work_sync() is
called in t3_sge_stop() for a queue that hasn't been initialized yet.

[10085.941785] ------------[ cut here ]------------
[10085.941799] WARNING: CPU: 1 PID: 5850 at kernel/workqueue.c:3074 __flush_work+0x3ff/0x480
[10085.941819] Modules linked in: vfat snd_hda_codec_hdmi fat snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio led_class ee1004 iTCO_
wdt intel_tcc_cooling x86_pkg_temp_thermal coretemp aesni_intel crypto_simd cryptd snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core r
8169 snd_pcm realtek mdio_devres snd_timer snd i2c_i801 i2c_smbus libphy i915 i2c_algo_bit cxgb3(-) intel_gtt ttm mdio drm_kms_helper mei_me s
yscopyarea sysfillrect sysimgblt mei fb_sys_fops acpi_pad sch_fq_codel crypto_user drm efivarfs ext4 mbcache jbd2 crc32c_intel
[10085.941944] CPU: 1 PID: 5850 Comm: rmmod Not tainted 5.14.0-rc7-next-20210826+ #6
[10085.941974] Hardware name: System manufacturer System Product Name/PRIME H310I-PLUS, BIOS 2603 10/21/2019
[10085.941992] RIP: 0010:__flush_work+0x3ff/0x480
[10085.942003] Code: c0 74 6b 65 ff 0d d1 bd 78 75 e8 bc 2f 06 00 48 c7 c6 68 b1 88 8a 48 c7 c7 e0 5f b4 8b 45 31 ff e8 e6 66 04 00 e9 4b fe ff ff <0f> 0b 45 31 ff e9 41 fe ff ff e8 72 c1 79 00 85 c0 74 87 80 3d 22
[10085.942036] RSP: 0018:ffffa1744383fc08 EFLAGS: 00010246
[10085.942048] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000923
[10085.942062] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff91c901710a88
[10085.942076] RBP: ffffa1744383fce8 R08: 0000000000000001 R09: 0000000000000001
[10085.942090] R10: 00000000000000c2 R11: 0000000000000000 R12: ffff91c901710a88
[10085.942104] R13: 0000000000000000 R14: ffff91c909a96100 R15: 0000000000000001
[10085.942118] FS:  00007fe417837740(0000) GS:ffff91c969d00000(0000) knlGS:0000000000000000
[10085.942134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10085.942146] CR2: 000055a8d567ecd8 CR3: 0000000121690003 CR4: 00000000003706e0
[10085.942160] Call Trace:
[10085.942166]  ? __lock_acquire+0x3af/0x22e0
[10085.942177]  ? cancel_work_sync+0xb/0x10
[10085.942187]  __cancel_work_timer+0x128/0x1b0
[10085.942197]  ? __pm_runtime_resume+0x5b/0x90
[10085.942208]  cancel_work_sync+0xb/0x10
[10085.942217]  t3_sge_stop+0x2f/0x50 [cxgb3]
[10085.942234]  remove_one+0x26/0x190 [cxgb3]
[10085.942248]  pci_device_remove+0x39/0xa0
[10085.942258]  __device_release_driver+0x15e/0x240
[10085.942269]  driver_detach+0xd9/0x120
[10085.942278]  bus_remove_driver+0x53/0xd0
[10085.942288]  driver_unregister+0x2c/0x50
[10085.942298]  pci_unregister_driver+0x31/0x90
[10085.942307]  cxgb3_cleanup_module+0x10/0x18c [cxgb3]
[10085.942324]  __do_sys_delete_module+0x191/0x250
[10085.942336]  ? syscall_enter_from_user_mode+0x21/0x60
[10085.942347]  ? trace_hardirqs_on+0x2a/0xe0
[10085.942357]  __x64_sys_delete_module+0x13/0x20
[10085.942368]  do_syscall_64+0x40/0x90
[10085.942377]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[10085.942389] RIP: 0033:0x7fe41796323b

Fixes: 5e0b8928927f ("net:cxgb3: replace tasklets with works")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mfd: lpc_sch: Rename GPIOBASE to prevent build error

One MIPS platform (mach-rc32434) defines GPIOBASE. This macro
conflicts with one of the same name in lpc_sch.c. Rename the latter one
to prevent the build error.

../drivers/mfd/lpc_sch.c:25: error: "GPIOBASE" redefined [-Werror]
25 | #define GPIOBASE 0x44
../arch/mips/include/asm/mach-rc32434/rb.h:32: note: this is the location of the previous definition
32 | #define GPIOBASE 0x050000

Cc: Denis Turischev <denis@compulab.co.il>
Fixes: e82c60ae7d3a ("mfd: Introduce lpc_sch for Intel SCH LPC bridge")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>

mfd: syscon: Use of_iomap() instead of ioremap()

This automatically selects between ioremap() and ioremap_np() on
platforms that require it, such as Apple SoCs.

Signed-off-by: Hector Martin <marcan@marcan.st>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Lee Jones <lee.jones@linaro.org>

can: c_can: fix null-ptr-deref on ioctl()

The pdev maybe not a platform device, e.g. c_can_pci device, in this
case, calling to_platform_device() would not make sense. Also, per the
comment in drivers/net/can/c_can/c_can_ethtool.c, @bus_info should
match dev_name() string, so I am replacing this with dev_name() to fix
this issue.

[    1.458583] BUG: unable to handle page fault for address: 0000000100000000
[    1.460921] RIP: 0010:strnlen+0x1a/0x30
[    1.466336]  ? c_can_get_drvinfo+0x65/0xb0 [c_can]
[    1.466597]  ethtool_get_drvinfo+0xae/0x360
[    1.466826]  dev_ethtool+0x10f8/0x2970
[    1.467880]  sock_ioctl+0xef/0x300

Fixes: 2722ac986e93 ("can: c_can: add ethtool support")
Link: https://lore.kernel.org/r/20210906233704.1162666-1-ztong0001@gmail.com
Cc: stable@vger.kernel.org # 5.14+
Signed-off-by: Tong Zhang <ztong0001@gmail.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: rcar_canfd: add __maybe_unused annotation to silence warning

Since commit

| dd3bd23eb438 ("can: rcar_canfd: Add Renesas R-Car CAN FD driver")

the rcar_canfd driver can be compile tested on all architectures. On
non OF enabled archs, or archs where OF is optional (and disabled in
the .config) the compilation throws the following warning:

| drivers/net/can/rcar/rcar_canfd.c:2020:34: warning: unused variable 'rcar_canfd_of_table' [-Wunused-const-variable]
| static const struct of_device_id rcar_canfd_of_table[] = {
| ^

This patch fixes the warning by marking the variable
rcar_canfd_of_table as __maybe_unused.

Fixes: ac4224087312 ("can: rcar: Kconfig: Add helper dependency on COMPILE_TEST")
Fixes: dd3bd23eb438 ("can: rcar_canfd: Add Renesas R-Car CAN FD driver")
Link: https://lore.kernel.org/all/20210907064537.1054268-1-mkl@pengutronix.de
Cc: linux-renesas-soc@vger.kernel.org
Cc: Cai Huoqing <caihuoqing@baidu.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

docs: pdfdocs: Fix typo in CJK-language specific font settings

There were typos in the fallback definitions of dummy LaTeX macros
for systems without CJK fonts.
They cause build errors in "make pdfdocs" on such systems.
Fix them.

Fixes: e291ff6f5a03 ("docs: pdfdocs: Add CJK-language-specific font settings")
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
Link: https://lore.kernel.org/r/ad7615a5-f8fa-2bc3-de6b-7ed49d458964@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

thunderbolt: test: split up test cases in tb_test_credit_alloc_all

The tb_test_credit_alloc_all() function had a huge number of
KUNIT_ASSERT() statements, all of which (though the magic of many many
layers of inscrutable macros) ended up allocating and initializing
various test assertion structures on the stack.

Don't do that.  The kernel stack isn't infinite, and we have compiler
warnings (now errors) for the case where a stack frame grows too large.

Like it did here, by not an inconsiderable margin:

   drivers/thunderbolt/test.c: In function ‘tb_test_credit_alloc_all’:
   drivers/thunderbolt/test.c:2367:1: error: the frame size of 4500 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]
    2367 | }
         | ^

Solve this similarly to the lib/test_scanf case: split out the tests
into several smaller functions, each just testing one particular tunnel
credit allocation.

This makes the i386 allyesconfig build work for me again.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

lib/test_scanf: split up number parsing test routines

It turns out that gcc has real trouble merging all the temporary
on-stack buffer allocation.  So despite the fact that their lifetimes do
not overlap, gcc will allocate stack for all of them when they have
different types.  Which they do in the number scanning test routines.

This is unfortunate in general, but with lots of test-cases in one
function, it becomes a real problem.  gcc will allocate a huge stack
frame for no actual good reason.

We have tried to counteract this tendency of gcc not merging stack slots
(see "-fconserve-stack"), but that has limited effect (and should be on
by default these days, iirc).

So with all the debug options enabled on an i386 allmodconfig build, we
end up with overly big stack frames, and the resulting stack frame size
warnings (now errors):

   lib/test_scanf.c: In function ‘numbers_list_field_width_val_width’:
   lib/test_scanf.c:530:1: error: the frame size of 2088 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]
     530 | }
         | ^
   lib/test_scanf.c: In function ‘numbers_list_field_width_typemax’:
   lib/test_scanf.c:488:1: error: the frame size of 2568 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]
     488 | }
         | ^
   lib/test_scanf.c: In function ‘numbers_list’:
   lib/test_scanf.c:437:1: error: the frame size of 2088 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]
     437 | }
         | ^

In this particular case, the reasonably straightforward solution is to
just split out the test routines into multiple more targeted versions.
That way we don't have one huge stack, but several smaller ones, and
they aren't active all at the same time.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

iwl: fix debug printf format strings

The variable 'package_size' is an unsigned long, and should be printed
out using '%lu', not '%zd' (that would be for a size_t).

Yes, on many architectures (including x86-64), 'size_t' is in fact the
same type as 'long', but that's a fairly random architecture definition,
and on some platforms 'size_t' is in fact 'int' rather than 'long'.

That is the case on traditional 32-bit x86. Yes, both types are the
exact same 32-bit size, and it would all print out perfectly correctly,
but '%zd' ends up still being wrong.

And we can't make 'package_size' be a 'size_t', because we get the
actual value using efivar_entry_get() that takes a pointer to an
'unsigned long'. So '%lu' it is.

This fixes two of the i386 allmodconfig build warnings (that is now an
error due to -Werror).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.15-2021-09-05' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"Was going to send this one in later this week, but given that -Werror
  is now enabled (or at least available), the mq-deadline fix really
  should go in for the folks hitting that.

   - Ensure dd_queued() is only there if needed (Geert)

   - Fix a kerneldoc warning for bio_alloc_kiocb()

   - BFQ fix for queue merging

   - loop locking fix (Tetsuo)"

* tag 'block-5.15-2021-09-05' of git://git.kernel.dk/linux-block:
  loop: reduce the loop_ctl_mutex scope
  bio: fix kerneldoc documentation for bio_alloc_kiocb()
  block, bfq: honor already-setup queue merges
  block/mq-deadline: Move dd_queued() to fix defined but not used warning

Merge tag 'misc-5.15-2021-09-05' of git://git.kernel.dk/linux-block

Pull CDROM maintainer update from Jens Axboe:
"It's been about 22 years since I originally started maintaining the
  CDROM code, and I just haven't been able to even get reviews done in a
  timely fashion the last handful of years.

  Time to pass it on, and Phillip has volunteered take over these
  duties. I'll be helping as needed for the foreseeable future"

* tag 'misc-5.15-2021-09-05' of git://git.kernel.dk/linux-block:
  cdrom: update uniform CD-ROM maintainership in MAINTAINERS file

Merge tag 'libata-5.15-2021-09-05' of git://git.kernel.dk/linux-block

Pull libata fixes from Jens Axboe:
"Fixes for queued trim on certain Samsung SSDs, in conjunction with
  certain ATI controllers"

* tag 'libata-5.15-2021-09-05' of git://git.kernel.dk/linux-block:
  libata: Add ATA_HORKAGE_NO_NCQ_ON_ATI for Samsung 860 and 870 SSD.
  libata: add ATA_HORKAGE_NO_NCQ_TRIM for Samsung 860 and 870 SSDs

Merge tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
"As sometimes happens, two reports came in around the merge window open
  that led to some fixes. Hence this one is a bit bigger than usual
  followup fixes, but most of it will be going towards stable, outside
  of the fixes that are addressing regressions from this merge window.

  In detail:

   - postgres is a heavy user of signals between tasks, and if we're
     unlucky this can interfere with io-wq worker creation. Make sure
     we're resilient against unrelated signal handling. This set of
     changes also includes hardening against allocation failures, which
     could previously had led to stalls.

   - Some use cases that end up having a mix of bounded and unbounded
     work would have starvation issues related to that. Split the
     pending work lists to handle that better.

   - Completion trace int -> unsigned -> long fix

   - Fix issue with REGISTER_IOWQ_MAX_WORKERS and SQPOLL

   - Fix regression with hash wait lock in this merge window

   - Fix retry issued on block devices (Ming)

   - Fix regression with links in this merge window (Pavel)

   - Fix race with multi-shot poll and completions (Xiaoguang)

   - Ensure regular file IO doesn't inadvertently skip completion
     batching (Pavel)

   - Ensure submissions are flushed after running task_work (Pavel)"

* tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block:
  io_uring: io_uring_complete() trace should take an integer
  io_uring: fix possible poll event lost in multi shot mode
  io_uring: prolong tctx_task_work() with flushing
  io_uring: don't disable kiocb_done() CQE batching
  io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
  io-wq: make worker creation resilient against signals
  io-wq: get rid of FIXED worker flag
  io-wq: only exit on fatal signals
  io-wq: split bounded and unbounded work into separate lists
  io-wq: fix queue stalling race
  io_uring: don't submit half-prepared drain request
  io_uring: fix queueing half-created requests
  io-wq: ensure that hash wait lock is IRQ disabling
  io_uring: retry in case of short read on block device
  io_uring: IORING_OP_WRITE needs hash_reg_file set
  io-wq: fix race between adding work and activating a free worker

don't make the syscall checking produce errors from warnings

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

net: wwan: iosm: Unify IO accessors used in the driver

Currently we have readl()/writel()/ioread*()/iowrite*() APIs in use.
Let's unify to use only ioread*()/iowrite*() variants.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: wwan: iosm: Replace io.*64_lo_hi() with regular accessors

The io.*_lo_hi() variants are not strictly needed on the x86 hardware
and especially the PCI bus. Replace them with regular accessors, but
leave headers in place in case of 32-bit build.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qcom/emac: Replace strlcpy with strscpy

The strlcpy should not be used because it doesn't limit the source
length. As linus says, it's a completely useless function if you
can't implicitly trust the source string - but that is almost always
why people think they should use it! All in all the BSD function
will lead some potential bugs.

But the strscpy doesn't require reading memory from the src string
beyond the specified "count" bytes, and since the return value is
easier to error-check than strlcpy()'s. In addition, the implementation
is robust to the string changing out from underneath it, unlike the
current strlcpy() implementation.

Thus, We prefer using strscpy instead of strlcpy.

Signed-off-by: Jason Wang <wangborong@cdjrlc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ip6_gre: Revert "ip6_gre: add validation for csum_start"

This reverts commit 9cf448c200ba9935baa94e7a0964598ce947db9d.

This commit was added for equivalence with a similar fix to ip_gre.
That fix proved to have a bug. Upon closer inspection, ip6_gre is not
susceptible to the original bug.

So revert the unnecessary extra check.

In short, ipgre_xmit calls skb_pull to remove ipv4 headers previously
inserted by dev_hard_header. ip6gre_tunnel_xmit does not.

Link: https://lore.kernel.org/netdev/CA+FuTSe+vJgTVLc9SojGuN-f9YQ+xWLPKE_S4f=f+w+_P2hgUg@mail.gmail.com/#t
Fixes: 9cf448c200ba ("ip6_gre: add validation for csum_start")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

kernel: debug: Convert to SPDX identifier

use SPDX-License-Identifier instead of a verbose license text

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Link: https://lore.kernel.org/r/20210906112302.937-1-caihuoqing@baidu.com
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>

KVM: Drop unused kvm_dirty_gfn_invalid()

Drop the unused function as reported by test bot.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210901230904.15164-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

net: hns3: make hclgevf_cmd_caps_bit_map0 and hclge_cmd_caps_bit_map0 static

This symbols is not used outside of hclge_cmd.c and hclgevf_cmd.c, so marks
it static.

Fix the following sparse warning:

drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c:345:35:
warning: symbol 'hclgevf_cmd_caps_bit_map0' was not declared. Should it
be static?

drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c:365:33: warning:
symbol 'hclge_cmd_caps_bit_map0' was not declared. Should it be static?

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: chongjiapeng <jiapeng.chong@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bonding-fix'

Jussi Maki says:

====================
bonding: Fix negative jump count reported by syzbot

This patch set fixes a negative jump count warning encountered by
syzbot [1] and extends the tests to cover nested bonding devices.

[1]: https://lore.kernel.org/lkml/0000000000000a9f3605cb1d2455@google.com/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/bpf: Test XDP bonding nest and unwind

Modify the test to check that enslaving a bond slave with a XDP program
is now allowed.

Extend attach test to exercise the program unwinding in bond_xdp_set and
add a new test for loading XDP program on doubly nested bond device to
verify that static key incr/decr is correct.

Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bonding: Fix negative jump label count on nested bonding

With nested bonding devices the nested bond device's ndo_bpf was
called without a program causing it to decrement the static key
without a prior increment leading to negative count.

Fix the issue by 1) only calling slave's ndo_bpf when there's a
program to be loaded and 2) only decrement the count when a program
is unloaded.

Fixes: 9e2ee5c7e7c3 ("net, bonding: Add XDP support to the bonding driver")
Reported-by: syzbot+30622fb04ddd72a4d167@syzkaller.appspotmail.com
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

MAINTAINERS: add VM SOCKETS (AF_VSOCK) entry

Add a new entry for VM Sockets (AF_VSOCK) that covers vsock core,
tests, and headers. Move some general vsock stuff from virtio-vsock
entry into this new more general vsock entry.

I've been reviewing and contributing for the last few years,
so I'm available to help maintain this code.

Cc: Dexuan Cui <decui@microsoft.com>
Cc: Jorgen Hansen <jhansen@vmware.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: dwmac-loongson:Fix missing return value

Add the return value when phy_mode < 0.

Signed-off-by: zhaoxiao <long870912@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fuse: remove unused arg in fuse_write_file_get()

The struct fuse_conn argument is not used and can be removed.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: wait for writepages in syncfs

In case of fuse the MM subsystem doesn't guarantee that page writeback
completes by the time ->sync_fs() is called.  This is because fuse
completes page writeback immediately to prevent DoS of memory reclaim by
the userspace file server.

This means that fuse itself must ensure that writes are synced before
sending the SYNCFS request to the server.

Introduce sync buckets, that hold a counter for the number of outstanding
write requests.  On syncfs replace the current bucket with a new one and
wait until the old bucket's counter goes down to zero.

It is possible to have multiple syncfs calls in parallel, in which case
there could be more than one waited-on buckets.  Descendant buckets must
not complete until the parent completes.  Add a count to the child (new)
bucket until the (parent) old bucket completes.

Use RCU protection to dereference the current bucket and to wake up an
emptied bucket.  Use fc->lock to protect against parallel assignments to
the current bucket.

This leaves just the counter to be a possible scalability issue.  The
fc->num_waiting counter has a similar issue, so both should be addressed at
the same time.

Reported-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 2d82ab251ef0 ("virtiofs: propagate sync() to file server")
Cc: <stable@vger.kernel.org> # v5.14
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted

When MSR_IA32_TSC_ADJUST is written by guest due to TSC ADJUST feature
especially there's a big tsc warp (like a new vCPU is hot-added into VM
which has been up for a long time), tsc_offset is added by a large value
then go back to guest. This causes system time jump as tsc_timestamp is
not adjusted in the meantime and pvclock monotonic character.
To fix this, just notify kvm to update vCPU's guest time before back to
guest.

Cc: stable@vger.kernel.org
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <1619576521-81399-2-git-send-email-zelin.deng@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: MMU: mark role_regs and role accessors as maybe unused

It is reasonable for these functions to be used only in some configurations,
for example only if the host is 64-bits (and therefore supports 64-bit
guests). It is also reasonable to keep the role_regs and role accessors
in sync even though some of the accessors may be used only for one of the
two sets (as is the case currently for CR4.LA57)..

Because clang reports warnings for unused inlines declared in a .c file,
mark both sets of accessors as __maybe_unused.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: MIPS: Remove a "set but not used" variable

This fix a build warning:

   arch/mips/kvm/vz.c: In function '_kvm_vz_restore_htimer':
>> arch/mips/kvm/vz.c:392:10: warning: variable 'freeze_time' set but not used [-Wunused-but-set-variable]
     392 |  ktime_t freeze_time;
         |          ^~~~~~~~~~~

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Message-Id: <20210406024911.2008046-1-chenhuacai@loongson.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Merge tag 'kvmarm-5.15' of git://git./linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 5.15

- Page ownership tracking between host EL1 and EL2

- Rely on userspace page tables to create large stage-2 mappings

- Fix incompatibility between pKVM and kmemleak

- Fix the PMU reset state, and improve the performance of the virtual PMU

- Move over to the generic KVM entry code

- Address PSCI reset issues w.r.t. save/restore

- Preliminary rework for the upcoming pKVM fixed feature

- A bunch of MM cleanups

- a vGIC fix for timer spurious interrupts

- Various cleanups

Merge tag 'kvm-s390-next-5.15-1' of git://git./linux/kernel/git/kvms390/linux into HEAD

KVM: s390: Fix and feature for 5.15

- enable interpretion of specification exceptions
- fix a vcpu_idx vs vcpu_id mixup

x86/kvm: Don't enable IRQ when IRQ enabled in kvm_wait

Commit f4e61f0c9add3 ("x86/kvm: Fix broken irq restoration in kvm_wait")
replaced "local_irq_restore() when IRQ enabled" with "local_irq_enable()
when IRQ enabled" to suppress a warnning.

Although there is no similar debugging warnning for doing local_irq_enable()
when IRQ enabled as doing local_irq_restore() in the same IRQ situation. But
doing local_irq_enable() when IRQ enabled is no less broken as doing
local_irq_restore() and we'd better avoid it.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210814035129.154242-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: stats: Add VM stat for remote tlb flush requests

Add a new stat that counts the number of times a remote TLB flush is
requested, regardless of whether it kicks vCPUs out of guest mode. This
allows us to look at how often flushes are initiated.

Unlike remote_tlb_flush, this one applies to ARM's instruction-set-based
TLB flush implementation, so apply it there too.

Original-by: David Matlack <dmatlack@google.com>
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210817002639.3856694-1-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: Remove unnecessary export of kvm_{inc,dec}_notifier_count()

Don't export KVM's MMU notifier count helpers, under no circumstance
should any downstream module, including x86's vendor code, have a
legitimate reason to piggyback KVM's MMU notifier logic. E.g in the x86
case, only KVM's MMU should be elevating the notifier count, and that
code is always built into the core kvm.ko module.

Fixes: edb298c663fc ("KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210902175951.1387989-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86/mmu: Move lpage_disallowed_link further "down" in kvm_mmu_page

Move "lpage_disallowed_link" out of the first 64 bytes, i.e. out of the
first cache line, of kvm_mmu_page so that "spt" and to a lesser extent
"gfns" land in the first cache line. "lpage_disallowed_link" is accessed
relatively infrequently compared to "spt", which is accessed any time KVM
is walking and/or manipulating the shadow page tables.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210901221023.1303578-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86/mmu: Relocate kvm_mmu_page.tdp_mmu_page for better cache locality

Move "tdp_mmu_page" into the 1-byte void left by the recently removed
"mmio_cached" so that it resides in the first 64 bytes of kvm_mmu_page,
i.e. in the same cache line as the most commonly accessed fields.

Don't bother wrapping tdp_mmu_page in CONFIG_X86_64, including the field in
32-bit builds doesn't affect the size of kvm_mmu_page, and a future patch
can always wrap the field in the unlikely event KVM gains a 1-byte flag
that is 32-bit specific.

Note, the size of kvm_mmu_page is also unchanged on CONFIG_X86_64=y due
to it previously sharing an 8-byte chunk with write_flooding_count.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210901221023.1303578-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()"

Revert a misguided illegal GPA check when "translating" a non-nested GPA.
The check is woefully incomplete as it does not fill in @exception as
expected by all callers, which leads to KVM attempting to inject a bogus
exception, potentially exposing kernel stack information in the process.

WARNING: CPU: 0 PID: 8469 at arch/x86/kvm/x86.c:525 exception_type+0x98/0xb0 arch/x86/kvm/x86.c:525
CPU: 1 PID: 8469 Comm: syz-executor531 Not tainted 5.14.0-rc7-syzkaller #0
RIP: 0010:exception_type+0x98/0xb0 arch/x86/kvm/x86.c:525
Call Trace:
  x86_emulate_instruction+0xef6/0x1460 arch/x86/kvm/x86.c:7853
  kvm_mmu_page_fault+0x2f0/0x1810 arch/x86/kvm/mmu/mmu.c:5199
  handle_ept_misconfig+0xdf/0x3e0 arch/x86/kvm/vmx/vmx.c:5336
  __vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6021 [inline]
  vmx_handle_exit+0x336/0x1800 arch/x86/kvm/vmx/vmx.c:6038
  vcpu_enter_guest+0x2a1c/0x4430 arch/x86/kvm/x86.c:9712
  vcpu_run arch/x86/kvm/x86.c:9779 [inline]
  kvm_arch_vcpu_ioctl_run+0x47d/0x1b20 arch/x86/kvm/x86.c:10010
  kvm_vcpu_ioctl+0x49e/0xe50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3652

The bug has escaped notice because practically speaking the GPA check is
useless.  The GPA check in question only comes into play when KVM is
walking guest page tables (or "translating" CR3), and KVM already handles
illegal GPA checks by setting reserved bits in rsvd_bits_mask for each
PxE, or in the case of CR3 for loading PTDPTRs, manually checks for an
illegal CR3.  This particular failure doesn't hit the existing reserved
bits checks because syzbot sets guest.MAXPHYADDR=1, and IA32 architecture
simply doesn't allow for such an absurd MAXPHYADDR, e.g. 32-bit paging
doesn't define any reserved PA bits checks, which KVM emulates by only
incorporating the reserved PA bits into the "high" bits, i.e. bits 63:32.

Simply remove the bogus check.  There is zero meaningful value and no
architectural justification for supporting guest.MAXPHYADDR < 32, and
properly filling the exception would introduce non-trivial complexity.

This reverts commit ec7771ab471ba6a945350353617e2e3385d0e013.

Fixes: ec7771ab471b ("KVM: x86: mmu: Add guest physical address check in translate_gpa()")
Cc: stable@vger.kernel.org
Reported-by: syzbot+200c08e88ae818f849ce@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210831164224.1119728-2-seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>