git.monstr.eu Git - linux-2.6-microblaze.git/log

Revert "dm: always call blk_queue_split() in dm_process_bio()"

This reverts commit effd58c95f277744f75d6e08819ac859dbcbd351.

blk_queue_split() is causing excessive IO splitting -- because
blk_max_size_offset() depends on 'chunk_sectors' limit being set and
if it isn't (as is the case for DM targets!) it falls back to
splitting on a 'max_sectors' boundary regardless of offset.

"Fix" this by reverting back to _not_ using blk_queue_split() in
dm_process_bio() for normal IO (reads and writes).  Long-term fix is
still TBD but it should focus on training blk_max_size_offset() to
call into a DM provided hook (to call DM's max_io_len()).

Test results from simple misaligned IO test on 4-way dm-striped device
with chunksize of 128K and stripesize of 512K:

xfs_io -d -c 'pread -b 2m 224s 4072s' /dev/mapper/stripe_dev

before this revert:

253,0   21        1     0.000000000  2206  Q   R 224 + 4072 [xfs_io]
253,0   21        2     0.000008267  2206  X   R 224 / 480 [xfs_io]
253,0   21        3     0.000010530  2206  X   R 224 / 256 [xfs_io]
253,0   21        4     0.000027022  2206  X   R 480 / 736 [xfs_io]
253,0   21        5     0.000028751  2206  X   R 480 / 512 [xfs_io]
253,0   21        6     0.000033323  2206  X   R 736 / 992 [xfs_io]
253,0   21        7     0.000035130  2206  X   R 736 / 768 [xfs_io]
253,0   21        8     0.000039146  2206  X   R 992 / 1248 [xfs_io]
253,0   21        9     0.000040734  2206  X   R 992 / 1024 [xfs_io]
253,0   21       10     0.000044694  2206  X   R 1248 / 1504 [xfs_io]
253,0   21       11     0.000046422  2206  X   R 1248 / 1280 [xfs_io]
253,0   21       12     0.000050376  2206  X   R 1504 / 1760 [xfs_io]
253,0   21       13     0.000051974  2206  X   R 1504 / 1536 [xfs_io]
253,0   21       14     0.000055881  2206  X   R 1760 / 2016 [xfs_io]
253,0   21       15     0.000057462  2206  X   R 1760 / 1792 [xfs_io]
253,0   21       16     0.000060999  2206  X   R 2016 / 2272 [xfs_io]
253,0   21       17     0.000062489  2206  X   R 2016 / 2048 [xfs_io]
253,0   21       18     0.000066133  2206  X   R 2272 / 2528 [xfs_io]
253,0   21       19     0.000067507  2206  X   R 2272 / 2304 [xfs_io]
253,0   21       20     0.000071136  2206  X   R 2528 / 2784 [xfs_io]
253,0   21       21     0.000072764  2206  X   R 2528 / 2560 [xfs_io]
253,0   21       22     0.000076185  2206  X   R 2784 / 3040 [xfs_io]
253,0   21       23     0.000077486  2206  X   R 2784 / 2816 [xfs_io]
253,0   21       24     0.000080885  2206  X   R 3040 / 3296 [xfs_io]
253,0   21       25     0.000082316  2206  X   R 3040 / 3072 [xfs_io]
253,0   21       26     0.000085788  2206  X   R 3296 / 3552 [xfs_io]
253,0   21       27     0.000087096  2206  X   R 3296 / 3328 [xfs_io]
253,0   21       28     0.000093469  2206  X   R 3552 / 3808 [xfs_io]
253,0   21       29     0.000095186  2206  X   R 3552 / 3584 [xfs_io]
253,0   21       30     0.000099228  2206  X   R 3808 / 4064 [xfs_io]
253,0   21       31     0.000101062  2206  X   R 3808 / 3840 [xfs_io]
253,0   21       32     0.000104956  2206  X   R 4064 / 4096 [xfs_io]
253,0   21       33     0.001138823     0  C   R 4096 + 200 [0]

after this revert:

253,0   18        1     0.000000000  4430  Q   R 224 + 3896 [xfs_io]
253,0   18        2     0.000018359  4430  X   R 224 / 256 [xfs_io]
253,0   18        3     0.000028898  4430  X   R 256 / 512 [xfs_io]
253,0   18        4     0.000033535  4430  X   R 512 / 768 [xfs_io]
253,0   18        5     0.000065684  4430  X   R 768 / 1024 [xfs_io]
253,0   18        6     0.000091695  4430  X   R 1024 / 1280 [xfs_io]
253,0   18        7     0.000098494  4430  X   R 1280 / 1536 [xfs_io]
253,0   18        8     0.000114069  4430  X   R 1536 / 1792 [xfs_io]
253,0   18        9     0.000129483  4430  X   R 1792 / 2048 [xfs_io]
253,0   18       10     0.000136759  4430  X   R 2048 / 2304 [xfs_io]
253,0   18       11     0.000152412  4430  X   R 2304 / 2560 [xfs_io]
253,0   18       12     0.000160758  4430  X   R 2560 / 2816 [xfs_io]
253,0   18       13     0.000183385  4430  X   R 2816 / 3072 [xfs_io]
253,0   18       14     0.000190797  4430  X   R 3072 / 3328 [xfs_io]
253,0   18       15     0.000197667  4430  X   R 3328 / 3584 [xfs_io]
253,0   18       16     0.000218751  4430  X   R 3584 / 3840 [xfs_io]
253,0   18       17     0.000226005  4430  X   R 3840 / 4096 [xfs_io]
253,0   18       18     0.000250404  4430  Q   R 4120 + 176 [xfs_io]
253,0   18       19     0.000847708     0  C   R 4096 + 24 [0]
253,0   18       20     0.000855783     0  C   R 4120 + 176 [0]

Fixes: effd58c95f27774 ("dm: always call blk_queue_split() in dm_process_bio()")
Cc: stable@vger.kernel.org
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Barry Marson <bmarson@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm integrity: fix ppc64le warning

Otherwise:

In file included from drivers/md/dm-integrity.c:13:
drivers/md/dm-integrity.c: In function 'dm_integrity_status':
drivers/md/dm-integrity.c:3061:10: error: format '%llu' expects
argument of type 'long long unsigned int', but argument 4 has type
'long int' [-Werror=format=]
   DMEMIT("%llu %llu",
          ^~~~~~~~~~~
    atomic64_read(&ic->number_of_mismatches),
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/linux/device-mapper.h:550:46: note: in definition of macro 'DMEMIT'
      0 : scnprintf(result + sz, maxlen - sz, x))
                                              ^
cc1: all warnings being treated as errors

Fixes: 7649194a1636ab5 ("dm integrity: remove sector type casts")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm clone metadata: Fix return type of dm_clone_nr_of_hydrated_regions()

dm_clone_nr_of_hydrated_regions() returns the number of regions that
have been hydrated so far. In order to do so it employs bitmap_weight().

Until now, the return type of dm_clone_nr_of_hydrated_regions() was
unsigned long.

Because bitmap_weight() returns an int, in case BITS_PER_LONG == 64 and
the return value of bitmap_weight() is 2^31 (the maximum allowed number
of regions for a device), the result is sign extended from 32 bits to 64
bits and an incorrect value is displayed, in the status output of
dm-clone, as the number of hydrated regions.

Fix this by having dm_clone_nr_of_hydrated_regions() return an unsigned
int.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm clone: Add missing casts to prevent overflows and data corruption

Add missing casts when converting from regions to sectors.

In case BITS_PER_LONG == 32, the lack of the appropriate casts can lead
to overflows and miscalculation of the device sector.

As a result, we could end up discarding and/or copying the wrong parts
of the device, thus corrupting the device's data.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm clone: Add overflow check for number of regions

Add overflow check for clone->nr_regions variable, which holds the
number of regions of the target.

The overflow can occur with sufficiently large devices, if BITS_PER_LONG
== 32. E.g., if the region size is 8 sectors (4K), the overflow would
occur for device sizes > 34359738360 sectors (~16TB).

This could result in multiple device sectors wrongly mapping to the same
region number, due to the truncation from 64 bits to 32 bits, which
would lead to data corruption.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm clone: Fix handling of partial region discards

There is a bug in the way dm-clone handles discards, which can lead to
discarding the wrong blocks or trying to discard blocks beyond the end
of the device.

This could lead to data corruption, if the destination device indeed
discards the underlying blocks, i.e., if the discard operation results
in the original contents of a block to be lost.

The root of the problem is the code that calculates the range of regions
covered by a discard request and decides which regions to discard.

Since dm-clone handles the device in units of regions, we don't discard
parts of a region, only whole regions.

The range is calculated as:

    rs = dm_sector_div_up(bio->bi_iter.bi_sector, clone->region_size);
    re = bio_end_sector(bio) >> clone->region_shift;

, where 'rs' is the first region to discard and (re - rs) is the number
of regions to discard.

The bug manifests when we try to discard part of a single region, i.e.,
when we try to discard a block with size < region_size, and the discard
request both starts at an offset with respect to the beginning of that
region and ends before the end of the region.

The root cause is the following comparison:

  if (rs == re)
    // skip discard and complete original bio immediately

, which doesn't take into account that 'rs' might be greater than 're'.

Thus, we then issue a discard request for the wrong blocks, instead of
skipping the discard all together.

Fix the check to also take into account the above case, so we don't end
up discarding the wrong blocks.

Also, add some range checks to dm_clone_set_region_hydrated() and
dm_clone_cond_set_range(), which update dm-clone's region bitmap.

Note that the aforementioned bug doesn't cause invalid memory accesses,
because dm_clone_is_range_hydrated() returns True for this case, so the
checks are just precautionary.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm writecache: add cond_resched to avoid CPU hangs

Initializing a dm-writecache device can take a long time when the
persistent memory device is large. Add cond_resched() to a few loops
to avoid warnings that the CPU is stuck.

Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm integrity: improve discard in journal mode

When we discard something that is present in the journal, we flush the
journal first, so that discarded blocks are not overwritten by the journal
content.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm integrity: add optional discard support

Add an argument "allow_discards" that enables discard processing on
dm-integrity device. Discards are only allowed to devices using
internal hash.

When a block is discarded the integrity tag is filled with
DISCARD_FILLER (0xf6) bytes.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm integrity: allow resize of the integrity device

If the size of the underlying device changes, change the size of the
integrity device too.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm integrity: factor out get_provided_data_sectors()

Move code to a new function get_provided_data_sectors().

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm integrity: don't replay journal data past the end of the device

Following commits will make it possible to shrink or extend the device. If
the device was shrunk, we don't want to replay journal data pointing past
the end of the device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm integrity: remove sector type casts

Since the commit 72deb455b5ec619ff043c30bc90025aa3de3cdda ("block:
remove CONFIG_LBDAF") sector_t is always defined as unsigned long
long.

Delete the needless type casts in printk and avoids some warnings if
DEBUG_PRINT is defined.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm integrity: fix a crash with unusually large tag size

If the user specifies tag size larger than HASH_MAX_DIGESTSIZE,
there's a crash in integrity_metadata().

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm zoned: remove duplicate nr_rnd_zones increase in dmz_init_zone()

zmd->nr_rnd_zones was increased twice by mistake. The other place it
is increased in dmz_init_zone() is the only one needed:

1131                 zmd->nr_useable_zones++;
1132                 if (dmz_is_rnd(zone)) {
1133                         zmd->nr_rnd_zones++;
^^^
Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm verity fec: fix memory leak in verity_fec_dtr

Fix below kmemleak detected in verity_fec_ctr. output_pool is
allocated for each dm-verity-fec device. But it is not freed when
dm-table for the verity target is removed. Hence free the output
mempool in destructor function verity_fec_dtr.

unreferenced object 0xffffffffa574d000 (size 4096):
  comm "init", pid 1667, jiffies 4294894890 (age 307.168s)
  hex dump (first 32 bytes):
    8e 36 00 98 66 a8 0b 9b 00 00 00 00 00 00 00 00  .6..f...........
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<0000000060e82407>] __kmalloc+0x2b4/0x340
    [<00000000dd99488f>] mempool_kmalloc+0x18/0x20
    [<000000002560172b>] mempool_init_node+0x98/0x118
    [<000000006c3574d2>] mempool_init+0x14/0x20
    [<0000000008cb266e>] verity_fec_ctr+0x388/0x3b0
    [<000000000887261b>] verity_ctr+0x87c/0x8d0
    [<000000002b1e1c62>] dm_table_add_target+0x174/0x348
    [<000000002ad89eda>] table_load+0xe4/0x328
    [<000000001f06f5e9>] dm_ctl_ioctl+0x3b4/0x5a0
    [<00000000bee5fbb7>] do_vfs_ioctl+0x5dc/0x928
    [<00000000b475b8f5>] __arm64_sys_ioctl+0x70/0x98
    [<000000005361e2e8>] el0_svc_common+0xa0/0x158
    [<000000001374818f>] el0_svc_handler+0x6c/0x88
    [<000000003364e9f4>] el0_svc+0x8/0xc
    [<000000009d84cec9>] 0xffffffffffffffff

Fixes: a739ff3f543af ("dm verity: add support for forward error correction")
Depends-on: 6f1c819c219f7 ("dm: convert to bioset_init()/mempool_init()")
Cc: stable@vger.kernel.org
Signed-off-by: Harshini Shetty <harshini.x.shetty@sony.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm writecache: optimize superblock write

If we write a superblock in writecache_flush, we don't need to set bit and
scan the bitmap for it - we can just write the superblock directly. Also,
we can set the flag REQ_FUA on the write bio, so that we don't need to
submit a flush bio afterwards.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm writecache: implement gradual cleanup

If a block is stored in the cache for too long, it will now be
written to the underlying device and cleaned up.

Add a new option "max_age" that specifies the maximum age of a block
in milliseconds.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm writecache: implement the "cleaner" policy

The "flush" or "flush_on_suspend" messages flush the whole cache. However,
these flushing methods can take some time and the process is left in
an interruptible state during the flush.

Implement a "cleaner" option that offers an alternate flushing method.
When this option is activated (either by a message or in the constructor
arguments), the cache will not promote new writes (however, writes to
already cached blocks are promoted, to avoid data corruption due to
misordered writes) and it will gradually writeback any cached data. The
userspace can then monitor the cleaning process with "dmsetup status".
When the number of cached bloks drops to zero, the userspace can unload
the dm-writecache target and replace it with dm-linear or other targets.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm writecache: do direct write if the cache is full

If the cache device is full, we do a direct write to the origin device.
Note that we must not do it if the written block is already in the cache.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm integrity: print device name in integrity_metadata() error message

Similar to f710126cfc89c8df477002a26dee8407eb0b4acd ("dm crypt: print
device name in integrity error message"), this message should also
better identify the device with the integrity failure.

Signed-off-by: Erich Eckner <git@eckner.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dm crypt: use crypt_integrity_aead() helper

Replace test_bit(CRYPT_MODE_INTEGRITY_AEAD, XXX) with
crypt_integrity_aead().

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

Linux 5.6-rc6

Merge tag 'irq-urgent-2020-03-15' of git://git./linux/kernel/git/tip/tip

Pull irq fix from Thomas Gleixner:
"A single commit to handle an erratum in Cavium ThunderX to prevent
access to GIC registers which are broken in the implementation"

* tag 'irq-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3: Workaround Cavium erratum 38539 when reading GICD_TYPER2

Merge tag 'locking-urgent-2020-03-15' of git://git./linux/kernel/git/tip/tip

Pull futex fix from Thomas Gleixner:
"Fix for yet another subtle futex issue.

  The futex code used ihold() to prevent inodes from vanishing, but
  ihold() does not guarantee inode persistence. Replace the inode
  pointer with a per boot, machine wide, unique inode identifier.

  The second commit fixes the breakage of the hash mechanism which
  causes a 100% performance regression"

* tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Unbreak futex hashing
  futex: Fix inode life-time issue

Merge tag 'x86-urgent-2020-03-15' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
"Two fixes for x86:

   - Map EFI runtime service data as encrypted when SEV is enabled.

     Otherwise e.g. SMBIOS data cannot be properly decoded by dmidecode.

   - Remove the warning in the vector management code which triggered
     when a managed interrupt affinity changed outside of a CPU hotplug
     operation.

     The warning was correct until the recent core code change that
     introduced a CPU isolation feature which needs to migrate managed
     interrupts away from online CPUs under certain conditions to
     achieve the isolation"

* tag 'x86-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vector: Remove warning on managed interrupt migration
  x86/ioremap: Map EFI runtime services data as encrypted for SEV

Merge tag 'perf-urgent-2020-03-15' of git://git./linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:
"A pile of perf fixes:

  Kernel side:

   - AMD uncore driver: Replace the open coded sanity check with the
     core variant, which provides the correct error code and also leaves
     a hint in dmesg

  Tooling:

   - Fix the stdio input handling with glibc versions >= 2.28

   - Unbreak the futex-wake benchmark which was reduced to 0 test
     threads due to the conversion to cpumaps

   - Initialize sigaction structs before invoking sys_sigactio()

   - Plug the mapfile memory leak in perf jevents

   - Fix off by one relative directory includes

   - Fix an undefined string comparison in perf diff"

* tag 'perf-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/amd/uncore: Replace manual sampling check with CAP_NO_INTERRUPT flag
  tools: Fix off-by 1 relative directory includes
  perf jevents: Fix leak of mapfile memory
  perf bench: Clear struct sigaction before sigaction() syscall
  perf bench futex-wake: Restore thread count default to online CPU count
  perf top: Fix stdio interface input handling with glibc 2.28+
  perf diff: Fix undefined string comparision spotted by clang's -Wstring-compare
  perf symbols: Don't try to find a vmlinux file when looking for kernel modules
  perf bench: Share some global variables to fix build with gcc 10
  perf parse-events: Use asprintf() instead of strncpy() to read tracepoint files
  perf env: Do not return pointers to local variables
  perf tests bp_account: Make global variable static

Merge tag 'timers-urgent-2020-03-15' of git://git./linux/kernel/git/tip/tip

Pull timer fix from Thomas Gleixner:
"A single fix adding the missing time namespace adjustment in
  sys/sysinfo which caused sys/sysinfo to be inconsistent with
  /proc/uptime when read from a task inside a time namespace"

* tag 'timers-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sys/sysinfo: Respect boottime inside time namespace

Merge tag 'ras-urgent-2020-03-15' of git://git./linux/kernel/git/tip/tip

Pull RAS fixes from Thomas Gleixner:
"Two RAS related fixes:

   - Shut down the per CPU thermal throttling poll work properly when a
     CPU goes offline.

     The missing shutdown caused the poll work to be migrated to a
     unbound worker which triggered warnings about the usage of
     smp_processor_id() in preemptible context

   - Fix the PPIN feature initialization which missed to enable the
     functionality when PPIN_CTL was enabled but the MSR locked against
     updates"

* tag 'ras-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Fix logic and comments around MSR_PPIN_CTL
  x86/mce/therm_throt: Undo thermal polling properly on CPU offline

Merge tag 'efi-urgent-2020-03-15' of git://git./linux/kernel/git/tip/tip

Pull EFI fixes from Thomas Gleixner:
"Two EFI fixes:

   - Prevent a race and buffer overflow in the sysfs efivars interface
     which causes kernel memory corruption.

   - Add the missing NULL pointer checks in efivar_store_raw()"

* tag 'efi-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi: Add a sanity check to efivar_store_raw()
  efi: Fix a race and a buffer overflow while reading efivars via sysfs

Merge tag 'iommu-fixes-v5.6-rc5' of git://git./linux/kernel/git/joro/iommu

Pull IOMMU fixes from Joerg Roedel:

- Intel VT-d fixes:
    - RCU list handling fixes
    - Replace WARN_TAINT with pr_warn + add_taint for reporting firmware
      issues
    - DebugFS fixes
    - Fix for hugepage handling in iova_to_phys implementation
    - Fix for handling VMD devices, which have a domain number which
      doesn't fit into 16 bits
    - Warning message fix

- MSI allocation fix for iommu-dma code

- Sign-extension fix for io page-table code

- Fix for AMD-Vi to properly update the is-running bit when AVIC is
   used

* tag 'iommu-fixes-v5.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/vt-d: Populate debugfs if IOMMUs are detected
  iommu/amd: Fix IOMMU AVIC not properly update the is_run bit in IRTE
  iommu/vt-d: Ignore devices with out-of-spec domain number
  iommu/vt-d: Fix the wrong printing in RHSA parsing
  iommu/vt-d: Fix debugfs register reads
  iommu/vt-d: quirk_ioat_snb_local_iommu: replace WARN_TAINT with pr_warn + add_taint
  iommu/vt-d: dmar_parse_one_rmrr: replace WARN_TAINT with pr_warn + add_taint
  iommu/vt-d: dmar: replace WARN_TAINT with pr_warn + add_taint
  iommu/vt-d: Silence RCU-list debugging warnings
  iommu/vt-d: Fix RCU-list bugs in intel_iommu_init()
  iommu/dma: Fix MSI reservation allocation
  iommu/io-pgtable-arm: Fix IOVA validation for 32-bit
  iommu/vt-d: Fix a bug in intel_iommu_iova_to_phys() for huge page
  iommu/vt-d: Fix RCU list debugging warnings

Merge tag 'irqchip-fixes-5.6-2' of git://git./linux/kernel/git/maz/arm-platforms into irq/urgent

Pull irqchip fixes from Marc Zyngier:

- Add workaround for Cavium/Marvell ThunderX unimplemented GIC registers

Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
"I2C has quite some regression fixes this time.

  One is also related to watchdogs, we have proper acks from Guenter for
  them"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: acpi: put device when verifying client fails
  misc: eeprom: at24: fix regulator underflow
  i2c: gpio: suppress error on probe defer
  macintosh: windfarm: fix MODINFO regression
  i2c: designware-pci: Fix BUG_ON during device removal
  i2c: i801: Do not add ICH_RES_IO_SMI for the iTCO_wdt device
  watchdog: iTCO_wdt: Make ICH_RES_IO_SMI optional
  watchdog: iTCO_wdt: Export vendorsupport

Merge tag 'arc-5.6-rc6' of git://git./linux/kernel/git/vgupta/arc

Pull ARC fixes from Vineet Gupta:

- Fix __ALIGN_STR and __ALIGN to not use default junk padding

- Misc Kconfig cleanups, header updates

* tag 'arc-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
  ARC: define __ALIGN_STR and __ALIGN symbols for ARC
  ARC: show_regs: reduce lines of output
  ARC: Replace <linux/clk-provider.h> by <linux/of_clk.h>
  ARC: fpu: fix randconfig build error reported by 0-day test service
  ARC: fix some Kconfig typos
  ARC: Cleanup old Kconfig IO scheduler options

Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"Bugfixes for x86 and s390"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: nVMX: avoid NULL pointer dereference with incorrect EVMCS GPAs
  KVM: x86: Initializing all kvm_lapic_irq fields in ioapic_write_indirect
  KVM: VMX: Condition ENCLS-exiting enabling on CPU support for SGX1
  KVM: s390: Also reset registers in sync regs for initial cpu reset
  KVM: fix Kconfig menu text for -Werror
  KVM: x86: remove stale comment from struct x86_emulate_ctxt
  KVM: x86: clear stale x86_emulate_ctxt->intercept value
  KVM: SVM: Fix the svm vmexit code for WRMSR
  KVM: X86: Fix dereference null cpufreq policy

iommu/vt-d: Populate debugfs if IOMMUs are detected

Currently, the intel iommu debugfs directory(/sys/kernel/debug/iommu/intel)
gets populated only when DMA remapping is enabled (dmar_disabled = 0)
irrespective of whether interrupt remapping is enabled or not.

Instead, populate the intel iommu debugfs directory if any IOMMUs are
detected.

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: ee2636b8670b1 ("iommu/vt-d: Enable base Intel IOMMU debugfs support")
Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>

Merge tag 'clk-fixes-for-linus' of git://git./linux/kernel/git/clk/linux

Pull clk fixes from Stephen Boyd:
"A small collection of fixes. I'll make another sweep soon to look for
  more fixes for this -rc series.

   - Mark device node const in of_clk_get_parent APIs to ease landing
     changes in users later

   - Fix flag for Qualcomm SC7180 video clocks where we thought it would
     never turn off but actually hardware takes care of it

   - Remove disp_cc_mdss_rscc_ahb_clk on Qualcomm SC7180 SoCs because
     this clk is always on anyway

   - Correct some bad dt-binding numbers for i.MX8MN SoCs"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: imx8mn: Fix incorrect clock defines
  clk: qcom: dispcc: Remove support of disp_cc_mdss_rscc_ahb_clk
  clk: qcom: videocc: Update the clock flag for video_cc_vcodec0_core_clk
  of: clk: Make of_clk_get_parent_{count,name}() parameter const

Merge branch 'kvm-null-pointer-fix' into kvm-master

KVM: nVMX: avoid NULL pointer dereference with incorrect EVMCS GPAs

When an EVMCS enabled L1 guest on KVM will tries doing enlightened VMEnter
with EVMCS GPA = 0 the host crashes because the

evmcs_gpa != vmx->nested.hv_evmcs_vmptr

condition in nested_vmx_handle_enlightened_vmptrld() will evaluate to
false (as nested.hv_evmcs_vmptr is zeroed after init). The crash will
happen on vmx->nested.hv_evmcs pointer dereference.

Another problematic EVMCS ptr value is '-1' but it only causes host crash
after nested_release_evmcs() invocation. The problem is exactly the same as
with '0', we mistakenly think that the EVMCS pointer hasn't changed and
thus nested.hv_evmcs_vmptr is valid.

Resolve the issue by adding an additional !vmx->nested.hv_evmcs
check to nested_vmx_handle_enlightened_vmptrld(), this way we will
always be trying kvm_vcpu_map() when nested.hv_evmcs is NULL
and this is supposed to catch all invalid EVMCS GPAs.

Also, initialize hv_evmcs_vmptr to '0' in nested_release_evmcs()
to be consistent with initialization where we don't currently
set hv_evmcs_vmptr to '-1'.

Cc: stable@vger.kernel.org
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Merge tag 'kvm-s390-master-5.6-1' of git://git./linux/kernel/git/kvms390/linux into kvm-master

KVM: s390: Fully do the CPU resets as intended

With 7de3f1423ff9 ("KVM: s390: Add new reset vcpu API") we clarified
the meaning of the reset ioctl to fully reset the CPU and not only the
parts that can not be handled by userspace. Turns out that we missed
some parts.

irqchip/gic-v3: Workaround Cavium erratum 38539 when reading GICD_TYPER2

Despite the architecture spec requiring that reserved registers in the GIC
distributor memory map are RES0 (and thus are not allowed to generate
an exception), the Cavium ThunderX (aka TX1) SoC explodes as such:

[    0.000000] GICv3: GIC: Using split EOI/Deactivate mode
[    0.000000] GICv3: 128 SPIs implemented
[    0.000000] GICv3: 0 Extended SPIs implemented
[    0.000000] Internal error: synchronous external abort: 96000210 [#1] SMP
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc4-00035-g3cf6a3d5725f #7956
[    0.000000] Hardware name: cavium,thunder-88xx (DT)
[    0.000000] pstate: 60000085 (nZCv daIf -PAN -UAO)
[    0.000000] pc : __raw_readl+0x0/0x8
[    0.000000] lr : gic_init_bases+0x110/0x560
[    0.000000] sp : ffff800011243d90
[    0.000000] x29: ffff800011243d90 x28: 0000000000000000
[    0.000000] x27: 0000000000000018 x26: 0000000000000002
[    0.000000] x25: ffff8000116f0000 x24: ffff000fbe6a2c80
[    0.000000] x23: 0000000000000000 x22: ffff010fdc322b68
[    0.000000] x21: ffff800010a7a208 x20: 00000000009b0404
[    0.000000] x19: ffff80001124dad0 x18: 0000000000000010
[    0.000000] x17: 000000004d8d492b x16: 00000000f67eb9af
[    0.000000] x15: ffffffffffffffff x14: ffff800011249908
[    0.000000] x13: ffff800091243ae7 x12: ffff800011243af4
[    0.000000] x11: ffff80001126e000 x10: ffff800011243a70
[    0.000000] x9 : 00000000ffffffd0 x8 : ffff80001069c828
[    0.000000] x7 : 0000000000000059 x6 : ffff8000113fb4d1
[    0.000000] x5 : 0000000000000001 x4 : 0000000000000000
[    0.000000] x3 : 0000000000000000 x2 : 0000000000000000
[    0.000000] x1 : 0000000000000000 x0 : ffff8000116f000c
[    0.000000] Call trace:
[    0.000000]  __raw_readl+0x0/0x8
[    0.000000]  gic_of_init+0x188/0x224
[    0.000000]  of_irq_init+0x200/0x3cc
[    0.000000]  irqchip_init+0x1c/0x40
[    0.000000]  init_IRQ+0x160/0x1d0
[    0.000000]  start_kernel+0x2ec/0x4b8
[    0.000000] Code: a8c47bfd d65f03c0 d538d080 d65f03c0 (b9400000)

when reading the GICv4.1 GICD_TYPER2 register, which is unexpected...

Work around it by adding a new quirk for the following variants:

ThunderX: CN88xx
OCTEON TX: CN83xx, CN81xx
OCTEON TX2: CN93xx, CN96xx, CN98xx, CNF95xx*

and use this flag to avoid accessing GICD_TYPER2. Note that all
reserved registers (including redistributors and ITS) are impacted
by this erratum, but that only GICD_TYPER2 has to be worked around
so far.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Robert Richter <rrichter@marvell.com>
Tested-by: Mark Salter <msalter@redhat.com>
Tested-by: Tim Harvey <tharvey@gateworks.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Robert Richter <rrichter@marvell.com>
Link: https://lore.kernel.org/r/20191027144234.8395-11-maz@kernel.org
Link: https://lore.kernel.org/r/20200311115649.26060-1-maz@kernel.org

KVM: x86: Initializing all kvm_lapic_irq fields in ioapic_write_indirect

Previously all fields of structure kvm_lapic_irq were not initialized
before it was passed to kvm_bitmap_or_dest_vcpus(). Which will cause
an issue when any of those fields are used for processing a request.
For example not initializing the msi_redir_hint field before passing
to the kvm_bitmap_or_dest_vcpus(), may lead to a misbehavior of
kvm_apic_map_get_dest_lapic(). This will specifically happen when the
kvm_lowest_prio_delivery() returns TRUE due to a non-zero garbage
value of msi_redir_hint, which should not happen as the request belongs
to APIC fixed delivery mode and we do not want to deliver the
interrupt only to the lowest priority candidate.

This patch initializes all the fields of kvm_lapic_irq based on the
values of ioapic redirect_entry object before passing it on to
kvm_bitmap_or_dest_vcpus().

Fixes: 7ee30bc132c6 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
[Set level to false since the value doesn't really matter. Suggested
by Vitaly Kuznetsov. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: VMX: Condition ENCLS-exiting enabling on CPU support for SGX1

Enable ENCLS-exiting (and thus set vmcs.ENCLS_EXITING_BITMAP) only if
the CPU supports SGX1.  Per Intel's SDM, all ENCLS leafs #UD if SGX1
is not supported[*], i.e. intercepting ENCLS to inject a #UD is
unnecessary.

Avoiding ENCLS-exiting even when it is reported as supported by the CPU
works around a reported issue where SGX is "hard" disabled after an S3
suspend/resume cycle, i.e. CPUID.0x7.SGX=0 and the VMCS field/control
are enumerated as unsupported.  While the root cause of the S3 issue is
unknown, it's definitely _not_ a KVM (or kernel) bug, i.e. this is a
workaround for what is most likely a hardware or firmware issue.  As a
bonus side effect, KVM saves a VMWRITE when first preparing vmcs01 and
vmcs02.

Note, SGX must be disabled in BIOS to take advantage of this workaround

[*] The additional ENCLS CPUID check on SGX1 exists so that SGX can be
    globally "soft" disabled post-reset, e.g. if #MC bits in MCi_CTL are
    cleared.  Soft disabled meaning disabling SGX without clearing the
    primary CPUID bit (in leaf 0x7) and without poking into non-SGX
    CPU paths, e.g. for the VMCS controls.

Fixes: 0b665d304028 ("KVM: vmx: Inject #UD for SGX ENCLS instruction in guest")
Reported-by: Toni Spets <toni.spets@iki.fi>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

iommu/amd: Fix IOMMU AVIC not properly update the is_run bit in IRTE

Commit b9c6ff94e43a ("iommu/amd: Re-factor guest virtual APIC
(de-)activation code") accidentally left out the ir_data pointer when
calling modity_irte_ga(), which causes the function amd_iommu_update_ga()
to return prematurely due to struct amd_ir_data.ref is NULL and
the "is_run" bit of IRTE does not get updated properly.

This results in bad I/O performance since IOMMU AVIC always generate GA Log
entry and notify IOMMU driver and KVM when it receives interrupt from the
PCI pass-through device instead of directly inject interrupt to the vCPU.

Fixes by passing ir_data when calling modify_irte_ga() as done previously.

Fixes: b9c6ff94e43a ("iommu/amd: Re-factor guest virtual APIC (de-)activation code")
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>

iommu/vt-d: Ignore devices with out-of-spec domain number

VMD subdevices are created with a PCI domain ID of 0x10000 or
higher.

These subdevices are also handled like all other PCI devices by
dmar_pci_bus_notifier().

However, when dmar_alloc_pci_notify_info() take records of such devices,
it will truncate the domain ID to a u16 value (in info->seg).
The device at (e.g.) 10000:00:02.0 is then treated by the DMAR code as if
it is 0000:00:02.0.

In the unlucky event that a real device also exists at 0000:00:02.0 and
also has a device-specific entry in the DMAR table,
dmar_insert_dev_scope() will crash on:
BUG_ON(i >= devices_cnt);

That's basically a sanity check that only one PCI device matches a
single DMAR entry; in this case we seem to have two matching devices.

Fix this by ignoring devices that have a domain number higher than
what can be looked up in the DMAR table.

This problem was carefully diagnosed by Jian-Hong Pan.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Daniel Drake <drake@endlessm.com>
Fixes: 59ce0515cdaf3 ("iommu/vt-d: Update DRHD/RMRR/ATSR device scope caches when PCI hotplug happens")
Signed-off-by: Joerg Roedel <jroedel@suse.de>

iommu/vt-d: Fix the wrong printing in RHSA parsing

When base address in RHSA structure doesn't match base address in
each DRHD structure, the base address in last DRHD is printed out.

This doesn't make sense when there are multiple DRHD units, fix it
by printing the buggy RHSA's base address.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@gmail.com>
Fixes: fd0c8894893cb ("intel-iommu: Set a more specific taint flag for invalid BIOS DMAR tables")
Signed-off-by: Joerg Roedel <jroedel@suse.de>

Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"Two small fixes, both in drivers: ipr and ufs"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ipr: Fix softlockup when rescanning devices in petitboot
scsi: ufs: Fix possible unclocked access to auto hibern8 timer register

Merge tag 'nfs-for-5.6-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
"These are mostly fscontext fixes, but there is also one that fixes
  collisions seen in fscache:

   - Ensure the fs_context has the correct fs_type when mounting and
     submounting

   - Fix leaking of ctx->nfs_server.hostname

   - Add minor version to fscache key to prevent collisions"

* tag 'nfs-for-5.6-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  nfs: add minor version to nfs_server_key for fscache
  NFS: Fix leak of ctx->nfs_server.hostname
  NFS: Don't hard-code the fs_type when submounting
  NFS: Ensure the fs_context has the correct fs_type before mounting

Merge tag 'fuse-fixes-5.6-rc6' of git://git./linux/kernel/git/mszeredi/fuse

Pull fuse fix from Miklos Szeredi:
"Fix an Oops introduced in v5.4"

* tag 'fuse-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix stack use after return

Merge tag 'ovl-fixes-5.6-rc6' of git://git./linux/kernel/git/mszeredi/vfs

Pull overlayfs fixes from Miklos Szeredi:
"Fix three bugs introduced in this cycle"

* tag 'ovl-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix lockdep warning for async write
  ovl: fix some xino configurations
  ovl: fix lock in ovl_llseek()

Merge tag 'pm-5.6-rc6' of git://git./linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
"Fix cpupower utility build failures with -fno-common enabled (Mike
Gilbert)"

* tag 'pm-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpupower: avoid multiple definition with gcc -fno-common

Merge tag 'io_uring-5.6-2020-03-13' of git://git.kernel.dk/linux-block

Pull io_uring fix from Jens Axboe:
"Just a single fix here, improving the RCU callback ordering from last
  week. After a bit more perusing by Paul, he poked a hole in the
  original"

* tag 'io_uring-5.6-2020-03-13' of git://git.kernel.dk/linux-block:
  io_uring: ensure RCU callback ordering with rcu_barrier()

Merge tag 'block-5.6-2020-03-13' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"A few fixes that should go into this release. This contains:

   - Fix for a corruption issue with the s390 dasd driver (Stefan)

   - Fixup/improvement for the flush insertion change that we had in
     this series (Ming)

   - Fix for the partition suppor for host aware zoned devices
     (Shin'ichiro)

   - Fix incorrect blk-iocost comparison (Tejun)

  The diffstat looks large, but that's a) mostly dasd, and b) the flush
  fix from Ming adds a big comment"

* tag 'block-5.6-2020-03-13' of git://git.kernel.dk/linux-block:
  block: Fix partition support for host aware zoned block devices
  blk-mq: insert flush request to the front of dispatch queue
  s390/dasd: fix data corruption for thin provisioned devices
  blk-iocost: fix incorrect vtime comparison in iocg_is_idle()

Merge tag 'mmc-v5.6-rc1-2' of git://git./linux/kernel/git/ulfh/mmc

Pull MMC fixes from Ulf Hansson:
"MMC core:

   - Fix HW busy detection support for host controllers requiring the
     MMC_RSP_BUSY response flag (R1B) to be set for the command. In
     particular for CMD6 (eMMC), erase/trim/discard (SD/eMMC) and CMD5
     (eMMC sleep).

  MMC host:

   - sdhci-omap|tegra: Fix support for HW busy detection"

* tag 'mmc-v5.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: core: Respect MMC_CAP_NEED_RSP_BUSY for eMMC sleep command
  mmc: sdhci-tegra: Fix busy detection by enabling MMC_CAP_NEED_RSP_BUSY
  mmc: sdhci-omap: Fix busy detection by enabling MMC_CAP_NEED_RSP_BUSY
  mmc: core: Respect MMC_CAP_NEED_RSP_BUSY for erase/trim/discard
  mmc: core: Allow host controllers to require R1B for CMD6

afs: Use kfree_rcu() instead of casting kfree() to rcu_callback_t

afs_put_addrlist() casts kfree() to rcu_callback_t. Apart from being wrong
in theory, this might also blow up when people start enforcing function
types via compiler instrumentation, and it means the rcu_head has to be
first in struct afs_addr_list.

Use kfree_rcu() instead, it's simpler and more correct.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'at24-fixes-for-v5.6-rc6' of git://git./linux/kernel/git/brgl/linux into i2c/for-current

at24 fixes for v5.6-rc6

- fix regulator underflow bug introduced during the v5.6 merge window

ovl: fix lockdep warning for async write

Lockdep reports "WARNING: lock held when returning to user space!" due to
async write holding freeze lock over the write. Apparently aio.c already
deals with this by lying to lockdep about the state of the lock.

Do the same here. No need to check for S_IFREG() here since these file ops
are regular-only.

Reported-by: syzbot+9331a354f4f624a52a55@syzkaller.appspotmail.com
Fixes: 2406a307ac7d ("ovl: implement async IO routines")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

ovl: fix some xino configurations

Fix up two bugs in the coversion to xino_mode:
1. xino=off does not always end up in disabled mode
2. xino=auto on 32bit arch should end up in disabled mode

Take a proactive approach to disabling xino on 32bit kernel:
1. Disable XINO_AUTO config during build time
2. Disable xino with a warning on mount time

As a by product, xino=on on 32bit arch also ends up in disabled mode.
We never intended to enable xino on 32bit arch and this will make the
rest of the logic simpler.

Fixes: 0f831ec85eda ("ovl: simplify ovl_same_sb() helper")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

x86/vector: Remove warning on managed interrupt migration

The vector management code assumes that managed interrupts cannot be
migrated away from an online CPU. free_moved_vector() has a WARN_ON_ONCE()
which triggers when a managed interrupt vector association on a online CPU
is cleared. The CPU offline code uses a different mechanism which cannot
trigger this.

This assumption is not longer correct because the new CPU isolation feature
which affects the placement of managed interrupts must be able to move a
managed interrupt away from an online CPU.

There are two reasons why this can happen:

  1) When the interrupt is activated the affinity mask which was
     established in irq_create_affinity_masks() is handed in to
     the vector allocation code. This mask contains all CPUs to which
     the interrupt can be made affine to, but this does not take the
     CPU isolation 'managed_irq' mask into account.

     When the interrupt is finally requested by the device driver then the
     affinity is checked again and the CPU isolation 'managed_irq' mask is
     taken into account, which moves the interrupt to a non-isolated CPU if
     possible.

  2) The interrupt can be affine to an isolated CPU because the
     non-isolated CPUs in the calculated affinity mask are not online.

     Once a non-isolated CPU which is in the mask comes online the
     interrupt is migrated to this non-isolated CPU

In both cases the regular online migration mechanism is used which triggers
the WARN_ON_ONCE() in free_moved_vector().

Case #1 could have been addressed by taking the isolation mask into
account, but that would require a massive code change in the activation
logic and the eventual migration event was accepted as a reasonable
tradeoff when the isolation feature was developed. But even if #1 would be
addressed, #2 would still trigger it.

Of course the warning in free_moved_vector() was overlooked at that time
and the above two cases which have been discussed during patch review have
obviously never been tested before the final submission.

So keep it simple and remove the warning.

[ tglx: Rewrote changelog and added a comment to free_moved_vector() ]

Fixes: 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed interrupts")
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lkml.kernel.org/r/20200312205830.81796-1-peterx@redhat.com

i2c: acpi: put device when verifying client fails

i2c_verify_client() can fail, so we need to put the device when that
happens.

Fixes: 525e6fabeae2 ("i2c / ACPI: add support for ACPI reconfigure notifications")
Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>

iommu/vt-d: Fix debugfs register reads

Commit 6825d3ea6cde ("iommu/vt-d: Add debugfs support to show register
contents") dumps the register contents for all IOMMU devices.

Currently, a 64 bit read(dmar_readq) is done for all the IOMMU registers,
even though some of the registers are 32 bits, which is incorrect.

Use the correct read function variant (dmar_readl/dmar_readq) while
reading the contents of 32/64 bit registers respectively.

Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
Link: https://lore.kernel.org/r/1583784587-26126-2-git-send-email-megha.dey@linux.intel.com
Acked-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>

iommu/vt-d: quirk_ioat_snb_local_iommu: replace WARN_TAINT with pr_warn + add_taint

Quoting from the comment describing the WARN functions in
include/asm-generic/bug.h:

* WARN(), WARN_ON(), WARN_ON_ONCE, and so on can be used to report
* significant kernel issues that need prompt attention if they should ever
* appear at runtime.
*
* Do not use these macros when checking for invalid external inputs

The (buggy) firmware tables which the dmar code was calling WARN_TAINT
for really are invalid external inputs. They are not under the kernel's
control and the issues in them cannot be fixed by a kernel update.
So logging a backtrace, which invites bug reports to be filed about this,
is not helpful.

Fixes: 556ab45f9a77 ("ioat2: catch and recover from broken vtd configurations v6")
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20200309182510.373875-1-hdegoede@redhat.com
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=701847
Signed-off-by: Joerg Roedel <jroedel@suse.de>

iommu/vt-d: dmar_parse_one_rmrr: replace WARN_TAINT with pr_warn + add_taint

Quoting from the comment describing the WARN functions in
include/asm-generic/bug.h:

* WARN(), WARN_ON(), WARN_ON_ONCE, and so on can be used to report
* significant kernel issues that need prompt attention if they should ever
* appear at runtime.
*
* Do not use these macros when checking for invalid external inputs

The (buggy) firmware tables which the dmar code was calling WARN_TAINT
for really are invalid external inputs. They are not under the kernel's
control and the issues in them cannot be fixed by a kernel update.
So logging a backtrace, which invites bug reports to be filed about this,
is not helpful.

Some distros, e.g. Fedora, have tools watching for the kernel backtraces
logged by the WARN macros and offer the user an option to file a bug for
this when these are encountered. The WARN_TAINT in dmar_parse_one_rmrr
+ another iommu WARN_TAINT, addressed in another patch, have lead to over
a 100 bugs being filed this way.

This commit replaces the WARN_TAINT("...") call, with a
pr_warn(FW_BUG "...") + add_taint(TAINT_FIRMWARE_WORKAROUND, ...) call
avoiding the backtrace and thus also avoiding bug-reports being filed
about this against the kernel.

Fixes: f5a68bb0752e ("iommu/vt-d: Mark firmware tainted if RMRR fails sanity check")
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Acked-by: Lu Baolu <baolu.lu@linux.intel.com>
Cc: stable@vger.kernel.org
Cc: Barret Rhoden <brho@google.com>
Link: https://lore.kernel.org/r/20200309140138.3753-3-hdegoede@redhat.com
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1808874

iommu/vt-d: dmar: replace WARN_TAINT with pr_warn + add_taint

Quoting from the comment describing the WARN functions in
include/asm-generic/bug.h:

* WARN(), WARN_ON(), WARN_ON_ONCE, and so on can be used to report
* significant kernel issues that need prompt attention if they should ever
* appear at runtime.
*
* Do not use these macros when checking for invalid external inputs

The (buggy) firmware tables which the dmar code was calling WARN_TAINT
for really are invalid external inputs. They are not under the kernel's
control and the issues in them cannot be fixed by a kernel update.
So logging a backtrace, which invites bug reports to be filed about this,
is not helpful.

Some distros, e.g. Fedora, have tools watching for the kernel backtraces
logged by the WARN macros and offer the user an option to file a bug for
this when these are encountered. The WARN_TAINT in warn_invalid_dmar()
+ another iommu WARN_TAINT, addressed in another patch, have lead to over
a 100 bugs being filed this way.

This commit replaces the WARN_TAINT("...") calls, with
pr_warn(FW_BUG "...") + add_taint(TAINT_FIRMWARE_WORKAROUND, ...) calls
avoiding the backtrace and thus also avoiding bug-reports being filed
about this against the kernel.

Fixes: fd0c8894893c ("intel-iommu: Set a more specific taint flag for invalid BIOS DMAR tables")
Fixes: e625b4a95d50 ("iommu/vt-d: Parse ANDD records")
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Acked-by: Lu Baolu <baolu.lu@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200309140138.3753-2-hdegoede@redhat.com
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1564895

Merge tag 'drm-fixes-2020-03-13' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
"It's a bit quieter, probably not as much as it could be.

  There is on large regression fix in here from Lyude for displayport
  bandwidth calculations, there've been reports of multi-monitor in
  docks not working since -rc1 and this has been tested to fix those.

  Otherwise it's a bunch of i915 (with some GVT fixes), a set of amdgpu
  watermark + bios fixes, and an exynos iommu cleanup fix.

  core:
   - DP MST bandwidth regression fix.

  i915:
   - hard lockup fix
   - GVT fixes
   - 32-bit alignment issue fix
   - timeline wait fixes
   - cacheline_retire and free

  amdgpu:
   - Update the display watermark bounding box for navi14
   - Fix fetching vbios directly from rom on vega20/arcturus
   - Navi and renoir watermark fixes

  exynos:
   - iommu object cleanup fix"

`

* tag 'drm-fixes-2020-03-13' of git://anongit.freedesktop.org/drm/drm:
  drm/dp_mst: Rewrite and fix bandwidth limit checks
  drm/dp_mst: Reprobe path resources in CSN handler
  drm/dp_mst: Use full_pbn instead of available_pbn for bandwidth checks
  drm/dp_mst: Rename drm_dp_mst_is_dp_mst_end_device() to be less redundant
  drm/i915: Defer semaphore priority bumping to a workqueue
  drm/i915/gt: Close race between cacheline_retire and free
  drm/i915/execlists: Enable timeslice on partial virtual engine dequeue
  drm/i915: be more solid in checking the alignment
  drm/i915/gvt: Fix dma-buf display blur issue on CFL
  drm/i915: Return early for await_start on same timeline
  drm/i915: Actually emit the await_start
  drm/amdgpu/powerplay: nv1x, renior copy dcn clock settings of watermark to smu during boot up
  drm/exynos: Fix cleanup of IOMMU related objects
  drm/amdgpu: correct ROM_INDEX/DATA offset for VEGA20
  drm/amd/display: update soc bb for nv14
  drm/i915/gvt: Fix emulated vbt size issue
  drm/i915/gvt: Fix unnecessary schedule timer when no vGPU exits

Merge tag 'topic/mst-bw-check-fixes-for-airlied-2020-03-12-2' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

UAPI Changes: None

Cross-subsystem Changes: None

Core Changes: Fixed regressions introduced by commit cd82d82cbc04
("drm/dp_mst: Add branch bandwidth validation to MST atomic check"),
which would cause us to:

* Calculate the available bandwidth on an MST topology incorrectly, and
  as a result reject most display configurations that would try to enable
  more then one sink on a topology
* Occasionally expose MST connectors to userspace before finishing
  probing their PBN capabilities, resulting in us rejecting display
  configurations because we assumed briefly that no bandwidth was
  available

Driver Changes: None

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Lyude Paul <lyude@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/bf16ee577567beed91c86b7d9cda3ec2e8c50a71.camel@redhat.com

Merge tag 'drm-intel-fixes-2020-03-12' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes

drm/i915 fixes for v5.6-rc6:
- hard lockup fix
- GVT fixes
- 32-bit alignment issue fix
- timeline wait fixes
- cacheline_retire and free

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/87lfo6ksvw.fsf@intel.com

Merge tag 'amd-drm-fixes-5.6-2020-03-11' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

amd-drm-fixes-5.6-2020-03-11:

amdgpu:
- Update the display watermark bounding box for navi14
- Fix fetching vbios directly from rom on vega20/arcturus
- Navi and renoir watermark fixes

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexdeucher@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200312020924.4161-1-alexander.deucher@amd.com

Merge git://git./linux/kernel/git/netdev/net

Pull networking fixes from David Miller:
"It looks like a decent sized set of fixes, but a lot of these are one
  liner off-by-one and similar type changes:

   1) Fix netlink header pointer to calcular bad attribute offset
      reported to user. From Pablo Neira Ayuso.

   2) Don't double clear PHY interrupts when ->did_interrupt is set,
      from Heiner Kallweit.

   3) Add missing validation of various (devlink, nl802154, fib, etc.)
      attributes, from Jakub Kicinski.

   4) Missing *pos increments in various netfilter seq_next ops, from
      Vasily Averin.

   5) Missing break in of_mdiobus_register() loop, from Dajun Jin.

   6) Don't double bump tx_dropped in veth driver, from Jiang Lidong.

   7) Work around FMAN erratum A050385, from Madalin Bucur.

   8) Make sure ARP header is pulled early enough in bonding driver,
      from Eric Dumazet.

   9) Do a cond_resched() during multicast processing of ipvlan and
      macvlan, from Mahesh Bandewar.

  10) Don't attach cgroups to unrelated sockets when in interrupt
      context, from Shakeel Butt.

  11) Fix tpacket ring state management when encountering unknown GSO
      types. From Willem de Bruijn.

  12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend()
      only in the suspend context. From Heiner Kallweit"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits)
  net: systemport: fix index check to avoid an array out of bounds access
  tc-testing: add ETS scheduler to tdc build configuration
  net: phy: fix MDIO bus PM PHY resuming
  net: hns3: clear port base VLAN when unload PF
  net: hns3: fix RMW issue for VLAN filter switch
  net: hns3: fix VF VLAN table entries inconsistent issue
  net: hns3: fix "tc qdisc del" failed issue
  taprio: Fix sending packets without dequeueing them
  net: mvmdio: avoid error message for optional IRQ
  net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register
  net: memcg: fix lockdep splat in inet_csk_accept()
  s390/qeth: implement smarter resizing of the RX buffer pool
  s390/qeth: refactor buffer pool code
  s390/qeth: use page pointers to manage RX buffer pool
  seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number
  net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed
  net/packet: tpacket_rcv: do not increment ring index on drop
  sxgbe: Fix off by one in samsung driver strncpy size arg
  net: caif: Add lockdep expression to RCU traversal primitive
  MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer
  ...

drm/dp_mst: Rewrite and fix bandwidth limit checks

Sigh, this is mostly my fault for not giving commit cd82d82cbc04
("drm/dp_mst: Add branch bandwidth validation to MST atomic check")
enough scrutiny during review. The way we're checking bandwidth
limitations here is mostly wrong:

For starters, drm_dp_mst_atomic_check_bw_limit() determines the
pbn_limit of a branch by simply scanning each port on the current branch
device, then uses the last non-zero full_pbn value that it finds. It
then counts the sum of the PBN used on each branch device for that
level, and compares against the full_pbn value it found before.

This is wrong because ports can and will have different PBN limitations
on many hubs, especially since a number of DisplayPort hubs out there
will be clever and only use the smallest link rate required for each
downstream sink - potentially giving every port a different full_pbn
value depending on what link rate it's trained at. This means with our
current code, which max PBN value we end up with is not well defined.

Additionally, we also need to remember when checking bandwidth
limitations that the top-most device in any MST topology is a branch
device, not a port. This means that the first level of a topology
doesn't technically have a full_pbn value that needs to be checked.
Instead, we should assume that so long as our VCPI allocations fit we're
within the bandwidth limitations of the primary MSTB.

We do however, want to check full_pbn on every port including those of
the primary MSTB. However, it's important to keep in mind that this
value represents the minimum link rate /between a port's sink or mstb,
and the mstb itself/. A quick diagram to explain:

                                MSTB #1
                               /       \
                              /         \
                           Port #1    Port #2
       full_pbn for Port #1 → |          | ← full_pbn for Port #2
                           Sink #1    MSTB #2
                                         |
                                       etc...

Note that in the above diagram, the combined PBN from all VCPI
allocations on said hub should not exceed the full_pbn value of port #2,
and the display configuration on sink #1 should not exceed the full_pbn
value of port #1. However, port #1 and port #2 can otherwise consume as
much bandwidth as they want so long as their VCPI allocations still fit.

And finally - our current bandwidth checking code also makes the mistake
of not checking whether something is an end device or not before trying
to traverse down it.

So, let's fix it by rewriting our bandwidth checking helpers. We split
the function into one part for handling branches which simply adds up
the total PBN on each branch and returns it, and one for checking each
port to ensure we're not going over its PBN limit. Phew.

This should fix regressions seen, where we erroneously reject display
configurations due to thinking they're going over our bandwidth limits
when they're not.

Changes since v1:
* Took an even closer look at how PBN limitations are supposed to be
  handled, and did some experimenting with Sean Paul. Ended up rewriting
  these helpers again, but this time they should actually be correct!
Changes since v2:
* Small indenting fix
* Fix pbn_used check in drm_dp_mst_atomic_check_port_bw_limit()

Signed-off-by: Lyude Paul <lyude@redhat.com>
Fixes: cd82d82cbc04 ("drm/dp_mst: Add branch bandwidth validation to MST atomic check")
Cc: Sean Paul <seanpaul@google.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Mikita Lipski <mikita.lipski@amd.com>
Tested-by: Hans de Goede <hdegoede@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200309210131.1497545-1-lyude@redhat.com

drm/dp_mst: Reprobe path resources in CSN handler

We used to punt off reprobing path resources to the link address probe
work, but now that we handle CSNs asynchronously from the driver's HPD
handling we can do whatever the heck we want from the CSN!

So, reprobe the path resources from drm_dp_mst_handle_conn_stat(). Also,
get rid of the path resource reprobing code in
drm_dp_check_and_send_link_address() since it's needlessly complicated
when we already reprobe path resources from
drm_dp_handle_link_address_port(). And finally, teach
drm_dp_send_enum_path_resources() to return 1 on PBN changes so we know
if we need to send another hotplug or not.

This fixes issues where we've indicated to userspace that a port has
just been connected, before we actually probed it's available PBN -
something that results in unexpected atomic check failures.

Signed-off-by: Lyude Paul <lyude@redhat.com>
Fixes: cd82d82cbc04 ("drm/dp_mst: Add branch bandwidth validation to MST atomic check")
Cc: Mikita Lipski <mikita.lipski@amd.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Sean Paul <sean@poorly.run>
Link: https://patchwork.freedesktop.org/patch/msgid/20200306234623.547525-4-lyude@redhat.com
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Tested-by: Hans de Goede <hdegoede@redhat.com>

drm/dp_mst: Use full_pbn instead of available_pbn for bandwidth checks

DisplayPort specifications are fun. For a while, it's been really
unclear to us what available_pbn actually does. There's a somewhat vague
explanation in the DisplayPort spec (starting from 1.2) that partially
explains it:

  The minimum payload bandwidth number supported by the path. Each node
  updates this number with its available payload bandwidth number if its
  payload bandwidth number is less than that in the Message Transaction
  reply.

So, it sounds like available_pbn represents the smallest link rate in
use between the source and the branch device. Cool, so full_pbn is just
the highest possible PBN that the branch device supports right?

Well, we assumed that for quite a while until Sean Paul noticed that on
some MST hubs, available_pbn will actually get set to 0 whenever there's
any active payloads on the respective branch device. This caused quite a
bit of confusion since clearing the payload ID table would end up fixing
the available_pbn value.

So, we just went with that until commit cd82d82cbc04 ("drm/dp_mst: Add
branch bandwidth validation to MST atomic check") started breaking
people's setups due to us getting erroneous available_pbn values. So, we
did some more digging and got confused until we finally looked at the
definition for full_pbn:

  The bandwidth of the link at the trained link rate and lane count
  between the DP Source device and the DP Sink device with no time slots
  allocated to VC Payloads, represented as a Payload Bandwidth Number. As
  with the Available_Payload_Bandwidth_Number, this number is determined
  by the link with the lowest lane count and link rate.

That's what we get for not reading specs closely enough, hehe. So, since
full_pbn is definitely what we want for doing bandwidth restriction
checks - let's start using that instead and ignore available_pbn
entirely.

Signed-off-by: Lyude Paul <lyude@redhat.com>
Fixes: cd82d82cbc04 ("drm/dp_mst: Add branch bandwidth validation to MST atomic check")
Cc: Mikita Lipski <mikita.lipski@amd.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Sean Paul <sean@poorly.run>
Reviewed-by: Mikita Lipski <mikita.lipski@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200306234623.547525-3-lyude@redhat.com
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Tested-by: Hans de Goede <hdegoede@redhat.com>

drm/dp_mst: Rename drm_dp_mst_is_dp_mst_end_device() to be less redundant

It's already prefixed by dp_mst, so we don't really need to repeat
ourselves here. One of the changes I should have picked up originally
when reviewing MST DSC support.

There should be no functional changes here

Cc: Mikita Lipski <mikita.lipski@amd.com>
Cc: Sean Paul <seanpaul@google.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Lyude Paul <lyude@redhat.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Tested-by: Hans de Goede <hdegoede@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200306234623.547525-2-lyude@redhat.com

Merge branch 'fixes' of git://git./linux/kernel/git/viro/vfs

Pull vfs fixes from Al Viro:
"A couple of fixes for old crap in ->atomic_open() instances"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
cifs_atomic_open(): fix double-put on late allocation failure
gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache

net: systemport: fix index check to avoid an array out of bounds access

Currently the bounds check on index is off by one and can lead to
an out of bounds access on array priv->filters_loc when index is
RXCHK_BRCM_TAG_MAX.

Fixes: bb9051a2b230 ("net: systemport: Add support for WAKE_FILTER")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tc-testing: add ETS scheduler to tdc build configuration

add CONFIG_NET_SCH_ETS to 'config', otherwise test suites using this file
to perform a full tdc run will encounter the following warning:

ok 645 e90e - Add ETS qdisc using bands # skipped - "-----> teardown stage" did not complete successfully

Fixes: 82c664b69c8b ("selftests: qdiscs: Add test coverage for ETS Qdisc")
Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: fix MDIO bus PM PHY resuming

So far we have the unfortunate situation that mdio_bus_phy_may_suspend()
is called in suspend AND resume path, assuming that function result is
the same. After the original change this is no longer the case,
resulting in broken resume as reported by Geert.

To fix this call mdio_bus_phy_may_suspend() in the suspend path only,
and let the phy_device store the info whether it was suspended by
MDIO bus PM.

Fixes: 503ba7c69610 ("net: phy: Avoid multiple suspends")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cifs_atomic_open(): fix double-put on late allocation failure

several iterations of ->atomic_open() calling conventions ago, we
used to need fput() if ->atomic_open() failed at some point after
successful finish_open().  Now (since 2016) it's not needed -
struct file carries enough state to make fput() work regardless
of the point in struct file lifecycle and discarding it on
failure exits in open() got unified.  Unfortunately, I'd missed
the fact that we had an instance of ->atomic_open() (cifs one)
that used to need that fput(), as well as the stale comment in
finish_open() demanding such late failure handling.  Trivially
fixed...

Fixes: fe9ec8291fca "do_last(): take fput() on error after opening to out:"
Cc: stable@kernel.org # v4.7+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache

with the way fs/namei.c:do_last() had been done, ->atomic_open()
instances needed to recognize the case when existing file got
found with O_EXCL|O_CREAT, either by falling back to finish_no_open()
or failing themselves. gfs2 one didn't.

Fixes: 6d4ade986f9c (GFS2: Add atomic_open support)
Cc: stable@kernel.org # v3.11
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Merge branch 'hns3-fixes'

Huazhong Tan says:

====================
net: hns3: fixes for -net

This series includes several bugfixes for the HNS3 ethernet driver.

[patch 1] fixes an "tc qdisc del" failure.
[patch 2] fixes SW & HW VLAN table not consistent issue.
[patch 3] fixes a RMW issue related to VLAN filter switch.
[patch 4] clears port based VLAN when uploading PF.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: clear port base VLAN when unload PF

Currently, PF missed to clear the port base VLAN for VF when
unload. In this case, the VLAN id will remain in the VLAN
table. This patch fixes it.

Fixes: 92f11ea177cd ("net: hns3: fix set port based VLAN issue for VF")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: fix RMW issue for VLAN filter switch

According to the user manual, the ingress and egress VLAN filter
are configured at the same time. Currently, hclge_init_vlan_config()
and hclge_set_vlan_spoofchk() will both change the VLAN filter
switch. So it's necessary to read the old configuration before
modifying it.

Fixes: 22044f95faa0 ("net: hns3: add support for spoof check setting")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: fix VF VLAN table entries inconsistent issue

Currently, if VF is loaded on the host side, the host doesn't
clear the VF's VLAN table entries when VF removing. In this
case, when doing reset and disabling sriov at the same time the
VLAN device over VF will be removed, but the VLAN table entries
in hardware are remained.

This patch fixes it by asking PF to clear the VLAN table entries for
VF when VF is removing. It also clears the VLAN table full bit
after VF VLAN table entries being cleared.

Fixes: c6075b193462 ("net: hns3: Record VF vlan tables")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: fix "tc qdisc del" failed issue

The HNS3 driver supports to configure TC numbers and TC to priority
map via "tc" tool. But when delete the rule, will fail, because
the HNS3 driver needs at least one TC, but the "tc" tool sets TC
number to zero when delete.

This patch makes sure that the TC number is at least one.

Fixes: 30d240dfa2e8 ("net: hns3: Add mqprio hardware offload support in hns3 driver")
Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

taprio: Fix sending packets without dequeueing them

There was a bug that was causing packets to be sent to the driver
without first calling dequeue() on the "child" qdisc. And the KASAN
report below shows that sending a packet without calling dequeue()
leads to bad results.

The problem is that when checking the last qdisc "child" we do not set
the returned skb to NULL, which can cause it to be sent to the driver,
and so after the skb is sent, it may be freed, and in some situations a
reference to it may still be in the child qdisc, because it was never
dequeued.

The crash log looks like this:

[   19.937538] ==================================================================
[   19.938300] BUG: KASAN: use-after-free in taprio_dequeue_soft+0x620/0x780
[   19.938968] Read of size 4 at addr ffff8881128628cc by task swapper/1/0
[   19.939612]
[   19.939772] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc3+ #97
[   19.940397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qe4
[   19.941523] Call Trace:
[   19.941774]  <IRQ>
[   19.941985]  dump_stack+0x97/0xe0
[   19.942323]  print_address_description.constprop.0+0x3b/0x60
[   19.942884]  ? taprio_dequeue_soft+0x620/0x780
[   19.943325]  ? taprio_dequeue_soft+0x620/0x780
[   19.943767]  __kasan_report.cold+0x1a/0x32
[   19.944173]  ? taprio_dequeue_soft+0x620/0x780
[   19.944612]  kasan_report+0xe/0x20
[   19.944954]  taprio_dequeue_soft+0x620/0x780
[   19.945380]  __qdisc_run+0x164/0x18d0
[   19.945749]  net_tx_action+0x2c4/0x730
[   19.946124]  __do_softirq+0x268/0x7bc
[   19.946491]  irq_exit+0x17d/0x1b0
[   19.946824]  smp_apic_timer_interrupt+0xeb/0x380
[   19.947280]  apic_timer_interrupt+0xf/0x20
[   19.947687]  </IRQ>
[   19.947912] RIP: 0010:default_idle+0x2d/0x2d0
[   19.948345] Code: 00 00 41 56 41 55 65 44 8b 2d 3f 8d 7c 7c 41 54 55 53 0f 1f 44 00 00 e8 b1 b2 c5 fd e9 07 00 3
[   19.950166] RSP: 0018:ffff88811a3efda0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
[   19.950909] RAX: 0000000080000000 RBX: ffff88811a3a9600 RCX: ffffffff8385327e
[   19.951608] RDX: 1ffff110234752c0 RSI: 0000000000000000 RDI: ffffffff8385262f
[   19.952309] RBP: ffffed10234752c0 R08: 0000000000000001 R09: ffffed10234752c1
[   19.953009] R10: ffffed10234752c0 R11: ffff88811a3a9607 R12: 0000000000000001
[   19.953709] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[   19.954408]  ? default_idle_call+0x2e/0x70
[   19.954816]  ? default_idle+0x1f/0x2d0
[   19.955192]  default_idle_call+0x5e/0x70
[   19.955584]  do_idle+0x3d4/0x500
[   19.955909]  ? arch_cpu_idle_exit+0x40/0x40
[   19.956325]  ? _raw_spin_unlock_irqrestore+0x23/0x30
[   19.956829]  ? trace_hardirqs_on+0x30/0x160
[   19.957242]  cpu_startup_entry+0x19/0x20
[   19.957633]  start_secondary+0x2a6/0x380
[   19.958026]  ? set_cpu_sibling_map+0x18b0/0x18b0
[   19.958486]  secondary_startup_64+0xa4/0xb0
[   19.958921]
[   19.959078] Allocated by task 33:
[   19.959412]  save_stack+0x1b/0x80
[   19.959747]  __kasan_kmalloc.constprop.0+0xc2/0xd0
[   19.960222]  kmem_cache_alloc+0xe4/0x230
[   19.960617]  __alloc_skb+0x91/0x510
[   19.960967]  ndisc_alloc_skb+0x133/0x330
[   19.961358]  ndisc_send_ns+0x134/0x810
[   19.961735]  addrconf_dad_work+0xad5/0xf80
[   19.962144]  process_one_work+0x78e/0x13a0
[   19.962551]  worker_thread+0x8f/0xfa0
[   19.962919]  kthread+0x2ba/0x3b0
[   19.963242]  ret_from_fork+0x3a/0x50
[   19.963596]
[   19.963753] Freed by task 33:
[   19.964055]  save_stack+0x1b/0x80
[   19.964386]  __kasan_slab_free+0x12f/0x180
[   19.964830]  kmem_cache_free+0x80/0x290
[   19.965231]  ip6_mc_input+0x38a/0x4d0
[   19.965617]  ipv6_rcv+0x1a4/0x1d0
[   19.965948]  __netif_receive_skb_one_core+0xf2/0x180
[   19.966437]  netif_receive_skb+0x8c/0x3c0
[   19.966846]  br_handle_frame_finish+0x779/0x1310
[   19.967302]  br_handle_frame+0x42a/0x830
[   19.967694]  __netif_receive_skb_core+0xf0e/0x2a90
[   19.968167]  __netif_receive_skb_one_core+0x96/0x180
[   19.968658]  process_backlog+0x198/0x650
[   19.969047]  net_rx_action+0x2fa/0xaa0
[   19.969420]  __do_softirq+0x268/0x7bc
[   19.969785]
[   19.969940] The buggy address belongs to the object at ffff888112862840
[   19.969940]  which belongs to the cache skbuff_head_cache of size 224
[   19.971202] The buggy address is located 140 bytes inside of
[   19.971202]  224-byte region [ffff888112862840, ffff888112862920)
[   19.972344] The buggy address belongs to the page:
[   19.972820] page:ffffea00044a1800 refcount:1 mapcount:0 mapping:ffff88811a2bd1c0 index:0xffff8881128625c0 compo0
[   19.973930] flags: 0x8000000000010200(slab|head)
[   19.974388] raw: 8000000000010200 ffff88811a2ed650 ffff88811a2ed650 ffff88811a2bd1c0
[   19.975151] raw: ffff8881128625c0 0000000000190013 00000001ffffffff 0000000000000000
[   19.975915] page dumped because: kasan: bad access detected
[   19.976461] page_owner tracks the page as allocated
[   19.976946] page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NO)
[   19.978332]  prep_new_page+0x24b/0x330
[   19.978707]  get_page_from_freelist+0x2057/0x2c90
[   19.979170]  __alloc_pages_nodemask+0x218/0x590
[   19.979619]  new_slab+0x9d/0x300
[   19.979948]  ___slab_alloc.constprop.0+0x2f9/0x6f0
[   19.980421]  __slab_alloc.constprop.0+0x30/0x60
[   19.980870]  kmem_cache_alloc+0x201/0x230
[   19.981269]  __alloc_skb+0x91/0x510
[   19.981620]  alloc_skb_with_frags+0x78/0x4a0
[   19.982043]  sock_alloc_send_pskb+0x5eb/0x750
[   19.982476]  unix_stream_sendmsg+0x399/0x7f0
[   19.982904]  sock_sendmsg+0xe2/0x110
[   19.983262]  ____sys_sendmsg+0x4de/0x6d0
[   19.983660]  ___sys_sendmsg+0xe4/0x160
[   19.984032]  __sys_sendmsg+0xab/0x130
[   19.984396]  do_syscall_64+0xe7/0xae0
[   19.984761] page last free stack trace:
[   19.985142]  __free_pages_ok+0x432/0xbc0
[   19.985533]  qlist_free_all+0x56/0xc0
[   19.985907]  quarantine_reduce+0x149/0x170
[   19.986315]  __kasan_kmalloc.constprop.0+0x9e/0xd0
[   19.986791]  kmem_cache_alloc+0xe4/0x230
[   19.987182]  prepare_creds+0x24/0x440
[   19.987548]  do_faccessat+0x80/0x590
[   19.987906]  do_syscall_64+0xe7/0xae0
[   19.988276]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   19.988775]
[   19.988930] Memory state around the buggy address:
[   19.989402]  ffff888112862780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   19.990111]  ffff888112862800: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[   19.990822] >ffff888112862880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   19.991529]                                               ^
[   19.992081]  ffff888112862900: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
[   19.992796]  ffff888112862980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Reported-by: Michael Schmidt <michael.schmidt@eti.uni-siegen.de>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'for-linus-5.6-2' of git://github.com/cminyard/linux-ipmi

Pull IPMI fix from Corey Minyard:
"Fix a message spew on some system

  The call to platform_get_irq() was changed to print a log if the
  interrupt was not available, and that was causing bogus messages to
  spew out for the IPMI driver. People have requested that this get in
  to 5.6 so I'm sending it along"

* tag 'for-linus-5.6-2' of git://github.com/cminyard/linux-ipmi:
  ipmi_si: Avoid spurious errors for optional IRQs

Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6

Pull crypto fix from Herbert Xu:
"Fix a build problem with x86/curve25519"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: x86/curve25519 - support assemblers with no adx support

ovl: fix lock in ovl_llseek()

ovl_inode_lock() is interruptible. When inode_lock() in ovl_llseek()
was replaced with ovl_inode_lock(), we did not add a check for error.

Fix this by making ovl_inode_lock() uninterruptible and change the
existing call sites to use an _interruptible variant.

Reported-by: syzbot+66a9752fa927f745385e@syzkaller.appspotmail.com
Fixes: b1f9d3858f72 ("ovl: use ovl_inode_lock in ovl_llseek()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

block: Fix partition support for host aware zoned block devices

Commit b72053072c0b ("block: allow partitions on host aware zone
devices") introduced the helper function disk_has_partitions() to check
if a given disk has valid partitions. However, since this function result
directly depends on the disk partition table length rather than the
actual existence of valid partitions in the table, it returns true even
after all partitions are removed from the disk. For host aware zoned
block devices, this results in zone management support to be kept
disabled even after removing all partitions.

Fix this by changing disk_has_partitions() to walk through the partition
table entries and return true if and only if a valid non-zero size
partition is found.

Fixes: b72053072c0b ("block: allow partitions on host aware zone devices")
Cc: stable@vger.kernel.org # 5.5
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

blk-mq: insert flush request to the front of dispatch queue

commit 01e99aeca397 ("blk-mq: insert passthrough request into
hctx->dispatch directly") may change to add flush request to the tail
of dispatch by applying the 'add_head' parameter of
blk_mq_sched_insert_request.

Turns out this way causes performance regression on NCQ controller because
flush is non-NCQ command, which can't be queued when there is any in-flight
NCQ command. When adding flush rq to the front of hctx->dispatch, it is
easier to introduce extra time to flush rq's latency compared with adding
to the tail of dispatch queue because of S_SCHED_RESTART, then chance of
flush merge is increased, and less flush requests may be issued to
controller.

So always insert flush request to the front of dispatch queue just like
before applying commit 01e99aeca397 ("blk-mq: insert passthrough request
into hctx->dispatch directly").

Cc: Damien Le Moal <Damien.LeMoal@wdc.com>
Cc: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 01e99aeca397 ("blk-mq: insert passthrough request into hctx->dispatch directly")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

s390/dasd: fix data corruption for thin provisioned devices

Devices are formatted in multiple of tracks.
For an Extent Space Efficient (ESE) volume we get errors when accessing
unformatted tracks. In this case the driver either formats the track on
the flight for write requests or returns zero data for read requests.

In case a request spans multiple tracks, the indication of an unformatted
track presented for the first track is incorrectly applied to all tracks
covered by the request. As a result, tracks containing data will be handled
as empty, resulting in zero data being returned on read, or overwriting
existing data with zero on write.

Fix by determining the track that gets the NRF error.
For write requests only format the track that is surely not formatted.
For Read requests all tracks before have returned valid data and should not
be touched.
All tracks after the unformatted track might be formatted or not. Those are
returned to the blocklayer to build a new request.

When using alias devices there is a chance that multiple write requests
trigger a format of the same track which might lead to data loss. Ensure
that a track is formatted only once by maintaining a list of currently
processed tracks.

Fixes: 5e2b17e712cf ("s390/dasd: Add dynamic formatting support for ESE volumes")
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

perf/amd/uncore: Replace manual sampling check with CAP_NO_INTERRUPT flag

Enable the sampling check in kernel/events/core.c::perf_event_open(),
which returns the more appropriate -EOPNOTSUPP.

BEFORE:

  $ sudo perf record -a -e instructions,l3_request_g1.caching_l3_cache_accesses true
  Error:
  The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (l3_request_g1.caching_l3_cache_accesses).
  /bin/dmesg | grep -i perf may provide additional information.

With nothing relevant in dmesg.

AFTER:

  $ sudo perf record -a -e instructions,l3_request_g1.caching_l3_cache_accesses true
  Error:
  l3_request_g1.caching_l3_cache_accesses: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'

Fixes: c43ca5091a37 ("perf/x86/amd: Add support for AMD NB and L2I "uncore" counters")
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200311191323.13124-1-kim.phillips@amd.com

mmc: core: Respect MMC_CAP_NEED_RSP_BUSY for eMMC sleep command

The busy timeout for the CMD5 to put the eMMC into sleep state, is specific
to the card. Potentially the timeout may exceed the host->max_busy_timeout.
If that becomes the case, mmc_sleep() converts from using an R1B response
to an R1 response, as to prevent the host from doing HW busy detection.

However, it has turned out that some hosts requires an R1B response no
matter what, so let's respect that via checking MMC_CAP_NEED_RSP_BUSY. Note
that, if the R1B gets enforced, the host becomes fully responsible of
managing the needed busy timeout, in one way or the other.

Suggested-by: Sowjanya Komatineni <skomatineni@nvidia.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20200311092036.16084-1-ulf.hansson@linaro.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>

misc: eeprom: at24: fix regulator underflow

The at24 driver attempts to read a byte from the device to validate that
it's actually present, and if not, disables the vcc regulator and
returns -ENODEV. However, between the read and the error handling path,
pm_runtime_idle() is called and invokes the driver's suspend callback,
which also disables the vcc regulator. This leads to an underflow of the
regulator enable count if the EEPROM is not present.

Move the pm_runtime_suspend() call to be after the error handling path
to resolve this.

Fixes: cd5676db0574 ("misc: eeprom: at24: support pm_runtime control")
Signed-off-by: Michael Auchter <michael.auchter@ni.com>
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>

net: mvmdio: avoid error message for optional IRQ

Per the dt-binding the interrupt is optional so use
platform_get_irq_optional() instead of platform_get_irq(). Since
commit 7723f4c5ecdb ("driver core: platform: Add an error message to
platform_get_irq*()") platform_get_irq() produces an error message

orion-mdio f1072004.mdio: IRQ index 0 not found

which is perfectly normal if one hasn't specified the optional property
in the device tree.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register

Only the bottom 12 bits contain the ATU bin occupancy statistics. The
upper bits need masking off.

Fixes: e0c69ca7dfbb ("net: dsa: mv88e6xxx: Add ATU occupancy via devlink resources")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: memcg: fix lockdep splat in inet_csk_accept()

Locking newsk while still holding the listener lock triggered
a lockdep splat [1]

We can simply move the memcg code after we release the listener lock,
as this can also help if multiple threads are sharing a common listener.

Also fix a typo while reading socket sk_rmem_alloc.

[1]
WARNING: possible recursive locking detected
5.6.0-rc3-syzkaller #0 Not tainted
--------------------------------------------
syz-executor598/9524 is trying to acquire lock:
ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492

but task is already holding lock:
ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445

other info that might help us debug this:
Possible unsafe locking scenario:

       CPU0
       ----
  lock(sk_lock-AF_INET6);
  lock(sk_lock-AF_INET6);

*** DEADLOCK ***

May be due to missing lock nesting notation

1 lock held by syz-executor598/9524:
#0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
#0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445

stack backtrace:
CPU: 0 PID: 9524 Comm: syz-executor598 Not tainted 5.6.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x188/0x20d lib/dump_stack.c:118
print_deadlock_bug kernel/locking/lockdep.c:2370 [inline]
check_deadlock kernel/locking/lockdep.c:2411 [inline]
validate_chain kernel/locking/lockdep.c:2954 [inline]
__lock_acquire.cold+0x114/0x288 kernel/locking/lockdep.c:3954
lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4484
lock_sock_nested+0xc5/0x110 net/core/sock.c:2947
lock_sock include/net/sock.h:1541 [inline]
inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492
inet_accept+0xe9/0x7c0 net/ipv4/af_inet.c:734
__sys_accept4_file+0x3ac/0x5b0 net/socket.c:1758
__sys_accept4+0x53/0x90 net/socket.c:1809
__do_sys_accept4 net/socket.c:1821 [inline]
__se_sys_accept4 net/socket.c:1818 [inline]
__x64_sys_accept4+0x93/0xf0 net/socket.c:1818
do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4445c9
Code: e8 0c 0d 03 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc35b37608 EFLAGS: 00000246 ORIG_RAX: 0000000000000120
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004445c9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000306777 R09: 0000000000306777
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00000000004053d0 R14: 0000000000000000 R15: 0000000000000000

Fixes: d752a4986532 ("net: memcg: late association of sock to memcg")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 's390-qeth-fixes'

Julian Wiedmann says:

====================
s390/qeth: fixes 2020-03-11

please apply the following patch series for qeth to netdev's net tree.

Just one fix to get the RX buffer pool resizing right, with two
preparatory cleanups.
This is on the larger side given where we are in the -rc cycle, but a
big chunk of the delta is just refactoring to make the fix look nice.

I intentionally split these off from yesterday's series. No objections
if you'd rather punt them to net-next, the series should apply cleanly.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: implement smarter resizing of the RX buffer pool

The RX buffer pool is allocated in qeth_alloc_qdio_queues().
A subsequent pool resizing is then handled in a very simple way:
first free the current pool, then allocate a new pool of the requested
size.

There's two ways where this can go wrong:
1. if the resize action happens _before_ the initial pool was allocated,
   then a subsequent initialization will call qeth_alloc_qdio_queues()
   and fill the pool with a second(!) set of pages. We consume twice the
   planned amount of memory.
   This is easy to fix - just skip the resizing if the queues haven't
   been allocated yet.
2. if the initial pool was created by qeth_alloc_qdio_queues() but a
   subsequent resizing fails, then the device has no(!) RX buffer pool.
   The next initialization will _not_ call qeth_alloc_qdio_queues(), and
   attempting to back the RX buffers with pages in
   qeth_init_qdio_queues() will fail.
   Not very difficult to fix either - instead of re-allocating the whole
   pool, just allocate/free as many entries to match the desired size.

Fixes: 4a71df50047f ("qeth: new qeth device driver")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: refactor buffer pool code

In preparation for a subsequent fix, split out helpers to allocate/free
individual pool entries.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>