Mikulas Patocka [Tue, 7 May 2019 18:28:35 +0000 (14:28 -0400)]
dm integrity: correctly calculate the size of metadata area
When we use separate devices for data and metadata, dm-integrity would
incorrectly calculate the size of the metadata device as if it had
512-byte block size - and it would refuse activation with larger block
size and smaller metadata device.
Fix this so that it takes actual block size into account, which fixes
the following reported issue:
https://gitlab.com/cryptsetup/cryptsetup/issues/450
Fixes:
356d9d52e122 ("dm integrity: allow separate metadata device")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
YueHaibing [Sat, 4 May 2019 10:30:36 +0000 (18:30 +0800)]
dm dust: Make dm_dust_init and dm_dust_exit static
Fix sparse warnings:
drivers/md/dm-dust.c:495:12: warning: symbol 'dm_dust_init' was not declared. Should it be static?
drivers/md/dm-dust.c:505:13: warning: symbol 'dm_dust_exit' was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Colin Ian King [Wed, 1 May 2019 12:57:17 +0000 (13:57 +0100)]
dm dust: remove redundant unsigned comparison to less than zero
Variable block is an unsigned long long hence the less than zero
comparison is always false, hence it is redundant and can be removed.
Addresses-Coverity: ("Unsigned compared against 0")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Martin Wilck [Mon, 29 Apr 2019 09:48:15 +0000 (11:48 +0200)]
dm mpath: always free attached_handler_name in parse_path()
Commit
b592211c33f7 ("dm mpath: fix attached_handler_name leak and
dangling hw_handler_name pointer") fixed a memory leak for the case
where setup_scsi_dh() returns failure. But setup_scsi_dh may return
success and not "use" attached_handler_name if the
retain_attached_hwhandler flag is not set on the map. As setup_scsi_sh
properly "steals" the pointer by nullifying it, freeing it
unconditionally in parse_path() is safe.
Fixes:
b592211c33f7 ("dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer")
Cc: stable@vger.kernel.org
Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Helen Koike [Fri, 26 Apr 2019 20:09:55 +0000 (17:09 -0300)]
dm init: fix max devices/targets checks
dm-init should allow up to DM_MAX_{DEVICES,TARGETS} for devices/targets,
and not DM_MAX_{DEVICES,TARGETS} - 1.
Fix the checks and also fix the error message when the number of devices
is surpassed.
Fixes:
6bbc923dfcf57d ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Bryan Gurney [Thu, 7 Mar 2019 20:42:39 +0000 (15:42 -0500)]
dm: add dust target
Add the dm-dust target, which simulates the behavior of bad sectors
at arbitrary locations, and the ability to enable the emulation of
the read failures at an arbitrary time.
This target behaves similarly to a linear target. At a given time,
the user can send a message to the target to start failing read
requests on specific blocks. When the failure behavior is enabled,
reads of blocks configured "bad" will fail with EIO.
Writes of blocks configured "bad" will result in the following:
1. Remove the block from the "bad block list".
2. Successfully complete the write.
After this point, the block will successfully contain the written
data, and will service reads and writes normally. This emulates the
behavior of a "remapped sector" on a hard disk drive.
dm-dust provides logging of which blocks have been added or removed
to the "bad block list", as well as logging when a block has been
removed from the bad block list. These messages can be used
alongside the messages from the driver using a dm-dust device to
analyze the driver's behavior when a read fails at a given time.
(This logging can be reduced via a "quiet" mode, if desired.)
NOTE: If the block size is larger than 512 bytes, only the first sector
of each "dust block" is detected. Placing a limiting layer above a dust
target, to limit the minimum I/O size to the dust block size, will
ensure proper emulation of the given large block size.
Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Co-developed-by: Joe Shimkus <jshimkus@redhat.com>
Co-developed-by: John Dorminy <jdorminy@redhat.com>
Co-developed-by: John Pittman <jpittman@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tjaskiew@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mikulas Patocka [Fri, 26 Apr 2019 13:59:24 +0000 (09:59 -0400)]
dm writecache: avoid unnecessary lookups in writecache_find_entry()
This is a small optimization in writecache_find_entry().
If we go past the condition "if (unlikely(!node))", we can be certain that
there is no entry in the tree that has the block equal to the "block"
variable.
Consequently, we can return the next entry directly, we don't need to go
to the second part of the function that finds the entry with lowest or
highest seq number that matches the "block" variable.
Also, add some whitespace and cleanup needless braces.
Suggested-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Huaisheng Ye [Thu, 25 Apr 2019 13:31:19 +0000 (21:31 +0800)]
dm writecache: remove unused member page_offset in writeback_struct
The stucture member page_offset in writeback_struct never has been
used actually. Remove it.
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mikulas Patocka [Thu, 25 Apr 2019 16:07:54 +0000 (12:07 -0400)]
dm delay: fix a crash when invalid device is specified
When the target line contains an invalid device, delay_ctr() will call
delay_dtr() with NULL workqueue. Attempting to destroy the NULL
workqueue causes a crash.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Peng Wang [Thu, 18 Apr 2019 13:59:19 +0000 (21:59 +0800)]
dm: only initialize md->dax_dev if CONFIG_DAX_DRIVER is enabled
md->dax_dev defaults to NULL and there is no need to initialize it
if CONFIG_DAX_DRIVER is disabled.
Signed-off-by: Peng Wang <rocking@whu.edu.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Yufen Yu [Wed, 24 Apr 2019 15:19:05 +0000 (23:19 +0800)]
dm mpath: fix missing call of path selector type->end_io
After commit
396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via
blk_insert_cloned_request feedback"), map_request() will requeue the tio
when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.
Thus, if device driver status is error, a tio may be requeued multiple
times until the return value is not DM_MAPIO_REQUEUE. That means
type->start_io may be called multiple times, while type->end_io is only
called when IO complete.
In fact, even without commit
396eaf21ee17, setup_clone() failure can
also cause tio requeue and associated missed call to type->end_io.
The service-time path selector selects path based on in_flight_size,
which is increased by st_start_io() and decreased by st_end_io().
Missed calls to st_end_io() can lead to in_flight_size count error and
will cause the selector to make the wrong choice. In addition,
queue-length path selector will also be affected.
To fix the problem, call type->end_io in ->release_clone_rq before tio
requeue. map_info is passed to ->release_clone_rq() for map_request()
error path that result in requeue.
Fixes:
396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Cc: stable@vger.kernl.org
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Thu, 18 Apr 2019 14:29:48 +0000 (10:29 -0400)]
dm thin metadata: do not write metadata if no changes occurred
Otherwise, just activating a thin-pool and thin device and then
deactivating them will cause the thin-pool metadata to be changed
(e.g. superblock written) -- even without any metadata being changed.
Add 'in_service' flag to struct dm_pool_metadata and set it in
pmd_write_lock() because all on-disk metadata changes must take a write
lock of pmd->root_lock. Once 'in_service' is set it is never cleared.
__commit_transaction() will return 0 if 'in_service' is not set.
dm_pool_commit_metadata() is updated to use __pmd_write_lock() so that
it isn't the sole reason for putting a thin-pool in service.
Also fix dm_pool_commit_metadata() to open the next transaction if the
return from __commit_transaction() is 0. Not seeing why the early
return ever made since for a return of 0 given that dm-io's async_io(),
as used by bufio, always returns 0.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Mon, 15 Apr 2019 20:54:36 +0000 (16:54 -0400)]
dm thin metadata: add wrappers for managing write locking of metadata
No functional change, but this prepares to hook off of pmd_write_lock()
with additional functionality (as provided in next commit).
Suggested-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Mon, 15 Apr 2019 20:40:08 +0000 (16:40 -0400)]
dm thin metadata: check __commit_transaction()'s return
Fix __reserve_metadata_snap() to return early if __commit_transaction()
fails.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Fri, 22 Mar 2019 20:12:22 +0000 (16:12 -0400)]
dm space map common: zero entire ll_disk
Otherwise, memory that is allocated (and potentially not previously
zeroed) will get written to disk as part of the space maps.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Huaisheng Ye [Fri, 12 Apr 2019 15:28:14 +0000 (11:28 -0400)]
dm writecache: add unlikely for returned value of rb_next/prev
In functions writecache_discard() and writecache_find_entry() there is a
high probablity that the pointer of structure rb_node won't equal NULL.
Add unlikely for the pointer node NULL.
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Huaisheng Ye [Fri, 12 Apr 2019 15:27:18 +0000 (11:27 -0400)]
dm writecache: remove needless dereferences in __writecache_writeback_pmem()
bio is already available so there is no need to access it in terms of
the wb pointer.
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Nikos Tsironis [Sun, 17 Mar 2019 12:22:58 +0000 (14:22 +0200)]
dm snapshot: Use fine-grained locking scheme
Substitute the global locking scheme with a fine grained one, employing
the read-write semaphore and the scalable exception tables with
per-bucket locks introduced by the previous two commits.
Summarizing, we now use a read-write semaphore to protect the mostly
read fields of the snapshot structure, e.g., valid, active, etc., and
per-bucket bit spinlocks to protect accesses to the complete and pending
exception tables.
Finally, we use an extra spinlock (pe_allocation_lock) to serialize the
allocation of new exceptions by the exception store. This allocation is
really fast, so the extra spinlock doesn't hurt the performance.
This scheme allows dm-snapshot to scale better, resulting in increased
IOPS and reduced latency.
Following are some benchmark results using the null_blk device:
modprobe null_blk gb=1024 bs=512 submit_queues=8 hw_queue_depth=4096 \
queue_mode=2 irqmode=1 completion_nsec=1 nr_devices=1
* Benchmark fio_origin_randwrite_throughput_N, from the device mapper
test suite [1] (direct IO, random 4K writes to origin device, IO
engine libaio):
+--------------+-------------+------------+
| # of workers | IOPS Before | IOPS After |
+--------------+-------------+------------+
| 1 | 57708 | 66421 |
| 2 | 63415 | 77589 |
| 4 | 67276 | 98839 |
| 8 | 60564 | 109258 |
+--------------+-------------+------------+
* Benchmark fio_origin_randwrite_latency_N, from the device mapper test
suite [1] (direct IO, random 4K writes to origin device, IO engine
psync):
+--------------+-----------------------+----------------------+
| # of workers | Latency (usec) Before | Latency (usec) After |
+--------------+-----------------------+----------------------+
| 1 | 16.25 | 13.27 |
| 2 | 31.65 | 25.08 |
| 4 | 55.28 | 41.08 |
| 8 | 121.47 | 74.44 |
+--------------+-----------------------+----------------------+
* Benchmark fio_snapshot_randwrite_throughput_N, from the device mapper
test suite [1] (direct IO, random 4K writes to snapshot device, IO
engine libaio):
+--------------+-------------+------------+
| # of workers | IOPS Before | IOPS After |
+--------------+-------------+------------+
| 1 | 72593 | 84938 |
| 2 | 97379 | 134973 |
| 4 | 90610 | 143077 |
| 8 | 90537 | 180085 |
+--------------+-------------+------------+
* Benchmark fio_snapshot_randwrite_latency_N, from the device mapper
test suite [1] (direct IO, random 4K writes to snapshot device, IO
engine psync):
+--------------+-----------------------+----------------------+
| # of workers | Latency (usec) Before | Latency (usec) After |
+--------------+-----------------------+----------------------+
| 1 | 12.53 | 10.6 |
| 2 | 19.78 | 14.89 |
| 4 | 40.37 | 23.47 |
| 8 | 89.32 | 48.48 |
+--------------+-----------------------+----------------------+
[1] https://github.com/jthornber/device-mapper-test-suite
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Nikos Tsironis [Sun, 17 Mar 2019 12:22:57 +0000 (14:22 +0200)]
dm snapshot: Make exception tables scalable
Use list_bl to implement the exception hash tables' buckets. This change
permits concurrent access, to distinct buckets, by multiple threads.
Also, implement helper functions to lock and unlock the exception tables
based on the chunk number of the exception at hand.
We retain the global locking, by means of down_write(), which is
replaced by the next commit.
Still, we must acquire the per-bucket spinlocks when accessing the hash
tables, since list_bl does not allow modification on unlocked lists.
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Nikos Tsironis [Sun, 17 Mar 2019 12:22:56 +0000 (14:22 +0200)]
dm snapshot: Replace mutex with rw semaphore
dm-snapshot uses a single mutex to serialize every access to the
snapshot state. This includes all accesses to the complete and pending
exception tables, which occur at every origin write, every snapshot
read/write and every exception completion.
The lock statistics indicate that this mutex is a bottleneck (average
wait time ~480 usecs for 8 processes doing random 4K writes to the
origin device) preventing dm-snapshot to scale as the number of threads
doing IO increases.
The major contention points are __origin_write()/snapshot_map() and
pending_complete(), i.e., the submission and completion of pending
exceptions.
Replace this mutex with a rw semaphore.
We essentially revert commit
ae1093be5a0ef9 ("dm snapshot: use mutex
instead of rw_semaphore") and together with the next two patches we
substitute the single mutex with a fine-grained locking scheme, where we
use a read-write semaphore to protect the mostly read fields of the
snapshot structure, e.g., valid, active, etc., and per-bucket bit
spinlocks to protect accesses to the complete and pending exception
tables.
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Nikos Tsironis [Sun, 17 Mar 2019 12:22:55 +0000 (14:22 +0200)]
dm snapshot: Don't sleep holding the snapshot lock
When completing a pending exception, pending_complete() waits for all
conflicting reads to drain, before inserting the final, completed
exception. Conflicting reads are snapshot reads redirected to the
origin, because the relevant chunk is not remapped to the COW device the
moment we receive the read.
The completed exception must be inserted into the exception table after
all conflicting reads drain to ensure snapshot reads don't return
corrupted data. This is required because inserting the completed
exception into the exception table signals that the relevant chunk is
remapped and both origin writes and snapshot merging will now overwrite
the chunk in origin.
This wait is done holding the snapshot lock to ensure that
pending_complete() doesn't starve if new snapshot reads keep coming for
this chunk.
In preparation for the next commit, where we use a spinlock instead of a
mutex to protect the exception tables, we remove the need for holding
the lock while waiting for conflicting reads to drain.
We achieve this in two steps:
1. pending_complete() inserts the completed exception before waiting for
conflicting reads to drain and removes the pending exception after
all conflicting reads drain.
This ensures that new snapshot reads will be redirected to the COW
device, instead of the origin, and thus pending_complete() will not
starve. Moreover, we use the existence of both a completed and
a pending exception to signify that the COW is done but there are
conflicting reads in flight.
2. In __origin_write() we check first if there is a pending exception
and then if there is a completed exception. If there is a pending
exception any submitted BIO is delayed on the pe->origin_bios list and
DM_MAPIO_SUBMITTED is returned. This ensures that neither writes to the
origin nor snapshot merging can overwrite the origin chunk, until all
conflicting reads drain, and thus snapshot reads will not return
corrupted data.
Summarizing, we now have the following possible combinations of pending
and completed exceptions for a chunk, along with their meaning:
A. No exceptions exist: The chunk has not been remapped yet.
B. Only a pending exception exists: The chunk is currently being copied
to the COW device.
C. Both a pending and a completed exception exist: COW for this chunk
has completed but there are snapshot reads in flight which had been
redirected to the origin before the chunk was remapped.
D. Only the completed exception exists: COW has been completed and there
are no conflicting reads in flight.
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Nikos Tsironis [Sun, 17 Mar 2019 12:22:54 +0000 (14:22 +0200)]
list_bl: Add hlist_bl_add_before/behind helpers
Add hlist_bl_add_before/behind helpers to add an element before/after an
existing element in a bl_list.
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Nikos Tsironis [Sun, 17 Mar 2019 12:22:53 +0000 (14:22 +0200)]
list: Don't use WRITE_ONCE() in hlist_add_behind()
Commit
1c97be677f72b3 ("list: Use WRITE_ONCE() when adding to lists and
hlists") introduced the use of WRITE_ONCE() to atomically write the list
head's ->next pointer.
hlist_add_behind() doesn't touch the hlist head's ->first pointer so
there is no reason to use WRITE_ONCE() in this case.
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Nikos Tsironis [Wed, 17 Apr 2019 14:19:18 +0000 (17:19 +0300)]
dm cache metadata: Fix loading discard bitset
Add missing dm_bitset_cursor_next() to properly advance the bitset
cursor.
Otherwise, the discarded state of all blocks is set according to the
discarded state of the first block.
Fixes:
ae4a46a1f6 ("dm cache metadata: use bitset cursor api to load discard bitset")
Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Damien Le Moal [Thu, 18 Apr 2019 09:03:07 +0000 (18:03 +0900)]
dm zoned: Fix zone report handling
The function blkdev_report_zones() returns success even if no zone
information is reported (empty report). Empty zone reports can only
happen if the report start sector passed exceeds the device capacity.
The conditions for this to happen are either a bug in the caller code,
or, a change in the device that forced the low level driver to change
the device capacity to a value that is lower than the report start
sector. This situation includes a failed disk revalidation resulting in
the disk capacity being changed to 0.
If this change happens while dm-zoned is in its initialization phase
executing dmz_init_zones(), this function may enter an infinite loop
and hang the system. To avoid this, add a check to disallow empty zone
reports and bail out early. Also fix the function dmz_update_zone() to
make sure that the report for the requested zone was correctly obtained.
Fixes:
3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Shaun Tancheff <shaun@tancheff.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Dan Carpenter [Wed, 10 Apr 2019 08:12:31 +0000 (11:12 +0300)]
dm zoned: Silence a static checker warning
My static checker complains about this line from dmz_get_zoned_device()
aligned_capacity = dev->capacity & ~(blk_queue_zone_sectors(q) - 1);
The problem is that "aligned_capacity" and "dev->capacity" are sector_t
type (which is a u64 under most configs) but blk_queue_zone_sectors(q)
returns a u32 so the higher 32 bits in aligned_capacity are cleared to
zero. This patch adds a cast to address the issue.
Fixes:
114e025968b5 ("dm zoned: ignore last smaller runt zone")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Christoph Hellwig [Thu, 4 Apr 2019 16:33:34 +0000 (18:33 +0200)]
dm crypt: fix endianness annotations around org_sector_of_dmreq
The sector used here is a little endian value, so use the right
type for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Linus Torvalds [Sun, 14 Apr 2019 22:17:41 +0000 (15:17 -0700)]
Linux 5.1-rc5
Linus Torvalds [Sun, 14 Apr 2019 22:09:40 +0000 (15:09 -0700)]
Merge branch 'page-refs' (page ref overflow)
Merge page ref overflow branch.
Jann Horn reported that he can overflow the page ref count with
sufficient memory (and a filesystem that is intentionally extremely
slow).
Admittedly it's not exactly easy. To have more than four billion
references to a page requires a minimum of 32GB of kernel memory just
for the pointers to the pages, much less any metadata to keep track of
those pointers. Jann needed a total of 140GB of memory and a specially
crafted filesystem that leaves all reads pending (in order to not ever
free the page references and just keep adding more).
Still, we have a fairly straightforward way to limit the two obvious
user-controllable sources of page references: direct-IO like page
references gotten through get_user_pages(), and the splice pipe page
duplication. So let's just do that.
* branch page-refs:
fs: prevent page refcount overflow in pipe_buf_get
mm: prevent get_user_pages() from overflowing page refcount
mm: add 'try_get_page()' helper function
mm: make page ref count overflow check tighter and more explicit
Matthew Wilcox [Fri, 5 Apr 2019 21:02:10 +0000 (14:02 -0700)]
fs: prevent page refcount overflow in pipe_buf_get
Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount. All
callers converted to handle a failure.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 11 Apr 2019 17:49:19 +0000 (10:49 -0700)]
mm: prevent get_user_pages() from overflowing page refcount
If the page refcount wraps around past zero, it will be freed while
there are still four billion references to it. One of the possible
avenues for an attacker to try to make this happen is by doing direct IO
on a page multiple times. This patch makes get_user_pages() refuse to
take a new page reference if there are already more than two billion
references to the page.
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 11 Apr 2019 17:14:59 +0000 (10:14 -0700)]
mm: add 'try_get_page()' helper function
This is the same as the traditional 'get_page()' function, but instead
of unconditionally incrementing the reference count of the page, it only
does so if the count was "safe". It returns whether the reference count
was incremented (and is marked __must_check, since the caller obviously
has to be aware of it).
Also like 'get_page()', you can't use this function unless you already
had a reference to the page. The intent is that you can use this
exactly like get_page(), but in situations where you want to limit the
maximum reference count.
The code currently does an unconditional WARN_ON_ONCE() if we ever hit
the reference count issues (either zero or negative), as a notification
that the conditional non-increment actually happened.
NOTE! The count access for the "safety" check is inherently racy, but
that doesn't matter since the buffer we use is basically half the range
of the reference count (ie we look at the sign of the count).
Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 11 Apr 2019 17:06:20 +0000 (10:06 -0700)]
mm: make page ref count overflow check tighter and more explicit
We have a VM_BUG_ON() to check that the page reference count doesn't
underflow (or get close to overflow) by checking the sign of the count.
That's all fine, but we actually want to allow people to use a "get page
ref unless it's already very high" helper function, and we want that one
to use the sign of the page ref (without triggering this VM_BUG_ON).
Change the VM_BUG_ON to only check for small underflows (or _very_ close
to overflowing), and ignore overflows which have strayed into negative
territory.
Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 13 Apr 2019 23:23:16 +0000 (16:23 -0700)]
Merge tag 'for-linus-
20190412' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Set of fixes that should go into this round. This pull is larger than
I'd like at this time, but there's really no specific reason for that.
Some are fixes for issues that went into this merge window, others are
not. Anyway, this contains:
- Hardware queue limiting for virtio-blk/scsi (Dongli)
- Multi-page bvec fixes for lightnvm pblk
- Multi-bio dio error fix (Jason)
- Remove the cache hint from the io_uring tool side, since we didn't
move forward with that (me)
- Make io_uring SETUP_SQPOLL root restricted (me)
- Fix leak of page in error handling for pc requests (Jérôme)
- Fix BFQ regression introduced in this merge window (Paolo)
- Fix break logic for bio segment iteration (Ming)
- Fix NVMe cancel request error handling (Ming)
- NVMe pull request with two fixes (Christoph):
- fix the initial CSN for nvme-fc (James)
- handle log page offsets properly in the target (Keith)"
* tag 'for-linus-
20190412' of git://git.kernel.dk/linux-block:
block: fix the return errno for direct IO
nvmet: fix discover log page when offsets are used
nvme-fc: correct csn initialization and increments on error
block: do not leak memory in bio_copy_user_iov()
lightnvm: pblk: fix crash in pblk_end_partial_read due to multipage bvecs
nvme: cancel request synchronously
blk-mq: introduce blk_mq_complete_request_sync()
scsi: virtio_scsi: limit number of hw queues by nr_cpu_ids
virtio-blk: limit number of hw queues by nr_cpu_ids
block, bfq: fix use after free in bfq_bfqq_expire
io_uring: restrict IORING_SETUP_SQPOLL to root
tools/io_uring: remove IOCQE_FLAG_CACHEHIT
block: don't use for-inside-for in bio_for_each_segment_all
Linus Torvalds [Sat, 13 Apr 2019 21:47:06 +0000 (14:47 -0700)]
Merge tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fix:
- Fix a deadlock in close() due to incorrect draining of RDMA queues
Bugfixes:
- Revert "SUNRPC: Micro-optimise when the task is known not to be
sleeping" as it is causing stack overflows
- Fix a regression where NFSv4 getacl and fs_locations stopped
working
- Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
- Fix xfstests failures due to incorrect copy_file_range() return
values"
* tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
NFSv4.1 fix incorrect return value in copy_file_range
xprtrdma: Fix helper that drains the transport
NFS: Fix handling of reply page vector
NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
Linus Torvalds [Sat, 13 Apr 2019 21:37:49 +0000 (14:37 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fix from James Bottomley:
"One obvious fix for a ciostor data corruption on error bug"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: csiostor: fix missing data copy in csio_scsi_err_handler()
Linus Torvalds [Sat, 13 Apr 2019 21:33:56 +0000 (14:33 -0700)]
Merge tag 'clk-fixes-for-linus' of git://git./linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"Here's more than a handful of clk driver fixes for changes that came
in during the merge window:
- Fix the AT91 sama5d2 programmable clk prescaler formula
- A bunch of Amlogic meson clk driver fixes for the VPU clks
- A DMI quirk for Intel's Bay Trail SoC's driver to properly mark pmc
clks as critical only when really needed
- Stop overwriting CLK_SET_RATE_PARENT flag in mediatek's clk gate
implementation
- Use the right structure to test for a frequency table in i.MX's
PLL_1416x driver"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: imx: Fix PLL_1416X not rounding rates
clk: mediatek: fix clk-gate flag setting
platform/x86: pmc_atom: Drop __initconst on dmi table
clk: x86: Add system specific quirk to mark clocks as critical
clk: meson: vid-pll-div: remove warning and return 0 on invalid config
clk: meson: pll: fix rounding and setting a rate that matches precisely
clk: meson-g12a: fix VPU clock parents
clk: meson: g12a: fix VPU clock muxes mask
clk: meson-gxbb: round the vdec dividers to closest
clk: at91: fix programmable clock for sama5d2
Linus Torvalds [Sat, 13 Apr 2019 21:29:21 +0000 (14:29 -0700)]
Merge tag 'pci-v5.1-fixes-2' of git://git./linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
- Add a DMA alias quirk for another Marvell SATA device (Andre
Przywara)
- Fix a pciehp regression that broke safe removal of devices (Sergey
Miroshnichenko)
* tag 'pci-v5.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: pciehp: Ignore Link State Changes after powering off a slot
PCI: Add function 1 DMA alias quirk for Marvell 9170 SATA controller
Linus Torvalds [Sat, 13 Apr 2019 16:03:09 +0000 (09:03 -0700)]
Merge tag 'powerpc-5.1-5' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"A minor build fix for 64-bit FLATMEM configs.
A fix for a boot failure on 32-bit powermacs.
My commit to fix CLOCK_MONOTONIC across Y2038 broke the 32-bit VDSO on
64-bit kernels, ie. compat mode, which is only used on big endian.
The rewrite of the SLB code we merged in 4.20 missed the fact that the
0x380 exception is also used with the Radix MMU to report out of range
accesses. This could lead to an oops if userspace tried to read from
addresses outside the user or kernel range.
Thanks to: Aneesh Kumar K.V, Christophe Leroy, Larry Finger, Nicholas
Piggin"
* tag 'powerpc-5.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Define MAX_PHYSMEM_BITS for all 64-bit configs
powerpc/64s/radix: Fix radix segment exception handling
powerpc/vdso32: fix CLOCK_MONOTONIC on PPC64
powerpc/32: Fix early boot failure with RTAS built-in
Linus Torvalds [Sat, 13 Apr 2019 15:57:00 +0000 (08:57 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"The main thing is a fix to our FUTEX_WAKE_OP implementation which was
unbelievably broken, but did actually work for the one scenario that
GLIBC used to use.
Summary:
- Fix stack unwinding so we ignore user stacks
- Fix ftrace module PLT trampoline initialisation checks
- Fix terminally broken implementation of FUTEX_WAKE_OP atomics"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value
arm64: backtrace: Don't bother trying to unwind the userspace stack
arm64/ftrace: fix inadvertent BUG() in trampoline check
Linus Torvalds [Sat, 13 Apr 2019 03:54:40 +0000 (20:54 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Fix typos in user-visible resctrl parameters, and also fix assembly
constraint bugs that might result in miscompilation"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Use stricter assembly constraints in bitops
x86/resctrl: Fix typos in the mba_sc mount option
Linus Torvalds [Sat, 13 Apr 2019 03:52:28 +0000 (20:52 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"Fix the alarm_timer_remaining() return value"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
alarmtimer: Return correct remaining time
Linus Torvalds [Sat, 13 Apr 2019 03:50:43 +0000 (20:50 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Fix a NULL pointer dereference crash in certain environments"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Do not re-read ->h_load_next during hierarchical load calculation
Linus Torvalds [Sat, 13 Apr 2019 03:42:30 +0000 (20:42 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Six kernel side fixes: three related to NMI handling on AMD systems, a
race fix, a kexec initialization fix and a PEBS sampling fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix perf_event_disable_inatomic() race
x86/perf/amd: Remove need to check "running" bit in NMI handler
x86/perf/amd: Resolve NMI latency issues for active PMCs
x86/perf/amd: Resolve race condition when disabling PMC
perf/x86/intel: Initialize TFA MSR
perf/x86/intel: Fix handling of wakeup_events for multi-entry PEBS
Linus Torvalds [Sat, 13 Apr 2019 03:31:08 +0000 (20:31 -0700)]
Merge branch 'locking-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fixes a crash when accessing /proc/lockdep"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Zap lock classes even with lock debugging disabled
Linus Torvalds [Sat, 13 Apr 2019 03:21:59 +0000 (20:21 -0700)]
Merge branch 'irq-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
"Two genirq fixes, plus an irqchip driver error handling fix"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Respect IRQCHIP_SKIP_SET_WAKE in irq_chip_set_wake_parent()
genirq: Initialize request_mutex if CONFIG_SPARSE_IRQ=n
irqchip/irq-ls1x: Missing error code in ls1x_intc_of_init()
Linus Torvalds [Sat, 13 Apr 2019 03:13:13 +0000 (20:13 -0700)]
Merge branch 'core-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull core fixes from Ingo Molnar:
"Fix an objtool warning plus fix a u64_to_user_ptr() macro expansion
bug"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Add rewind_stack_do_exit() to the noreturn list
linux/kernel.h: Use parentheses around argument in u64_to_user_ptr()
Leonard Crestez [Fri, 12 Apr 2019 14:10:03 +0000 (14:10 +0000)]
clk: imx: Fix PLL_1416X not rounding rates
Code which initializes the "clk_init_data.ops" checks pll->rate_table
before that field is ever assigned to so it always picks
"clk_pll1416x_min_ops".
This breaks dynamic rate rounding for features such as cpufreq.
Fix by checking pll_clk->rate_table instead, here pll_clk refers to
the constant initialization data coming from per-soc clk driver.
Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Fixes:
8646d4dcc7fb ("clk: imx: Add PLLs driver for imx8mm soc")
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Weiyi Lu [Fri, 12 Apr 2019 03:30:27 +0000 (11:30 +0800)]
clk: mediatek: fix clk-gate flag setting
CLK_SET_RATE_PARENT would be dropped.
Merge two flag setting together to correct the error.
Fixes:
5a1cc4c27ad2 ("clk: mediatek: Add flags to mtk_gate")
Cc: <stable@vger.kernel.org>
Signed-off-by: Weiyi Lu <weiyi.lu@mediatek.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Linus Torvalds [Fri, 12 Apr 2019 15:25:16 +0000 (08:25 -0700)]
Merge tag 'dma-mapping-5.1-1' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
"Fix a sparc64 sun4v_pci regression introduced in this merged window,
and a dma-debug stracktrace regression from the big refactor last
merge window"
* tag 'dma-mapping-5.1-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-debug: only skip one stackframe entry
sparc64/pci_sun4v: fix ATU checks for large DMA masks
Linus Torvalds [Fri, 12 Apr 2019 15:21:15 +0000 (08:21 -0700)]
Merge tag 'iommu-fix-v5.1-rc5' of git://git./linux/kernel/git/joro/iommu
Pull IOMMU fix from Joerg Roedel:
"Fix an AMD IOMMU issue where the driver didn't correctly setup the
exclusion range in the hardware registers, resulting in exclusion
ranges being one page too big.
This can cause data corruption of the address of that last page is
used by DMA operations"
* tag 'iommu-fix-v5.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/amd: Set exclusion range correctly
Linus Torvalds [Fri, 12 Apr 2019 15:18:37 +0000 (08:18 -0700)]
Merge tag 'clang-format-for-linus-v5.1-rc5' of git://github.com/ojeda/linux
Pull clang-format update from Miguel Ojeda:
"The usual roughly-per-release .clang-format macro list update"
* tag 'clang-format-for-linus-v5.1-rc5' of git://github.com/ojeda/linux:
clang-format: Update with the latest for_each macro list
Linus Torvalds [Fri, 12 Apr 2019 15:16:40 +0000 (08:16 -0700)]
Merge tag 'mmc-v5.1-rc2' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC host fixes from Ulf Hansson:
- alcor: Stabilize data write requests
- sdhci-omap: Fix command error path during tuning
* tag 'mmc-v5.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sdhci-omap: Don't finish_mrq() on a command error during tuning
mmc: alcor: don't write data before command has completed
Linus Torvalds [Fri, 12 Apr 2019 15:11:59 +0000 (08:11 -0700)]
Merge tag 'sound-5.1-rc5' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Well, this one became unpleasantly larger than previous pull requests,
but it's a kind of usual pattern: now it contains a collection of ASoC
fixes, and nothing to worry too much.
The fixes for ASoC core (DAPM, DPCM, topology) are all small and just
covering corner cases. The rest changes are driver-specific, many of
which are for x86 platforms and new drivers like STM32, in addition to
the usual fixups for HD-audio"
* tag 'sound-5.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (66 commits)
ASoC: wcd9335: Fix missing regmap requirement
ALSA: hda: Fix racy display power access
ASoC: pcm: fix error handling when try_module_get() fails.
ASoC: stm32: sai: fix master clock management
ASoC: Intel: kbl: fix wrong number of channels
ALSA: hda - Add two more machines to the power_save_blacklist
ASoC: pcm: update module refcount if module_get_upon_open is set
ASoC: core: conditionally increase module refcount on component open
ASoC: stm32: fix sai driver name initialisation
ASoC: topology: Use the correct dobj to free enum control values and texts
ALSA: seq: Fix OOB-reads from strlcpy
ASoC: intel: skylake: add remove() callback for component driver
ASoC: cs35l35: Disable regulators on driver removal
ALSA: xen-front: Do not use stream buffer size before it is set
ASoC: rockchip: pdm: change dma burst to 8
ASoC: rockchip: pdm: fix regmap_ops hang issue
ASoC: simple-card: don't select DPCM via simple-audio-card
ASoC: audio-graph-card: don't select DPCM via audio-graph-card
ASoC: tlv320aic32x4: Change author's name
ALSA: hda/realtek - Add quirk for Tuxedo XC 1509
...
Linus Torvalds [Fri, 12 Apr 2019 15:07:46 +0000 (08:07 -0700)]
Merge tag 'acpi-5.1-rc5' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Fix an ACPICA issue introduced during the 4.20 development cycle and
causing some systems to crash because of leftover operation region
data still maintained after the operation region in question has gone
away (Erik Schmauss)"
* tag 'acpi-5.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPICA: Namespace: remove address node from global list after method termination
Linus Torvalds [Fri, 12 Apr 2019 15:04:01 +0000 (08:04 -0700)]
Merge tag 'drm-fixes-2019-04-12' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Fixes across the driver spectrum this week, the mediatek fbdev support
might be a bit late for this round, but I looked over it and it's not
very large and seems like a useful feature for them.
Otherwise the main thing is a regression fix for i915 5.0 bug that
caused black screens on a bunch of Dell XPS 15s I think, I know at
least Fedora is waiting for this to land, and the udl fix is also for
a regression since 5.0 where unplugging the device would end badly.
core:
- make atomic hooks optional
i915:
- Revert a 5.0 regression where some eDP panels stopped working
- DSI related fixes for platforms up to IceLake
- GVT (regression fix, warning fix, use-after free fix)
amdgpu:
- Cursor fixes
- missing PCI ID fix for KFD
- XGMI fix
- shadow buffer handling after reset fix
udl:
- fix unplugging device crashes.
mediatek:
- stabilise MT2701 HDMI support
- fbdev support
tegra:
- fix for build regression in rc1.
sun4i:
- Allwinner A6 max freq improvements
- null ptr deref fix
dw-hdmi:
- SCDC configuration improvements
omap:
- CEC clock management policy fix"
* tag 'drm-fixes-2019-04-12' of git://anongit.freedesktop.org/drm/drm: (32 commits)
gpu: host1x: Fix compile error when IOMMU API is not available
drm/i915/gvt: Roundup fb->height into tile's height at calucation fb->size
drm/i915/dp: revert back to max link rate and lane count on eDP
drm/i915/icl: Fix port disable sequence for mipi-dsi
drm/i915/icl: Ungate ddi clocks before IO enable
drm/mediatek: no change parent rate in round_rate() for MT2701 hdmi phy
drm/mediatek: using new factor for tvdpll for MT2701 hdmi phy
drm/mediatek: remove flag CLK_SET_RATE_PARENT for MT2701 hdmi phy
drm/mediatek: make implementation of recalc_rate() for MT2701 hdmi phy
drm/mediatek: fix the rate and divder of hdmi phy for MT2701
drm/mediatek: fix possible object reference leak
drm/i915: Get power refs in encoder->get_power_domains()
drm/i915: Fix pipe_bpp readout for BXT/GLK DSI
drm/amd/display: Fix negative cursor pos programming (v2)
drm/sun4i: tcon top: Fix NULL/invalid pointer dereference in sun8i_tcon_top_un/bind
drm/udl: add a release method and delay modeset teardown
drm/i915/gvt: Prevent use-after-free in ppgtt_free_all_spt()
drm/i915/gvt: Annotate iomem usage
drm/sun4i: DW HDMI: Lower max. supported rate for H6
Revert "Documentation/gpu/meson: Remove link to meson_canvas.c"
...
Will Deacon [Mon, 8 Apr 2019 11:45:09 +0000 (12:45 +0100)]
arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value
Rather embarrassingly, our futex() FUTEX_WAKE_OP implementation doesn't
explicitly set the return value on the non-faulting path and instead
leaves it holding the result of the underlying atomic operation. This
means that any FUTEX_WAKE_OP atomic operation which computes a non-zero
value will be reported as having failed. Regrettably, I wrote the buggy
code back in 2011 and it was upstreamed as part of the initial arm64
support in 2012.
The reasons we appear to get away with this are:
1. FUTEX_WAKE_OP is rarely used and therefore doesn't appear to get
exercised by futex() test applications
2. If the result of the atomic operation is zero, the system call
behaves correctly
3. Prior to version 2.25, the only operation used by GLIBC set the
futex to zero, and therefore worked as expected. From 2.25 onwards,
FUTEX_WAKE_OP is not used by GLIBC at all.
Fix the implementation by ensuring that the return value is either 0
to indicate that the atomic operation completed successfully, or -EFAULT
if we encountered a fault when accessing the user mapping.
Cc: <stable@kernel.org>
Fixes:
6170a97460db ("arm64: Atomic operations")
Signed-off-by: Will Deacon <will.deacon@arm.com>
Joerg Roedel [Fri, 12 Apr 2019 10:50:31 +0000 (12:50 +0200)]
iommu/amd: Set exclusion range correctly
The exlcusion range limit register needs to contain the
base-address of the last page that is part of the range, as
bits 0-11 of this register are treated as 0xfff by the
hardware for comparisons.
So correctly set the exclusion range in the hardware to the
last page which is _in_ the range.
Fixes:
b2026aa2dce44 ('x86, AMD IOMMU: add functions for programming IOMMU MMIO space')
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Miguel Ojeda [Sat, 30 Mar 2019 08:20:16 +0000 (09:20 +0100)]
clang-format: Update with the latest for_each macro list
Re-run the shell fragment that generated the original list now that
there are two dozens of new entries after v5.1's merge window.
Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Peter Zijlstra [Thu, 4 Apr 2019 13:03:00 +0000 (15:03 +0200)]
perf/core: Fix perf_event_disable_inatomic() race
Thomas-Mich Richter reported he triggered a WARN()ing from event_function_local()
on his s390. The problem boils down to:
CPU-A CPU-B
perf_event_overflow()
perf_event_disable_inatomic()
@pending_disable = 1
irq_work_queue();
sched-out
event_sched_out()
@pending_disable = 0
sched-in
perf_event_overflow()
perf_event_disable_inatomic()
@pending_disable = 1;
irq_work_queue(); // FAILS
irq_work_run()
perf_pending_event()
if (@pending_disable)
perf_event_disable_local(); // WHOOPS
The problem exists in generic, but s390 is particularly sensitive
because it doesn't implement arch_irq_work_raise(), nor does it call
irq_work_run() from it's PMU interrupt handler (nor would that be
sufficient in this case, because s390 also generates
perf_event_overflow() from pmu::stop). Add to that the fact that s390
is a virtual architecture and (virtual) CPU-A can stall long enough
for the above race to happen, even if it would self-IPI.
Adding a irq_work_sync() to event_sched_in() would work for all hardare
PMUs that properly use irq_work_run() but fails for software PMUs.
Instead encode the CPU number in @pending_disable, such that we can
tell which CPU requested the disable. This then allows us to detect
the above scenario and even redirect the IPI to make up for the failed
queue.
Reported-by: Thomas-Mich Richter <tmricht@linux.ibm.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Hendrik Brueckner <brueckner@linux.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dave Airlie [Fri, 12 Apr 2019 03:39:22 +0000 (13:39 +1000)]
Merge tag 'drm-intel-fixes-2019-04-11' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
- Revert back to max link rate and lane count on eDP.
- DSI related fixes for all platforms including Ice Lake.
- GVT Fixes including one vGPU display plane size regression fix,
one for preventing use-after-free in ppgtt shadow free function,
and another warning fix for iomem access annotation.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190411235832.GA6476@intel.com
Jason Yan [Fri, 12 Apr 2019 02:09:16 +0000 (10:09 +0800)]
block: fix the return errno for direct IO
If the last bio returned is not dio->bio, the status of the bio will
not assigned to dio->bio if it is error. This will cause the whole IO
status wrong.
ksoftirqd/21-117 [021] ..s. 4017.966090: 8,0 C N
4883648 [0]
<idle>-0 [018] ..s. 4017.970888: 8,0 C WS
4924800 + 1024 [0]
<idle>-0 [018] ..s. 4017.970909: 8,0 D WS
4935424 + 1024 [<idle>]
<idle>-0 [018] ..s. 4017.970924: 8,0 D WS
4936448 + 321 [<idle>]
ksoftirqd/21-117 [021] ..s. 4017.995033: 8,0 C R
4883648 + 336 [65475]
ksoftirqd/21-117 [021] d.s. 4018.001988: myprobe1: (blkdev_bio_end_io+0x0/0x168) bi_status=7
ksoftirqd/21-117 [021] d.s. 4018.001992: myprobe: (aio_complete_rw+0x0/0x148) x0=0xffff802f2595ad80 res=0x12a000 res2=0x0
We always have to assign bio->bi_status to dio->bio.bi_status because we
will only check dio->bio.bi_status when we return the whole IO to
the upper layer.
Fixes:
542ff7bf18c6 ("block: new direct I/O implementation")
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Thu, 11 Apr 2019 21:19:02 +0000 (14:19 -0700)]
Merge tag 'for-5.1-rc4-tag' of git://git./linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix parsing of compression algorithm when set as a inode property,
this could end up with eg. 'zst' or 'zli' in the value
- don't allow trim on a filesystem with unreplayed log, this could
cause data loss if there are pending updates to the block groups that
would not be subject to trim after replay
* tag 'for-5.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: prop: fix vanished compression property after failed set
btrfs: prop: fix zstd compression parameter validation
Btrfs: do not allow trimming when a fs is mounted with the nologreplay option
Dave Airlie [Thu, 11 Apr 2019 20:55:20 +0000 (06:55 +1000)]
Merge tag 'drm-misc-fixes-2019-04-11' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
- core: Make atomic_enable and disable optional for CRTC
- dw-hdmi: Lower max frequency for the Allwinner H6, SCDC configuration
improvements for older controller versions
- omap: a fix for the CEC clock management policy
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <maxime.ripard@bootlin.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190411151658.orm46ccd5zmrw27l@flea
Trond Myklebust [Thu, 11 Apr 2019 19:16:52 +0000 (15:16 -0400)]
Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
This reverts commit
009a82f6437490c262584d65a14094a818bcb747.
The ability to optimise here relies on compiler being able to optimise
away tail calls to avoid stack overflows. Unfortunately, we are seeing
reports of problems, so let's just revert.
Reported-by: Daniel Mack <daniel@zonque.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Olga Kornievskaia [Thu, 11 Apr 2019 18:34:18 +0000 (14:34 -0400)]
NFSv4.1 fix incorrect return value in copy_file_range
According to the NFSv4.2 spec if the input and output file is the
same file, operation should fail with EINVAL. However, linux
copy_file_range() system call has no such restrictions. Therefore,
in such case let's return EOPNOTSUPP and allow VFS to fallback
to doing do_splice_direct(). Also when copy_file_range is called
on an NFSv4.0 or 4.1 mount (ie., a server that doesn't support
COPY functionality), we also need to return EOPNOTSUPP and
fallback to a regular copy.
Fixes xfstest generic/075, generic/091, generic/112, generic/263
for all NFSv4.x versions.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Chuck Lever [Tue, 9 Apr 2019 21:04:09 +0000 (17:04 -0400)]
xprtrdma: Fix helper that drains the transport
We want to drain only the RQ first. Otherwise the transport can
deadlock on ->close if there are outstanding Send completions.
Fixes:
6d2d0ee27c7a ("xprtrdma: Replace rpcrdma_receive_wq ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v5.0+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Chuck Lever [Tue, 9 Apr 2019 14:44:16 +0000 (10:44 -0400)]
NFS: Fix handling of reply page vector
NFSv4 GETACL and FS_LOCATIONS requests stopped working in v5.1-rc.
These two need the extra padding to be added directly to the reply
length.
Reported-by: Olga Kornievskaia <aglo@umich.edu>
Fixes:
02ef04e432ba ("NFS: Account for XDR pad of buf->pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tetsuo Handa [Sat, 30 Mar 2019 01:21:07 +0000 (10:21 +0900)]
NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
syzbot is reporting uninitialized value at rpc_sockaddr2uaddr() [1]. This
is because syzbot is setting AF_INET6 to "struct sockaddr_in"->sin_family
(which is embedded into user-visible "struct nfs_mount_data" structure)
despite nfs23_validate_mount_data() cannot pass sizeof(struct sockaddr_in6)
bytes of AF_INET6 address to rpc_sockaddr2uaddr().
Since "struct nfs_mount_data" structure is user-visible, we can't change
"struct nfs_mount_data" to use "struct sockaddr_storage". Therefore,
assuming that everybody is using AF_INET family when passing address via
"struct nfs_mount_data"->addr, reject if its sin_family is not AF_INET.
[1] https://syzkaller.appspot.com/bug?id=
599993614e7cbbf66bc2656a919ab2a95fb5d75c
Reported-by: syzbot <syzbot+047a11c361b872896a4f@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Scott Wood [Wed, 10 Apr 2019 21:59:25 +0000 (16:59 -0500)]
dma-debug: only skip one stackframe entry
With skip set to 1, I get a traceback like this:
[ 106.867637] DMA-API: Mapped at:
[ 106.870784] afu_dma_map_region+0x2cd/0x4f0 [dfl_afu]
[ 106.875839] afu_ioctl+0x258/0x380 [dfl_afu]
[ 106.880108] do_vfs_ioctl+0xa9/0x720
[ 106.883688] ksys_ioctl+0x60/0x90
[ 106.887007] __x64_sys_ioctl+0x16/0x20
With the previous value of 2, afu_dma_map_region was being omitted. I
suspect that the code paths have simply changed since the value of 2 was
chosen a decade ago, but it's also possible that it varies based on which
mapping function was used, compiler inlining choices, etc. In any case,
it's best to err on the side of skipping less.
Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Stephen Boyd [Thu, 11 Apr 2019 17:22:43 +0000 (10:22 -0700)]
platform/x86: pmc_atom: Drop __initconst on dmi table
It's used by probe and that isn't an init function. Drop this so that we
don't get a section mismatch.
Reported-by: kbuild test robot <lkp@intel.com>
Cc: David Müller <dave.mueller@gmx.ch>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Fixes:
7c2e07130090 ("clk: x86: Add system specific quirk to mark clocks as critical")
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Rodrigo Vivi [Thu, 11 Apr 2019 16:18:13 +0000 (09:18 -0700)]
Merge tag 'gvt-fixes-2019-04-11' of https://github.com/intel/gvt-linux into drm-intel-fixes
gvt-fixes-2019-04-11
- Fix sparse warning on iomem usage (Chris)
- Prevent use-after-free for ppgtt shadow table free (Chris)
- Fix display plane size regression for tiled surface (Xiong)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
From: Zhenyu Wang <zhenyuw@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190411064910.GF17995@zhen-hp.sh.intel.com
Jens Axboe [Thu, 11 Apr 2019 15:36:41 +0000 (09:36 -0600)]
Merge branch 'nvme-5.1' of git://git.infradead.org/nvme into for-linus
Pull NVMe fixes from Christoph:
"Two nvme fixes for 5.1 - fixing the initial CSN for nvme-fc, and handle
log page offsets properly in the target."
* 'nvme-5.1' of git://git.infradead.org/nvme:
nvmet: fix discover log page when offsets are used
nvme-fc: correct csn initialization and increments on error
Keith Busch [Tue, 9 Apr 2019 16:03:59 +0000 (10:03 -0600)]
nvmet: fix discover log page when offsets are used
The nvme target hadn't been taking the Get Log Page offset parameter
into consideration, and so has been returning corrupted log pages when
offsets are used. Since many tools, including nvme-cli, split the log
request to 4k, we've been breaking discovery log responses when more
than 3 subsystems exist.
Fix the returned data by internally generating the entire discovery
log page and copying only the requested bytes into the user buffer. The
command log page offset type has been modified to a native __le64 to
make it easier to extract the value from a command.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Tested-by: Minwoo Im <minwoo.im@samsung.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
James Smart [Mon, 8 Apr 2019 18:15:19 +0000 (11:15 -0700)]
nvme-fc: correct csn initialization and increments on error
This patch fixes a long-standing bug that initialized the FC-NVME
cmnd iu CSN value to 1. Early FC-NVME specs had the connection starting
with CSN=1. By the time the spec reached approval, the language had
changed to state a connection should start with CSN=0. This patch
corrects the initialization value for FC-NVME connections.
Additionally, in reviewing the transport, the CSN value is assigned to
the new IU early in the start routine. It's possible that a later dma
map request may fail, causing the command to never be sent to the
controller. Change the location of the assignment so that it is
immediately prior to calling the lldd. Add a comment block to explain
the impacts if the lldd were to additionally fail sending the command.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Takashi Iwai [Thu, 11 Apr 2019 12:36:30 +0000 (14:36 +0200)]
Merge tag 'asoc-fix-v5.1-rc4' of git://git./linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v5.1
A few core fixes along with the driver specific ones, mainly fixing
small issues that only affect x86 platforms for various reasons (their
unusual machine enumeration mechanisms mainly, plus a fix for error
handling in topology).
There's some of the driver fixes that look larger than they are, like
the hdmi-codec changes which resulted in an indentation change, and most
of the other large changes are for new drivers like the STM32 changes.
Faiz Abbas [Thu, 11 Apr 2019 08:59:37 +0000 (14:29 +0530)]
mmc: sdhci-omap: Don't finish_mrq() on a command error during tuning
commit
5b0d62108b46 ("mmc: sdhci-omap: Add platform specific reset
callback") skips data resets during tuning operation. Because of this,
a data error or data finish interrupt might still arrive after a command
error has been handled and the mrq ended. This ends up with a "mmc0: Got
data interrupt 0x00000002 even though no data operation was in progress"
error message.
Fix this by adding a platform specific callback for sdhci_irq. Mark the
mrq as a failure but wait for a data interrupt instead of calling
finish_mrq().
Fixes:
5b0d62108b46 ("mmc: sdhci-omap: Add platform specific reset
callback")
Signed-off-by: Faiz Abbas <faiz_abbas@ti.com>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Dave Airlie [Thu, 11 Apr 2019 09:20:31 +0000 (19:20 +1000)]
Merge branch 'drm-fixes-5.1' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
A few fixes for 5.1:
- Cursor fixes
- Add missing picasso pci id to KFD
- XGMI fix
- Shadow buffer handling fix for GPU reset
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexdeucher@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190410183031.3710-1-alexander.deucher@amd.com
Dave Airlie [Thu, 11 Apr 2019 09:19:12 +0000 (19:19 +1000)]
Merge branch 'mediatek-drm-fixes-5.1' of https://github.com/ckhu-mediatek/linux.git-tags into drm-fixes
This include stable MT2701 HDMI, framebuffer device and some fixes for
mediatek drm driver.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: CK Hu <ck.hu@mediatek.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1554860914.29842.4.camel@mtksdaap41
Dave Airlie [Thu, 11 Apr 2019 09:17:31 +0000 (19:17 +1000)]
Merge tag 'drm/tegra/for-5.1-rc5' of git://anongit.freedesktop.org/tegra/linux into drm-fixes
drm/tegra: Fixes for v5.1-rc5
A single, one-line fix for a build error introduced in v5.1-rc1.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thierry Reding <thierry.reding@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190411084106.7552-1-thierry.reding@gmail.com
Stefan Agner [Wed, 10 Apr 2019 22:47:46 +0000 (00:47 +0200)]
gpu: host1x: Fix compile error when IOMMU API is not available
In case the IOMMU API is not available compiling host1x fails with
the following error:
In file included from drivers/gpu/host1x/hw/host1x06.c:27:
drivers/gpu/host1x/hw/channel_hw.c: In function ‘host1x_channel_set_streamid’:
drivers/gpu/host1x/hw/channel_hw.c:118:30: error: implicit declaration of function
‘dev_iommu_fwspec_get’; did you mean ‘iommu_fwspec_free’? [-Werror=implicit-function-declaration]
struct iommu_fwspec *spec = dev_iommu_fwspec_get(channel->dev->parent);
^~~~~~~~~~~~~~~~~~~~
iommu_fwspec_free
Fixes:
de5469c21ff9 ("gpu: host1x: Program the channel stream ID")
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Xiong Zhang [Wed, 10 Apr 2019 04:16:33 +0000 (12:16 +0800)]
drm/i915/gvt: Roundup fb->height into tile's height at calucation fb->size
When fb is tiled and fb->height isn't the multiple of tile's height,
the format fb->size = fb->stride * fb->height, will get a smaller size
than the actual size. As the memory height of tiled fb should be multiple
of tile's height.
Fixes:
7f1a93b1f1d1 ("drm/i915/gvt: Correct the calculation of plane size")
Reviewed-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
David Müller [Mon, 8 Apr 2019 13:33:54 +0000 (15:33 +0200)]
clk: x86: Add system specific quirk to mark clocks as critical
Since commit
648e921888ad ("clk: x86: Stop marking clocks as
CLK_IS_CRITICAL"), the pmc_plt_clocks of the Bay Trail SoC are
unconditionally gated off. Unfortunately this will break systems where these
clocks are used for external purposes beyond the kernel's knowledge. Fix it
by implementing a system specific quirk to mark the necessary pmc_plt_clks as
critical.
Fixes:
648e921888ad ("clk: x86: Stop marking clocks as CLK_IS_CRITICAL")
Signed-off-by: David Müller <dave.mueller@gmx.ch>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Jérôme Glisse [Wed, 10 Apr 2019 20:27:51 +0000 (16:27 -0400)]
block: do not leak memory in bio_copy_user_iov()
When bio_add_pc_page() fails in bio_copy_user_iov() we should free
the page we just allocated otherwise we are leaking it.
Cc: linux-block@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sergey Miroshnichenko [Tue, 12 Mar 2019 12:05:48 +0000 (15:05 +0300)]
PCI: pciehp: Ignore Link State Changes after powering off a slot
During a safe hot remove, the OS powers off the slot, which may cause a
Data Link Layer State Changed event. The slot has already been set to
OFF_STATE, so that event results in re-enabling the device, making it
impossible to safely remove it.
Clear out the Presence Detect Changed and Data Link Layer State Changed
events when the disabled slot has settled down.
It is still possible to re-enable the device if it remains in the slot
after pressing the Attention Button by pressing it again.
Fixes the problem that Micah reported below: an NVMe drive power button may
not actually turn off the drive.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=203237
Reported-by: Micah Parrish <micah.parrish@hpe.com>
Tested-by: Micah Parrish <micah.parrish@hpe.com>
Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
[bhelgaas: changelog, add bugzilla URL]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v4.19+
Christoph Hellwig [Wed, 3 Apr 2019 19:34:34 +0000 (21:34 +0200)]
sparc64/pci_sun4v: fix ATU checks for large DMA masks
Now that we allow drivers to always need to set larger than required
DMA masks we need to be a little more careful in the sun4v PCI iommu
driver to chose when to select the ATU support - a larger DMA mask
can be set even when the platform does not support ATU, so we always
have to check if it is avaiable before using it. Add a little helper
for that and use it in all the places where we make ATU usage decisions
based on the DMA mask.
Fixes:
24132a419c68 ("sparc64/pci_sun4v: allow large DMA masks")
Reported-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Meelis Roos <mroos@linux.ee>
Acked-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 10 Apr 2019 19:39:04 +0000 (09:39 -1000)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Several driver bug fixes posted in the last several weeks
- Several bug fixes for the hfi1 driver 'TID RDMA' functionality
merged into 5.1. Since TID RDMA is on by default these all seem to
be regressions.
- Wrong software permission checks on memory in mlx5
- Memory leak in vmw_pvrdma during driver remove
- Several bug fixes for hns driver features merged into 5.1"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/hfi1: Do not flush send queue in the TID RDMA second leg
RDMA/hns: Bugfix for SCC hem free
RDMA/hns: Fix bug that caused srq creation to fail
RDMA/vmw_pvrdma: Fix memory leak on pvrdma_pci_remove
IB/mlx5: Reset access mask when looping inside page fault handler
IB/hfi1: Fix the allocation of RSM table
IB/hfi1: Eliminate opcode tests on mr deref
IB/hfi1: Clear the IOWAIT pending bits when QP is put into error state
IB/hfi1: Failed to drain send queue when QP is put into error state
Hans Holmberg [Wed, 10 Apr 2019 17:56:43 +0000 (19:56 +0200)]
lightnvm: pblk: fix crash in pblk_end_partial_read due to multipage bvecs
The introduction of multipage bio vectors broke pblk's partial read
logic due to it not being prepared for multipage bio vectors.
Use bio vector iterators instead of direct bio vector indexing.
Fixes:
07173c3ec276 ("block: enable multipage bvecs")
Reported-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Updated description.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kaike Wan [Wed, 10 Apr 2019 13:27:05 +0000 (06:27 -0700)]
IB/hfi1: Do not flush send queue in the TID RDMA second leg
When a QP is put into error state, the send queue will be flushed.
This mechanism is implemented in both the first and the second leg
of the send engine. Since the second leg is only responsible for
data transactions in the KDETH space for the TID RDMA WRITE request,
it should not perform the flushing of the send queue.
This patch removes the flushing function of the second leg, but
still keeps the bailing out of the QP if it is put into error state.
Fixes:
70dcb2e3dc6a ("IB/hfi1: Add the TID second leg send packet builder")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Linus Torvalds [Wed, 10 Apr 2019 16:42:51 +0000 (06:42 -1000)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"Several fixes, add more reviewers to the list"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio: Honour 'may_reduce_num' in vring_create_virtqueue
MAiNTAINERS: add Paolo, Stefan for virtio blk/scsi
virtio_pci: fix a NULL pointer reference in vp_del_vqs
Marc Gonzalez [Wed, 10 Apr 2019 14:23:38 +0000 (16:23 +0200)]
ASoC: wcd9335: Fix missing regmap requirement
wcd9335.c: undefined reference to 'devm_regmap_add_irq_chip'
Signed-off-by: Marc Gonzalez <marc.w.gonzalez@free.fr>
Signed-off-by: Mark Brown <broonie@kernel.org>
Jani Nikula [Fri, 5 Apr 2019 07:52:20 +0000 (10:52 +0300)]
drm/i915/dp: revert back to max link rate and lane count on eDP
Commit
7769db588384 ("drm/i915/dp: optimize eDP 1.4+ link config fast
and narrow") started to optize the eDP 1.4+ link config, both per spec
and as preparation for display stream compression support.
Sadly, we again face panels that flat out fail with parameters they
claim to support. Revert, and go back to the drawing board.
v2: Actually revert to max params instead of just wide-and-slow.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109959
Fixes:
7769db588384 ("drm/i915/dp: optimize eDP 1.4+ link config fast and narrow")
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Manasi Navare <manasi.d.navare@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matt Atwood <matthew.s.atwood@intel.com>
Cc: "Lee, Shawn C" <shawn.c.lee@intel.com>
Cc: Dave Airlie <airlied@gmail.com>
Cc: intel-gfx@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v5.0+
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Manasi Navare <manasi.d.navare@intel.com>
Tested-by: Albert Astals Cid <aacid@kde.org> # v5.0 backport
Tested-by: Emanuele Panigati <ilpanich@gmail.com> # v5.0 backport
Tested-by: Matteo Iervasi <matteoiervasi@gmail.com> # v5.0 backport
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190405075220.9815-1-jani.nikula@intel.com
(cherry picked from commit
f11cb1c19ad0563b3c1ea5eb16a6bac0e401f428)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Vandita Kulkarni [Mon, 25 Mar 2019 11:26:42 +0000 (16:56 +0530)]
drm/i915/icl: Fix port disable sequence for mipi-dsi
Re-enable clock gating of DDI clocks.
v2: Fix the default ddi clk state for mipi-dsi (Imre)
Fixes:
1026bea00381 ("drm/i915/icl: Ungate DSI clocks")
Signed-off-by: Vandita Kulkarni <vandita.kulkarni@intel.com>
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1553513202-13863-2-git-send-email-vandita.kulkarni@intel.com
(cherry picked from commit
942d1cf48eae3fcd7e973cfb708d5c4860f0c713)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Vandita Kulkarni [Mon, 25 Mar 2019 11:26:41 +0000 (16:56 +0530)]
drm/i915/icl: Ungate ddi clocks before IO enable
IO enable sequencing needs ddi clocks enabled.
These clocks will be gated at a later point in
the enable sequence.
v2: Fix the commit header (Uma)
v3: Remove the redundant read (Ville)
Fixes:
949fc52af19e ("drm/i915/icl: add pll mapping for DSI")
Signed-off-by: Vandita Kulkarni <vandita.kulkarni@intel.com>
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1553513202-13863-1-git-send-email-vandita.kulkarni@intel.com
(cherry picked from commit
c5b81a325263a891d5811aabe938c87e03db4c37)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Ming Lei [Mon, 8 Apr 2019 22:31:22 +0000 (06:31 +0800)]
nvme: cancel request synchronously
nvme_cancel_request() is used in error handler, and it is always
reliable to cancel request synchronously, and avoids possible race
in which request may be completed after real hw queue is destroyed.
One issue is reported by our customer on NVMe RDMA, in which freed ib
queue pair may be used in nvme_rdma_complete_rq().
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: James Smart <james.smart@broadcom.com>
Cc: linux-nvme@lists.infradead.org
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 8 Apr 2019 22:31:21 +0000 (06:31 +0800)]
blk-mq: introduce blk_mq_complete_request_sync()
In NVMe's error handler, follows the typical steps of tearing down
hardware for recovering controller:
1) stop blk_mq hw queues
2) stop the real hw queues
3) cancel in-flight requests via
blk_mq_tagset_busy_iter(tags, cancel_request, ...)
cancel_request():
mark the request as abort
blk_mq_complete_request(req);
4) destroy real hw queues
However, there may be race between #3 and #4, because blk_mq_complete_request()
may run q->mq_ops->complete(rq) remotelly and asynchronously, and
->complete(rq) may be run after #4.
This patch introduces blk_mq_complete_request_sync() for fixing the
above race.
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: James Smart <james.smart@broadcom.com>
Cc: linux-nvme@lists.infradead.org
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dongli Zhang [Wed, 27 Mar 2019 10:36:35 +0000 (18:36 +0800)]
scsi: virtio_scsi: limit number of hw queues by nr_cpu_ids
When tag_set->nr_maps is 1, the block layer limits the number of hw queues
by nr_cpu_ids. No matter how many hw queues are used by virtio-scsi, as it
has (tag_set->nr_maps == 1), it can use at most nr_cpu_ids hw queues.
In addition, specifically for pci scenario, when the 'num_queues' specified
by qemu is more than maxcpus, virtio-scsi would not be able to allocate
more than maxcpus vectors in order to have a vector for each queue. As a
result, it falls back into MSI-X with one vector for config and one shared
for queues.
Considering above reasons, this patch limits the number of hw queues used
by virtio-scsi by nr_cpu_ids.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dongli Zhang [Wed, 27 Mar 2019 10:36:34 +0000 (18:36 +0800)]
virtio-blk: limit number of hw queues by nr_cpu_ids
When tag_set->nr_maps is 1, the block layer limits the number of hw queues
by nr_cpu_ids. No matter how many hw queues are used by virtio-blk, as it
has (tag_set->nr_maps == 1), it can use at most nr_cpu_ids hw queues.
In addition, specifically for pci scenario, when the 'num-queues' specified
by qemu is more than maxcpus, virtio-blk would not be able to allocate more
than maxcpus vectors in order to have a vector for each queue. As a result,
it falls back into MSI-X with one vector for config and one shared for
queues.
Considering above reasons, this patch limits the number of hw queues used
by virtio-blk by nr_cpu_ids.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Paolo Valente [Wed, 10 Apr 2019 08:38:33 +0000 (10:38 +0200)]
block, bfq: fix use after free in bfq_bfqq_expire
The function bfq_bfqq_expire() invokes the function
__bfq_bfqq_expire(), and the latter may free the in-service bfq-queue.
If this happens, then no other instruction of bfq_bfqq_expire() must
be executed, or a use-after-free will occur.
Basing on the assumption that __bfq_bfqq_expire() invokes
bfq_put_queue() on the in-service bfq-queue exactly once, the queue is
assumed to be freed if its refcounter is equal to one right before
invoking __bfq_bfqq_expire().
But, since commit
9dee8b3b057e ("block, bfq: fix queue removal from
weights tree") this assumption is false. __bfq_bfqq_expire() may also
invoke bfq_weights_tree_remove() and, since commit
9dee8b3b057e
("block, bfq: fix queue removal from weights tree"), also
the latter function may invoke bfq_put_queue(). So __bfq_bfqq_expire()
may invoke bfq_put_queue() twice, and this is the actual case where
the in-service queue may happen to be freed.
To address this issue, this commit moves the check on the refcounter
of the queue right around the last bfq_put_queue() that may be invoked
on the queue.
Fixes:
9dee8b3b057e ("block, bfq: fix queue removal from weights tree")
Reported-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
Reported-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
Tested-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Takashi Iwai [Wed, 10 Apr 2019 10:49:55 +0000 (12:49 +0200)]
ALSA: hda: Fix racy display power access
snd_hdac_display_power() doesn't handle the concurrent calls carefully
enough, and it may lead to the doubly get_power or put_power calls,
when a runtime PM and an async work get called in racy way.
This patch addresses it by reusing the bus->lock mutex that has been
used for protecting the link state change in ext bus code, so that it
can protect against racy display state changes. The initialization of
bus->lock was moved from snd_hdac_ext_bus_init() to
snd_hdac_bus_init() as well accordingly.
Testcase: igt/i915_pm_rpm/module-reload #glk-dsi
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Imre Deak <imre.deak@intel.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>