linux-2.6-microblaze.git
3 years agoblk-mq: Rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED
Ming Lei [Wed, 19 Aug 2020 15:20:19 +0000 (23:20 +0800)]
blk-mq: Rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED

BLK_MQ_F_TAG_SHARED actually means that tags is shared among request
queues, all of which should belong to LUNs attached to same HBA.

So rename it to make the point explicitly.

[jpg: rebase a few times, add rnbd-clt.c change]

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove revalidate_disk()
Christoph Hellwig [Tue, 1 Sep 2020 15:57:48 +0000 (17:57 +0200)]
block: remove revalidate_disk()

Remove the now unused helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvdimm: simplify revalidate_disk handling
Christoph Hellwig [Tue, 1 Sep 2020 15:57:47 +0000 (17:57 +0200)]
nvdimm: simplify revalidate_disk handling

The nvdimm block driver abuse revalidate_disk in a strange way, and
totally unrelated to what other drivers do.  Simplify this by just
calling nvdimm_revalidate_disk (which seems rather misnamed) from the
probe routines, as the additional bdev size revalidation is pointless
at this point, and remove the revalidate_disk methods given that
it can only be triggered from add_disk, which is right before the
manual calls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agosd: open code revalidate_disk
Christoph Hellwig [Tue, 1 Sep 2020 15:57:46 +0000 (17:57 +0200)]
sd: open code revalidate_disk

Instead of calling revalidate_disk just do the work directly by
calling sd_revalidate_disk, and revalidate_disk_size where needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvme: opencode revalidate_disk in nvme_validate_ns
Christoph Hellwig [Tue, 1 Sep 2020 15:57:45 +0000 (17:57 +0200)]
nvme: opencode revalidate_disk in nvme_validate_ns

Keep control in the NVMe driver instead of going through an indirect
call back into ->revalidate_disk.  Also reorder the function a bit to be
easier to follow with the additional code.

And now that we have removed all callers of revalidate_disk() in the nvme
code, ->revalidate_disk is only called from the open code when first
opening the device.  Which is of course totally pointless as we have
a valid size since the initial scan, and will get an updated view
through the asynchronous notifiation everytime the size changes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: use revalidate_disk_size in set_capacity_revalidate_and_notify
Christoph Hellwig [Tue, 1 Sep 2020 15:57:44 +0000 (17:57 +0200)]
block: use revalidate_disk_size in set_capacity_revalidate_and_notify

Only virtio_blk and xen-blkfront set the revalidate argument to true,
and both do not implement the ->revalidate_disk method.  So switch
to the helper that just updates the size instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: add a new revalidate_disk_size helper
Christoph Hellwig [Tue, 1 Sep 2020 15:57:43 +0000 (17:57 +0200)]
block: add a new revalidate_disk_size helper

revalidate_disk is a relative awkward helper for driver use, as it first
calls an optional driver method and then updates the block device size,
while most callers either don't need the method call at all, or want to
keep state between the caller and the called method.

Add a revalidate_disk_size helper that just performs the update of the
block device size from the gendisk one, and switch all drivers that do
not implement ->revalidate_disk to use the new helper instead of
revalidate_disk()

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: rename bd_invalidated
Christoph Hellwig [Tue, 1 Sep 2020 15:57:42 +0000 (17:57 +0200)]
block: rename bd_invalidated

Replace bd_invalidate with a new BDEV_NEED_PART_SCAN flag in a bd_flags
variable to better describe the condition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: don't clear bd_invalidated in check_disk_size_change
Christoph Hellwig [Tue, 1 Sep 2020 15:57:41 +0000 (17:57 +0200)]
block: don't clear bd_invalidated in check_disk_size_change

bd_invalidated is set by check_disk_change or in add_disk to initiate a
partition scan.  Move it from check_disk_size_change which is called
from both revalidate_disk() and bdev_disk_changed() to only the latter,
as that is what is called from the block device open code (and nbd) to
deal with the bd_invalidated event.  revalidate_disk() on the other hand
is mostly used to propagate a size update from the gendisk to the block
device, which is entirely unrelated.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoDocumentation/filesystems/locking.rst: remove an incorrect sentence
Christoph Hellwig [Tue, 1 Sep 2020 15:57:40 +0000 (17:57 +0200)]
Documentation/filesystems/locking.rst: remove an incorrect sentence

unlock_native_capacity is never called from check_disk_change(), and
while revalidate_disk can be called from it, it can also be called
from two other places at the moment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Remove a duplicative condition
Baolin Wang [Wed, 2 Sep 2020 01:45:25 +0000 (09:45 +0800)]
block: Remove a duplicative condition

Remove a duplicative condition to remove below cppcheck warnings:

"warning: Redundant condition: sched_allow_merge. '!A || (A && B)' is
equivalent to '!A || B' [redundantCondition]"

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: better deal with the delayed not supported case in blk_cloned_rq_check_limits
Ritika Srivastava [Tue, 1 Sep 2020 20:17:31 +0000 (13:17 -0700)]
block: better deal with the delayed not supported case in blk_cloned_rq_check_limits

If WRITE_ZERO/WRITE_SAME operation is not supported by the storage,
blk_cloned_rq_check_limits() will return IO error which will cause
device-mapper to fail the paths.

Instead, if the queue limit is set to 0, return BLK_STS_NOTSUPP.
BLK_STS_NOTSUPP will be ignored by device-mapper and will not fail the
paths.

Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Ritika Srivastava <ritika.srivastava@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Return blk_status_t instead of errno codes
Ritika Srivastava [Tue, 1 Sep 2020 20:17:30 +0000 (13:17 -0700)]
block: Return blk_status_t instead of errno codes

Replace returning legacy errno codes with blk_status_t in
blk_cloned_rq_check_limits().

Signed-off-by: Ritika Srivastava <ritika.srivastava@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: grant IOPRIO_CLASS_RT to CAP_SYS_NICE
Khazhismel Kumykov [Mon, 24 Aug 2020 22:10:34 +0000 (15:10 -0700)]
block: grant IOPRIO_CLASS_RT to CAP_SYS_NICE

CAP_SYS_ADMIN is too broad, and ionice fits into CAP_SYS_NICE's grouping.

Retain CAP_SYS_ADMIN permission for backwards compatibility.

Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: update iocost_monitor.py
Tejun Heo [Tue, 1 Sep 2020 18:52:57 +0000 (14:52 -0400)]
blk-iocost: update iocost_monitor.py

iocost went through significant internal changes. Update iocost_monitor.py
accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: add three debug stat - cost.wait, indebt and indelay
Tejun Heo [Tue, 1 Sep 2020 18:52:56 +0000 (14:52 -0400)]
blk-iocost: add three debug stat - cost.wait, indebt and indelay

These are really cheap to collect and can be useful in debugging iocost
behavior. Add them as debug stats for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: restore inuse update tracepoints
Tejun Heo [Tue, 1 Sep 2020 18:52:55 +0000 (14:52 -0400)]
blk-iocost: restore inuse update tracepoints

Update and restore the inuse update tracepoints.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: implement vtime loss compensation
Tejun Heo [Tue, 1 Sep 2020 18:52:54 +0000 (14:52 -0400)]
blk-iocost: implement vtime loss compensation

When an iocg accumulates too much vtime or gets deactivated, we throw away
some vtime, which lowers the overall device utilization. As the exact amount
which is being thrown away is known, we can compensate by accelerating the
vrate accordingly so that the extra vtime generated in the current period
matches what got lost.

This significantly improves work conservation when involving high weight
cgroups with intermittent and bursty IO patterns.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: halve debts if device stays idle
Tejun Heo [Tue, 1 Sep 2020 18:52:53 +0000 (14:52 -0400)]
blk-iocost: halve debts if device stays idle

A low weight iocg can amass a large amount of debt, for example, when
anonymous memory gets reclaimed aggressively. If the system has a lot of
memory paired with a slow IO device, the debt can span multiple seconds or
more. If there are no other subsequent IO issuers, the in-debt iocg may end
up blocked paying its debt while the IO device is idle.

This patch implements a mechanism to protect against such pathological
cases. If the device has been sufficiently idle for a substantial amount of
time, the debts are halved. The criteria are on the conservative side as we
want to resolve the rare extreme cases without impacting regular operation
by forgiving debts too readily.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: implement delay adjustment hysteresis
Tejun Heo [Tue, 1 Sep 2020 18:52:52 +0000 (14:52 -0400)]
blk-iocost: implement delay adjustment hysteresis

Curently, iocost syncs the delay duration to the outstanding debt amount,
which seemed enough to protect the system from anon memory hogs. However,
that was mostly because the delay calcuation was using hweight_inuse which
quickly converges towards zero under debt for delay duration calculation,
often pusnishing debtors overly harshly for longer than deserved.

The previous patch fixed the delay calcuation and now the protection against
anonymous memory hogs isn't enough because the effect of delay is indirect
and non-linear and a huge amount of future debt can accumulate abruptly
while unthrottled.

This patch implements delay hysteresis so that delay is decayed
exponentially over time instead of getting cleared immediately as debt is
paid off. While the overall behavior is similar to the blk-cgroup
implementation used by blk-iolatency, a lot of the details are different and
due to the empirical nature of the mechanism, it's challenging to adapt the
mechanism for one controller without negatively impacting the other.

As the delay is gradually decayed now, there's no point in running it from
its own hrtimer. Periodic updates are now performed from ioc_timer_fn() and
the dedicated hrtimer is removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: revamp debt handling
Tejun Heo [Tue, 1 Sep 2020 18:52:51 +0000 (14:52 -0400)]
blk-iocost: revamp debt handling

Debt handling had several issues.

* How much inuse a debtor carries wasn't clearly defined. inuse would be
  driven down over time from not issuing IOs but it'd be better to clamp it
  to minimum immediately once in debt.

* How much can be paid off was determined by hweight_inuse. As inuse was
  driven down, the payment amount would fall together regardless of the
  debtor's active weight. This means that the debtors were punished harshly.

* ioc_rqos_merge() wasn't calling blkcg_schedule_throttle() after
  iocg_kick_delay().

This patch revamps debt handling so that

* Debt handling owns inuse for iocgs in debt and keeps them at zero.

* Payment amount is determined by hweight_active. This is more deterministic
  and safer than hweight_inuse but still far from ideal in that it doesn't
  factor in possible donations from other iocgs for debt payments. This
  likely needs further improvements in the future.

* iocg_rqos_merge() now calls blkcg_schedule_throttle() as necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: revamp in-period donation snapbacks
Tejun Heo [Tue, 1 Sep 2020 18:52:50 +0000 (14:52 -0400)]
blk-iocost: revamp in-period donation snapbacks

When the margin drops below the minimum on a donating iocg, donation is
immediately canceled in full. There are a couple shortcomings with the
current behavior.

* It's abrupt. A small temporary budget deficit can lead to a wide swing in
  weight allocation and a large surplus.

* It's open coded in the issue path but not implemented for the merge path.
  A series of merges at a low inuse can make the iocg incur debts and stall
  incorrectly.

This patch reimplements in-period donation snapbacks so that

* inuse adjustment and cost calculations are factored into
  adjust_inuse_and_calc_cost() which is called from both the issue and merge
  paths.

* Snapbacks are more gradual. It occurs in quarter steps.

* A snapback triggers if the margin goes below the low threshold and is
  lower than the budget at the time of the last adjustment.

* For the above, __propagate_weights() stores the margin in
  iocg->saved_margin. Move iocg->last_inuse storing together into
  __propagate_weights() for consistency.

* Full snapback is guaranteed when there are waiters.

* With precise donation and gradual snapbacks, inuse adjustments are now a
  lot more effective and the value of scaling inuse on weight changes isn't
  clear. Removed inuse scaling from weight_update().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: revamp donation amount determination
Tejun Heo [Tue, 1 Sep 2020 18:52:49 +0000 (14:52 -0400)]
blk-iocost: revamp donation amount determination

iocost has various safety nets to combat inuse adjustment calculation
inaccuracies. With Andy's method implemented in transfer_surpluses(), inuse
adjustment calculations are now accurate and we can make donation amount
determinations accurate too.

* Stop keeping track of past usage history and using the maximum. Act on the
  immediate usage information.

* Remove donation constraints defined by SURPLUS_* constants. Donate
  whatever isn't used.

* Determine the donation amount so that the iocg will end up with
  MARGIN_TARGET_PCT budget at the end of the coming period assuming the same
  usage as the previous period. TARGET is set at 50% of period, which is the
  previous maximum. This provides smooth convergence for most repetitive IO
  patterns.

* Apply donation logic early at 20% budget. There's no risk in doing so as
  the calculation is based on the delta between the current budget and the
  target budget at the end of the coming period.

* Remove preemptive iocg activation for zero cost IOs. As donation can reach
  near zero now, the mere activation doesn't provide any protection anymore.
  In the unlikely case that this becomes a problem, the right solution is
  assigning appropriate costs for such IOs.

This significantly improves the donation determination logic while also
simplifying it. Now all donations are immediate, exact and smooth.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: implement Andy's method for donation weight updates
Tejun Heo [Tue, 1 Sep 2020 18:52:48 +0000 (14:52 -0400)]
blk-iocost: implement Andy's method for donation weight updates

iocost implements work conservation by reducing iocg->inuse and propagating
the adjustment upwards proportionally. However, while I knew the target
absolute hierarchical proportion - adjusted hweight_inuse, I couldn't figure
out how to determine the iocg->inuse adjustment to achieve that and
approximated the adjustment by scaling iocg->inuse using the proportion of
the needed hweight_inuse changes.

When nested, these scalings aren't accurate even when adjusting a single
node as the donating node also receives the benefit of the donated portion.
When multiple nodes are donating as they often do, they can be wildly wrong.

iocost employed various safety nets to combat the inaccuracies. There are
ample buffers in determining how much to donate, the adjustments are
conservative and gradual. While it can achieve a reasonable level of work
conservation in simple scenarios, the inaccuracies can easily add up leading
to significant loss of total work. This in turn makes it difficult to
closely cap vrate as vrate adjustment is needed to compensate for the loss
of work. The combination of inaccurate donation calculations and vrate
adjustments can lead to wide fluctuations and clunky overall behaviors.

Andy Newell devised a method to calculate the needed ->inuse updates to
achieve the target hweight_inuse's. The method is compatible with the
proportional inuse adjustment propagation which allows all hot path
operations to be local to each iocg.

To roughly summarize, Andy's method divides the tree into donating and
non-donating parts, calculates global donation rate which is used to
determine the target hweight_inuse for each node, and then derives per-level
proportions. There's non-trivial amount of math involved. Please refer to
the following pdfs for detailed descriptions.

  https://drive.google.com/file/d/1PsJwxPFtjUnwOY1QJ5AeICCcsL7BM3bo
  https://drive.google.com/file/d/1vONz1-fzVO7oY5DXXsLjSxEtYYQbOvsE
  https://drive.google.com/file/d/1WcrltBOSPN0qXVdBgnKm4mdp9FhuEFQN

This patch implements Andy's method in transfer_surpluses(). This makes the
donation calculations accurate per cycle and enables further improvements in
other parts of the donation logic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: restructure surplus donation logic
Tejun Heo [Tue, 1 Sep 2020 18:52:47 +0000 (14:52 -0400)]
blk-iocost: restructure surplus donation logic

The way the surplus donation logic is structured isn't great. There are two
separate paths for starting/increasing donations and decreasing them making
the logic harder to follow and is prone to unnecessary behavior differences.

In preparation for improved donation handling, this patch restructures the
code so that

* All donors - new, increasing and decreasing - are funneled through the
  same code path.

* The target donation calculation is factored into hweight_after_donation()
  which is called once from the same spot for all possible donors.

* Actual inuse adjustment is factored into trasnfer_surpluses().

This change introduces a few behavior differences - e.g. donation amount
reduction now uses the max usage of the recent three periods just like new
and increasing donations, and inuse now gets adjusted upwards the same way
it gets downwards. These differences are unlikely to have severely negative
implications and the whole logic will be revamped soon.

This patch also removes two tracepoints. The existing TPs don't quite fit
the new implementation. A later patch will update and reinstate them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: decouple vrate adjustment from surplus transfers
Tejun Heo [Tue, 1 Sep 2020 18:52:46 +0000 (14:52 -0400)]
blk-iocost: decouple vrate adjustment from surplus transfers

Budget donations are inaccurate and could take multiple periods to converge.
To prevent triggering vrate adjustments while surplus transfers were
catching up, vrate adjustment was suppressed if donations were increasing,
which was indicated by non-zero nr_surpluses.

This entangling won't be necessary with the scheduled rewrite of donation
mechanism which will make it precise and immediate. Let's decouple the two
in preparation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: replace iocg->has_surplus with ->surplus_list
Tejun Heo [Tue, 1 Sep 2020 18:52:45 +0000 (14:52 -0400)]
blk-iocost: replace iocg->has_surplus with ->surplus_list

Instead of marking iocgs with surplus with a flag and filtering for them
while walking all active iocgs, build a surpluses list. This doesn't make
much difference now but will help implementing improved donation logic which
will iterate iocgs with surplus multiple times.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: calculate iocg->usages[] from iocg->local_stat.usage_us
Tejun Heo [Tue, 1 Sep 2020 18:52:44 +0000 (14:52 -0400)]
blk-iocost: calculate iocg->usages[] from iocg->local_stat.usage_us

Currently, iocg->usages[] which are used to guide inuse adjustments are
calculated from vtime deltas. This, however, assumes that the hierarchical
inuse weight at the time of calculation held for the entire period, which
often isn't true and can lead to significant errors.

Now that we have absolute usage information collected, we can derive
iocg->usages[] from iocg->local_stat.usage_us so that inuse adjustment
decisions are made based on actual absolute usage. The calculated usage is
clamped between 1 and WEIGHT_ONE and WEIGHT_ONE is also used to signal
saturation regardless of the current hierarchical inuse weight.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: add absolute usage stat
Tejun Heo [Tue, 1 Sep 2020 18:52:43 +0000 (14:52 -0400)]
blk-iocost: add absolute usage stat

Currently, iocost doesn't collect or expose any statistics punting off all
monitoring duties to drgn based iocost_monitor.py. While it works for some
scenarios, there are some usability and data availability challenges. For
example, accurate per-cgroup usage information can't be tracked by vtime
progression at all and the number available in iocg->usages[] are really
short-term snapshots used for control heuristics with possibly significant
errors.

This patch implements per-cgroup absolute usage stat counter and exposes it
through io.stat along with the current vrate. Usage stat collection and
flushing employ the same method as cgroup rstat on the active iocg's and the
only hot path overhead is preemption toggling and adding to a percpu
counter.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: grab ioc->lock for debt handling
Tejun Heo [Tue, 1 Sep 2020 18:52:42 +0000 (14:52 -0400)]
blk-iocost: grab ioc->lock for debt handling

Currently, debt handling requires only iocg->waitq.lock. In the future, we
want to adjust and propagate inuse changes depending on debt status. Let's
grab ioc->lock in debt handling paths in preparation.

* Because ioc->lock nests outside iocg->waitq.lock, the decision to grab
  ioc->lock needs to be made before entering the critical sections.

* Add and use iocg_[un]lock() which handles the conditional double locking.

* Add @pay_debt to iocg_kick_waitq() so that debt payment happens only when
  the caller grabbed both locks.

This patch is prepatory and the comments contain references to future
changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: streamline vtime margin and timer slack handling
Tejun Heo [Tue, 1 Sep 2020 18:52:41 +0000 (14:52 -0400)]
blk-iocost: streamline vtime margin and timer slack handling

The margin handling was pretty inconsistent.

* ioc->margin_us and ioc->inuse_margin_vtime were used as vtime margin
  thresholds. However, the two are in different units with the former
  requiring conversion to vtime on use.

* iocg_kick_waitq() was using a quarter of WAITQ_TIMER_MARGIN_PCT of
  period_us as the timer slack - ~1.2%. While iocg_kick_delay() was using a
  quarter of ioc->margin_us - ~12.5%. There aren't strong reasons to use
  different values for the two.

This patch cleans up margin and timer slack handling:

* vtime margins are now recorded in ioc->margins.{min, max} on period
  duration changes and used consistently.

* Timer slack is now 1% of period_us and recorded in ioc->timer_slack_ns and
  used consistently for iocg_kick_waitq() and iocg_kick_delay().

The only functional change is shortening of timer slack. No meaningful
visible change is expected.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: make ioc_now->now and ioc->period_at 64bit
Tejun Heo [Tue, 1 Sep 2020 18:52:40 +0000 (14:52 -0400)]
blk-iocost: make ioc_now->now and ioc->period_at 64bit

They are in microseconds and wrap in around 1.2 hours with u32. While
unlikely, confusions from wraparounds are still possible. We aren't saving
anything meaningful by keeping these u32. Let's make them u64.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: use WEIGHT_ONE based fixed point number for weights
Tejun Heo [Tue, 1 Sep 2020 18:52:39 +0000 (14:52 -0400)]
blk-iocost: use WEIGHT_ONE based fixed point number for weights

To improve weight donations, we want to able to scale inuse with a greater
accuracy and down below 1. Let's make non-hierarchical weights to use
WEIGHT_ONE based fixed point numbers too like hierarchical ones.

This doesn't cause any behavior changes yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: s/HWEIGHT_WHOLE/WEIGHT_ONE/g
Tejun Heo [Tue, 1 Sep 2020 18:52:38 +0000 (14:52 -0400)]
blk-iocost: s/HWEIGHT_WHOLE/WEIGHT_ONE/g

We're gonna use HWEIGHT_WHOLE for regular weights too. Let's rename it to
WEIGHT_ONE.

Pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: make iocg_kick_waitq() call iocg_kick_delay() after paying debt
Tejun Heo [Tue, 1 Sep 2020 18:52:37 +0000 (14:52 -0400)]
blk-iocost: make iocg_kick_waitq() call iocg_kick_delay() after paying debt

iocg_kick_waitq() is the function which pays debt and iocg_kick_delay()
updates the actual delay status accordingly. If iocg_kick_delay() is not
called after iocg_kick_delay() updated debt, unnecessarily large delays can
be applied temporarily.

Let's make sure such conditions don't occur by making iocg_kick_waitq()
always call iocg_kick_delay() after paying debt.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: move iocg_kick_delay() above iocg_kick_waitq()
Tejun Heo [Tue, 1 Sep 2020 18:52:36 +0000 (14:52 -0400)]
blk-iocost: move iocg_kick_delay() above iocg_kick_waitq()

We'll make iocg_kick_waitq() call iocg_kick_delay(). Reorder them in
preparation. This is pure code reorganization.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: clamp inuse and skip noops in __propagate_weights()
Tejun Heo [Tue, 1 Sep 2020 18:52:35 +0000 (14:52 -0400)]
blk-iocost: clamp inuse and skip noops in __propagate_weights()

__propagate_weights() currently expects the callers to clamp inuse within
[1, active], which is needlessly fragile. The inuse adjustment logic is
going to be revamped, in preparation, let's make __propagate_weights() clamp
inuse on entry.

Also, make it avoid weight updates altogether if neither active or inuse is
changed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: rename propagate_active_weights() to propagate_weights()
Tejun Heo [Tue, 1 Sep 2020 18:52:34 +0000 (14:52 -0400)]
blk-iocost: rename propagate_active_weights() to propagate_weights()

It already propagates two weights - active and inuse - and there will be
another soon. Let's drop the confusing misnomers. Rename
[__]propagate_active_weights() to [__]propagate_weights() and
commit_active_weights() to commit_weights().

This is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: use local[64]_t for percpu stat
Tejun Heo [Tue, 1 Sep 2020 18:52:33 +0000 (14:52 -0400)]
blk-iocost: use local[64]_t for percpu stat

blk-iocost has been reading percpu stat counters from remote cpus which on
some archs can lead to torn reads in really rare occassions. Use local[64]_t
for those counters.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove the unused q argument to part_in_flight and part_in_flight_rw
Christoph Hellwig [Mon, 31 Aug 2020 18:02:39 +0000 (20:02 +0200)]
block: remove the unused q argument to part_in_flight and part_in_flight_rw

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove the disk argument to delete_partition
Christoph Hellwig [Mon, 31 Aug 2020 18:02:38 +0000 (20:02 +0200)]
block: remove the disk argument to delete_partition

We can trivially derive the gendisk from the hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: cleanup __alloc_disk_node
Christoph Hellwig [Mon, 31 Aug 2020 18:02:37 +0000 (20:02 +0200)]
block: cleanup __alloc_disk_node

Use early returns and goto-based unwinding to simplify the flow a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: move the devcgroup_inode_permission call to blkdev_get
Christoph Hellwig [Mon, 31 Aug 2020 18:02:36 +0000 (20:02 +0200)]
block: move the devcgroup_inode_permission call to blkdev_get

devcgroup_inode_permission is never called for the recusive case, so
move it out into blkdev_get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove an outdated comment on the bd_dev field
Christoph Hellwig [Mon, 31 Aug 2020 18:02:35 +0000 (20:02 +0200)]
block: remove an outdated comment on the bd_dev field

kdev_t is long gone, so we don't need to comment a field isn't one..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove the discard_alignment field from struct hd_struct
Christoph Hellwig [Mon, 31 Aug 2020 18:02:34 +0000 (20:02 +0200)]
block: remove the discard_alignment field from struct hd_struct

The alignment offset is only used in slow path callers, so just calculate
it on the fly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove the alignment_offset field from struct hd_struct
Christoph Hellwig [Mon, 31 Aug 2020 18:02:33 +0000 (20:02 +0200)]
block: remove the alignment_offset field from struct hd_struct

The alignment offset is only used in slow path callers, so just calculate
it on the fly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: use BLK_MQ_NO_TAG for no tag
Xianting Tian [Thu, 27 Aug 2020 06:34:17 +0000 (14:34 +0800)]
blk-mq: use BLK_MQ_NO_TAG for no tag

Replace various magic -1 constants for tags with BLK_MQ_NO_TAG.

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Remove blk_mq_attempt_merge() function
Baolin Wang [Fri, 28 Aug 2020 02:52:57 +0000 (10:52 +0800)]
block: Remove blk_mq_attempt_merge() function

The small blk_mq_attempt_merge() function is only called by
__blk_mq_sched_bio_merge(), just open code it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Add a new helper to attempt to merge a bio
Baolin Wang [Fri, 28 Aug 2020 02:52:56 +0000 (10:52 +0800)]
block: Add a new helper to attempt to merge a bio

There are lots of duplicated code when trying to merge a bio from
plug list and sw queue, we can introduce a new helper to attempt
to merge a bio, which can simplify the blk_bio_list_merge()
and blk_attempt_plug_merge().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Move blk_mq_bio_list_merge() into blk-merge.c
Baolin Wang [Fri, 28 Aug 2020 02:52:55 +0000 (10:52 +0800)]
block: Move blk_mq_bio_list_merge() into blk-merge.c

Move the blk_mq_bio_list_merge() into blk-merge.c and
rename it as a generic name.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Move bio merge related functions into blk-merge.c
Baolin Wang [Fri, 28 Aug 2020 02:52:54 +0000 (10:52 +0800)]
block: Move bio merge related functions into blk-merge.c

It's better to move bio merge related functions into blk-merge.c,
which contains all merge related functions.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-wbt: Remove obsolete multiqueue I/O scheduling comment
Danny Lin [Sat, 29 Aug 2020 09:13:53 +0000 (02:13 -0700)]
blk-wbt: Remove obsolete multiqueue I/O scheduling comment

This comment was added before the multiqueue I/O scheduler framework
was introduced; multiqueue has support for I/O scheduling now, so this
obsolete comment can be removed.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agovirtio-blk: Use kobj_to_dev() instead of container_of()
Tian Tao [Fri, 21 Aug 2020 01:19:15 +0000 (09:19 +0800)]
virtio-blk: Use kobj_to_dev() instead of container_of()

Use kobj_to_dev() instead of container_of()

Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoraw: deprecate the raw driver
Christoph Hellwig [Wed, 19 Aug 2020 07:35:33 +0000 (09:35 +0200)]
raw: deprecate the raw driver

The raw driver has been replaced by O_DIRECT support on the block device
in 2002.  Deprecate it to prepare for removal in a few kernel releases.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove the BIO_USER_MAPPED flag
Christoph Hellwig [Thu, 27 Aug 2020 15:37:48 +0000 (17:37 +0200)]
block: remove the BIO_USER_MAPPED flag

Just check if there is private data, in which case the bio must have
originated from bio_copy_user_iov.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove __blk_rq_map_user_iov
Christoph Hellwig [Thu, 27 Aug 2020 15:37:47 +0000 (17:37 +0200)]
block: remove __blk_rq_map_user_iov

Just duplicate a small amount of code in the low-level map into the bio
and copy to the bio routines, leading to much easier to follow and
maintain code, and better shared error handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove __blk_rq_unmap_user
Christoph Hellwig [Thu, 27 Aug 2020 15:37:46 +0000 (17:37 +0200)]
block: remove __blk_rq_unmap_user

Open code __blk_rq_unmap_user in the two callers.  Both never pass a NULL
bio, and one of them can use an existing local variable instead of the bio
flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove the BIO_NULL_MAPPED flag
Christoph Hellwig [Thu, 27 Aug 2020 15:37:45 +0000 (17:37 +0200)]
block: remove the BIO_NULL_MAPPED flag

We can simply use a boolean flag in the bio_map_data data structure
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvme: don't call revalidate_disk from nvme_set_queue_dying
Christoph Hellwig [Sun, 23 Aug 2020 09:10:43 +0000 (11:10 +0200)]
nvme: don't call revalidate_disk from nvme_set_queue_dying

In nvme_set_queue_dying we really just want to ensure the disk and bdev
sizes are set to zero.  Going through revalidate_disk leads to a somewhat
arcance and complex callchain relying on special behavior in a few
places.  Instead just lift the set_capacity directly to
nvme_set_queue_dying, and rename and move the nvme_mpath_update_disk_size
helper so that we can use it in nvme_set_queue_dying to propagate the
size to the bdev without detours.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: fix locking for struct block_device size updates
Christoph Hellwig [Sun, 23 Aug 2020 09:10:42 +0000 (11:10 +0200)]
block: fix locking for struct block_device size updates

Two different callers use two different mutexes for updating the
block device size, which obviously doesn't help to actually protect
against concurrent updates from the different callers.  In addition
one of the locks, bd_mutex is rather prone to deadlocks with other
parts of the block stack that use it for high level synchronization.

Switch to using a new spinlock protecting just the size updates, as
that is all we need, and make sure everyone does the update through
the proper helper.

This fixes a bug reported with the nvme revalidating disks during a
hot removal operation, which can currently deadlock on bd_mutex.

Reported-by: Xianting Tian <xianting_tian@126.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: replace bd_set_size with bd_set_nr_sectors
Christoph Hellwig [Sun, 23 Aug 2020 09:10:41 +0000 (11:10 +0200)]
block: replace bd_set_size with bd_set_nr_sectors

Replace bd_set_size with a version that takes the number of sectors
instead, as that fits most of the current and future callers much better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Make request_queue.rpm_status an enum
Geert Uytterhoeven [Wed, 19 Aug 2020 12:34:03 +0000 (14:34 +0200)]
block: Make request_queue.rpm_status an enum

request_queue.rpm_status is assigned values of the rpm_status enum only,
so reflect that in its type.

Note that including <linux/pm.h> is (currently) a no-op, as it is
already included through <linux/genhd.h> and <linux/device.h>, but it is
better to play it safe.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoMerge branch 'block-5.9' into for-5.10/block
Jens Axboe [Tue, 1 Sep 2020 22:49:20 +0000 (16:49 -0600)]
Merge branch 'block-5.9' into for-5.10/block

* block-5.9:
  blk-stat: make q->stats->lock irqsafe
  blk-iocost: ioc_pd_free() shouldn't assume irq disabled
  block: fix locking in bdev_del_partition
  block: release disk reference in hd_struct_free_work
  block: ensure bdi->io_pages is always initialized
  nvme-pci: cancel nvme device request before disabling
  nvme: only use power of two io boundaries
  nvme: fix controller instance leak
  nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
  nvme: Fix NULL dereference for pci nvme controllers
  nvme-rdma: fix reset hang if controller died in the middle of a reset
  nvme-rdma: fix timeout handler
  nvme-rdma: serialize controller teardown sequences
  nvme-tcp: fix reset hang if controller died in the middle of a reset
  nvme-tcp: fix timeout handler
  nvme-tcp: serialize controller teardown sequences
  nvme: have nvme_wait_freeze_timeout return if it timed out
  nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
  nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu

3 years agoblk-stat: make q->stats->lock irqsafe
Tejun Heo [Tue, 1 Sep 2020 18:52:32 +0000 (14:52 -0400)]
blk-stat: make q->stats->lock irqsafe

blk-iocost calls blk_stat_enable_accounting() while holding an irqsafe lock
which triggers a lockdep splat because q->stats->lock isn't irqsafe. Let's
make it irqsafe.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: cd006509b0a9 ("blk-iocost: account for IO size when testing latencies")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: ioc_pd_free() shouldn't assume irq disabled
Tejun Heo [Tue, 1 Sep 2020 18:52:31 +0000 (14:52 -0400)]
blk-iocost: ioc_pd_free() shouldn't assume irq disabled

ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled
when it can be called with irq disabled or enabled. This has a small chance
of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations
instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: fix locking in bdev_del_partition
Christoph Hellwig [Tue, 1 Sep 2020 09:59:41 +0000 (11:59 +0200)]
block: fix locking in bdev_del_partition

We need to hold the whole device bd_mutex to protect against
other thread concurrently deleting out partition before we get
to it, and thus causing a use after free.

Fixes: cddae808aeb7 ("block: pass a hd_struct to delete_partition")
Reported-by: syzbot+6448f3c229bc52b82f69@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: release disk reference in hd_struct_free_work
Ming Lei [Tue, 1 Sep 2020 10:07:38 +0000 (18:07 +0800)]
block: release disk reference in hd_struct_free_work

Commit e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal")
stops to release request queue from wq context because that commit
supposed all blk_put_queue() is called in context which is allowed
to sleep. However, this assumption isn't true because we release disk's
reference in partition's percpu_ref's ->release() which doesn't allow
to sleep, because the ->release() is run via call_rcu().

Fixes this issue by moving put disk reference into hd_struct_free_work()

Fixes: e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal")
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: ensure bdi->io_pages is always initialized
Jens Axboe [Mon, 31 Aug 2020 17:20:02 +0000 (11:20 -0600)]
block: ensure bdi->io_pages is always initialized

If a driver leaves the limit settings as the defaults, then we don't
initialize bdi->io_pages. This means that file systems may need to
work around bdi->io_pages == 0, which is somewhat messy.

Initialize the default value just like we do for ->ra_pages.

Cc: stable@vger.kernel.org
Fixes: 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting")
Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoLinux 5.9-rc3
Linus Torvalds [Sun, 30 Aug 2020 23:01:54 +0000 (16:01 -0700)]
Linux 5.9-rc3

3 years agoMerge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Linus Torvalds [Sun, 30 Aug 2020 22:53:44 +0000 (15:53 -0700)]
Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6

Pull crypto fixes from Herbert Xu:

 - fix regression in af_alg that affects iwd

 - restore polling delay in qat

 - fix double free in ingenic on error path

 - fix potential build failure in sa2ul due to missing Kconfig dependency

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: af_alg - Work around empty control messages without MSG_MORE
  crypto: sa2ul - add Kconfig selects to fix build error
  crypto: ingenic - Drop kfree for memory allocated with devm_kzalloc
  crypto: qat - add delay before polling mailbox

3 years agoMerge tag 'x86-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 30 Aug 2020 19:01:23 +0000 (12:01 -0700)]
Merge tag 'x86-urgent-2020-08-30' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "Three interrupt related fixes for X86:

   - Move disabling of the local APIC after invoking fixup_irqs() to
     ensure that interrupts which are incoming are noted in the IRR and
     not ignored.

   - Unbreak affinity setting.

     The rework of the entry code reused the regular exception entry
     code for device interrupts. The vector number is pushed into the
     errorcode slot on the stack which is then lifted into an argument
     and set to -1 because that's regs->orig_ax which is used in quite
     some places to check whether the entry came from a syscall.

     But it was overlooked that orig_ax is used in the affinity cleanup
     code to validate whether the interrupt has arrived on the new
     target. It turned out that this vector check is pointless because
     interrupts are never moved from one vector to another on the same
     CPU. That check is a historical leftover from the time where x86
     supported multi-CPU affinities, but not longer needed with the now
     strict single CPU affinity. Famous last words ...

   - Add a missing check for an empty cpumask into the matrix allocator.

     The affinity change added a warning to catch the case where an
     interrupt is moved on the same CPU to a different vector. This
     triggers because a condition with an empty cpumask returns an
     assignment from the allocator as the allocator uses for_each_cpu()
     without checking the cpumask for being empty. The historical
     inconsistent for_each_cpu() behaviour of ignoring the cpumask and
     unconditionally claiming that CPU0 is in the mask struck again.
     Sigh.

  plus a new entry into the MAINTAINER file for the HPE/UV platform"

* tag 'x86-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/matrix: Deal with the sillyness of for_each_cpu() on UP
  x86/irq: Unbreak interrupt affinity setting
  x86/hotplug: Silence APIC only after all interrupts are migrated
  MAINTAINERS: Add entry for HPE Superdome Flex (UV) maintainers

3 years agoMerge tag 'irq-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 30 Aug 2020 18:56:54 +0000 (11:56 -0700)]
Merge tag 'irq-urgent-2020-08-30' of git://git./linux/kernel/git/tip/tip

Pull irq fixes from Thomas Gleixner:
 "A set of fixes for interrupt chip drivers:

   - Revert the platform driver conversion of interrupt chip drivers as
     it turned out to create more problems than it solves.

   - Fix a trivial typo in the new module helpers which made probing
     reliably fail.

   - Small fixes in the STM32 and MIPS Ingenic drivers

   - The TI firmware rework which had badly managed dependencies and had
     to wait post rc1"

* tag 'irq-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/ingenic: Leave parent IRQ unmasked on suspend
  irqchip/stm32-exti: Avoid losing interrupts due to clearing pending bits by mistake
  irqchip: Revert modular support for drivers using IRQCHIP_PLATFORM_DRIVER helperse
  irqchip: Fix probing deferal when using IRQCHIP_PLATFORM_DRIVER helpers
  arm64: dts: k3-am65: Update the RM resource types
  arm64: dts: k3-am65: ti-sci-inta/intr: Update to latest bindings
  arm64: dts: k3-j721e: ti-sci-inta/intr: Update to latest bindings
  irqchip/ti-sci-inta: Add support for INTA directly connecting to GIC
  irqchip/ti-sci-inta: Do not store TISCI device id in platform device id field
  dt-bindings: irqchip: Convert ti, sci-inta bindings to yaml
  dt-bindings: irqchip: ti, sci-inta: Update docs to support different parent.
  irqchip/ti-sci-intr: Add support for INTR being a parent to INTR
  dt-bindings: irqchip: Convert ti, sci-intr bindings to yaml
  dt-bindings: irqchip: ti, sci-intr: Update bindings to drop the usage of gic as parent
  firmware: ti_sci: Add support for getting resource with subtype
  firmware: ti_sci: Drop unused structure ti_sci_rm_type_map
  firmware: ti_sci: Drop the device id to resource type translation

3 years agoMerge tag 'sched-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 30 Aug 2020 18:50:53 +0000 (11:50 -0700)]
Merge tag 'sched-urgent-2020-08-30' of git://git./linux/kernel/git/tip/tip

Pull scheduler fix from Thomas Gleixner:
 "A single fix for the scheduler:

   - Make is_idle_task() __always_inline to prevent the compiler from
     putting it out of line into the wrong section because it's used
     inside noinstr sections"

* tag 'sched-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Use __always_inline on is_idle_task()

3 years agoMerge tag 'locking-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 30 Aug 2020 18:43:50 +0000 (11:43 -0700)]
Merge tag 'locking-urgent-2020-08-30' of git://git./linux/kernel/git/tip/tip

Pull locking fixes from Thomas Gleixner:
 "A set of fixes for lockdep, tracing and RCU:

   - Prevent recursion by using raw_cpu_* operations

   - Fixup the interrupt state in the cpu idle code to be consistent

   - Push rcu_idle_enter/exit() invocations deeper into the idle path so
     that the lock operations are inside the RCU watching sections

   - Move trace_cpu_idle() into generic code so it's called before RCU
     goes idle.

   - Handle raw_local_irq* vs. local_irq* operations correctly

   - Move the tracepoints out from under the lockdep recursion handling
     which turned out to be fragile and inconsistent"

* tag 'locking-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep,trace: Expose tracepoints
  lockdep: Only trace IRQ edges
  mips: Implement arch_irqs_disabled()
  arm64: Implement arch_irqs_disabled()
  nds32: Implement arch_irqs_disabled()
  locking/lockdep: Cleanup
  x86/entry: Remove unused THUNKs
  cpuidle: Move trace_cpu_idle() into generic code
  cpuidle: Make CPUIDLE_FLAG_TLB_FLUSHED generic
  sched,idle,rcu: Push rcu_idle deeper into the idle path
  cpuidle: Fixup IRQ state
  lockdep: Use raw_cpu_*() for per-cpu variables

3 years agoMerge tag '5.9-rc2-smb-fix' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sun, 30 Aug 2020 18:38:21 +0000 (11:38 -0700)]
Merge tag '5.9-rc2-smb-fix' of git://git.samba.org/sfrench/cifs-2.6

Pull cfis fix from Steve French:
 "DFS fix for referral problem when using SMB1"

* tag '5.9-rc2-smb-fix' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix check of tcon dfs in smb1

3 years agoMerge tag 'powerpc-5.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sun, 30 Aug 2020 17:56:12 +0000 (10:56 -0700)]
Merge tag 'powerpc-5.9-4' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Revert our removal of PROT_SAO, at least one user expressed an
   interest in using it on Power9. Instead don't allow it to be used in
   guests unless enabled explicitly at compile time.

 - A fix for a crash introduced by a recent change to FP handling.

 - Revert a change to our idle code that left Power10 with no idle
   support.

 - One minor fix for the new scv system call path to set PPR.

 - Fix a crash in our "generic" PMU if branch stack events were enabled.

 - A fix for the IMC PMU, to correctly identify host kernel samples.

 - The ADB_PMU powermac code was found to be incompatible with
   VMAP_STACK, so make them incompatible in Kconfig until the code can
   be fixed.

 - A build fix in drivers/video/fbdev/controlfb.c, and a documentation
   fix.

Thanks to Alexey Kardashevskiy, Athira Rajeev, Christophe Leroy,
Giuseppe Sacco, Madhavan Srinivasan, Milton Miller, Nicholas Piggin,
Pratik Rajesh Sampat, Randy Dunlap, Shawn Anastasio, Vaidyanathan
Srinivasan.

* tag 'powerpc-5.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/32s: Disable VMAP stack which CONFIG_ADB_PMU
  Revert "powerpc/powernv/idle: Replace CPU feature check with PVR check"
  powerpc/perf: Fix reading of MSR[HV/PR] bits in trace-imc
  powerpc/perf: Fix crashes with generic_compat_pmu & BHRB
  powerpc/64s: Fix crash in load_fp_state() due to fpexc_mode
  powerpc/64s: scv entry should set PPR
  Documentation/powerpc: fix malformed table in syscall64-abi
  video: fbdev: controlfb: Fix build for COMPILE_TEST=y && PPC_PMAC=n
  selftests/powerpc: Update PROT_SAO test to skip ISA 3.1
  powerpc/64s: Disallow PROT_SAO in LPARs by default
  Revert "powerpc/64s: Remove PROT_SAO support"

3 years agoMerge tag 'usb-5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Sun, 30 Aug 2020 17:51:03 +0000 (10:51 -0700)]
Merge tag 'usb-5.9-rc3' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Let's try this again...  Here are some USB fixes for 5.9-rc3.

  This differs from the previous pull request for this release in that
  the usb gadget patch now does not break some systems, and actually
  does what it was intended to do. Many thanks to Marek Szyprowski for
  quickly noticing and testing the patch from Andy Shevchenko to resolve
  this issue.

  Additionally, some more new USB quirks have been added to get some new
  devices to work properly based on user reports.

  Other than that, the patches are all here, and they contain:

   - usb gadget driver fixes

   - xhci driver fixes

   - typec fixes

   - new quirks and ids

   - fixes for USB patches that went into 5.9-rc1.

  All of these have been tested in linux-next with no reported issues"

* tag 'usb-5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (33 commits)
  usb: storage: Add unusual_uas entry for Sony PSZ drives
  USB: Ignore UAS for JMicron JMS567 ATA/ATAPI Bridge
  usb: host: ohci-exynos: Fix error handling in exynos_ohci_probe()
  USB: gadget: u_f: Unbreak offset calculation in VLAs
  USB: quirks: Ignore duplicate endpoint on Sound Devices MixPre-D
  usb: typec: tcpm: Fix Fix source hard reset response for TDA 2.3.1.1 and TDA 2.3.1.2 failures
  USB: PHY: JZ4770: Fix static checker warning.
  USB: gadget: f_ncm: add bounds checks to ncm_unwrap_ntb()
  USB: gadget: u_f: add overflow checks to VLA macros
  xhci: Always restore EP_SOFT_CLEAR_TOGGLE even if ep reset failed
  xhci: Do warm-reset when both CAS and XDEV_RESUME are set
  usb: host: xhci: fix ep context print mismatch in debugfs
  usb: uas: Add quirk for PNY Pro Elite
  tools: usb: move to tools buildsystem
  USB: Fix device driver race
  USB: Also match device drivers using the ->match vfunc
  usb: host: xhci-tegra: fix tegra_xusb_get_phy()
  usb: host: xhci-tegra: otg usb2/usb3 port init
  usb: hcd: Fix use after free in usb_hcd_pci_remove()
  usb: typec: ucsi: Hold con->lock for the entire duration of ucsi_register_port()
  ...

3 years agoMerge tag 'edac_urgent_for_v5.9_rc3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 30 Aug 2020 17:47:23 +0000 (10:47 -0700)]
Merge tag 'edac_urgent_for_v5.9_rc3' of git://git./linux/kernel/git/ras/ras

Pull EDAC fix from Borislav Petkov:
 "A fix to properly clear ghes_edac driver state on driver remove so
  that a subsequent load can probe the system properly (Shiju Jose)"

* tag 'edac_urgent_for_v5.9_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()

3 years agoMerge tag 'dma-mapping-5.9-2' of git://git.infradead.org/users/hch/dma-mapping
Linus Torvalds [Sun, 30 Aug 2020 17:37:18 +0000 (10:37 -0700)]
Merge tag 'dma-mapping-5.9-2' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fix from Christoph Hellwig:
 "Fix a possibly uninitialized variable (Dan Carpenter)"

* tag 'dma-mapping-5.9-2' of git://git.infradead.org/users/hch/dma-mapping:
  dma-pool: Fix an uninitialized variable bug in atomic_pool_expand()

3 years agogenirq/matrix: Deal with the sillyness of for_each_cpu() on UP
Thomas Gleixner [Sun, 30 Aug 2020 17:07:53 +0000 (19:07 +0200)]
genirq/matrix: Deal with the sillyness of for_each_cpu() on UP

Most of the CPU mask operations behave the same way, but for_each_cpu() and
it's variants ignore the cpumask argument and claim that CPU0 is always in
the mask. This is historical, inconsistent and annoying behaviour.

The matrix allocator uses for_each_cpu() and can be called on UP with an
empty cpumask. The calling code does not expect that this succeeds but
until commit e027fffff799 ("x86/irq: Unbreak interrupt affinity setting")
this went unnoticed. That commit added a WARN_ON() to catch cases which
move an interrupt from one vector to another on the same CPU. The warning
triggers on UP.

Add a check for the cpumask being empty to prevent this.

Fixes: 2f75d9e1c905 ("genirq: Implement bitmap matrix allocator")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
3 years agoMerge tag 'fallthrough-fixes-5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 29 Aug 2020 21:21:58 +0000 (14:21 -0700)]
Merge tag 'fallthrough-fixes-5.9-rc3' of git://git./linux/kernel/git/gustavoars/linux

Pull fallthrough fixes from Gustavo A. R. Silva:
 "Fix some minor issues introduced by the recent treewide fallthrough
  conversions:

   - Fix identation issue

   - Fix erroneous fallthrough annotation

   - Remove unnecessary fallthrough annotation

   - Fix code comment changed by fallthrough conversion"

* tag 'fallthrough-fixes-5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
  arm64/cpuinfo: Remove unnecessary fallthrough annotation
  media: dib0700: Fix identation issue in dib8096_set_param_override()
  afs: Remove erroneous fallthough annotation
  iio: dpot-dac: fix code comment in dpot_dac_read_raw()

3 years agofsldma: fix very broken 32-bit ppc ioread64 functionality
Linus Torvalds [Sat, 29 Aug 2020 20:50:56 +0000 (13:50 -0700)]
fsldma: fix very broken 32-bit ppc ioread64 functionality

Commit ef91bb196b0d ("kernel.h: Silence sparse warning in
lower_32_bits") caused new warnings to show in the fsldma driver, but
that commit was not to blame: it only exposed some very incorrect code
that tried to take the low 32 bits of an address.

That made no sense for multiple reasons, the most notable one being that
that code was intentionally limited to only 32-bit ppc builds, so "only
low 32 bits of an address" was completely nonsensical.  There were no
high bits to mask off to begin with.

But even more importantly fropm a correctness standpoint, turning the
address into an integer then caused the subsequent address arithmetic to
be completely wrong too, and the "+1" actually incremented the address
by one, rather than by four.

Which again was incorrect, since the code was reading two 32-bit values
and trying to make a 64-bit end result of it all.  Surprisingly, the
iowrite64() did not suffer from the same odd and incorrect model.

This code has never worked, but it's questionable whether anybody cared:
of the two users that actually read the 64-bit value (by way of some C
preprocessor hackery and eventually the 'get_cdar()' inline function),
one of them explicitly ignored the value, and the other one might just
happen to work despite the incorrect value being read.

This patch at least makes it not fail the build any more, and makes the
logic superficially sane.  Whether it makes any difference to the code
_working_ or not shall remain a mystery.

Compile-tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoMerge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Sat, 29 Aug 2020 20:13:44 +0000 (13:13 -0700)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
 "A core fix for ACPI matching and two driver bugfixes"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: iproc: Fix shifting 31 bits
  i2c: rcar: in slave mode, clear NACK earlier
  i2c: acpi: Remove dead code, i.e. i2c_acpi_match_device()
  i2c: core: Don't fail PRP0001 enumeration when no ID table exist

3 years agoMerge tag 's390-5.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Linus Torvalds [Sat, 29 Aug 2020 19:51:58 +0000 (12:51 -0700)]
Merge tag 's390-5.9-4' of git://git./linux/kernel/git/s390/linux

Pull s390 fixes from Vasily Gorbik:

 - Disable preemption trace in percpu macros since the lockdep code
   itself uses percpu variables now and it causes recursions.

 - Fix kernel space 4-level paging broken by recent vmem rework.

* tag 's390-5.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/vmem: fix vmem_add_range for 4-level paging
  s390: don't trace preemption in percpu macros

3 years agoMerge tag 'for-linus-5.9-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 29 Aug 2020 19:44:30 +0000 (12:44 -0700)]
Merge tag 'for-linus-5.9-rc3-tag' of git://git./linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:
 "Two fixes for Xen: one needed for ongoing work to support virtio with
  Xen, and one for a corner case in IRQ handling with Xen"

* tag 'for-linus-5.9-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  arm/xen: Add misuse warning to virt_to_gfn
  xen/xenbus: Fix granting of vmalloc'd memory
  XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN data pointer which contains XEN specific information.

3 years agoMerge tag 'hwmon-for-v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groec...
Linus Torvalds [Sat, 29 Aug 2020 19:37:00 +0000 (12:37 -0700)]
Merge tag 'hwmon-for-v5.9-rc3' of git://git./linux/kernel/git/groeck/linux-staging

Pull hwmon fixes from Guenter Roeck:

 - Fix tempeerature scale in gsc-hwmon driver

 - Fix divide by 0 error in nct7904 driver

 - Drop non-existing attribute from pmbus/isl68137 driver

 - Fix status check in applesmc driver

* tag 'hwmon-for-v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (gsc-hwmon) Scale temperature to millidegrees
  hwmon: (applesmc) check status earlier.
  hwmon: (nct7904) Correct divide by 0
  hwmon: (pmbus/isl68137) remove READ_TEMPERATURE_1 telemetry for RAA228228

3 years agoMerge branch 'nvme-5.9-rc' of git://git.infradead.org/nvme into block-5.9
Jens Axboe [Sat, 29 Aug 2020 16:54:11 +0000 (10:54 -0600)]
Merge branch 'nvme-5.9-rc' of git://git.infradead.org/nvme into block-5.9

Pull NVMe fixes from Sagi:

"- instance leak and io boundary fixes from Keith
 - fc locking fix from Christophe
 - various tcp/rdma reset during traffic fixes from Me
 - pci use-after-free fix from Tong
 - tcp target null deref fix from Ziye"

* 'nvme-5.9-rc' of git://git.infradead.org/nvme:
  nvme-pci: cancel nvme device request before disabling
  nvme: only use power of two io boundaries
  nvme: fix controller instance leak
  nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
  nvme: Fix NULL dereference for pci nvme controllers
  nvme-rdma: fix reset hang if controller died in the middle of a reset
  nvme-rdma: fix timeout handler
  nvme-rdma: serialize controller teardown sequences
  nvme-tcp: fix reset hang if controller died in the middle of a reset
  nvme-tcp: fix timeout handler
  nvme-tcp: serialize controller teardown sequences
  nvme: have nvme_wait_freeze_timeout return if it timed out
  nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
  nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu

3 years agonvme-pci: cancel nvme device request before disabling
Tong Zhang [Fri, 28 Aug 2020 14:17:08 +0000 (10:17 -0400)]
nvme-pci: cancel nvme device request before disabling

This patch addresses an irq free warning and null pointer dereference
error problem when nvme devices got timeout error during initialization.
This problem happens when nvme_timeout() function is called while
nvme_reset_work() is still in execution. This patch fixed the problem by
setting flag of the problematic request to NVME_REQ_CANCELLED before
calling nvme_dev_disable() to make sure __nvme_submit_sync_cmd() returns
an error code and let nvme_submit_sync_cmd() fail gracefully.
The following is console output.

[   62.472097] nvme nvme0: I/O 13 QID 0 timeout, disable controller
[   62.488796] nvme nvme0: could not set timestamp (881)
[   62.494888] ------------[ cut here ]------------
[   62.495142] Trying to free already-free IRQ 11
[   62.495366] WARNING: CPU: 0 PID: 7 at kernel/irq/manage.c:1751 free_irq+0x1f7/0x370
[   62.495742] Modules linked in:
[   62.495902] CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.8.0+ #8
[   62.496206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
[   62.496772] Workqueue: nvme-reset-wq nvme_reset_work
[   62.497019] RIP: 0010:free_irq+0x1f7/0x370
[   62.497223] Code: e8 ce 49 11 00 48 83 c4 08 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 44 89 f6 48 c70
[   62.498133] RSP: 0000:ffffa96800043d40 EFLAGS: 00010086
[   62.498391] RAX: 0000000000000000 RBX: ffff9b87fc458400 RCX: 0000000000000000
[   62.498741] RDX: 0000000000000001 RSI: 0000000000000096 RDI: ffffffff9693d72c
[   62.499091] RBP: ffff9b87fd4c8f60 R08: ffffa96800043bfd R09: 0000000000000163
[   62.499440] R10: ffffa96800043bf8 R11: ffffa96800043bfd R12: ffff9b87fd4c8e00
[   62.499790] R13: ffff9b87fd4c8ea4 R14: 000000000000000b R15: ffff9b87fd76b000
[   62.500140] FS:  0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[   62.500534] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   62.500816] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[   62.501165] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   62.501515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   62.501864] Call Trace:
[   62.501993]  pci_free_irq+0x13/0x20
[   62.502167]  nvme_reset_work+0x5d0/0x12a0
[   62.502369]  ? update_load_avg+0x59/0x580
[   62.502569]  ? ttwu_queue_wakelist+0xa8/0xc0
[   62.502780]  ? try_to_wake_up+0x1a2/0x450
[   62.502979]  process_one_work+0x1d2/0x390
[   62.503179]  worker_thread+0x45/0x3b0
[   62.503361]  ? process_one_work+0x390/0x390
[   62.503568]  kthread+0xf9/0x130
[   62.503726]  ? kthread_park+0x80/0x80
[   62.503911]  ret_from_fork+0x22/0x30
[   62.504090] ---[ end trace de9ed4a70f8d71e2 ]---
[  123.912275] nvme nvme0: I/O 12 QID 0 timeout, disable controller
[  123.914670] nvme nvme0: 1/0/0 default/read/poll queues
[  123.916310] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  123.917469] #PF: supervisor write access in kernel mode
[  123.917725] #PF: error_code(0x0002) - not-present page
[  123.917976] PGD 0 P4D 0
[  123.918109] Oops: 0002 [#1] SMP PTI
[  123.918283] CPU: 0 PID: 7 Comm: kworker/u4:0 Tainted: G        W         5.8.0+ #8
[  123.918650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
[  123.919219] Workqueue: nvme-reset-wq nvme_reset_work
[  123.919469] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
[  123.919757] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
[  123.920657] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
[  123.920912] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
[  123.921258] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
[  123.921602] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
[  123.921949] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
[  123.922295] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
[  123.922641] FS:  0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[  123.923032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  123.923312] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[  123.923660] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  123.924007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  123.924353] Call Trace:
[  123.924479]  blk_mq_alloc_tag_set+0x137/0x2a0
[  123.924694]  nvme_reset_work+0xed6/0x12a0
[  123.924898]  process_one_work+0x1d2/0x390
[  123.925099]  worker_thread+0x45/0x3b0
[  123.925280]  ? process_one_work+0x390/0x390
[  123.925486]  kthread+0xf9/0x130
[  123.925642]  ? kthread_park+0x80/0x80
[  123.925825]  ret_from_fork+0x22/0x30
[  123.926004] Modules linked in:
[  123.926158] CR2: 0000000000000000
[  123.926322] ---[ end trace de9ed4a70f8d71e3 ]---
[  123.926549] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
[  123.926832] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
[  123.927734] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
[  123.927989] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
[  123.928336] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
[  123.928679] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
[  123.929025] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
[  123.929370] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
[  123.929715] FS:  0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[  123.930106] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  123.930384] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[  123.930731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  123.931077] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Co-developed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Tong Zhang <ztong0001@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
3 years agonvme: only use power of two io boundaries
Keith Busch [Thu, 27 Aug 2020 17:38:57 +0000 (10:38 -0700)]
nvme: only use power of two io boundaries

The kernel requires a power of two for boundaries because that's the
only way it can efficiently split commands that cross them. A
controller, however, may report a non-power of two boundary.

The driver had been rounding the controller's value to one the kernel
can use, but splitting on the wrong boundary provides no benefit on the
device side, and incurs additional submission overhead from non-optimal
splits.

Don't provide any boundary hint if the controller's value can't be used
and log a warning when first scanning a disk's unreported IO boundary.
Since the chunk sector logic has grown, move it to a separate function.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
3 years agonvme: fix controller instance leak
Keith Busch [Wed, 26 Aug 2020 17:53:04 +0000 (10:53 -0700)]
nvme: fix controller instance leak

If the driver has to unbind from the controller for an early failure
before the subsystem has been set up, there won't be a subsystem holding
the controller's instance, so the controller needs to free its own
instance in this case.

Fixes: 733e4b69d508d ("nvme: Assign subsys instance from first ctrl")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
3 years agonvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
Christophe JAILLET [Fri, 21 Aug 2020 07:58:19 +0000 (09:58 +0200)]
nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'

The way 'spin_lock()' and 'spin_lock_irqsave()' are used is not consistent
in this function.

Use 'spin_lock_irqsave()' also here, as there is no guarantee that
interruptions are disabled at that point, according to surrounding code.

Fixes: a97ec51b37ef ("nvmet_fc: Rework target side abort handling")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
3 years agonvme: Fix NULL dereference for pci nvme controllers
Sagi Grimberg [Mon, 24 Aug 2020 22:47:25 +0000 (15:47 -0700)]
nvme: Fix NULL dereference for pci nvme controllers

PCIe controllers do not have fabric opts, verify they exist before
showing ctrl_loss_tmo or reconnect_delay attributes.

Fixes: 764075fdcb2f ("nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs")
Reported-by: Tobias Markus <tobias@markus-regensburg.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
3 years agonvme-rdma: fix reset hang if controller died in the middle of a reset
Sagi Grimberg [Thu, 30 Jul 2020 20:42:42 +0000 (13:42 -0700)]
nvme-rdma: fix reset hang if controller died in the middle of a reset

If the controller becomes unresponsive in the middle of a reset, we
will hang because we are waiting for the freeze to complete, but that
cannot happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to
proceed (either schedule a reconnect of remove the controller).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
3 years agonvme-rdma: fix timeout handler
Sagi Grimberg [Wed, 29 Jul 2020 09:36:03 +0000 (02:36 -0700)]
nvme-rdma: fix timeout handler

When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.

However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.

Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
3 years agonvme-rdma: serialize controller teardown sequences
Sagi Grimberg [Thu, 6 Aug 2020 01:13:58 +0000 (18:13 -0700)]
nvme-rdma: serialize controller teardown sequences

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
3 years agonvme-tcp: fix reset hang if controller died in the middle of a reset
Sagi Grimberg [Thu, 30 Jul 2020 20:25:34 +0000 (13:25 -0700)]
nvme-tcp: fix reset hang if controller died in the middle of a reset

If the controller becomes unresponsive in the middle of a reset, we will
hang because we are waiting for the freeze to complete, but that cannot
happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to proceed
(either schedule a reconnect of remove the controller).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
3 years agonvme-tcp: fix timeout handler
Sagi Grimberg [Tue, 28 Jul 2020 20:16:36 +0000 (13:16 -0700)]
nvme-tcp: fix timeout handler

When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.

However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.

Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
3 years agonvme-tcp: serialize controller teardown sequences
Sagi Grimberg [Thu, 6 Aug 2020 01:13:48 +0000 (18:13 -0700)]
nvme-tcp: serialize controller teardown sequences

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
3 years agonvme: have nvme_wait_freeze_timeout return if it timed out
Sagi Grimberg [Thu, 30 Jul 2020 20:24:45 +0000 (13:24 -0700)]
nvme: have nvme_wait_freeze_timeout return if it timed out

Users can detect if the wait has completed or not and take appropriate
actions based on this information (e.g. weather to continue
initialization or rather fail and schedule another initialization
attempt).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
3 years agonvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
Sagi Grimberg [Fri, 14 Aug 2020 18:46:51 +0000 (11:46 -0700)]
nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance

NVME_CTRL_NEW should never see any I/O, because in order to start
initialization it has to transition to NVME_CTRL_CONNECTING and from
there it will never return to this state.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>