Christoph Hellwig [Mon, 4 Mar 2024 14:04:52 +0000 (07:04 -0700)]
nvme: move a few things out of nvme_update_disk_info
Move setting up the integrity profile and setting the disk capacity out
of nvme_update_disk_info to get nvme_update_disk_info into a shape where
it just sets queue_limits eventually.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Mon, 4 Mar 2024 14:04:51 +0000 (07:04 -0700)]
nvme: don't use nvme_update_disk_info for the multipath disk
Currently nvme_update_ns_info_block calls nvme_update_disk_info both for
the namespace attached disk, and the multipath one (if it exists). This
is very different from how other stacking drivers work, and leads to
a lot of complexity.
Switch to setting the disk capacity and initializing the integrity
profile, and let blk_stack_limits which already is called just below
deal with updating the other limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Mon, 4 Mar 2024 14:04:50 +0000 (07:04 -0700)]
nvme: move blk_integrity_unregister into nvme_init_integrity
Move uneregistering the existing integrity profile into the helper
dealing with all the other integrity / metadata setup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Mon, 4 Mar 2024 14:04:49 +0000 (07:04 -0700)]
nvme: cleanup the nvme_init_integrity calling conventions
Handle the no metadata support case in nvme_init_integrity as well to
simplify the calling convention and prepare for future changes in the
area.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Mon, 4 Mar 2024 14:04:48 +0000 (07:04 -0700)]
nvme: move max_integrity_segments handling out of nvme_init_integrity
max_integrity_segments is just a hardware limit and doesn't need to be
in nvme_init_integrity with the PI setup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Mon, 4 Mar 2024 14:04:47 +0000 (07:04 -0700)]
nvme: remove nvme_revalidate_zones
Handle setting the zone size / chunk_sectors and max_append_sectors
limits together with the other ZNS limits, and just open code the
call to blk_revalidate_zones in the current place.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Mon, 4 Mar 2024 14:04:46 +0000 (07:04 -0700)]
nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard
Move the handling of the NVME_QUIRK_DEALLOCATE_ZEROES quirk out of
nvme_config_discard so that it is combined with the normal write_zeroes
limit handling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Mon, 4 Mar 2024 14:04:45 +0000 (07:04 -0700)]
nvme: set max_hw_sectors unconditionally
All transports set a max_hw_sectors value in the nvme_ctrl, so make
the code using it unconditional and clean it up using a little helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Guixin Liu [Wed, 28 Feb 2024 02:37:59 +0000 (10:37 +0800)]
nvme-fabrics: check max outstanding commands
Maxcmd is mandatory for fabrics, check it early to identify the root
cause instead of waiting for it to propagate to "sqsize" and "allocing
queue".
By the way, change nvme_check_ctrl_fabric_info() to
nvmf_validate_identify_ctrl().
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Max Gurtovoy [Tue, 23 Jan 2024 14:40:32 +0000 (16:40 +0200)]
nvmet-rdma: set max_queue_size for RDMA transport
A new port configuration was added to set max_queue_size. Clamp user
configuration to RDMA transport limits.
Increase the maximal queue size of RDMA controllers from 128 to 256
(the default size stays 128 same as before).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Max Gurtovoy [Tue, 23 Jan 2024 14:40:31 +0000 (16:40 +0200)]
nvmet: introduce new max queue size configuration entry
Using this port configuration, one will be able to set the maximal queue
size to be used for any controller that will be associated to the
configured port.
The default value stayed 1024 but each transport will be able to set the
its own values before enabling the port.
Introduce lower limit of 16 for minimal queue depth (same as we use in
the host fabrics drivers).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Max Gurtovoy [Tue, 23 Jan 2024 14:40:30 +0000 (16:40 +0200)]
nvme-rdma: clamp queue size according to ctrl cap
If a controller is configured with metadata support, clamp the maximal
queue size to be 128 since there are more resources that are needed
for metadata operations. Otherwise, clamp it to 256.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Max Gurtovoy [Tue, 23 Jan 2024 14:40:29 +0000 (16:40 +0200)]
nvme-rdma: introduce NVME_RDMA_MAX_METADATA_QUEUE_SIZE definition
This definition will be used by controllers that are configured with
metadata support. For now, both regular and metadata controllers have
the same maximal queue size but later commit will increase the maximal
queue size for regular RDMA controllers to 256.
We'll keep the maximal queue size for metadata controllers to be 128
since there are more resources that are needed for metadata operations
and 128 is the optimal size found for metadata controllers base on
testing.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Max Gurtovoy [Tue, 23 Jan 2024 14:40:28 +0000 (16:40 +0200)]
nvmet: set ctrl pi_support cap before initializing cap reg
This is a preparation for setting the maximal queue size of a controller
that supports PI.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Max Gurtovoy [Tue, 23 Jan 2024 14:40:27 +0000 (16:40 +0200)]
nvmet: set maxcmd to be per controller
This is a preparation for having a dynamic configuration of max queue
size for a controller. Make sure that the maxcmd field stays the same as
the MQES (+1) value as we do today.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Max Gurtovoy [Tue, 23 Jan 2024 14:40:26 +0000 (16:40 +0200)]
nvmet: compare mqes and sqsize only for IO SQ
According to the NVMe Spec:
"
MQES: This field indicates the maximum individual queue size that the
controller supports. For NVMe over PCIe implementations, this value
applies to the I/O Submission Queues and I/O Completion Queues that the
host creates. For NVMe over Fabrics implementations, this value applies
to only the I/O Submission Queues that the host creates.
"
Align the target code to compare mqes and sqsize as mentioned in the
NVMe Spec.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Max Gurtovoy [Tue, 23 Jan 2024 14:40:25 +0000 (16:40 +0200)]
nvme-rdma: move NVME_RDMA_IP_PORT from common file
The correct place for this definition is the nvme rdma header file and
not the common nvme header file.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Thu, 29 Feb 2024 14:38:46 +0000 (06:38 -0800)]
nbd: use the atomic queue limits API in nbd_set_size
Use queue_limits_start_update / queue_limits_commit_update to update
all the limits in one go and with proper sanity checking.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240229143846.1047223-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 29 Feb 2024 14:38:45 +0000 (06:38 -0800)]
nbd: freeze the queue for queue limits updates
nbd currently updates the logical and physical block sizes as well
as the discard_sectors on a live queue. Freeze the queue first to
make sure there are not commands in flight that can see torn or
inconsistent limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240229143846.1047223-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 29 Feb 2024 14:38:44 +0000 (06:38 -0800)]
nbd: don't clear discard_sectors in nbd_config_put
nbd_config_put currently clears discard_sectors when unusing a device.
This is pretty odd behavior and different from the sector size
configuration which is simply left in places and then reconfigured when
nbd_set_size is as part of configuring the device. Change nbd_set_size
to clear discard_sectors if discard is not supported so that all the
queue limits changes are handled in one place.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240229143846.1047223-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 29 Feb 2024 14:44:08 +0000 (06:44 -0800)]
pktcdvd: don't set max_hw_sectors on the underlying device
pktcdvd sets max_hw_sectors on the queue of the underlying device that
it doesn't own (and doesn't reset it ever) since the driver was merged.
This can create all kinds of problems as the underlying driver doesn't
even know about it changing the limit.
As the state purpose is to not create I/Os larger than a single frame,
and pktcdvd never builds bios larger than that, just set REQ_NOMERGE
on the bios it submits so that largers I/Os never get built.
Note: I don't have packet writing hardware, so this is compile tested
only.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240229144408.1047967-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 28 Feb 2024 22:56:42 +0000 (14:56 -0800)]
dm: use queue_limits_set
Use queue_limits_set which validates the limits and takes care of
updating the readahead settings instead of directly assigning them to
the queue. For that make sure all limits are actually updated before
the assignment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20240228225653.947152-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 28 Feb 2024 22:56:41 +0000 (14:56 -0800)]
block: add a queue_limits_stack_bdev helper
Add a small wrapper around blk_stack_limits that allows passing a bdev
for the bottom device and prints an error in case of misaligned
device. The name fits into the new queue limits API and the intent is
to eventually replace disk_stack_limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240228225653.947152-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 28 Feb 2024 22:56:40 +0000 (14:56 -0800)]
block: add a queue_limits_set helper
Add a small wrapper around queue_limits_commit_update for stacking
drivers that don't want to update existing limits, but set an
entirely new set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240228225653.947152-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 1 Mar 2024 12:37:13 +0000 (05:37 -0700)]
Merge tag 'md-6.9-
20240301' of https://git./linux/kernel/git/song/md into for-6.9/block
Pull MD updates from Song:
"The major changes are:
1. Refactor raid1 read_balance, by Yu Kuai and Paul Luse.
2. Clean up and fix for md_ioctl, by Li Nan.
3. Other small fixes, by Gui-Dong Han and Heming Zhao."
* tag 'md-6.9-
20240301' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (22 commits)
md/raid1: factor out helpers to choose the best rdev from read_balance()
md/raid1: factor out the code to manage sequential IO
md/raid1: factor out choose_bb_rdev() from read_balance()
md/raid1: factor out choose_slow_rdev() from read_balance()
md/raid1: factor out read_first_rdev() from read_balance()
md/raid1-10: factor out a new helper raid1_should_read_first()
md/raid1-10: add a helper raid1_check_read_range()
md/raid1: fix choose next idle in read_balance()
md/raid1: record nonrot rdevs while adding/removing rdevs to conf
md/raid1: factor out helpers to add rdev to conf
md: add a new helper rdev_has_badblock()
md/raid5: fix atomicity violation in raid5_cache_count
md/md-bitmap: fix incorrect usage for sb_index
md: check mddev->pers before calling md_set_readonly()
md: clean up openers check in do_md_stop() and md_set_readonly()
md: sync blockdev before stopping raid or setting readonly
md: factor out a helper to sync mddev
md: Don't clear MD_CLOSING when the raid is about to stop
md: return directly before setting did_set_md_closing
md: clean up invalid BUG_ON in md_ioctl
...
Song Liu [Fri, 1 Mar 2024 06:50:26 +0000 (22:50 -0800)]
Merge branch 'raid1-read_balance' into md-6.9
From: Yu Kuai <yukuai3@huawei.com>
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
The original idea is that Paul want to optimize raid1 read
performance([1]), however, we think that the original code for
read_balance() is quite complex, and we don't want to add more
complexity. Hence we decide to refactor read_balance() first, to make
code cleaner and easier for follow up.
Before this patchset, read_balance() has many local variables and many
branches, it want to consider all the scenarios in one iteration. The
idea of this patch is to divide them into 4 different steps:
1) If resync is in progress, find the first usable disk, patch 5;
Otherwise:
2) Loop through all disks and skipping slow disks and disks with bad
blocks, choose the best disk, patch 10. If no disk is found:
3) Look for disks with bad blocks and choose the one with most number of
sectors, patch 8. If no disk is found:
4) Choose first found slow disk with no bad blocks, or slow disk with
most number of sectors, patch 7.
Note that step 3) and step 4) are super code path, and performance
should not be considered.
And after this patchset, we'll continue to optimize read_balance for
step 2), specifically how to choose the best rdev to read.
[1] https://lore.kernel.org/all/
20240102125115.129261-1-paul.e.luse@linux.intel.com/
Yu Kuai (11):
md: add a new helper rdev_has_badblock()
md/raid1: factor out helpers to add rdev to conf
md/raid1: record nonrot rdevs while adding/removing rdevs to conf
md/raid1: fix choose next idle in read_balance()
md/raid1-10: add a helper raid1_check_read_range()
md/raid1-10: factor out a new helper raid1_should_read_first()
md/raid1: factor out read_first_rdev() from read_balance()
md/raid1: factor out choose_slow_rdev() from read_balance()
md/raid1: factor out choose_bb_rdev() from read_balance()
md/raid1: factor out the code to manage sequential IO
md/raid1: factor out helpers to choose the best rdev from
read_balance()
Yu Kuai [Thu, 29 Feb 2024 09:57:14 +0000 (17:57 +0800)]
md/raid1: factor out helpers to choose the best rdev from read_balance()
The way that best rdev is chosen:
1) If the read is sequential from one rdev:
- if rdev is rotational, use this rdev;
- if rdev is non-rotational, use this rdev until total read length
exceed disk opt io size;
2) If the read is not sequential:
- if there is idle disk, use it, otherwise:
- if the array has non-rotational disk, choose the rdev with minimal
inflight IO;
- if all the underlaying disks are rotational disk, choose the rdev
with closest IO;
There are no functional changes, just to make code cleaner and prepare
for following refactor.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-12-yukuai1@huaweicloud.com
Yu Kuai [Thu, 29 Feb 2024 09:57:13 +0000 (17:57 +0800)]
md/raid1: factor out the code to manage sequential IO
There is no functional change for now, make read_balance() cleaner and
prepare to fix problems and refactor the handler of sequential IO.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-11-yukuai1@huaweicloud.com
Yu Kuai [Thu, 29 Feb 2024 09:57:12 +0000 (17:57 +0800)]
md/raid1: factor out choose_bb_rdev() from read_balance()
read_balance() is hard to understand because there are too many status
and branches, and it's overlong.
This patch factor out the case to read the rdev with bad blocks from
read_balance(), there are no functional changes.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-10-yukuai1@huaweicloud.com
Yu Kuai [Thu, 29 Feb 2024 09:57:11 +0000 (17:57 +0800)]
md/raid1: factor out choose_slow_rdev() from read_balance()
read_balance() is hard to understand because there are too many status
and branches, and it's overlong.
This patch factor out the case to read the slow rdev from
read_balance(), there are no functional changes.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-9-yukuai1@huaweicloud.com
Yu Kuai [Thu, 29 Feb 2024 09:57:10 +0000 (17:57 +0800)]
md/raid1: factor out read_first_rdev() from read_balance()
read_balance() is hard to understand because there are too many status
and branches, and it's overlong.
This patch factor out the case to read the first rdev from
read_balance(), there are no functional changes.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-8-yukuai1@huaweicloud.com
Yu Kuai [Thu, 29 Feb 2024 09:57:09 +0000 (17:57 +0800)]
md/raid1-10: factor out a new helper raid1_should_read_first()
If resync is in progress, read_balance() should find the first usable
disk, otherwise, data could be inconsistent after resync is done. raid1
and raid10 implement the same checking, hence factor out the checking
to make code cleaner.
Noted that raid1 is using 'mddev->recovery_cp', which is updated after
all resync IO is done, while raid10 is using 'conf->next_resync', which
is inaccurate because raid10 update it before submitting resync IO.
Fortunately, raid10 read IO can't concurrent with resync IO, hence there
is no problem. And this patch also switch raid10 to use
'mddev->recovery_cp'.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-7-yukuai1@huaweicloud.com
Yu Kuai [Thu, 29 Feb 2024 09:57:08 +0000 (17:57 +0800)]
md/raid1-10: add a helper raid1_check_read_range()
The checking and handler of bad blocks appear many timers during
read_balance() in raid1 and raid10. This helper will be used in later
patches to simplify read_balance() a lot.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-6-yukuai1@huaweicloud.com
Yu Kuai [Thu, 29 Feb 2024 09:57:07 +0000 (17:57 +0800)]
md/raid1: fix choose next idle in read_balance()
Commit
12cee5a8a29e ("md/raid1: prevent merging too large request") add
the case choose next idle in read_balance():
read_balance:
for_each_rdev
if(next_seq_sect == this_sector || dist == 0)
-> sequential reads
best_disk = disk;
if (...)
choose_next_idle = 1
continue;
for_each_rdev
-> iterate next rdev
if (pending == 0)
best_disk = disk;
-> choose the next idle disk
break;
if (choose_next_idle)
-> keep using this rdev if there are no other idle disk
contine
However, commit
2e52d449bcec ("md/raid1: add failfast handling for reads.")
remove the code:
- /* If device is idle, use it */
- if (pending == 0) {
- best_disk = disk;
- break;
- }
Hence choose next idle will never work now, fix this problem by
following:
1) don't set best_disk in this case, read_balance() will choose the best
disk after iterating all the disks;
2) add 'pending' so that other idle disk will be chosen;
3) add a new local variable 'sequential_disk' to record the disk, and if
there is no other idle disk, 'sequential_disk' will be chosen;
Fixes:
2e52d449bcec ("md/raid1: add failfast handling for reads.")
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-5-yukuai1@huaweicloud.com
Yu Kuai [Thu, 29 Feb 2024 09:57:06 +0000 (17:57 +0800)]
md/raid1: record nonrot rdevs while adding/removing rdevs to conf
For raid1, each read will iterate all the rdevs from conf and check if
any rdev is non-rotational, then choose rdev with minimal IO inflight
if so, or rdev with closest distance otherwise.
Disk nonrot info can be changed through sysfs entry:
/sys/block/[disk_name]/queue/rotational
However, consider that this should only be used for testing, and user
really shouldn't do this in real life. Record the number of non-rotational
disks in conf, to avoid checking each rdev in IO fast path and simplify
read_balance() a little bit.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-4-yukuai1@huaweicloud.com
Yu Kuai [Thu, 29 Feb 2024 09:57:05 +0000 (17:57 +0800)]
md/raid1: factor out helpers to add rdev to conf
There are no functional changes, just make code cleaner and prepare to
record disk non-rotational information while adding and removing rdev to
conf
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-3-yukuai1@huaweicloud.com
Yu Kuai [Thu, 29 Feb 2024 09:57:04 +0000 (17:57 +0800)]
md: add a new helper rdev_has_badblock()
The current api is_badblock() must pass in 'first_bad' and
'bad_sectors', however, many caller just want to know if there are
badblocks or not, and these caller must define two local variable that
will never be used.
Add a new helper rdev_has_badblock() that will only return if there are
badblocks or not, remove unnecessary local variables and replace
is_badblock() with the new helper in many places.
There are no functional changes, and the new helper will also be used
later to refactor read_balance().
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-2-yukuai1@huaweicloud.com
Ming Lei [Fri, 23 Feb 2024 07:55:39 +0000 (15:55 +0800)]
ublk: add UBLK_CMD_DEL_DEV_ASYNC
The current command UBLK_CMD_DEL_DEV won't return until the device is
released, this way looks more reliable, but makes userspace more
difficult to implement, especially about orders: unmap command
buffer(which holds one ublkc reference), ublkc close,
io_uring_file_unregister, ublkb close.
Add UBLK_CMD_DEL_DEV_ASYNC so that device deletion won't wait release,
then userspace needn't worry about the above order. Actually both loop
and nbd is deleted in this async way.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240223075539.89945-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 23 Feb 2024 07:55:38 +0000 (15:55 +0800)]
ublk: improve getting & putting ublk device
Firstly convert get_device() and put_device() into ublk_get_device()
and ublk_put_device().
Secondly annotate ublk_get_device() & ublk_put_device() as noinline
for trace, especially it is often to trigger device deletion hang
when incorrect order is used on ublkc mmap, ublkc close,
io_uring_sqe_unregister_file, ublkb close.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240223075539.89945-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 28 Feb 2024 04:08:57 +0000 (12:08 +0800)]
blk-mq: don't change nr_hw_queues and nr_maps for kdump kernel
For most of ARCHs, 'nr_cpus=1' is passed for kdump kernel, so
nr_hw_queues for each mapping is supposed to be 1 already.
More importantly, this way may cause trouble for driver, because blk-mq and
driver see different queue mapping since driver should setup hardware
queue setting before calling into allocating blk-mq tagset.
So not overriding nr_hw_queues and nr_maps for kdump kernel.
Cc: Wen Xiong <wenxiong@us.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240228040857.306483-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Gui-Dong Han [Fri, 12 Jan 2024 07:10:17 +0000 (15:10 +0800)]
md/raid5: fix atomicity violation in raid5_cache_count
In raid5_cache_count():
if (conf->max_nr_stripes < conf->min_nr_stripes)
return 0;
return conf->max_nr_stripes - conf->min_nr_stripes;
The current check is ineffective, as the values could change immediately
after being checked.
In raid5_set_cache_size():
...
conf->min_nr_stripes = size;
...
while (size > conf->max_nr_stripes)
conf->min_nr_stripes = conf->max_nr_stripes;
...
Due to intermediate value updates in raid5_set_cache_size(), concurrent
execution of raid5_cache_count() and raid5_set_cache_size() may lead to
inconsistent reads of conf->max_nr_stripes and conf->min_nr_stripes.
The current checks are ineffective as values could change immediately
after being checked, raising the risk of conf->min_nr_stripes exceeding
conf->max_nr_stripes and potentially causing an integer overflow.
This possible bug is found by an experimental static analysis tool
developed by our team. This tool analyzes the locking APIs to extract
function pairs that can be concurrently executed, and then analyzes the
instructions in the paired functions to identify possible concurrency bugs
including data races and atomicity violations. The above possible bug is
reported when our tool analyzes the source code of Linux 6.2.
To resolve this issue, it is suggested to introduce local variables
'min_stripes' and 'max_stripes' in raid5_cache_count() to ensure the
values remain stable throughout the check. Adding locks in
raid5_cache_count() fails to resolve atomicity violations, as
raid5_set_cache_size() may hold intermediate values of
conf->min_nr_stripes while unlocked. With this patch applied, our tool no
longer reports the bug, with the kernel configuration allyesconfig for
x86_64. Due to the lack of associated hardware, we cannot test the patch
in runtime testing, and just verify it according to the code logic.
Fixes:
edbe83ab4c27 ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org
Signed-off-by: Gui-Dong Han <2045gemini@gmail.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240112071017.16313-1-2045gemini@gmail.com
Signed-off-by: Song Liu <song@kernel.org>
Christoph Hellwig [Thu, 22 Feb 2024 07:24:17 +0000 (08:24 +0100)]
ubd: open the backing files in ubd_add
Opening the backing device only when the block device is opened is
a bit weird as no one configures block devices to not use them.
Opend them at add time, close them at remove time and remove the
now superflous opened counter as remove can simply check for
disk_openers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20240222072417.3773131-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 22 Feb 2024 07:24:16 +0000 (08:24 +0100)]
ubd: remove the queue pointer in struct ubd
No need for it now, everything goes through the gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20240222072417.3773131-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 22 Feb 2024 07:24:15 +0000 (08:24 +0100)]
ubd: move set_disk_ro to ubd_add
No need to delay this until open time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20240222072417.3773131-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 22 Feb 2024 07:24:14 +0000 (08:24 +0100)]
ubd: move setting the variable queue limits to ubd_add
No reason to delay this until open time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20240222072417.3773131-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 22 Feb 2024 07:24:13 +0000 (08:24 +0100)]
ubd: move setting the nonrot flag to ubd_add
No reason to delay this until open time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20240222072417.3773131-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 22 Feb 2024 07:24:12 +0000 (08:24 +0100)]
ubd: remove ubd_disk_register
Fold it into the only caller to remove lots of references to the
global ubd_devs array.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20240222072417.3773131-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 22 Feb 2024 07:24:11 +0000 (08:24 +0100)]
ubd: remove the ubd_gendisk array
And add a disk pointer to the ubd structure instead to keep all
the per-device information together.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20240222072417.3773131-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 21 Feb 2024 12:58:45 +0000 (13:58 +0100)]
xen-blkfront: atomically update queue limits
Pass the initial queue limits to blk_mq_alloc_disk and use the
blkif_set_queue_limits API to update the limits on reconnect.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20240221125845.3610668-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 21 Feb 2024 12:58:44 +0000 (13:58 +0100)]
xen-blkfront: don't redundantly set max_sements in blkif_recover
blkif_set_queue_limits already sets the max_sements limits, so don't do
it a second time. Also remove a comment about a long fixe bug in
blk_mq_update_nr_hw_queues.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20240221125845.3610668-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 21 Feb 2024 12:58:43 +0000 (13:58 +0100)]
xen-blkfront: rely on the default discard granularity
The block layer now sets the discard granularity to the physical
block size default. Take advantage of that in xen-blkfront and only
set the discard granularity if explicitly specified.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20240221125845.3610668-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 21 Feb 2024 12:58:42 +0000 (13:58 +0100)]
xen-blkfront: set max_discard/secure erase limits to UINT_MAX
Currently xen-blkfront set the max discard limit to the capacity of
the device, which is suboptimal when the capacity changes. Just set
it to UINT_MAX, which has the same effect and is simpler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20240221125845.3610668-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Heming Zhao [Fri, 23 Feb 2024 12:11:28 +0000 (20:11 +0800)]
md/md-bitmap: fix incorrect usage for sb_index
Commit
d7038f951828 ("md-bitmap: don't use ->index for pages backing the
bitmap file") removed page->index from bitmap code, but left wrong code
logic for clustered-md. current code never set slot offset for cluster
nodes, will sometimes cause crash in clustered env.
Call trace (partly):
md_bitmap_file_set_bit+0x110/0x1d8 [md_mod]
md_bitmap_startwrite+0x13c/0x240 [md_mod]
raid1_make_request+0x6b0/0x1c08 [raid1]
md_handle_request+0x1dc/0x368 [md_mod]
md_submit_bio+0x80/0xf8 [md_mod]
__submit_bio+0x178/0x300
submit_bio_noacct_nocheck+0x11c/0x338
submit_bio_noacct+0x134/0x614
submit_bio+0x28/0xdc
submit_bh_wbc+0x130/0x1cc
submit_bh+0x1c/0x28
Fixes:
d7038f951828 ("md-bitmap: don't use ->index for pages backing the bitmap file")
Cc: stable@vger.kernel.org # v6.6+
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240223121128.28985-1-heming.zhao@suse.com
Li Nan [Mon, 26 Feb 2024 03:14:44 +0000 (11:14 +0800)]
md: check mddev->pers before calling md_set_readonly()
If 'mddev->pers' is NULL, there is nothing to do in md_set_readonly().
Except for md_ioctl(), the other two callers of md_set_readonly() have
already checked 'mddev->pers'. To simplify the code, move the check of
'mddev->pers' to the caller.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-10-linan666@huaweicloud.com
Li Nan [Mon, 26 Feb 2024 03:14:43 +0000 (11:14 +0800)]
md: clean up openers check in do_md_stop() and md_set_readonly()
Before stopping or setting readonly, mddev_set_closing_and_sync_blockdev()
is always called to check the openers. So no longer need to check it again
in do_md_stop() and md_set_readonly(). Clean it up.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-9-linan666@huaweicloud.com
Li Nan [Mon, 26 Feb 2024 03:14:42 +0000 (11:14 +0800)]
md: sync blockdev before stopping raid or setting readonly
Commit
a05b7ea03d72 ("md: avoid crash when stopping md array races
with closing other open fds.") added sync_block before stopping raid and
setting readonly. Later in commit
260fa034ef7a ("md: avoid deadlock when
dirty buffers during md_stop.") it is moved to ioctl. array_state_store()
was ignored. Add sync blockdev to array_state_store() now.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-8-linan666@huaweicloud.com
Li Nan [Mon, 26 Feb 2024 03:14:41 +0000 (11:14 +0800)]
md: factor out a helper to sync mddev
There are no functional changes, prepare to sync mddev in
array_state_store().
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-7-linan666@huaweicloud.com
Li Nan [Mon, 26 Feb 2024 03:14:40 +0000 (11:14 +0800)]
md: Don't clear MD_CLOSING when the raid is about to stop
The raid should not be opened anymore when it is about to be stopped.
However, other processes can open it again if the flag MD_CLOSING is
cleared before exiting. From now on, this flag will not be cleared when
the raid will be stopped.
Fixes:
065e519e71b2 ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-6-linan666@huaweicloud.com
Li Nan [Mon, 26 Feb 2024 03:14:39 +0000 (11:14 +0800)]
md: return directly before setting did_set_md_closing
There is nothing to do at 'out' before setting 'did_set_md_closing'
in md_ioctl(). Return directly, and it will help us to remove
'did_set_md_closing' later.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-5-linan666@huaweicloud.com
Li Nan [Mon, 26 Feb 2024 03:14:38 +0000 (11:14 +0800)]
md: clean up invalid BUG_ON in md_ioctl
'disk->private_data' is set to mddev in md_alloc() and never set to NULL,
and users need to open mddev before submitting ioctl. So mddev must not
have been freed during ioctl, and there is no need to check mddev here.
Clean up it.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-4-linan666@huaweicloud.com
Li Nan [Mon, 26 Feb 2024 03:14:37 +0000 (11:14 +0800)]
md: changed the switch of RAID_VERSION to if
There is only one case of this 'switch'. Change it to 'if'.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-3-linan666@huaweicloud.com
Li Nan [Mon, 26 Feb 2024 03:14:36 +0000 (11:14 +0800)]
md: merge the check of capabilities into md_ioctl_valid()
There is no functional change. Just to make code cleaner.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-2-linan666@huaweicloud.com
Chengming Zhou [Sat, 24 Feb 2024 13:46:46 +0000 (13:46 +0000)]
bdev: remove SLAB_MEM_SPREAD flag usage
The SLAB_MEM_SPREAD flag is already a no-op as of 6.8-rc1, remove
its usage so we can delete it from slab. No functional change.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20240224134646.829105-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Qais Yousef [Fri, 23 Feb 2024 15:57:49 +0000 (15:57 +0000)]
block/blk-mq: Don't complete locally if capacities are different
The logic in blk_mq_complete_need_ipi() assumes SMP systems where all
CPUs have equal compute capacities and only LLC cache can make
a different on perceived performance. But this assumption falls apart on
HMP systems where LLC is shared, but the CPUs have different capacities.
Staying local then can have a big performance impact if the IO request
was done from a CPU with higher capacity but the interrupt is serviced
on a lower capacity CPU.
Use the new cpus_equal_capacity() function to check if we need to send
an IPI.
Without the patch I see the BLOCK softirq always running on little cores
(where the hardirq is serviced). With it I can see it running on all
cores.
This was noticed after the topology change [1] where now on a big.LITTLE
we truly get that the LLC is shared between all cores where as in the
past it was being misrepresented for historical reasons. The logic
exposed a missing dependency on capacities for such systems where there
can be a big performance difference between the CPUs.
This of course introduced a noticeable change in behavior depending on
how the topology is presented. Leading to regressions in some workloads
as the performance of the BLOCK softirq on littles can be noticeably
worse on some platforms.
Worth noting that we could have checked for capacities being greater
than or equal instead for equality. This will lead to favouring higher
performance always. But opted for equality instead to match the
performance of the requester without making an assumption that can lead
to power trade-offs which these systems tend to be sensitive about. If
the requester would like to run faster, it's better to rely on the
scheduler to give the IO requester via some facility to run on a faster
core; and then if the interrupt triggered on a CPU with different
capacity we'll make sure to match the performance the requester is
supposed to run at.
[1] https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf
Signed-off-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240223155749.2958009-3-qyousef@layalina.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Qais Yousef [Fri, 23 Feb 2024 15:57:48 +0000 (15:57 +0000)]
sched: Add a new function to compare if two cpus have the same capacity
The new helper function is needed to help blk-mq check if it needs to
dispatch the softirq on another CPU to match the performance level the
IO requester is running at. This is important on HMP systems where not
all CPUs have the same compute capacity.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240223155749.2958009-2-qyousef@layalina.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Fri, 23 Feb 2024 15:59:10 +0000 (07:59 -0800)]
blk-lib: check for kill signal
Some of these block operations can access a significant capacity and
take longer than the user expected. A user may change their mind about
wanting to run that command and attempt to kill the process and do
something else with their device. But since the task is uninterruptable,
they have to wait for it to finish, which could be many hours.
Check for a fatal signal at each iteration so the user doesn't have to
wait for their regretted operation to complete naturally.
Reported-by: Conrad Meyer <conradmeyer@meta.com>
Tested-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240223155910.3622666-5-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Fri, 23 Feb 2024 15:59:09 +0000 (07:59 -0800)]
block: io wait hang check helper
This is the same in two places, and another will be added soon. Create a
helper for it.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240223155910.3622666-4-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Fri, 23 Feb 2024 15:59:08 +0000 (07:59 -0800)]
block: cleanup __blkdev_issue_write_zeroes
Use min to calculate the next number of sectors like everyone else.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240223155910.3622666-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Fri, 23 Feb 2024 15:59:07 +0000 (07:59 -0800)]
block: blkdev_issue_secure_erase loop style
Use consistent coding style in this file. All the other loops for the
same purpose use "while (nr_sects)", so they win.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240223155910.3622666-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Li Nan [Wed, 21 Feb 2024 09:01:22 +0000 (17:01 +0800)]
block: fix deadlock between bd_link_disk_holder and partition scan
'open_mutex' of gendisk is used to protect open/close block devices. But
in bd_link_disk_holder(), it is used to protect the creation of symlink
between holding disk and slave bdev, which introduces some issues.
When bd_link_disk_holder() is called, the driver is usually in the process
of initialization/modification and may suspend submitting io. At this
time, any io hold 'open_mutex', such as scanning partitions, can cause
deadlocks. For example, in raid:
T1 T2
bdev_open_by_dev
lock open_mutex [1]
...
efi_partition
...
md_submit_bio
md_ioctl mddev_syspend
-> suspend all io
md_add_new_disk
bind_rdev_to_array
bd_link_disk_holder
try lock open_mutex [2]
md_handle_request
-> wait mddev_resume
T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume
mddev, but T2 waits for open_mutex held by T1. Deadlock occurs.
Fix it by introducing a local mutex 'blk_holder_mutex' to replace
'open_mutex'.
Fixes:
1b0a2d950ee2 ("md: use new apis to suspend array for ioctls involed array reconfiguration")
Reported-by: mgperkow@gmail.com
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218459
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240221090122.1281868-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Thu, 22 Feb 2024 13:17:24 +0000 (22:17 +0900)]
block: Do not include rbtree.h in blk-zoned.c
The block zone code does not use RB-tree. So remove the include of
linux/rbtree.h as it is not needed.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240222131724.1803520-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Thu, 22 Feb 2024 13:17:23 +0000 (22:17 +0900)]
block: Clear zone limits for a non-zoned stacked queue
Device mapper may create a non-zoned mapped device out of a zoned device
(e.g., the dm-zoned target). In such case, some queue limit such as the
max_zone_append_sectors and zone_write_granularity endup being non zero
values for a block device that is not zoned. Avoid this by clearing
these limits in blk_stack_limits() when the stacked zoned limit is
false.
Fixes:
3093a479727b ("block: inherit the zoned characteristics in blk_stack_limits")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240222131724.1803520-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Thu, 22 Feb 2024 08:34:20 +0000 (08:34 +0000)]
null_blk: Delete nullb.{queue_depth, nr_queues}
Since commit
8b631f9cf0b8 ("null_blk: remove the bio based I/O path"),
struct nullb members queue_depth and nr_queues are only ever written, so
delete them.
With that, null_exit_hctx() can also be deleted.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240222083420.6026-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 22 Feb 2024 07:36:47 +0000 (08:36 +0100)]
pktcdvd: set queue limits at disk allocation time
Remove pkt_init_queue and just pass the two parameters directly to
blk_alloc_disk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240222073647.3776769-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 22 Feb 2024 07:36:46 +0000 (08:36 +0100)]
pktcdvd: stop setting q->queuedata
The two users can get the private data from the gendisk with one less
pointer dereference, and we can drop the useless q parameter from
pkt_make_request_write.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240222073647.3776769-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 21 Feb 2024 12:50:10 +0000 (13:50 +0100)]
block: fix virt_boundary handling in blk_validate_limits
Don't set the default max_segment_size value when a virt_boundary is
used.
Fixes:
d690cb8ae14b ("block: add an API to atomically update queue limits")
Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20240221125010.3609444-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 20 Feb 2024 09:32:48 +0000 (10:32 +0100)]
null_blk: pass queue_limits to blk_mq_alloc_disk
Pass the queue limits directly to blk_mq_alloc_disk instead of
setting them one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240220093248.3290292-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 20 Feb 2024 09:32:47 +0000 (10:32 +0100)]
null_blk: remove null_gendisk_register
null_gendisk_register isn't a very useful abstraction given that it
doesn't even allocate the gendisk. Merge it into the only caller
instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240220093248.3290292-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 20 Feb 2024 09:32:46 +0000 (10:32 +0100)]
null_blk: refactor tag_set setup
Move the tagset initialization out of null_add_dev into a new
null_setup_tagset helper, and move the shared vs local differences
out of null_init_tag_set into the callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240220093248.3290292-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 20 Feb 2024 09:32:45 +0000 (10:32 +0100)]
null_blk: initialize the tag_set timeout in null_init_tag_set
Otherwise it will be reset to the always same value when initializing a
device using the shared tag_set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240220093248.3290292-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 20 Feb 2024 09:32:44 +0000 (10:32 +0100)]
null_blk: remove the bio based I/O path
The bio based I/O path complicates null_blk and also make various
data structures, including the per-command one way bigger than
required for the main request based interface. As the bio-based
path is mostly used by stacking drivers and simple memory based
drivers, and brd is a good example driver for the latter there is
no need to have a bio based path in null_blk. Remove the path
to simplify the driver and make future block layer API changes
simpler by not having to deal with the complex two API setup in
null_blk.
Note that the queue_mode field in struct nullb_device is kept as
that is simpler than having two different places to check the
value and fully open coding the debugfs helpers as the existing
ones won't work without a named struct member.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240220093248.3290292-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:03:00 +0000 (08:03 +0100)]
mmc: pass queue_limits to blk_mq_alloc_disk
Pass the queue limit set at initialization time directly to
blk_mq_alloc_disk instead of updating it right after the allocation.
This requires refactoring the code a bit so that what was mmc_setup_queue
before also allocates the gendisk now and actually sets all limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20240215070300.2200308-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:59 +0000 (08:02 +0100)]
ublk: pass queue_limits to blk_mq_alloc_disk
Pass the limits ublk imposes directly to blk_mq_alloc_disk instead of
setting them one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:58 +0000 (08:02 +0100)]
scm_blk: pass queue_limits to blk_mq_alloc_disk
Pass the few limits scm_block imposes directly to blk_mq_alloc_disk
instead of setting them one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:57 +0000 (08:02 +0100)]
ubiblock: pass queue_limits to blk_mq_alloc_disk
Pass the few limits ubiblock imposes directly to blk_mq_alloc_disk
instead of setting them one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20240215070300.2200308-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:56 +0000 (08:02 +0100)]
mtd_blkdevs: pass queue_limits to blk_mq_alloc_disk
Pass the few limits mtd_blkdevs imposes directly to blk_mq_alloc_disk
instead of setting them one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:55 +0000 (08:02 +0100)]
mspro_block: pass queue_limits to blk_mq_alloc_disk
Pass the few limits mspro_block imposes directly to blk_mq_alloc_disk
instead of setting them one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:54 +0000 (08:02 +0100)]
ms_block: pass queue_limits to blk_mq_alloc_disk
Pass the few limits ms_block imposes directly to blk_mq_alloc_disk
instead of setting them one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:53 +0000 (08:02 +0100)]
gdrom: pass queue_limits to blk_mq_alloc_disk
Pass the few limits gdrom imposes directly to blk_mq_alloc_disk instead
of setting them one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:52 +0000 (08:02 +0100)]
sunvdc: pass queue_limits to blk_mq_alloc_disk
Pass the few limits sunvdc imposes directly to blk_mq_alloc_disk instead
of setting them one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:51 +0000 (08:02 +0100)]
rnbd-clt: pass queue_limits to blk_mq_alloc_disk
Pass the limits rnbd-clt imposes directly to blk_mq_alloc_disk instead
of setting them one at a time.
While at it don't set an explicit number of discard segments, as 1 is
the default (which most drivers rely on).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20240215070300.2200308-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:50 +0000 (08:02 +0100)]
rbd: pass queue_limits to blk_mq_alloc_disk
Pass the limits rbd imposes directly to blk_mq_alloc_disk instead
of setting them one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:49 +0000 (08:02 +0100)]
ps3disk: pass queue_limits to blk_mq_alloc_disk
Pass the few limits ps3disk imposes directly to blk_mq_alloc_disk instead
of setting them one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:48 +0000 (08:02 +0100)]
nbd: pass queue_limits to blk_mq_alloc_disk
Pass the few limits nbd imposes directly to blk_mq_alloc_disk instead
of setting them one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:47 +0000 (08:02 +0100)]
mtip: pass queue_limits to blk_mq_alloc_disk
Pass the few limits mtip imposes directly to blk_mq_alloc_disk instead
of setting them one at a time and drop the pointless setting of a io_min
that is equal to the physical block size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:46 +0000 (08:02 +0100)]
floppy: pass queue_limits to blk_mq_alloc_disk
Pass the few limits floppy imposes directly to blk_mq_alloc_disk instead
of setting them one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Denis Efremov <efremov@linux.com>
Link: https://lore.kernel.org/r/20240215070300.2200308-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:45 +0000 (08:02 +0100)]
aoe: pass queue_limits to blk_mq_alloc_disk
Pass the few limits aoe imposes directly to blk_mq_alloc_disk instead
of setting them one at a time and improve the way the default
max_hw_sectors is initialized while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:02:44 +0000 (08:02 +0100)]
ubd: pass queue_limits to blk_mq_alloc_disk
Pass the few limits ubd imposes directly to blk_mq_alloc_disk instead
of setting them one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:10:55 +0000 (08:10 +0100)]
dcssblk: pass queue_limits to blk_mq_alloc_disk
Pass the queue limits directly to blk_alloc_disk instead of setting them
one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20240215071055.2201424-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 15 Feb 2024 07:10:54 +0000 (08:10 +0100)]
pmem: pass queue_limits to blk_mq_alloc_disk
Pass the queue limits directly to blk_alloc_disk instead of setting them
one at a time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20240215071055.2201424-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>