Christian Loehle [Thu, 18 Jan 2024 09:29:56 +0000 (09:29 +0000)]
Documentation: block: ioprio: Update schedulers
This doc hasn't been touched in a while, in the meantime some
new io schedulers were added (e.g. all of mq), some with ioprio
support.
Also reword the introduction to remove reference to CFQ and the
limitation that io priorities only work on reads, which is no longer
true.
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/a86cfdc8-016f-40f1-8b58-0cb15d2a792c@arm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 17 Jan 2024 17:59:01 +0000 (18:59 +0100)]
loop: fix the the direct I/O support check when used on top of block devices
__loop_update_dio only checks the alignment requirement for block backed
file systems, but misses them for the case where the loop device is
created directly on top of another block device. Due to this creating
a loop device with default option plus the direct I/O flag on a > 512 byte
sector size file system will lead to incorrect I/O being submitted to the
lower block device and a lot of error from the lock layer. This can
be seen with xfstests generic/563.
Fix the code in __loop_update_dio by factoring the alignment check into
a helper, and calling that also for the struct block_device of a block
device inode.
Also remove the TODO comment talking about dynamically switching between
buffered and direct I/O, which is a would be a recipe for horrible
performance and occasional data loss.
Fixes:
2e5ab5f379f9 ("block: loop: prepare for supporing direct IO")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240117175901.871796-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Wed, 17 Jan 2024 20:36:09 +0000 (12:36 -0800)]
blk-mq: Remove the hctx 'run' debugfs attribute
Nobody uses the debugfs hctx 'run' attribute. Hence remove this
attribute and also the code that updates the corresponding member
variable.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Gabriel Ryan <gabe@cs.columbia.edu>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240117203609.4122520-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric Dumazet [Fri, 12 Jan 2024 13:26:57 +0000 (13:26 +0000)]
nbd: always initialize struct msghdr completely
syzbot complains that msg->msg_get_inq value can be uninitialized [1]
struct msghdr got many new fields recently, we should always make
sure their values is zero by default.
[1]
BUG: KMSAN: uninit-value in tcp_recvmsg+0x686/0xac0 net/ipv4/tcp.c:2571
tcp_recvmsg+0x686/0xac0 net/ipv4/tcp.c:2571
inet_recvmsg+0x131/0x580 net/ipv4/af_inet.c:879
sock_recvmsg_nosec net/socket.c:1044 [inline]
sock_recvmsg+0x12b/0x1e0 net/socket.c:1066
__sock_xmit+0x236/0x5c0 drivers/block/nbd.c:538
nbd_read_reply drivers/block/nbd.c:732 [inline]
recv_work+0x262/0x3100 drivers/block/nbd.c:863
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x104e/0x1e70 kernel/workqueue.c:2700
worker_thread+0xf45/0x1490 kernel/workqueue.c:2781
kthread+0x3ed/0x540 kernel/kthread.c:388
ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
Local variable msg created at:
__sock_xmit+0x4c/0x5c0 drivers/block/nbd.c:513
nbd_read_reply drivers/block/nbd.c:732 [inline]
recv_work+0x262/0x3100 drivers/block/nbd.c:863
CPU: 1 PID: 7465 Comm: kworker/u5:1 Not tainted
6.7.0-rc7-syzkaller-00041-gf016f7547aee #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
Workqueue: nbd5-recv recv_work
Fixes:
f94fd25cb0aa ("tcp: pass back data left in socket after receive")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: stable@vger.kernel.org
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: nbd@other.debian.org
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240112132657.647112-1-edumazet@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Matthew Wilcox (Oracle) [Tue, 16 Jan 2024 21:29:59 +0000 (21:29 +0000)]
block: Fix iterating over an empty bio with bio_for_each_folio_all
If the bio contains no data, bio_first_folio() calls page_folio() on a
NULL pointer and oopses. Move the test that we've reached the end of
the bio from bio_next_folio() to bio_first_folio().
Reported-by: syzbot+8b23309d5788a79d3eea@syzkaller.appspotmail.com
Reported-by: syzbot+004c1e0fced2b4bc3dcc@syzkaller.appspotmail.com
Fixes:
640d1930bef4 ("block: Add bio_for_each_folio_all()")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240116212959.3413014-1-willy@infradead.org
[axboe: add unlikely() to error case]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dmitry Antipov [Tue, 16 Jan 2024 14:34:31 +0000 (17:34 +0300)]
block: bio-integrity: fix kcalloc() arguments order
When compiling with gcc version 14.0.1
20240116 (experimental)
and W=1, I've noticed the following warning:
block/bio-integrity.c: In function 'bio_integrity_map_user':
block/bio-integrity.c:339:38: warning: 'kcalloc' sizes specified with 'sizeof'
in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
339 | bvec = kcalloc(sizeof(*bvec), nr_vecs, GFP_KERNEL);
| ^
block/bio-integrity.c:339:38: note: earlier argument should specify number of
elements, later size of each element
Since 'n' and 'size' arguments of 'kcalloc()' are multiplied to
calculate the final size, their actual order doesn't affect the
result and so this is not a bug. But it's still worth to fix it.
Fixes:
492c5d455969 ("block: bio-integrity: directly map user buffers")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240116143437.89060-1-dmantipov@yandex.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Li RongQing [Sat, 13 Jan 2024 04:09:47 +0000 (12:09 +0800)]
virtio_blk: remove duplicate check if queue is broken in virtblk_done
virtqueue_enable_cb() will call virtqueue_poll() which will check if
queue is broken at beginning, so remove the virtqueue_is_broken() call
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 15 Jan 2024 14:56:26 +0000 (22:56 +0800)]
sbitmap: remove stale comment in sbq_calc_wake_batch
After commit
106397376c036 ("sbitmap: fix batching wakeup"), we may wake
up more than one queue for each batch. Just remove stale comment that
we wake up only one queue for each batch.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240115145626.665562-1-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nicky Chorley [Sun, 14 Jan 2024 19:10:56 +0000 (19:10 +0000)]
block: Correct a documentation comment in blk-cgroup.c
Commit
99e603874366
("blk-cgroup: pass a gendisk to the blkg allocation helpers") changed
blkg_alloc() to take a struct gendisk instead of a struct request_queue,
but the documentation comment still referred to q.
So, update that comment to refer to disk instead and fix a typo.
Signed-off-by: Nicky Chorley <ndchorley@gmail.com>
Link: https://lore.kernel.org/r/20240114191056.6992-1-ndchorley@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christophe JAILLET [Sun, 14 Jan 2024 09:00:59 +0000 (10:00 +0100)]
null_blk: Remove usage of the deprecated ida_simple_xx() API
ida_alloc() and ida_free() should be preferred to the deprecated
ida_simple_get() and ida_simple_remove().
This is less verbose.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/bf257b1078475a415cdc3344c6a750842946e367.1705222845.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 12 Jan 2024 16:12:20 +0000 (09:12 -0700)]
block: ensure we hold a queue reference when using queue limits
q_usage_counter is the only thing preventing us from the limits changing
under us in __bio_split_to_limits, but blk_mq_submit_bio doesn't hold
it while calling into it.
Move the splitting inside the region where we know we've got a queue
reference. Ideally this could still remain a shared section of code, but
let's keep the fix simple and defer any refactoring here to later.
Reported-by: Christoph Hellwig <hch@lst.de>
Fixes:
900e08075202 ("block: move queue enter logic into blk_mq_submit_bio()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 11 Jan 2024 13:57:04 +0000 (14:57 +0100)]
blk-mq: rename blk_mq_can_use_cached_rq
blk_mq_can_use_cached_rq doesn't just check if we can use the request,
but also performs the work to actually use it. Remove the _can in the
naming, and improve the comment describing the function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240111135705.2155518-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christian Heusel [Thu, 11 Jan 2024 23:15:18 +0000 (00:15 +0100)]
block: print symbolic error name instead of error code
Utilize the %pe print specifier to get the symbolic error name as a
string (i.e "-ENOMEM") in the log message instead of the error code to
increase its readablility.
This change was suggested in
https://lore.kernel.org/all/
92972476-0b1f-4d0a-9951-
af3fc8bc6e65@suswa.mountain/
Signed-off-by: Christian Heusel <christian@heusel.eu>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240111231521.1596838-1-christian@heusel.eu
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 12 Jan 2024 12:26:26 +0000 (20:26 +0800)]
blk-mq: fix IO hang from sbitmap wakeup race
In blk_mq_mark_tag_wait(), __add_wait_queue() may be re-ordered
with the following blk_mq_get_driver_tag() in case of getting driver
tag failure.
Then in __sbitmap_queue_wake_up(), waitqueue_active() may not observe
the added waiter in blk_mq_mark_tag_wait() and wake up nothing, meantime
blk_mq_mark_tag_wait() can't get driver tag successfully.
This issue can be reproduced by running the following test in loop, and
fio hang can be observed in < 30min when running it on my test VM
in laptop.
modprobe -r scsi_debug
modprobe scsi_debug delay=0 dev_size_mb=4096 max_queue=1 host_max_queue=1 submit_queues=4
dev=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename`
fio --filename=/dev/"$dev" --direct=1 --rw=randrw --bs=4k --iodepth=1 \
--runtime=100 --numjobs=40 --time_based --name=test \
--ioengine=libaio
Fix the issue by adding one explicit barrier in blk_mq_mark_tag_wait(), which
is just fine in case of running out of tag.
Cc: Jan Kara <jack@suse.cz>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Reported-by: Changhui Zhong <czhong@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240112122626.4181044-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 10 Jan 2024 22:26:42 +0000 (15:26 -0700)]
Merge tag 'nvme-6.8-2024-1-10' of git://git.infradead.org/nvme into for-6.8/block
Pull NVMe changes from Keith:
"nvme follow-up updates for Linux 6.8
- tcp, fc, and rdma target fixes (Maurizio, Daniel, Hannes, Christoph)
- discard fixes and improvements (Christoph)
- timeout debug improvements (Keith, Max)
- various cleanups (Daniel, Max, Giuxen)
- trace event string fixes (Arnd)
- shadow doorbell setup on reset fix (William)
- a write zeroes quirk for SK Hynix (Jim)"
* tag 'nvme-6.8-2024-1-10' of git://git.infradead.org/nvme: (25 commits)
nvmet-rdma: avoid circular locking dependency on install_queue()
nvmet-tcp: avoid circular locking dependency on install_queue()
nvme-pci: set doorbell config before unquiescing
nvmet-tcp: Fix the H2C expected PDU len calculation
nvme-tcp: enhance timeout kernel log
nvme-rdma: enhance timeout kernel log
nvme-pci: enhance timeout kernel log
nvme: trace: avoid memcpy overflow warning
nvmet: re-fix tracing strncpy() warning
nvme: introduce nvme_disk_is_ns_head helper
nvme-pci: disable write zeroes for SK Hynix BC901
nvmet-fcloop: Remove remote port from list when unlinking
nvmet-trace: avoid dereferencing pointer too early
nvmet-fc: remove unnecessary bracket
nvme: simplify the max_discard_segments calculation
nvme: fix max_discard_sectors calculation
nvme: also skip discard granularity updates in nvme_config_discard
nvme: update the explanation for not updating the limits in nvme_config_discard
nvmet-tcp: fix a missing endianess conversion in nvmet_tcp_try_peek_pdu
nvme-common: mark nvme_tls_psk_prio static
...
Hannes Reinecke [Fri, 8 Dec 2023 12:53:21 +0000 (13:53 +0100)]
nvmet-rdma: avoid circular locking dependency on install_queue()
nvmet_rdma_install_queue() is driven from the ->io_work workqueue
function, but will call flush_workqueue() which might trigger
->release_work() which in itself calls flush_work on ->io_work.
To avoid that check for pending queue in disconnecting status,
and return 'controller busy' when we reached a certain threshold.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Hannes Reinecke [Fri, 8 Dec 2023 12:53:20 +0000 (13:53 +0100)]
nvmet-tcp: avoid circular locking dependency on install_queue()
nvmet_tcp_install_queue() is driven from the ->io_work workqueue
function, but will call flush_workqueue() which might trigger
->release_work() which in itself calls flush_work on ->io_work.
To avoid that check for pending queue in disconnecting status,
and return 'controller busy' when we reached a certain threshold.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
William Butler [Wed, 10 Jan 2024 18:28:55 +0000 (18:28 +0000)]
nvme-pci: set doorbell config before unquiescing
During resets, if queues are unquiesced first, then the host can submit
IOs to the controller using shadow doorbell logic but the controller
won't be aware. This can lead to necessary MMIO doorbells from being
not issued, causing requests to be delayed and timed-out.
Signed-off-by: William Butler <wab@google.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Damien Le Moal [Wed, 10 Jan 2024 09:29:42 +0000 (18:29 +0900)]
block: fix partial zone append completion handling in req_bio_endio()
Partial completions of zone append request is not allowed but if a zone
append completion indicates a number of completed bytes different from
the original BIO size, only the BIO status is set to error. This leads
to bio_advance() not setting the BIO size to 0 and thus to not call
bio_endio() at the end of req_bio_endio().
Make sure a partially completed zone append is failed and completed
immediately by forcing the completed number of bytes (nbytes) to be
equal to the BIO size, thus ensuring that bio_endio() is called.
Fixes:
297db731847e ("block: fix req_bio_endio append error handling")
Cc: stable@kernel.vger.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240110092942.442334-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 10 Jan 2024 15:29:53 +0000 (08:29 -0700)]
block/iocost: silence warning on 'last_period' potentially being unused
If CONFIG_TRACEPOINTS isn't enabled, we assign this variable but then
never use it. This can cause the compiler to complain about that:
block/blk-iocost.c:1264:6: warning: variable 'last_period' set but not used [-Wunused-but-set-variable]
1264 | u64 last_period, cur_period;
| ^
Rather than add ifdefs to guard this, just mark it __maybe_unused.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/
202401102335.GiWdeIo9-lkp@intel.com/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 10 Jan 2024 01:28:43 +0000 (18:28 -0700)]
Merge tag 'md-6.8-
20240109' of https://git./linux/kernel/git/song/md into for-6.8/block
Pull MD fixes from Song:
"1. Sparse warning since v6.0, by Bart;
2. /proc/mdstat regression since v6.7, by Yu Kuai."
* tag 'md-6.8-
20240109' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md/raid1: Use blk_opf_t for read and write operations
md: Fix md_seq_ops() regressions
Bart Van Assche [Mon, 8 Jan 2024 00:12:23 +0000 (16:12 -0800)]
md/raid1: Use blk_opf_t for read and write operations
Use the type blk_opf_t for read and write operations instead of int. This
patch does not affect the generated code but fixes the following sparse
warning:
drivers/md/raid1.c:1993:60: sparse: sparse: incorrect type in argument 5 (different base types)
expected restricted blk_opf_t [usertype] opf
got int rw
Cc: Song Liu <song@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Fixes:
3c5e514db58f ("md/raid1: Use the new blk_opf_t type")
Cc: stable@vger.kernel.org # v6.0+
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/
202401080657.UjFnvQgX-lkp@intel.com/
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240108001223.23835-1-bvanassche@acm.org
Yu Kuai [Tue, 9 Jan 2024 13:39:57 +0000 (21:39 +0800)]
md: Fix md_seq_ops() regressions
Commit
cf1b6d4441ff ("md: simplify md_seq_ops") introduce following
regressions:
1) If list all_mddevs is emptly, personalities and unused devices won't
be showed to user anymore.
2) If seq_file buffer overflowed from md_seq_show(), then md_seq_start()
will be called again, hence personalities will be showed to user
again.
3) If seq_file buffer overflowed from md_seq_stop(), seq_read_iter()
doesn't handle this, hence unused devices won't be showed to user.
Fix above problems by printing personalities and unused devices in
md_seq_show().
Fixes:
cf1b6d4441ff ("md: simplify md_seq_ops")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240109133957.2975272-1-yukuai1@huaweicloud.com
Jens Axboe [Mon, 8 Jan 2024 18:51:57 +0000 (11:51 -0700)]
block: make __get_task_ioprio() easier to read
We don't need to do any gymnastics if we don't have an io_context
assigned at all, so just return early with our default priority.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 8 Jan 2024 18:50:16 +0000 (11:50 -0700)]
block: move __get_task_ioprio() into header file
We call this once per IO, which can be millions of times per second.
Since nobody really uses io priorities, or at least it isn't very
common, this is all wasted time and can amount to as much as 3% of
the total kernel time.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Maurizio Lombardi [Fri, 5 Jan 2024 08:14:44 +0000 (09:14 +0100)]
nvmet-tcp: Fix the H2C expected PDU len calculation
The nvmet_tcp_handle_h2c_data_pdu() function should take into
consideration the possibility that the header digest and/or the data
digests are enabled when calculating the expected PDU length, before
comparing it to the value stored in cmd->pdu_len.
Fixes:
efa56305908b ("nvmet-tcp: Fix a kernel panic when host sends an invalid H2C PDU length")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Max Gurtovoy [Sun, 7 Jan 2024 00:29:50 +0000 (02:29 +0200)]
nvme-tcp: enhance timeout kernel log
Print the command_id along side blk-mq's tag to help match commands with
protocol wire traces and logs.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Max Gurtovoy [Sun, 7 Jan 2024 00:29:49 +0000 (02:29 +0200)]
nvme-rdma: enhance timeout kernel log
Print the command_id along side blk-mq's tag to help match commands with
protocol wire traces and logs.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Keith Busch [Wed, 6 Dec 2023 18:48:30 +0000 (10:48 -0800)]
nvme-pci: enhance timeout kernel log
Kernel configs don't necessarily have opcode decoding, and some opcodes
are not even decodable. It is still interesting for debugging SSD issues
to know what opcode is timing out, what request type it came from, and
the data size (if applicable).
Also print the command_id along side blk-mq's tag to help match commands
with protocol wire traces and firmware logs,
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Damien Le Moal [Sun, 7 Jan 2024 07:22:12 +0000 (16:22 +0900)]
block: Treat sequential write preferred zone type as invalid
With the removal of the support for host-aware zoned devices,
blk_revalidate_zone_cb() should never see the zone type
BLK_ZONE_TYPE_SEQWRITE_PREF (sequential write preffered zones). Treat
this zone type as being invalid.
Fixes:
7437bb73f087 ("block: remove support for the host aware zone model")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240107072212.1071080-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 28 Dec 2023 07:51:41 +0000 (07:51 +0000)]
block: remove disk_clear_zoned
disk_clear_zoned is unused now that the last warts of the host-aware
model support in sd are gone.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20231228075141.362560-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 28 Dec 2023 07:51:40 +0000 (07:51 +0000)]
sd: remove the !ZBC && blk_queue_is_zoned case in sd_read_block_characteristics
Now that host-aware devices are always treated as conventional this case
can't happen.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20231228075141.362560-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Arnd Bergmann [Wed, 3 Jan 2024 15:56:56 +0000 (16:56 +0100)]
nvme: trace: avoid memcpy overflow warning
A previous patch introduced a struct_group() in nvme_common_command to help
stringop fortification figure out the length of the fields, but one function
is not currently using them:
In file included from drivers/nvme/target/core.c:7:
In file included from include/linux/string.h:254:
include/linux/fortify-string.h:592:4: error: call to '__read_overflow2_field' declared with 'warning' attribute: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Werror,-Wattribute-warning]
__read_overflow2_field(q_size_field, size);
^
Change this one to use the correct field name to avoid the warning.
Fixes:
5c629dc9609dc ("nvme: use struct group for generic command dwords")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Arnd Bergmann [Wed, 3 Jan 2024 15:56:55 +0000 (16:56 +0100)]
nvmet: re-fix tracing strncpy() warning
An earlier patch had tried to address a warning about a string copy with
missing zero termination:
drivers/nvme/target/trace.h:52:3: warning: ‘strncpy’ specified bound 32 equals destination size [-Wstringop-truncation]
The new version causes a different warning with some compiler versions, notably
gcc-9 and gcc-10, and also misses the zero padding that was apparently done
intentionally in the original code:
drivers/nvme/target/trace.h:56:2: error: 'strncpy' specified bound depends on the length of the source argument [-Werror=stringop-overflow=]
Change it to use strscpy_pad() with the original length, which will give
a properly padded and zero-terminated string as well as avoiding the warning.
Fixes:
d86481e924a7 ("nvmet: use min of device_path and disk len")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Guixin Liu [Wed, 27 Dec 2023 09:31:06 +0000 (17:31 +0800)]
nvme: introduce nvme_disk_is_ns_head helper
We currently rely on gendisk's file operations (fops) to distinguish
between a namespace head (ns_head) and a regular namespace. To enhance
code readability, introduce a helper function.
Additionally, we must ensure that the device is not an ns_head before
calling nvme_get_ns_from_dev(). To enforce this, add a WARN_ON check
within the nvme_get_ns_from_dev().
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Liu Song <liusong@linux.alibaba.com>
[include fix: https://lore.kernel.org/oe-kbuild-all/
202401031943.0N72Tkji-lkp@intel.com/]
Signed-off-by: Keith Busch <kbusch@kernel.org>
Jim.Lin [Tue, 28 Nov 2023 02:57:37 +0000 (10:57 +0800)]
nvme-pci: disable write zeroes for SK Hynix BC901
SK Hynix BC901 drive write zero will cause Chromebook takes more than 20 mins to switch to developer mode
"disable write zeroes" can fix this issue and Sk Hynix has been verified.
Signed-off-by: Jim.Lin <jim.lin@siliconmotion.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Daniel Wagner [Mon, 18 Dec 2023 15:30:53 +0000 (16:30 +0100)]
nvmet-fcloop: Remove remote port from list when unlinking
The remote port is removed too late from fcloop_nports list. Remove it
when port is unregistered.
This prevents a busy loop in fcloop_exit, because it is possible the
remote port is found in the list and thus we will never progress.
The kernel log will be spammed with
nvme_fcloop: fcloop_exit: Failed deleting remote port
nvme_fcloop: fcloop_exit: Failed deleting target port
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
liyouhong [Tue, 26 Dec 2023 09:57:01 +0000 (17:57 +0800)]
drivers/block/xen-blkback/common.h: Fix spelling typo in comment
Fix spelling typo in comment.
Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: liyouhong <liyouhong@kylinos.cn>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20231226095701.172080-1-liyouhong@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 19 Dec 2023 01:28:33 +0000 (09:28 +0800)]
blk-cgroup: fix rcu lockdep warning in blkg_lookup()
blkg_lookup() is called with either queue_lock or rcu read lock, so
use rcu_dereference_check(lockdep_is_held(&q->queue_lock)) for
retrieving 'blkg', which way models the check exactly for covering
queue lock or rcu read lock.
Fix lockdep warning of "block/blk-cgroup.h:254 suspicious rcu_dereference_check() usage!"
from blkg_lookup().
Tested-by: Changhui Zhong <czhong@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Fixes:
83462a6c971c ("blkcg: Drop unnecessary RCU read [un]locks from blkg_conf_prep/finish()")
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231219012833.2129540-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Daniel Vacek [Thu, 4 Jan 2024 18:00:30 +0000 (19:00 +0100)]
blk-cgroup: don't use removal safe list iterators
Commit
f1c006f1c685 moved deletion of the list blkg->q_node from
blkg_destroy() to blkg_free_workfn(). Switch to using the list
iterators, as we don't need removal protection anymore.
Signed-off-by: Daniel Vacek <neelx@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240104180031.148148-1-neelx@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 3 Jan 2024 08:16:22 +0000 (08:16 +0000)]
block: floor the discard granularity to the physical block size
Discarding less than a physical block doesn't make sense. This fixes
the existing behavior for zram before the recent changes to default
the discard granularity to the logical block size, and is also a
generally useful sanity check.
Fixes:
3753039def5d ("zram: use the default discard granularity")
Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240103081622.508754-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Daniel Wagner [Mon, 18 Dec 2023 15:30:51 +0000 (16:30 +0100)]
nvmet-trace: avoid dereferencing pointer too early
The first command issued from the host to the target is the fabrics
connect command. At this point, neither the target queue nor the
controller have been allocated. But we already try to trace this command
in nvmet_req_init.
Reported by KASAN.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Daniel Wagner [Mon, 18 Dec 2023 15:30:50 +0000 (16:30 +0100)]
nvmet-fc: remove unnecessary bracket
There is no need for the bracket around the identifier. Remove it.
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Tue, 26 Dec 2023 08:58:44 +0000 (08:58 +0000)]
nvme: simplify the max_discard_segments calculation
Just stash away the DMRL value in the nvme_ctrl struture, and leave
all interpretation to nvme_config_discard, where we know DSM is
supported by the time we're configuring the number of segments.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Tue, 26 Dec 2023 08:58:43 +0000 (08:58 +0000)]
nvme: fix max_discard_sectors calculation
ctrl->max_discard_sectors stores a value that is potentially based of
the DMRSL field in Identify Controller, which is in units of LBAs and
thus dependent on the Format of a namespace.
Fix this by moving the calculation of max_discard_sectors entirely
into nvme_config_discard and replacing the ctrl->max_discard_sectors
value with a local variable so that the calculation is always
namespace-specific.
Fixes:
1a86924e4f46 ("nvme: fix interpretation of DMRSL")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Tue, 26 Dec 2023 08:58:42 +0000 (08:58 +0000)]
nvme: also skip discard granularity updates in nvme_config_discard
Don't just skip the discard sectors and segments but also the granularity
if a value was already set before.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Tue, 26 Dec 2023 08:58:41 +0000 (08:58 +0000)]
nvme: update the explanation for not updating the limits in nvme_config_discard
Expeand the comment a bit to explain what is going on.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Tue, 26 Dec 2023 08:13:29 +0000 (08:13 +0000)]
nvmet-tcp: fix a missing endianess conversion in nvmet_tcp_try_peek_pdu
No, a __le32 cast doesn't magically byteswap on big-endian systems..
Fixes:
70525e5d82f6 ("nvmet-tcp: peek icreq before starting TLS")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Tue, 26 Dec 2023 08:14:12 +0000 (08:14 +0000)]
nvme-common: mark nvme_tls_psk_prio static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Max Gurtovoy [Mon, 1 Jan 2024 10:35:27 +0000 (12:35 +0200)]
nvme: remove unused definition
There is no users for NVMF_AUTH_HASH_LEN macro.
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Guixin Liu [Sun, 31 Dec 2023 06:56:44 +0000 (14:56 +0800)]
nvme: tcp: remove unnecessary goto statement
There is no requirement to call nvme_tcp_free_queue() for queue
deallocation if the pskid is null or the queue allocation fails, as
the NVME_TCP_Q_ALLOCATED flag would not be set in such scenarios.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Maurizio Lombardi [Fri, 22 Dec 2023 15:17:50 +0000 (16:17 +0100)]
nvmet-tcp: remove boilerplate code
Simplify the nvmet_tcp_handle_h2c_data_pdu() function by removing
boilerplate code.
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Maurizio Lombardi [Fri, 22 Dec 2023 15:17:49 +0000 (16:17 +0100)]
nvmet-tcp: fix a crash in nvmet_req_complete()
in nvmet_tcp_handle_h2c_data_pdu(), if the host sends a data_offset
different from rbytes_done, the driver ends up calling nvmet_req_complete()
passing a status error.
The problem is that at this point cmd->req is not yet initialized,
the kernel will crash after dereferencing a NULL pointer.
Fix the bug by replacing the call to nvmet_req_complete() with
nvmet_tcp_fatal_error().
Fixes:
872d26a391da ("nvmet-tcp: add NVMe over TCP target driver")
Reviewed-by: Keith Busch <kbsuch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Maurizio Lombardi [Fri, 22 Dec 2023 15:17:48 +0000 (16:17 +0100)]
nvmet-tcp: Fix a kernel panic when host sends an invalid H2C PDU length
If the host sends an H2CData command with an invalid DATAL,
the kernel may crash in nvmet_tcp_build_pdu_iovec().
Unable to handle kernel NULL pointer dereference at
virtual address
0000000000000000
lr : nvmet_tcp_io_work+0x6ac/0x718 [nvmet_tcp]
Call trace:
process_one_work+0x174/0x3c8
worker_thread+0x2d0/0x3e8
kthread+0x104/0x110
Fix the bug by raising a fatal error if DATAL isn't coherent
with the packet size.
Also, the PDU length should never exceed the MAXH2CDATA parameter which
has been communicated to the host in nvmet_tcp_handle_icreq().
Fixes:
872d26a391da ("nvmet-tcp: add NVMe over TCP target driver")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Thu, 28 Dec 2023 07:55:45 +0000 (07:55 +0000)]
mtd_blkdevs: use the default discard granularity
The discard granularity now defaults to a single sector, so don't set
that value explicitly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20231228075545.362768-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 28 Dec 2023 07:55:44 +0000 (07:55 +0000)]
bcache: use the default discard granularity
The discard granularity now defaults to a single sector, so don't set
that value explicitly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 28 Dec 2023 07:55:43 +0000 (07:55 +0000)]
zram: use the default discard granularity
The discard granularity now defaults to a single sector, so don't set
that value explicitly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 28 Dec 2023 07:55:42 +0000 (07:55 +0000)]
null_blk: use the default discard granularity
The discard granularity now defaults to a single sector, so don't set
that value explicitly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 28 Dec 2023 07:55:41 +0000 (07:55 +0000)]
nbd: use the default discard granularity
The discard granularity now defaults to a single sector, so don't set
that value explicitly. Also don't bother clearing it as a discard
granularity without discard_sectors doesn't mean anything.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 28 Dec 2023 07:55:40 +0000 (07:55 +0000)]
ubd: use the default discard granularity
The discard granularity now defaults to a single sector, so don't set
that value explicitly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20231228075545.362768-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 28 Dec 2023 07:55:39 +0000 (07:55 +0000)]
block: default the discard granularity to sector size
Current the discard granularity defaults to 0 and must be initialized by
any driver that wants to support discard. Default to the sector size
instead, which is the smallest possible value, and a very useful default.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 28 Dec 2023 07:55:38 +0000 (07:55 +0000)]
bcache: discard_granularity should not be smaller than a sector
Just like all block I/O, discards are in units of sectors. Thus setting a
smaller than sector size discard limit in case of > 512 byte sectors in
bcache doesn't make sense. Always set the discard granularity to 512
bytes instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 28 Dec 2023 07:55:37 +0000 (07:55 +0000)]
block: remove two comments in bio_split_discard
A zero discard_granularity is not treated the same as a single-block one,
and not having any segments after taking alignment is perfectly fine
and does not need a warning.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 27 Dec 2023 09:23:05 +0000 (09:23 +0000)]
block: rename and document BLK_DEF_MAX_SECTORS
Give BLK_DEF_MAX_SECTORS a _CAP postfix and document what it is used for.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227092305.279567-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 27 Dec 2023 09:23:04 +0000 (09:23 +0000)]
loop: don't abuse BLK_DEF_MAX_SECTORS
BLK_DEF_MAX_SECTORS despite the confusing name is the default cap for
the max_sectors limits. Don't use it to initialize max_hw_setors, which
is a hardware / driver capacility.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227092305.279567-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 27 Dec 2023 09:23:03 +0000 (09:23 +0000)]
aoe: don't abuse BLK_DEF_MAX_SECTORS
BLK_DEF_MAX_SECTORS despite the confusing name is the default cap for
the max_sectors limits. Don't use it to initialize max_hw_setors, which
is a hardware / driver capacility.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227092305.279567-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 27 Dec 2023 09:23:02 +0000 (09:23 +0000)]
null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS
null_blk has some rather odd capping of the max_hw_sectors value to
BLK_DEF_MAX_SECTORS, which doesn't make sense - max_hw_sector is the
hardware limit, and BLK_DEF_MAX_SECTORS despite the confusing name is the
default cap for the max_sectors field used for normal file system I/O.
Remove all the capping, and simply leave it to the block layer or
user to take up or not all of that for file system I/O.
Fixes:
ea17fd354ca8 ("null_blk: Allow controlling max_hw_sectors limit")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227092305.279567-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 27 Dec 2023 08:20:20 +0000 (08:20 +0000)]
loop: don't update discard limits from loop_set_status
loop_set_status doesn't change anything relevant to the discard and
write_zeroes setting, so don't bother calling loop_config_discard.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227082020.249427-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 26 Dec 2023 09:07:47 +0000 (09:07 +0000)]
blk-wbt: remove the separate write cache tracking
Use the queue wide write back cache tracking insted of duplicating the
value in strut rq_wb.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231226090747.204969-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 21 Dec 2023 07:05:38 +0000 (08:05 +0100)]
block: reject invalid operation in submit_bio_noacct
submit_bio_noacct allows completely invalid operations, or operations
that are not supported in the bio path. Extent the existing switch
statement to rejcect all invalid types.
Move the code point for REQ_OP_ZONE_APPEND so that it's not right in the
middle of the zone management operations and the switch statement can
follow the numerical order of the operations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231221070538.1112446-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Randy Dunlap [Fri, 22 Dec 2023 06:19:08 +0000 (22:19 -0800)]
drbd: actlog: fix kernel-doc warnings and spelling
Fix all kernel-doc warnings in drbd_actlog.c:
drbd_actlog.c:963: warning: No description found for return value of 'drbd_rs_begin_io'
drbd_actlog.c:1015: warning: Function parameter or member 'peer_device' not described in 'drbd_try_rs_begin_io'
drbd_actlog.c:1015: warning: Excess function parameter 'device' description in 'drbd_try_rs_begin_io'
drbd_actlog.c:1015: warning: No description found for return value of 'drbd_try_rs_begin_io'
drbd_actlog.c:1197: warning: No description found for return value of 'drbd_rs_del_all'
Fix one spelling error (s/ore/or/).
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Cc: <drbd-dev@lists.linbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <linux-block@vger.kernel.org>
Link: https://lore.kernel.org/r/20231222061909.8791-1-rdunlap@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kundan Kumar [Fri, 22 Dec 2023 10:17:07 +0000 (15:47 +0530)]
block: skip start/end time stamping for passthrough IO
commit
41fa722239b4 ("blk-mq: do not include passthrough requests in I/O
accounting")' disables I/O accounting for passthrough requests. Since tools
like 'iostat' do not show anything useful for passthrough I/O, it's
wasteful to do start/end time-stamping. So do away with that.
Avoiding the time-stamping improves the I/O performance by ~7%
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20231222101707.6921-1-kundan.kumar@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 21 Dec 2023 21:44:17 +0000 (14:44 -0700)]
Merge tag 'nvme-6.8-2023-12-21' of git://git.infradead.org/nvme into for-6.8/block
Pull NVMe updates from Keith:
"nvme updates for Linux 6.8
- nvme fabrics spec updates (Guixin, Max)
- nvme target udpates (Guixin, Evan)
- nvme attribute refactoring (Daniel)
- nvme-fc numa fix (Keith)"
* tag 'nvme-6.8-2023-12-21' of git://git.infradead.org/nvme:
nvme-fc: set numa_node after nvme_init_ctrl
nvme-fabrics: don't check discovery ioccsz/iorcsz
nvmet: configfs: use ctrl->instance to track passthru subsystems
nvme: repack struct nvme_ns_head
nvme: add csi, ms and nuse to sysfs
nvme: rename ns attribute group
nvme: refactor ns info setup function
nvme: refactor ns info helpers
nvme: move ns id info to struct nvme_ns_head
nvmet: remove cntlid_min and cntlid_max check in nvmet_alloc_ctrl
nvmet: allow identical cntlid_min and cntlid_max settings
nvme-fabrics: check ioccsz and iorcsz
nvme: introduce nvme_check_ctrl_fabric_info helper
Keith Busch [Mon, 18 Dec 2023 23:22:24 +0000 (15:22 -0800)]
nvme-fc: set numa_node after nvme_init_ctrl
nvme_init_ctrl() resets numa_node to NUMA_NO_NODE, so be sure to set the
desired value after that function call so it won't be overwritten.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Max Gurtovoy [Wed, 20 Dec 2023 09:27:45 +0000 (11:27 +0200)]
nvme-fabrics: don't check discovery ioccsz/iorcsz
IOCCSZ and IORCSZ are reserved for discovery controllers. Avoid checking
their values during identify controller phase.
Fixes:
2fcd3ab39826 ("nvme-fabrics: check ioccsz and iorcsz")
Reported-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Jens Axboe [Thu, 21 Dec 2023 03:32:12 +0000 (20:32 -0700)]
block: export disk_clear_zoned()
A previous commit split disk_set_zoned(..., bool) into not taking an
argument for whether to set or clear, and instead added
disk_clear_zoned() as the counterpart. However, that commit neglected
to export the new symbol, causing failures for modular drivers that
used it.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes:
d73e93b4dfab ("block: simplify disk_set_zoned")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sun, 17 Dec 2023 16:53:59 +0000 (17:53 +0100)]
sd: only call disk_clear_zoned when needed
disk_clear_zoned only needs to be called when a device reported zone
managed mode first and we clear it. Add a check so that disk_clear_zoned
isn't called on devices that were never zoned.
This avoids a fairly expensive queue freezing when revalidating
conventional devices.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sun, 17 Dec 2023 16:53:58 +0000 (17:53 +0100)]
block: simplify disk_set_zoned
Only use disk_set_zoned to actually enable zoned device support.
For clearing it, call disk_clear_zoned, which is renamed from
disk_clear_zone_settings and now directly clears the zoned flag as
well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sun, 17 Dec 2023 16:53:57 +0000 (17:53 +0100)]
block: remove support for the host aware zone model
When zones were first added the SCSI and ATA specs, two different
models were supported (in addition to the drive managed one that
is invisible to the host):
- host managed where non-conventional zones there is strict requirement
to write at the write pointer, or else an error is returned
- host aware where a write point is maintained if writes always happen
at it, otherwise it is left in an under-defined state and the
sequential write preferred zones behave like conventional zones
(probably very badly performing ones, though)
Not surprisingly this lukewarm model didn't prove to be very useful and
was finally removed from the ZBC and SBC specs (NVMe never implemented
it). Due to to the easily disappearing write pointer host software
could never rely on the write pointer to actually be useful for say
recovery.
Fortunately only a few HDD prototypes shipped using this model which
never made it to mass production. Drop the support before it is too
late. Note that any such host aware prototype HDD can still be used
with Linux as we'll now treat it as a conventional HDD.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sun, 17 Dec 2023 16:53:56 +0000 (17:53 +0100)]
virtio_blk: remove the broken zone revalidation support
virtblk_revalidate_zones is called unconditionally from
virtblk_config_changed_work from the virtio config_changed callback.
virtblk_revalidate_zones is a bit odd in that it re-clears the zoned
state for host aware or non-zoned devices, which isn't needed unless the
zoned mode changed - but a zone mode change to a host managed model isn't
handled at all, and virtio_blk also doesn't handle any other config
change except for a capacity change is handled (and even if it was
the upper layers above virtio_blk wouldn't handle it very well).
But even the useful case of a size change that would add or remove
zones isn't handled properly as blk_revalidate_disk_zones expects the
device capacity to cover all zones, but the capacity is only updated
after virtblk_revalidate_zones.
As this code appears to be entirely untested and is getting in the way
remove it for now, but it can be readded in a fixed version with
proper test coverage if needed.
Fixes:
95bfec41bd3d ("virtio-blk: add support for zoned block devices")
Fixes:
f1ba4e674feb ("virtio-blk: fix to match virtio spec")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sun, 17 Dec 2023 16:53:55 +0000 (17:53 +0100)]
virtio_blk: cleanup zoned device probing
Move reading and checking the zoned model from virtblk_probe_zoned_device
into the caller, leaving only the code to perform the actual setup for
host managed zoned devices in virtblk_probe_zoned_device.
This allows to share the model reading and sharing between builds with
and without CONFIG_BLK_DEV_ZONED, and improve it for the
!CONFIG_BLK_DEV_ZONED case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 19 Dec 2023 22:49:23 +0000 (15:49 -0700)]
Merge tag 'md-next-
20231219' of https://git./linux/kernel/git/song/md into for-6.8/block
Pull MD updates from Song:
"1. Remove deprecated flavors, by Song Liu;
2. raid1 read error check support, by Li Nan;
3. Better handle events off-by-1 case, by Alex Lyakas."
* tag 'md-next-
20231219' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md: Remove deprecated CONFIG_MD_FAULTY
md: Remove deprecated CONFIG_MD_MULTIPATH
md: Remove deprecated CONFIG_MD_LINEAR
md/raid1: support read error check
md: factor out a helper exceed_read_errors() to check read_errors
md: Whenassemble the array, consult the superblock of the freshest device
md/raid1: remove unnecessary null checking
Song Liu [Thu, 14 Dec 2023 22:21:07 +0000 (14:21 -0800)]
md: Remove deprecated CONFIG_MD_FAULTY
md-faulty has been marked as deprecated for 2.5 years. Remove it.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Guoqing Jiang <guoqing.jiang@linux.dev>
Cc: Mateusz Grzonka <mateusz.grzonka@intel.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20231214222107.2016042-4-song@kernel.org
Song Liu [Thu, 14 Dec 2023 22:21:06 +0000 (14:21 -0800)]
md: Remove deprecated CONFIG_MD_MULTIPATH
md-multipath has been marked as deprecated for 2.5 years. Remove it.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Guoqing Jiang <guoqing.jiang@linux.dev>
Cc: Mateusz Grzonka <mateusz.grzonka@intel.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20231214222107.2016042-3-song@kernel.org
Song Liu [Thu, 14 Dec 2023 22:21:05 +0000 (14:21 -0800)]
md: Remove deprecated CONFIG_MD_LINEAR
md-linear has been marked as deprecated for 2.5 years. Remove it.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Guoqing Jiang <guoqing.jiang@linux.dev>
Cc: Mateusz Grzonka <mateusz.grzonka@intel.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20231214222107.2016042-2-song@kernel.org
Evan Burgess [Mon, 18 Dec 2023 19:03:32 +0000 (19:03 +0000)]
nvmet: configfs: use ctrl->instance to track passthru subsystems
To prevent enabling more than one passthrough subsystem per NVMe
controller, passthru.c maintains an xarray indexed by cntlid values.
Passthrough for a given nvmet subsystem cannot be enabled by configfs
if the subsystem's passthru_ctrl->cntlid value is already accounted
for in the xarray.
However, according to the NVMe spec (rev 2.0c, p.145), "The Controller
ID (CNTLID) value returned in the Identify Controller data structure
may be used to uniquely identify a controller within an NVM subsystem,"
meaning that cntlid values are not guaranteed to be globally unique
across multiple subsystems. Instead, the cntlid only uniquely
identifies multiple controllers _within_ a subsystem.
As a result, multiple unique & valid NVMe targets can be blocked from
enabling passthrough at the same time if their controllers share cntlid
values, a behavior allowed by the spec. Fix this by indexing the xarray
with passthru_ctrl->instance values, which are allocated per
controller by IDA and thus should be truly unique.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Evan Burgess <evan.burgess@seagate.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Daniel Wagner [Mon, 18 Dec 2023 16:59:54 +0000 (17:59 +0100)]
nvme: repack struct nvme_ns_head
ns_id, lba_shift and ms are always accessed for every read/write I/O in
nvme_setup_rw. By grouping these variables into one cacheline we can
safe some cycles.
4k sequential reads:
baseline patched
Bandwidth: 1620 1634
IOPs
66345579 66910939
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Daniel Wagner [Mon, 18 Dec 2023 16:59:53 +0000 (17:59 +0100)]
nvme: add csi, ms and nuse to sysfs
libnvme is using the sysfs for enumarating the nvme resources. Though
there are few missing attritbutes in the sysfs. For these libnvme issues
commands during discovering.
As the kernel already knows all these attributes and we would like to
avoid libnvme to issue commands all the time, expose these missing
attributes.
The nuse value is updated on request because the nuse is a volatile
value. Since any user can read the sysfs attribute, a very simple rate
limit is added (update once every 5 seconds). A more sophisticated
update strategy can be added later if there is actually a need for it.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Daniel Wagner [Mon, 18 Dec 2023 16:59:52 +0000 (17:59 +0100)]
nvme: rename ns attribute group
Drop the 'id' part of the attribute group name because we want to expose
non 'id' related attributes via the ns attribute group.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Daniel Wagner [Mon, 18 Dec 2023 16:59:51 +0000 (17:59 +0100)]
nvme: refactor ns info setup function
Use nvme_ns_head instead of nvme_ns where possible. This reduces the
coupling between the different data structures.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Daniel Wagner [Mon, 18 Dec 2023 16:59:50 +0000 (17:59 +0100)]
nvme: refactor ns info helpers
Pass in the nvme_ns_head pointer directly. This reduces the necessity on
the caller side have the nvme_ns data structure present. Thus we can
refactor the caller side in the next step as well.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Daniel Wagner [Mon, 18 Dec 2023 16:59:49 +0000 (17:59 +0100)]
nvme: move ns id info to struct nvme_ns_head
Move the namesapce info to struct nvme_ns_head, because it's the same
for all associated namespaces.
Note: with multipathing enabled the PI information is shared between all
paths. If a path is using a different PI configuration it will overwrite
the previous settings. This is obviously not correct and such
configuration will be rejected in future. For the time being we expect
a correctly configured storage.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Li Nan [Tue, 19 Dec 2023 07:59:42 +0000 (15:59 +0800)]
block: add check of 'minors' and 'first_minor' in device_add_disk()
'first_minor' represents the starting minor number of disks, and
'minors' represents the number of partitions in the device. Neither
of them can be greater than MINORMASK + 1.
Commit
e338924bd05d ("block: check minor range in device_add_disk()")
only added the check of 'first_minor + minors'. However, their sum might
be less than MINORMASK but their values are wrong. Complete the checks now.
Fixes:
e338924bd05d ("block: check minor range in device_add_disk()")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231219075942.840255-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kundan Kumar [Mon, 18 Dec 2023 15:27:22 +0000 (20:57 +0530)]
block: skip cgroups for passthrough io
Even if BLK_CGROUP is enabled, it does not work for passthrough io.
So skip setting up blkg for passthrough bio.
Reduced processing gives ~5% hike in peak-performance workload.
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20231218152722.1768-1-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Li Nan [Fri, 15 Dec 2023 02:38:52 +0000 (10:38 +0800)]
md/raid1: support read error check
After commit
1e50915fe0bb ("raid: improve MD/raid10 handling of correctable
read errors."), rdev will be set to faulty if it reads data error to many
times in raid10. Add this mechanism to raid1 now.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231215023852.3478228-3-linan666@huaweicloud.com
Li Nan [Fri, 15 Dec 2023 02:38:51 +0000 (10:38 +0800)]
md: factor out a helper exceed_read_errors() to check read_errors
Move check_decay_read_errors() to raid1-10.c and factor out a helper
exceed_read_errors() to check if read_errors exceeds the limit, so that
raid1 can also use it. There are no functional changes.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231215023852.3478228-2-linan666@huaweicloud.com
Alex Lyakas [Wed, 13 Dec 2023 12:24:31 +0000 (14:24 +0200)]
md: Whenassemble the array, consult the superblock of the freshest device
Upon assembling the array, both kernel and mdadm allow the devices to have event
counter difference of 1, and still consider them as up-to-date.
However, a device whose event count is behind by 1, may in fact not be up-to-date,
and array resync with such a device may cause data corruption.
To avoid this, consult the superblock of the freshest device about the status
of a device, whose event counter is behind by 1.
Signed-off-by: Alex Lyakas <alex.lyakas@zadara.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/1702470271-16073-1-git-send-email-alex.lyakas@zadara.com
Jens Axboe [Thu, 14 Dec 2023 18:08:15 +0000 (11:08 -0700)]
block: improve struct request_queue layout
It's clearly been a while since someone looked at this, so I gave it a
quick shot. There are few issues in here:
- Random bundling of members that are mostly read-only and often written
- Random holes that need not be there
This moves the most frequently used bits into cacheline 1 and 2, with
the 2nd one being more write intensive than the first one, which is
basically read-only.
Outside of making this work a bit more efficiently, it also reduces the
size of struct request_queue for my test setup from 864 bytes (spanning
14 cachelines!) to 832 bytes and 13 cachelines.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/d2b7b61c-4868-45c0-9060-4f9c73de9d7e@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 4 Dec 2023 17:34:19 +0000 (18:34 +0100)]
block: support adding less than len in bio_add_hw_page
bio_add_hw_page currently always fails or succeeds. This is fine for
the existing callers that always add PAGE_SIZE worth given that the
max_segment_size and max_sectors must always allow at least a page
worth of data. But when we want to add it for bigger amounts of data
this means it can also fail when adding the data to a bio, and creating
a fallback for that becomes really annoying in the callers.
Make use of the existing API design that allows to return a smaller
length than the one passed in and add up to max_segment_size worth
of data from a larger input. All the existing callers are fine with
this - not because they handle this return correctly, but because they
never pass more than a page in.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20231204173419.782378-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 4 Dec 2023 17:34:18 +0000 (18:34 +0100)]
block: prevent an integer overflow in bvec_try_merge_hw_page
Reordered a check to avoid a possible overflow when adding len to bv_len.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20231204173419.782378-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>