linux-2.6-microblaze.git
21 months agoMerge branch 'for-5.20/io_uring' into for-5.20/io_uring-zerocopy-send
Jens Axboe [Mon, 25 Jul 2022 00:41:03 +0000 (18:41 -0600)]
Merge branch 'for-5.20/io_uring' into for-5.20/io_uring-zerocopy-send

* for-5.20/io_uring: (716 commits)
  io_uring: ensure REQ_F_ISREG is set async offload
  net: fix compat pointer in get_compat_msghdr()
  io_uring: Don't require reinitable percpu_ref
  io_uring: fix types in io_recvmsg_multishot_overflow
  io_uring: Use atomic_long_try_cmpxchg in __io_account_mem
  io_uring: support multishot in recvmsg
  net: copy from user before calling __get_compat_msghdr
  net: copy from user before calling __copy_msghdr
  io_uring: support 0 length iov in buffer select in compat
  io_uring: fix multishot ending when not polled
  io_uring: add netmsg cache
  io_uring: impose max limit on apoll cache
  io_uring: add abstraction around apoll cache
  io_uring: move apoll cache to poll.c
  io_uring: consolidate hash_locked io-wq handling
  io_uring: clear REQ_F_HASH_LOCKED on hash removal
  io_uring: don't race double poll setting REQ_F_ASYNC_DATA
  io_uring: don't miss setting REQ_F_DOUBLE_POLL
  io_uring: disable multishot recvmsg
  io_uring: only trace one of complete or overflow
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoMerge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel...
Jens Axboe [Mon, 25 Jul 2022 00:40:31 +0000 (18:40 -0600)]
Merge branch 'io_uring-zerocopy-send' of git://git./linux/kernel/git/kuba/linux into for-5.20/io_uring-zerocopy-send

Merge prep net series for io_uring tx zc from the Jakub's tree.

* 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux:
  net: fix uninitialised msghdr->sg_from_iter
  tcp: support externally provided ubufs
  ipv6/udp: support externally provided ubufs
  ipv4/udp: support externally provided ubufs
  net: introduce __skb_fill_page_desc_noacc
  net: introduce managed frags infrastructure
  net: Allow custom iter handler in msghdr
  skbuff: carry external ubuf_info in msghdr
  skbuff: add SKBFL_DONT_ORPHAN flag
  skbuff: don't mix ubuf_info from different sources
  ipv6: avoid partial copy for zc
  ipv4: avoid partial copy for zc

21 months agoio_uring: ensure REQ_F_ISREG is set async offload
Jens Axboe [Thu, 21 Jul 2022 15:06:47 +0000 (09:06 -0600)]
io_uring: ensure REQ_F_ISREG is set async offload

If we're offloading requests directly to io-wq because IOSQE_ASYNC was
set in the sqe, we can miss hashing writes appropriately because we
haven't set REQ_F_ISREG yet. This can cause a performance regression
with buffered writes, as io-wq then no longer correctly serializes writes
to that file.

Ensure that we set the flags in io_prep_async_work(), which will cause
the io-wq work item to be hashed appropriately.

Fixes: 584b0180f0f4 ("io_uring: move read/write file prep state into actual opcode handler")
Link: https://lore.kernel.org/io-uring/20220608080054.GB22428@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agonet: fix compat pointer in get_compat_msghdr()
Jens Axboe [Fri, 15 Jul 2022 21:54:47 +0000 (15:54 -0600)]
net: fix compat pointer in get_compat_msghdr()

A previous change enabled external users to copy the data before
calling __get_compat_msghdr(), but didn't modify get_compat_msghdr() or
__io_compat_recvmsg_copy_hdr() to take that into account. They are both
stil passing in the __user pointer rather than the copied version.

Ensure we pass in the kernel struct, not the pointer to the user data.

Link: https://lore.kernel.org/all/46439555-644d-08a1-7d66-16f8f9a320f0@samsung.com/
Fixes: 1a3e4e94a1b9 ("net: copy from user before calling __get_compat_msghdr")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: Don't require reinitable percpu_ref
Michal Koutný [Fri, 15 Jul 2022 17:45:01 +0000 (19:45 +0200)]
io_uring: Don't require reinitable percpu_ref

The commit 8bb649ee1da3 ("io_uring: remove ring quiesce for
io_uring_register") removed the worklow relying on reinit/resurrection
of the percpu_ref, hence, initialization with that requested is a relic.

This is based on code review, this causes no real bug (and theoretically
can't). Technically it's a revert of commit 214828962dea ("io_uring:
initialize percpu refcounters using PERCU_REF_ALLOW_REINIT") but since
the flag omission is now justified, I'm not making this a revert.

Fixes: 8bb649ee1da3 ("io_uring: remove ring quiesce for io_uring_register")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: fix types in io_recvmsg_multishot_overflow
Dylan Yudaken [Fri, 15 Jul 2022 13:02:52 +0000 (06:02 -0700)]
io_uring: fix types in io_recvmsg_multishot_overflow

io_recvmsg_multishot_overflow had incorrect types on non x64 system.
But also it had an unnecessary INT_MAX check, which could just be done
by changing the type of the accumulator to int (also simplifying the
casts).

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: a8b38c4ce724 ("io_uring: support multishot in recvmsg")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220715130252.610639-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: Use atomic_long_try_cmpxchg in __io_account_mem
Uros Bizjak [Thu, 14 Jul 2022 16:33:01 +0000 (18:33 +0200)]
io_uring: Use atomic_long_try_cmpxchg in __io_account_mem

Use atomic_long_try_cmpxchg instead of
atomic_long_cmpxchg (*ptr, old, new) == old in __io_account_mem.
x86 CMPXCHG instruction returns success in ZF flag, so this
change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

Also, atomic_long_try_cmpxchg implicitly assigns old *ptr value
to "old" when cmpxchg fails, enabling further code simplifications.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: support multishot in recvmsg
Dylan Yudaken [Thu, 14 Jul 2022 11:02:58 +0000 (04:02 -0700)]
io_uring: support multishot in recvmsg

Similar to multishot recv, this will require provided buffers to be
used. However recvmsg is much more complex than recv as it has multiple
outputs. Specifically flags, name, and control messages.

Support this by introducing a new struct io_uring_recvmsg_out with 4
fields. namelen, controllen and flags match the similar out fields in
msghdr from standard recvmsg(2), payloadlen is the length of the payload
following the header.
This struct is placed at the start of the returned buffer. Based on what
the user specifies in struct msghdr, the next bytes of the buffer will be
name (the next msg_namelen bytes), and then control (the next
msg_controllen bytes). The payload will come at the end. The return value
in the CQE is the total used size of the provided buffer.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220714110258.1336200-4-dylany@fb.com
[axboe: style fixups, see link]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agonet: copy from user before calling __get_compat_msghdr
Dylan Yudaken [Thu, 14 Jul 2022 11:02:57 +0000 (04:02 -0700)]
net: copy from user before calling __get_compat_msghdr

this is in preparation for multishot receive from io_uring, where it needs
to have access to the original struct user_msghdr.

functionally this should be a no-op.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220714110258.1336200-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agonet: copy from user before calling __copy_msghdr
Dylan Yudaken [Thu, 14 Jul 2022 11:02:56 +0000 (04:02 -0700)]
net: copy from user before calling __copy_msghdr

this is in preparation for multishot receive from io_uring, where it needs
to have access to the original struct user_msghdr.

functionally this should be a no-op.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220714110258.1336200-2-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: support 0 length iov in buffer select in compat
Dylan Yudaken [Fri, 8 Jul 2022 18:18:36 +0000 (11:18 -0700)]
io_uring: support 0 length iov in buffer select in compat

Match up work done in "io_uring: allow iov_len = 0 for recvmsg and buffer
select", but for compat code path.

Fixes: a68caad69ce5 ("io_uring: allow iov_len = 0 for recvmsg and buffer select")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220708181838.1495428-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: fix multishot ending when not polled
Dylan Yudaken [Fri, 8 Jul 2022 18:18:35 +0000 (11:18 -0700)]
io_uring: fix multishot ending when not polled

If multishot is not actually polling then return IOU_OK rather than the
result.
If the result was > 0 this will confuse things further up the callstack
which expect a return <= 0.

Fixes: 1300ebb20286 ("io_uring: multishot recv")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220708181838.1495428-2-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: add netmsg cache
Jens Axboe [Thu, 7 Jul 2022 20:30:09 +0000 (14:30 -0600)]
io_uring: add netmsg cache

For recvmsg/sendmsg, if they don't complete inline, we currently need
to allocate a struct io_async_msghdr for each request. This is a
somewhat large struct.

Hook up sendmsg/recvmsg to use the io_alloc_cache. This reduces the
alloc + free overhead considerably, yielding 4-5% of extra performance
running netbench.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: impose max limit on apoll cache
Jens Axboe [Thu, 7 Jul 2022 20:20:54 +0000 (14:20 -0600)]
io_uring: impose max limit on apoll cache

Caches like this tend to grow to the peak size, and then never get any
smaller. Impose a max limit on the size, to prevent it from growing too
big.

A somewhat randomly chosen 512 is the max size we'll allow the cache
to get. If a batch of frees come in and would bring it over that, we
simply start kfree'ing the surplus.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: add abstraction around apoll cache
Jens Axboe [Thu, 7 Jul 2022 20:16:20 +0000 (14:16 -0600)]
io_uring: add abstraction around apoll cache

In preparation for adding limits, and one more user, abstract out the
core bits of the allocation+free cache.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: move apoll cache to poll.c
Jens Axboe [Thu, 7 Jul 2022 17:18:33 +0000 (11:18 -0600)]
io_uring: move apoll cache to poll.c

This is where it's used, move the flush handler in there.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: consolidate hash_locked io-wq handling
Pavel Begunkov [Thu, 7 Jul 2022 14:13:17 +0000 (15:13 +0100)]
io_uring: consolidate hash_locked io-wq handling

Don't duplicate code disabling REQ_F_HASH_LOCKED for IO_URING_F_UNLOCKED
(i.e. io-wq), move the handling into __io_arm_poll_handler().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0ff0ffdfaa65b3d536131535c3dad3c63d9b7bb0.1657203020.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: clear REQ_F_HASH_LOCKED on hash removal
Pavel Begunkov [Thu, 7 Jul 2022 14:13:16 +0000 (15:13 +0100)]
io_uring: clear REQ_F_HASH_LOCKED on hash removal

Instead of clearing REQ_F_HASH_LOCKED while arming a poll, unset the bit
when we're removing the entry from the table in io_poll_tw_hash_eject().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/02e48bb88d6f1480c94ac2924c43ad1fbd48e92a.1657203020.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: don't race double poll setting REQ_F_ASYNC_DATA
Pavel Begunkov [Thu, 7 Jul 2022 14:13:15 +0000 (15:13 +0100)]
io_uring: don't race double poll setting REQ_F_ASYNC_DATA

Just as with io_poll_double_prepare() setting REQ_F_DOUBLE_POLL, we can
race with the first poll entry when setting REQ_F_ASYNC_DATA. Move it
under io_poll_double_prepare().

Fixes: a18427bb2d9b ("io_uring: optimise submission side poll_refs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/df6920f509c11115aa2bce8b34dc5fdb0eb98920.1657203020.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: don't miss setting REQ_F_DOUBLE_POLL
Pavel Begunkov [Thu, 7 Jul 2022 14:13:14 +0000 (15:13 +0100)]
io_uring: don't miss setting REQ_F_DOUBLE_POLL

When adding a second poll entry we should set REQ_F_DOUBLE_POLL
unconditionally. We might race with the first entry removal but that
doesn't change the rule.

Fixes: a18427bb2d9b ("io_uring: optimise submission side poll_refs")
Reported-and-tested-by: syzbot+49950ba66096b1f0209b@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b680d83ded07424db83e8745585e7a6d72826ef.1657203020.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: disable multishot recvmsg
Dylan Yudaken [Mon, 4 Jul 2022 14:01:06 +0000 (07:01 -0700)]
io_uring: disable multishot recvmsg

recvmsg has semantics that do not make it trivial to extend to
multishot. Specifically it has user pointers and returns data in the
original parameter. In order to make this API useful these will need to be
somehow included with the provided buffers.

For now remove multishot for recvmsg as it is not useful.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220704140106.200167-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: only trace one of complete or overflow
Dylan Yudaken [Thu, 30 Jun 2022 09:12:31 +0000 (02:12 -0700)]
io_uring: only trace one of complete or overflow

In overflow we see a duplcate line in the trace, and in some cases 3
lines (if initial io_post_aux_cqe fails).
Instead just trace once for each CQE

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-13-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: fix io_uring_cqe_overflow trace format
Dylan Yudaken [Thu, 30 Jun 2022 09:12:30 +0000 (02:12 -0700)]
io_uring: fix io_uring_cqe_overflow trace format

Make the trace format consistent with io_uring_complete for cflags

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-12-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: multishot recv
Dylan Yudaken [Thu, 30 Jun 2022 09:12:29 +0000 (02:12 -0700)]
io_uring: multishot recv

Support multishot receive for io_uring.
Typical server applications will run a loop where for each recv CQE it
requeues another recv/recvmsg.

This can be simplified by using the existing multishot functionality
combined with io_uring's provided buffers.
The API is to add the IORING_RECV_MULTISHOT flag to the SQE. CQEs will
then be posted (with IORING_CQE_F_MORE flag set) when data is available
and is read. Once an error occurs or the socket ends, the multishot will
be removed and a completion without IORING_CQE_F_MORE will be posted.

The benefit to this is that the recv is much more performant.
 * Subsequent receives are queued up straight away without requiring the
   application to finish a processing loop.
 * If there are more data in the socket (sat the provided buffer size is
   smaller than the socket buffer) then the data is immediately
   returned, improving batching.
 * Poll is only armed once and reused, saving CPU cycles

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-11-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: fix multishot accept ordering
Dylan Yudaken [Thu, 30 Jun 2022 09:12:28 +0000 (02:12 -0700)]
io_uring: fix multishot accept ordering

Similar to multishot poll, drop multishot accept when CQE overflow occurs.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-10-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: fix multishot poll on overflow
Dylan Yudaken [Thu, 30 Jun 2022 09:12:27 +0000 (02:12 -0700)]
io_uring: fix multishot poll on overflow

On overflow, multishot poll can still complete with the IORING_CQE_F_MORE
flag set.
If in the meantime the user clears a CQE and a the poll was cancelled then
the poll will post a CQE without the IORING_CQE_F_MORE (and likely result
-ECANCELED).

However when processing the application will encounter the non-overflow
CQE which indicates that there will be no more events posted. Typical
userspace applications would free memory associated with the poll in this
case.
It will then subsequently receive the earlier CQE which has overflowed,
which breaks the contract given by the IORING_CQE_F_MORE flag.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-9-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: add allow_overflow to io_post_aux_cqe
Dylan Yudaken [Thu, 30 Jun 2022 09:12:26 +0000 (02:12 -0700)]
io_uring: add allow_overflow to io_post_aux_cqe

Some use cases of io_post_aux_cqe would not want to overflow as is, but
might want to change the flags/result. For example multishot receive
requires in order CQE, and so if there is an overflow it would need to
stop receiving until the overflow is taken care of.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-8-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: add IOU_STOP_MULTISHOT return code
Dylan Yudaken [Thu, 30 Jun 2022 09:12:25 +0000 (02:12 -0700)]
io_uring: add IOU_STOP_MULTISHOT return code

For multishot we want a way to signal the caller that multishot has ended
but also this might not be an error return.

For example sockets return 0 when closed, which should end a multishot
recv, but still have a CQE with result 0

Introduce IOU_STOP_MULTISHOT which does this and indicates that the return
code is stored inside req->cqe

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-7-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: clean up io_poll_check_events return values
Dylan Yudaken [Thu, 30 Jun 2022 09:12:24 +0000 (02:12 -0700)]
io_uring: clean up io_poll_check_events return values

The values returned are a bit confusing, where 0 and 1 have implied
meaning, so add some definitions for them.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-6-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: recycle buffers on error
Dylan Yudaken [Thu, 30 Jun 2022 09:12:23 +0000 (02:12 -0700)]
io_uring: recycle buffers on error

Rather than passing an error back to the user with a buffer attached,
recycle the buffer immediately.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-5-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: allow iov_len = 0 for recvmsg and buffer select
Dylan Yudaken [Thu, 30 Jun 2022 09:12:22 +0000 (02:12 -0700)]
io_uring: allow iov_len = 0 for recvmsg and buffer select

When using BUFFER_SELECT there is no technical requirement that the user
actually provides iov, and this removes one copy_from_user call.

So allow iov_len to be 0.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-4-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: restore bgid in io_put_kbuf
Dylan Yudaken [Thu, 30 Jun 2022 09:12:21 +0000 (02:12 -0700)]
io_uring: restore bgid in io_put_kbuf

Attempt to restore bgid. This is needed when recycling unused buffers as
the next time around it will want the correct bgid.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: allow 0 length for buffer select
Dylan Yudaken [Thu, 30 Jun 2022 09:12:20 +0000 (02:12 -0700)]
io_uring: allow 0 length for buffer select

If user gives 0 for length, we can set it from the available buffer size.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-2-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: let to set a range for file slot allocation
Pavel Begunkov [Sat, 25 Jun 2022 10:55:38 +0000 (11:55 +0100)]
io_uring: let to set a range for file slot allocation

From recently io_uring provides an option to allocate a file index for
operation registering fixed files. However, it's utterly unusable with
mixed approaches when for a part of files the userspace knows better
where to place it, as it may race and users don't have any sane way to
pick a slot and hoping it will not be taken.

Let the userspace to register a range of fixed file slots in which the
auto-allocation happens. The use case is splittting the fixed table in
two parts, where on of them is used for auto-allocation and another for
slot-specified operations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/66ab0394e436f38437cf7c44676e1920d09687ad.1656154403.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: add support for passing fixed file descriptors
Jens Axboe [Mon, 13 Jun 2022 10:47:02 +0000 (04:47 -0600)]
io_uring: add support for passing fixed file descriptors

With IORING_OP_MSG_RING, one ring can send a message to another ring.
Extend that support to also allow sending a fixed file descriptor to
that ring, enabling one ring to pass a registered descriptor to another
one.

Arguments are extended to pass in:

sqe->addr3 fixed file slot in source ring
sqe->file_index fixed file slot in destination ring

IORING_OP_MSG_RING is extended to take a command argument in sqe->addr.
If set to zero (or IORING_MSG_DATA), it sends just a message like before.
If set to IORING_MSG_SEND_FD, a fixed file descriptor is sent according
to the above arguments.

Two common use cases for this are:

1) Server needs to be shutdown or restarted, pass file descriptors to
   another onei

2) Backend is split, and one accepts connections, while others then get
  the fd passed and handle the actual connection.

Both of those are classic SCM_RIGHTS use cases, and it's not possible to
support them with direct descriptors today.

By default, this will post a CQE to the target ring, similarly to how
IORING_MSG_DATA does it. If IORING_MSG_RING_CQE_SKIP is set, no message
is posted to the target ring. The issuer is expected to notify the
receiver side separately.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: split out fixed file installation and removal
Jens Axboe [Mon, 13 Jun 2022 10:42:56 +0000 (04:42 -0600)]
io_uring: split out fixed file installation and removal

Put it with the filetable code, which is where it belongs. While doing
so, have the helpers take a ctx rather than an io_kiocb. It doesn't make
sense to use a request, as it's not an operation on the request itself.
It applies to the ring itself.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: replace zero-length array with flexible-array member
Gustavo A. R. Silva [Tue, 28 Jun 2022 19:33:20 +0000 (21:33 +0200)]
io_uring: replace zero-length array with flexible-array member

There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use “flexible array members”[1] for these
cases. The older style of one-element or zero-length arrays should
no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/78
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: remove ctx->refs pinning on enter
Pavel Begunkov [Sat, 25 Jun 2022 10:53:02 +0000 (11:53 +0100)]
io_uring: remove ctx->refs pinning on enter

io_uring_enter() takes ctx->refs, which was previously preventing racing
with register quiesce. However, as register now doesn't touch the refs,
we can freely kill extra ctx pinning and rely on the fact that we're
holding a file reference preventing the ring from being destroyed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a11c57ad33a1be53541fce90669c1b79cf4d8940.1656153286.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: don't check file ops of registered rings
Pavel Begunkov [Sat, 25 Jun 2022 10:53:01 +0000 (11:53 +0100)]
io_uring: don't check file ops of registered rings

Registered rings are per definitions io_uring files, so we don't need to
additionally verify them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/425cd64fd885b8e329a46c205ee811987691baaf.1656153286.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: remove extra TIF_NOTIFY_SIGNAL check
Pavel Begunkov [Sat, 25 Jun 2022 10:53:00 +0000 (11:53 +0100)]
io_uring: remove extra TIF_NOTIFY_SIGNAL check

io_run_task_work() accounts for TIF_NOTIFY_SIGNAL, so no need to have an
second check in io_run_task_work_sig().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/52ce41a592ad904511697f432141e5690fd4b968.1656153285.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: fuse fallback_node and normal tw node
Pavel Begunkov [Sat, 25 Jun 2022 10:52:59 +0000 (11:52 +0100)]
io_uring: fuse fallback_node and normal tw node

Now as both normal and fallback paths use llist, just keep one node head
in struct io_task_work and kill off ->fallback_node.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d04ebde409f7b162fe247b361b4486b193293e46.1656153285.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: improve io_fail_links()
Pavel Begunkov [Sat, 25 Jun 2022 10:52:58 +0000 (11:52 +0100)]
io_uring: improve io_fail_links()

io_fail_links() is called with ->completion_lock held and for that
reason we'd want to keep it as small as we can. Instead of doing
__io_req_complete_post() for each linked request under the lock, fail
them in a task_work handler under ->uring_lock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a2f68708b970a21f4e84ddfa7b3abd67a8fffb27.1656153285.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: move POLLFREE handling to separate function
Jens Axboe [Tue, 21 Jun 2022 20:34:15 +0000 (14:34 -0600)]
io_uring: move POLLFREE handling to separate function

We really don't care about this at all in terms of performance. Outside
of having it already be marked unlikely(), shove it into a separate
__cold function.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: kbuf: inline io_kbuf_recycle_ring()
Hao Xu [Thu, 23 Jun 2022 13:01:26 +0000 (21:01 +0800)]
io_uring: kbuf: inline io_kbuf_recycle_ring()

Make io_kbuf_recycle_ring() inline since it is the fast path of
provided buffer.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220623130126.179232-1-hao.xu@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: optimise submission side poll_refs
Pavel Begunkov [Thu, 23 Jun 2022 13:24:49 +0000 (14:24 +0100)]
io_uring: optimise submission side poll_refs

The final poll_refs put in __io_arm_poll_handler() takes quite some
cycles. When we're arming from the original task context task_work won't
be run, so in this case we can assume that we won't race with task_works
and so not take the initial ownership ref.

One caveat is that after arming a poll we may race with it, so we have
to add a bunch of io_poll_get_ownership() hidden inside of
io_poll_can_finish_inline() whenever we want to complete arming inline.
For the same reason we can't just set REQ_F_DOUBLE_POLL in
__io_queue_proc() and so need to sync with the first poll entry by
taking its wq head lock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8825315d7f5e182ac1578a031e546f79b1c97d01.1655990418.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: refactor poll arm error handling
Pavel Begunkov [Thu, 23 Jun 2022 13:24:48 +0000 (14:24 +0100)]
io_uring: refactor poll arm error handling

__io_arm_poll_handler() errors parsing is a horror, in case it failed it
returns 0 and the caller is expected to look at ipt.error, which already
led us to a number of problems before.

When it returns a valid mask, leave it as it's not, i.e. return 1 and
store the mask in ipt.result_mask. In case of a failure that can be
handled inline return an error code (negative value), and return 0 if
__io_arm_poll_handler() took ownership of the request and will complete
it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/018cacdaef5fe95d7dc56b32e85d752cab7607f6.1655990418.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: change arm poll return values
Pavel Begunkov [Thu, 23 Jun 2022 13:24:47 +0000 (14:24 +0100)]
io_uring: change arm poll return values

The rules for __io_arm_poll_handler()'s result parsing are complicated,
as the first step don't pass return a mask but pass back a positive
return code and fill ipt->result_mask.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/529e29e9f97f2e6e383ccd44234d8b576a83a921.1655990418.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: add a helper for apoll alloc
Pavel Begunkov [Thu, 23 Jun 2022 13:24:46 +0000 (14:24 +0100)]
io_uring: add a helper for apoll alloc

Extract a helper function for apoll allocation, makes the code easier to
read.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2f93282b47dd678e805dd0d7097f66968ced495c.1655990418.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: remove events caching atavisms
Pavel Begunkov [Thu, 23 Jun 2022 13:24:45 +0000 (14:24 +0100)]
io_uring: remove events caching atavisms

Remove events argument from *io_poll_execute(), it's not needed and not
used.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/12efd4e15c6a90cf9e5b59807cfcb57852b51dc7.1655990418.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: clean poll ->private flagging
Pavel Begunkov [Thu, 23 Jun 2022 13:24:44 +0000 (14:24 +0100)]
io_uring: clean poll ->private flagging

We store a req pointer in wqe->private but also take one bit to mark
double poll entries. Replace macro helpers with inline functions for
better type checking and also name the double flag.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9a61240555c64ac0b7a9b0eb59a9efeb638a35a4.1655990418.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: add sync cancelation API through io_uring_register()
Jens Axboe [Sat, 18 Jun 2022 16:00:50 +0000 (10:00 -0600)]
io_uring: add sync cancelation API through io_uring_register()

The io_uring cancelation API is async, like any other API that we expose
there. For the case of finding a request to cancel, or not finding one,
it is fully sync in that when submission returns, the CQE for both the
cancelation request and the targeted request have been posted to the
CQ ring.

However, if the targeted work is being executed by io-wq, the API can
only start the act of canceling it. This makes it difficult to use in
some circumstances, as the caller then has to wait for the CQEs to come
in and match on the same cancelation data there.

Provide a IORING_REGISTER_SYNC_CANCEL command for io_uring_register()
that does sync cancelations, always. For the io-wq case, it'll wait
for the cancelation to come in before returning. The only expected
returns from this API is:

0 Request found and canceled fine.
> 0 Requests found and canceled. Only happens if asked to
cancel multiple requests, and if the work wasn't in
progress.
-ENOENT Request not found.
-ETIME A timeout on the operation was requested, but the timeout
expired before we could cancel.

and we won't get -EALREADY via this API.

If the timeout value passed in is -1 (tv_sec and tv_nsec), then that
means that no timeout is requested. Otherwise, the timespec passed in
is the amount of time the sync cancel will wait for a successful
cancelation.

Link: https://github.com/axboe/liburing/discussions/608
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: add IORING_ASYNC_CANCEL_FD_FIXED cancel flag
Jens Axboe [Sat, 18 Jun 2022 15:47:04 +0000 (09:47 -0600)]
io_uring: add IORING_ASYNC_CANCEL_FD_FIXED cancel flag

In preparation for not having a request to pass in that carries this
state, add a separate cancelation flag that allows the caller to ask
for a fixed file for cancelation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: have cancelation API accept io_uring_task directly
Jens Axboe [Sat, 18 Jun 2022 15:23:54 +0000 (09:23 -0600)]
io_uring: have cancelation API accept io_uring_task directly

We just use the io_kiocb passed in to find the io_uring_task, and we
already pass in the ctx via cd->ctx anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: kbuf: kill __io_kbuf_recycle()
Hao Xu [Wed, 22 Jun 2022 05:55:51 +0000 (13:55 +0800)]
io_uring: kbuf: kill __io_kbuf_recycle()

__io_kbuf_recycle() is only called in io_kbuf_recycle(). Kill it and
tweak the code so that the legacy pbuf and ring pbuf code become clear

Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220622055551.642370-1-hao.xu@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: trace task_work_run
Dylan Yudaken [Wed, 22 Jun 2022 13:40:28 +0000 (06:40 -0700)]
io_uring: trace task_work_run

trace task_work_run to help provide stats on how often task work is run
and what batch sizes are coming through.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-9-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: add trace event for running task work
Dylan Yudaken [Wed, 22 Jun 2022 13:40:27 +0000 (06:40 -0700)]
io_uring: add trace event for running task work

This is useful for investigating if task_work is batching

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-8-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: batch task_work
Dylan Yudaken [Wed, 22 Jun 2022 13:40:25 +0000 (06:40 -0700)]
io_uring: batch task_work

Batching task work up is an important performance optimisation, as
task_work_add is expensive.

In order to keep the semantics replace the task_list with a fake node
while processing the old list, and then do a cmpxchg at the end to see if
there is more work.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-6-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: introduce llist helpers
Dylan Yudaken [Wed, 22 Jun 2022 13:40:24 +0000 (06:40 -0700)]
io_uring: introduce llist helpers

Introduce helpers to atomically switch llist.

Will later move this into common code

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-5-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: lockless task list
Dylan Yudaken [Wed, 22 Jun 2022 13:40:23 +0000 (06:40 -0700)]
io_uring: lockless task list

With networking use cases we see contention on the spinlock used to
protect the task_list when multiple threads try and add completions at once.
Instead we can use a lockless list, and assume that the first caller to
add to the list is responsible for kicking off task work.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-4-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: remove __io_req_task_work_add
Dylan Yudaken [Wed, 22 Jun 2022 13:40:22 +0000 (06:40 -0700)]
io_uring: remove __io_req_task_work_add

this is no longer needed as there is only one caller

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: remove priority tw list optimisation
Dylan Yudaken [Wed, 22 Jun 2022 13:40:21 +0000 (06:40 -0700)]
io_uring: remove priority tw list optimisation

This optimisation has some built in assumptions that make it easy to
introduce bugs. It also does not have clear wins that make it worth keeping.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-2-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: dedup io_run_task_work
Pavel Begunkov [Tue, 21 Jun 2022 09:09:02 +0000 (10:09 +0100)]
io_uring: dedup io_run_task_work

We have an identical copy of io_run_task_work() for io-wq called
io_flush_signals(), deduplicate them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a157a4df5fa217b8bd03c73494f2fd0e24e44fbc.1655802465.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: move list helpers to a separate file
Pavel Begunkov [Tue, 21 Jun 2022 09:09:01 +0000 (10:09 +0100)]
io_uring: move list helpers to a separate file

It's annoying to have io-wq.h as a dependency every time we want some of
struct io_wq_work_list helpers, move them into a separate file.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c1d891ce12b30767d1d2a3b7db2ca3abc1ecc4a2.1655802465.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: improve io_run_task_work()
Pavel Begunkov [Tue, 21 Jun 2022 09:09:00 +0000 (10:09 +0100)]
io_uring: improve io_run_task_work()

Since SQPOLL now uses TWA_SIGNAL_NO_IPI, there won't be task work items
without TIF_NOTIFY_SIGNAL. Simplify io_run_task_work() by removing
task->task_works check. Even though looks it doesn't cause extra cache
bouncing, it's still nice to not touch it an extra time when it might be
not cached.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/75d4f34b0c671075892821a409e28da6cb1d64fe.1655802465.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: optimize io_uring_task layout
Pavel Begunkov [Mon, 20 Jun 2022 14:27:35 +0000 (15:27 +0100)]
io_uring: optimize io_uring_task layout

task_work bits of io_uring_task are split into two cache lines causing
extra cache bouncing, place them into a separate cache line. Also move
the most used submission path fields closer together, so there are hot.

Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: add a warn_once for poll_find
Pavel Begunkov [Mon, 20 Jun 2022 00:26:01 +0000 (01:26 +0100)]
io_uring: add a warn_once for poll_find

io_poll_remove() expects poll_find() to search only for poll requests and
passes a flag for this. Just be a little bit extra cautious considering
lots of recent poll/cancellation changes and add a WARN_ON_ONCE checking
that we don't get an apoll'ed request.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ec9a66f1e22f99dcd02288d4e42f3cc6bb357804.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: consistent naming for inline completion
Pavel Begunkov [Mon, 20 Jun 2022 00:26:00 +0000 (01:26 +0100)]
io_uring: consistent naming for inline completion

Improve naming of the inline/deferred completion helper so it's
consistent with it's *_post counterpart. Add some comments and extra
lockdeps to ensure the locking is done right.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/797c619943dac06529e9d3fcb16e4c3cde6ad1a3.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: move io_import_fixed()
Pavel Begunkov [Mon, 20 Jun 2022 00:25:59 +0000 (01:25 +0100)]
io_uring: move io_import_fixed()

Move io_import_fixed() into rsrc.c where it belongs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d5becb21f332b4fef6a7cedd6a50e65e2371630.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: opcode independent fixed buf import
Pavel Begunkov [Mon, 20 Jun 2022 00:25:58 +0000 (01:25 +0100)]
io_uring: opcode independent fixed buf import

Fixed buffers are generic infrastructure, make io_import_fixed() opcode
agnostic.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b1e765c8a1c2c913a05a28d2399fc53e1d3cf37a.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: add io_commit_cqring_flush()
Pavel Begunkov [Mon, 20 Jun 2022 00:25:57 +0000 (01:25 +0100)]
io_uring: add io_commit_cqring_flush()

Since __io_commit_cqring_flush users moved to different files, introduce
io_commit_cqring_flush() helper and encapsulate all flags testing details
inside.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0da03887435dd9869ffe46dcd3962bf104afcca3.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: introduce locking helpers for CQE posting
Pavel Begunkov [Mon, 20 Jun 2022 00:25:56 +0000 (01:25 +0100)]
io_uring: introduce locking helpers for CQE posting

spin_lock(&ctx->completion_lock);
/* post CQEs */
io_commit_cqring(ctx);
spin_unlock(&ctx->completion_lock);
io_cqring_ev_posted(ctx);

We have many places repeating this sequence, and the three function
unlock section is not perfect from the maintainance perspective and also
makes it harder to add new locking/sync trick.

Introduce two helpers. io_cq_lock(), which is simple and only grabs
->completion_lock, and io_cq_unlock_post() encapsulating the three call
section.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fe0c682bf7f7b55d9be55b0d034be9c1949277dc.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: hide eventfd assumptions in eventfd paths
Pavel Begunkov [Mon, 20 Jun 2022 00:25:55 +0000 (01:25 +0100)]
io_uring: hide eventfd assumptions in eventfd paths

Some io_uring-eventfd users assume that there won't be spurious wakeups.
That assumption has to be honoured by all io_cqring_ev_posted() callers,
which is inconvenient and from time to time leads to problems but should
be maintained to not break the userspace.

Instead of making the callers track whether a CQE was posted or not, hide
it inside io_eventfd_signal(). It saves ->cached_cq_tail it saw last time
and triggers the eventfd only when ->cached_cq_tail changed since then.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0ffc66bae37a2513080b601e4370e147faaa72c5.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: fix io_poll_remove_all clang warnings
Pavel Begunkov [Mon, 20 Jun 2022 00:25:54 +0000 (01:25 +0100)]
io_uring: fix io_poll_remove_all clang warnings

clang complains on bitwise operations with bools, add a bit more
verbosity to better show that we want to call io_poll_remove_all_table()
twice but with different arguments.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f11d21dcdf9233e0eeb15fa13b858a05a78eb310.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: improve task exit timeout cancellations
Pavel Begunkov [Mon, 20 Jun 2022 00:25:53 +0000 (01:25 +0100)]
io_uring: improve task exit timeout cancellations

Don't spin trying to cancel timeouts that are reachable but not
cancellable, e.g. already executing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ab8a7440a60bbdf69ae514f672ad050e43dd1b03.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: fix multi ctx cancellation
Pavel Begunkov [Mon, 20 Jun 2022 00:25:52 +0000 (01:25 +0100)]
io_uring: fix multi ctx cancellation

io_uring_try_cancel_requests() loops until there is nothing left to do
with the ring, however there might be several rings and they might have
dependencies between them, e.g. via poll requests.

Instead of cancelling rings one by one, try to cancel them all and only
then loop over if we still potenially some work to do.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8d491fe02d8ac4c77ff38061cf86b9a827e8845c.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: remove ->flush_cqes optimisation
Pavel Begunkov [Sun, 19 Jun 2022 11:26:08 +0000 (12:26 +0100)]
io_uring: remove ->flush_cqes optimisation

It's not clear how widely used IOSQE_CQE_SKIP_SUCCESS is, and how often
->flush_cqes flag prevents from completion being flushed. Sometimes it's
high level of concurrency that enables it at least for one CQE, but
sometimes it doesn't save much because nobody waiting on the CQ.

Remove ->flush_cqes flag and the optimisation, it should benefit the
normal use case. Note, that there is no spurious eventfd problem with
that as checks for spuriousness were incorporated into
io_eventfd_signal().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/692e81eeddccc096f449a7960365fa7b4a18f8e6.1655637157.git.asml.silence@gmail.com
[axboe: remove now dead state->flush_cqes variable]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: move io_eventfd_signal()
Pavel Begunkov [Sun, 19 Jun 2022 11:26:06 +0000 (12:26 +0100)]
io_uring: move io_eventfd_signal()

Move io_eventfd_signal() in the sources without any changes and kill its
forward declaration.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9ebebb3f6f56f5a5448a621e0b6a537720c43334.1655637157.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: reshuffle io_uring/io_uring.h
Pavel Begunkov [Sun, 19 Jun 2022 11:26:05 +0000 (12:26 +0100)]
io_uring: reshuffle io_uring/io_uring.h

It's a good idea to first do forward declarations and then inline
helpers, otherwise there will be keep stumbling on dependencies
between them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1d7fa6672ed43f20ccc0c54ae201369ebc3ebfab.1655637157.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: remove extra io_commit_cqring()
Pavel Begunkov [Sun, 19 Jun 2022 11:26:04 +0000 (12:26 +0100)]
io_uring: remove extra io_commit_cqring()

We don't post events in __io_commit_cqring_flush() anymore but send all
requests to tw, so no need to do io_commit_cqring() there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f2481e32375e749be89c42e4804268b608722cef.1655637157.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: move a few private types to local headers
Jens Axboe [Sun, 19 Jun 2022 01:44:33 +0000 (19:44 -0600)]
io_uring: move a few private types to local headers

Commit 3a3d47fa9cfd ("io_uring: make io_uring_types.h public") moved
a bunch of io_uring types to a kernel wide header, so we could make
tracing a bit saner rather than pass in a ton of arguments.

However, there are a few types in there that are not really needed to
be system wide. Move the cancel data and mapped buffers back to the
appropriate io_uring local headers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: clean up tracing events
Pavel Begunkov [Thu, 16 Jun 2022 12:57:20 +0000 (13:57 +0100)]
io_uring: clean up tracing events

We have lots of trace events accepting an io_uring request and wanting
to print some of its fields like user_data, opcode, flags and so on.
However, as trace points were unaware of io_uring structures, we had to
pass all the fields as arguments. Teach trace/events/io_uring.h about
struct io_kiocb and stop the misery of passing a horde of arguments to
trace helpers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/40ff72f92798114e56d400f2b003beb6cde6ef53.1655384063.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: make io_uring_types.h public
Pavel Begunkov [Thu, 16 Jun 2022 12:57:19 +0000 (13:57 +0100)]
io_uring: make io_uring_types.h public

Move io_uring types to linux/include, need them public so tracing can
see the definitions and we can clean trace/events/io_uring.h

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a15f12e8cb7289b2de0deaddcc7518d98a132d17.1655384063.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: kill extra io_uring_types.h includes
Pavel Begunkov [Thu, 16 Jun 2022 12:57:18 +0000 (13:57 +0100)]
io_uring: kill extra io_uring_types.h includes

io_uring/io_uring.h already includes io_uring_types.h, no need to
include it every time. Kill it in a bunch of places, it prepares us for
following patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/94d8c943fbe0ef949981c508ddcee7fc1c18850f.1655384063.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: change ->cqe_cached invariant for CQE32
Pavel Begunkov [Fri, 17 Jun 2022 08:48:05 +0000 (09:48 +0100)]
io_uring: change ->cqe_cached invariant for CQE32

With IORING_SETUP_CQE32 ->cqe_cached doesn't store a real address but
rather an implicit offset into cqes. Store the real cqe pointer and
increment it accordingly if CQE32.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1ee1838cba16bed96381a006950b36ba640d998c.1655455613.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: deduplicate io_get_cqe() calls
Pavel Begunkov [Fri, 17 Jun 2022 08:48:04 +0000 (09:48 +0100)]
io_uring: deduplicate io_get_cqe() calls

Deduplicate calls to io_get_cqe() from __io_fill_cqe_req().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4fa077986cc3abab7c59ff4e7c390c783885465f.1655455613.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: deduplicate __io_fill_cqe_req tracing
Pavel Begunkov [Fri, 17 Jun 2022 08:48:03 +0000 (09:48 +0100)]
io_uring: deduplicate __io_fill_cqe_req tracing

Deduplicate two trace_io_uring_complete() calls in __io_fill_cqe_req().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/277ed85dba5189ab7d932164b314013a0f0b0fdc.1655455613.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: introduce io_req_cqe_overflow()
Pavel Begunkov [Fri, 17 Jun 2022 08:48:02 +0000 (09:48 +0100)]
io_uring: introduce io_req_cqe_overflow()

__io_fill_cqe_req() is hot and inlined, we want it to be as small as
possible. Add io_req_cqe_overflow() accepting only a request and doing
all overflow accounting, and replace with it two calls to 6 argument
io_cqring_event_overflow().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/048b9fbcce56814d77a1a540409c98c3d383edcb.1655455613.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: don't inline __io_get_cqe()
Pavel Begunkov [Fri, 17 Jun 2022 08:48:01 +0000 (09:48 +0100)]
io_uring: don't inline __io_get_cqe()

__io_get_cqe() is not as hot as io_get_cqe(), no need to inline it, it
sheds ~500B from the binary.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c1ac829198a881b7af8710926f99a3559b9f24c0.1655455613.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: don't expose io_fill_cqe_aux()
Pavel Begunkov [Fri, 17 Jun 2022 08:48:00 +0000 (09:48 +0100)]
io_uring: don't expose io_fill_cqe_aux()

Deduplicate some code and add a helper for filling an aux CQE, locking
and notification.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b7c6557c8f9dc5c4cfb01292116c682a0ff61081.1655455613.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: kbuf: add comments for some tricky code
Hao Xu [Fri, 17 Jun 2022 05:04:29 +0000 (13:04 +0800)]
io_uring: kbuf: add comments for some tricky code

Add comments to explain why it is always under uring lock when
incrementing head in __io_kbuf_recycle. And rectify one comemnt about
kbuf consuming in iowq case.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220617050429.94293-1-hao.xu@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: mutex locked poll hashing
Pavel Begunkov [Thu, 16 Jun 2022 09:22:12 +0000 (10:22 +0100)]
io_uring: mutex locked poll hashing

Currently we do two extra spin lock/unlock pairs to add a poll/apoll
request to the cancellation hash table and remove it from there.

On the submission side we often already hold ->uring_lock and tw
completion is likely to hold it as well. Add a second cancellation hash
table protected by ->uring_lock. In concerns for latency because of a
need to have the mutex locked on the completion side, use the new table
only in following cases:

1) IORING_SETUP_SINGLE_ISSUER: only one task grabs uring_lock, so there
   is little to no contention and so the main tw hander will almost
   always end up grabbing it before calling callbacks.

2) IORING_SETUP_SQPOLL: same as with single issuer, only one task is
   a major user of ->uring_lock.

3) apoll: we normally grab the lock on the completion side anyway to
   execute the request, so it's free.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1bbad9c78c454b7b92f100bbf46730a37df7194f.1655371007.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: propagate locking state to poll cancel
Pavel Begunkov [Thu, 16 Jun 2022 09:22:11 +0000 (10:22 +0100)]
io_uring: propagate locking state to poll cancel

Poll cancellation will be soon need to grab ->uring_lock inside, pass
the locking state, i.e. issue_flags, inside the cancellation functions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b86781d047727c07163443b57551a3fa57c7c5e1.1655371007.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: introduce a struct for hash table
Pavel Begunkov [Thu, 16 Jun 2022 09:22:10 +0000 (10:22 +0100)]
io_uring: introduce a struct for hash table

Instead of passing around a pointer to hash buckets, add a bit of type
safety and wrap it into a structure.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d65bc3faba537ec2aca9eabf334394936d44bd28.1655371007.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: pass hash table into poll_find
Pavel Begunkov [Thu, 16 Jun 2022 09:22:09 +0000 (10:22 +0100)]
io_uring: pass hash table into poll_find

In preparation for having multiple cancellation hash tables, pass a
table pointer into io_poll_find() and other poll cancel functions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a31c88502463dce09254240fa037352927d7ecc3.1655371007.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: add IORING_SETUP_SINGLE_ISSUER
Pavel Begunkov [Thu, 16 Jun 2022 09:22:08 +0000 (10:22 +0100)]
io_uring: add IORING_SETUP_SINGLE_ISSUER

Add a new IORING_SETUP_SINGLE_ISSUER flag and the userspace visible part
of it, i.e. put limitations of submitters. Also, don't allow it together
with IOPOLL as we're not going to put it to good use.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4bcc41ee467fdf04c8aab8baf6ce3ba21858c3d4.1655371007.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: use state completion infra for poll reqs
Pavel Begunkov [Thu, 16 Jun 2022 09:22:07 +0000 (10:22 +0100)]
io_uring: use state completion infra for poll reqs

Use io_req_task_complete() for poll request completions, so it can
utilise state completions and save lots of unnecessary locking.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ced94cb5a728d8e386c640d052fd3da3f5d6891a.1655371007.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: clean up io_ring_ctx_alloc
Pavel Begunkov [Thu, 16 Jun 2022 09:22:06 +0000 (10:22 +0100)]
io_uring: clean up io_ring_ctx_alloc

Add a variable for the number of hash buckets in io_ring_ctx_alloc(),
makes it more readable.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/993926ed0d614ba9a76b2a85bebae2babcb13983.1655371007.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: limit the number of cancellation buckets
Pavel Begunkov [Thu, 16 Jun 2022 09:22:05 +0000 (10:22 +0100)]
io_uring: limit the number of cancellation buckets

Don't allocate to many hash/cancellation buckets, there might be too
many, clamp it to 8 bits, or 256 * 64B = 16KB. We don't usually have too
many requests, and 256 buckets should be enough, especially since we
do hash search only in the cancellation path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b9620c8072ba61a2d50eba894b89bd93a94a9abd.1655371007.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: clean up io_try_cancel
Pavel Begunkov [Thu, 16 Jun 2022 09:22:04 +0000 (10:22 +0100)]
io_uring: clean up io_try_cancel

Get rid of an unnecessary extra goto in io_try_cancel() and simplify the
function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/48cf5417b43a8386c6c364dba1ad9b4c7382d158.1655371007.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
21 months agoio_uring: pass poll_find lock back
Pavel Begunkov [Thu, 16 Jun 2022 09:22:03 +0000 (10:22 +0100)]
io_uring: pass poll_find lock back

Instead of using implicit knowledge of what is locked or not after
io_poll_find() and co returns, pass back a pointer to the locked
bucket if any. If set the user must to unlock the spinlock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dae1dc5749aa34367812ecf62f82fd3f053aae44.1655371007.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>