mbox series

[v6,0/8] ublk: decouple server threads from hctxs

Message ID 20250507-ublk_task_per_io-v6-0-a2a298783c01@purestorage.com
Headers show
Series ublk: decouple server threads from hctxs | expand

Message

Uday Shankar May 7, 2025, 9:49 p.m. UTC
This patch set aims to allow ublk server threads to better balance load
amongst themselves by decoupling server threads from ublk queues/hctxs,
so that multiple threads can service I/Os from a single hctx.

The first patch is the functional change in the driver which switches
from per-queue daemons to per-io daemons and allows for ublk servers to
balance load that is imbalanced among queues. The second patch fixes a
bug in tag allocation (in the sbitmap layer) that was observed while
developing a test for this feature. The next five patches add support in
the selftests ublk server (kublk) for this feature, and add a test which
shows the new feature working as intended. The last patch documents the
new feature.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
---
Changes in v6:
- Add a feature flag for this feature, called UBLK_F_RR_TAGS (Ming Lei)
- Add test for this feature (Ming Lei)
- Add documentation for this feature (Ming Lei)
- Link to v5: https://lore.kernel.org/r/20250416-ublk_task_per_io-v5-0-9261ad7bff20@purestorage.com

Changes in v5:
- Set io->task before ublk_mark_io_ready (Caleb Sander Mateos)
- Set io->task atomically, read it atomically when needed
- Return 0 on success from command-specific helpers in
  __ublk_ch_uring_cmd (Caleb Sander Mateos)
- Rename ublk_handle_need_get_data to ublk_get_data (Caleb Sander
  Mateos)
- Link to v4: https://lore.kernel.org/r/20250415-ublk_task_per_io-v4-0-54210b91a46f@purestorage.com

Changes in v4:
- Drop "ublk: properly serialize all FETCH_REQs" since Ming is taking it
  in another set
- Prevent data races by marking data structures which should be
  read-only in the I/O path as const (Ming Lei)
- Link to v3: https://lore.kernel.org/r/20250410-ublk_task_per_io-v3-0-b811e8f4554a@purestorage.com

Changes in v3:
- Check for UBLK_IO_FLAG_ACTIVE on I/O again after taking lock to ensure
  that two concurrent FETCH_REQs on the same I/O can't succeed (Caleb
  Sander Mateos)
- Link to v2: https://lore.kernel.org/r/20250408-ublk_task_per_io-v2-0-b97877e6fd50@purestorage.com

Changes in v2:
- Remove changes split into other patches
- To ease error handling/synchronization, associate each I/O (instead of
  each queue) to the last task that issues a FETCH_REQ against it. Only
  that task is allowed to operate on the I/O.
- Link to v1: https://lore.kernel.org/r/20241002224437.3088981-1-ushankar@purestorage.com

---
Uday Shankar (8):
      ublk: have a per-io daemon instead of a per-queue daemon
      sbitmap: fix off-by-one when wrapping hint
      selftests: ublk: kublk: plumb q_id in io_uring user_data
      selftests: ublk: kublk: tie sqe allocation to io instead of queue
      selftests: ublk: kublk: lift queue initialization out of thread
      selftests: ublk: kublk: move per-thread data out of ublk_queue
      selftests: ublk: kublk: decouple ublk_queues from ublk server threads
      Documentation: ublk: document UBLK_F_RR_TAGS

 Documentation/block/ublk.rst                       |  83 +++++-
 drivers/block/ublk_drv.c                           |  82 ++---
 include/uapi/linux/ublk_cmd.h                      |   8 +
 lib/sbitmap.c                                      |   4 +-
 tools/testing/selftests/ublk/Makefile              |   1 +
 tools/testing/selftests/ublk/fault_inject.c        |   4 +-
 tools/testing/selftests/ublk/file_backed.c         |  20 +-
 tools/testing/selftests/ublk/kublk.c               | 329 ++++++++++++++-------
 tools/testing/selftests/ublk/kublk.h               |  73 +++--
 tools/testing/selftests/ublk/null.c                |  12 +-
 tools/testing/selftests/ublk/stripe.c              |  17 +-
 tools/testing/selftests/ublk/test_generic_08.sh    |  61 ++++
 .../selftests/ublk/trace/count_ios_per_tid.bt      |   9 +
 13 files changed, 488 insertions(+), 215 deletions(-)
---
base-commit: 037af793557ed192b2c10cf2379ac97abacedf55
change-id: 20250408-ublk_task_per_io-c693cf608d7a

Best regards,

Comments

Randy Dunlap May 7, 2025, 11:08 p.m. UTC | #1
Hi,

On 5/7/25 2:49 PM, Uday Shankar wrote:
> Document the new flag UBLK_F_RR_TAGS along with its intended use case.
> Also describe the new restrictions on threading model imposed by
> ublk_drv (one (qid,tag) pair is can be served by only one thread), and
> remove references to ubq_daemon/per-queue threads, since such a concept
> no longer exists.
> 
> Signed-off-by: Uday Shankar <ushankar@purestorage.com>
> ---
>  Documentation/block/ublk.rst | 83 ++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 72 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> index 854f823b46c2add01d0b65ba36aecd26c45bb65d..e9cbabdd69c5539a02463780ba5e51de0416c3f6 100644
> --- a/Documentation/block/ublk.rst
> +++ b/Documentation/block/ublk.rst
> @@ -115,15 +115,15 @@ managing and controlling ublk devices with help of several control commands:
>  
>  - ``UBLK_CMD_START_DEV``
>  
> -  After the server prepares userspace resources (such as creating per-queue
> -  pthread & io_uring for handling ublk IO), this command is sent to the
> +  After the server prepares userspace resources (such as creating I/O handler
> +  threads & io_uring for handling ublk IO), this command is sent to the
>    driver for allocating & exposing ``/dev/ublkb*``. Parameters set via
>    ``UBLK_CMD_SET_PARAMS`` are applied for creating the device.
>  
>  - ``UBLK_CMD_STOP_DEV``
>  
>    Halt IO on ``/dev/ublkb*`` and remove the device. When this command returns,
> -  ublk server will release resources (such as destroying per-queue pthread &
> +  ublk server will release resources (such as destroying I/O handler threads &
>    io_uring).
>  
>  - ``UBLK_CMD_DEL_DEV``
> @@ -208,15 +208,15 @@ managing and controlling ublk devices with help of several control commands:
>    modify how I/O is handled while the ublk server is dying/dead (this is called
>    the ``nosrv`` case in the driver code).
>  
> -  With just ``UBLK_F_USER_RECOVERY`` set, after one ubq_daemon(ublk server's io
> -  handler) is dying, ublk does not delete ``/dev/ublkb*`` during the whole
> +  With just ``UBLK_F_USER_RECOVERY`` set, after the ublk server exits,
> +  ublk does not delete ``/dev/ublkb*`` during the whole
>    recovery stage and ublk device ID is kept. It is ublk server's
>    responsibility to recover the device context by its own knowledge.
>    Requests which have not been issued to userspace are requeued. Requests
>    which have been issued to userspace are aborted.
>  
> -  With ``UBLK_F_USER_RECOVERY_REISSUE`` additionally set, after one ubq_daemon
> -  (ublk server's io handler) is dying, contrary to ``UBLK_F_USER_RECOVERY``,
> +  With ``UBLK_F_USER_RECOVERY_REISSUE`` additionally set, after the ublk server
> +  exits, contrary to ``UBLK_F_USER_RECOVERY``,
>    requests which have been issued to userspace are requeued and will be
>    re-issued to the new process after handling ``UBLK_CMD_END_USER_RECOVERY``.
>    ``UBLK_F_USER_RECOVERY_REISSUE`` is designed for backends who tolerate
> @@ -241,10 +241,11 @@ can be controlled/accessed just inside this container.
>  Data plane
>  ----------
>  
> -ublk server needs to create per-queue IO pthread & io_uring for handling IO
> -commands via io_uring passthrough. The per-queue IO pthread
> -focuses on IO handling and shouldn't handle any control & management
> -tasks.
> +The ublk server should create dedicated threads for handling I/O. Each
> +thread should have its own io_uring through which it is notified of new
> +I/O, and through which it can complete I/O. These dedicated threads
> +should focus on IO handling and shouldn't handle any control &
> +management tasks.
>  
>  The's IO is assigned by a unique tag, which is 1:1 mapping with IO

   ???

>  request of ``/dev/ublkb*``.
> @@ -265,6 +266,13 @@ with specified IO tag in the command data:
>    destined to ``/dev/ublkb*``. This command is sent only once from the server
>    IO pthread for ublk driver to setup IO forward environment.
>  
> +  Once a thread issues this command against a given (qid,tag) pair, the thread
> +  registers itself as that I/O's daemon. In the future, only that I/O's daemon
> +  is allowed to issue commands against the I/O. If any other thread attempts
> +  to issue a command against a (qid,tag) pair for which the thread is not the
> +  daemon, the command will fail. Daemons can be reset only be going through
> +  recovery.
> +
>  - ``UBLK_IO_COMMIT_AND_FETCH_REQ``
>  
>    When an IO request is destined to ``/dev/ublkb*``, the driver stores
> @@ -309,6 +317,59 @@ with specified IO tag in the command data:
>    ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
>    the server buffer (pages) read to the IO request pages.
>  
> +Load balancing
> +--------------
> +
> +A simple approach to designing a ublk server might involve selecting a
> +number of I/O handler threads N, creating devices with N queues, and
> +pairing up I/O handler threads with queues, so that each thread gets a
> +unique qid, and it issues ``FETCH_REQ``s against all tags for that qid.
> +Indeed, before the introduction of the ``UBLK_F_RR_TAGS`` feature, this
> +was essentially the only option (*)

Add ending period (full stop), please.

> +
> +This approach can run into performance issues under imbalanced load.
> +This architecture taken together with the `blk-mq architecture
> +<https://docs.kernel.org/block/blk-mq.html>`_ implies that there is a
> +fixed mapping from I/O submission CPU to the ublk server thread that
> +handles it. If the workload is CPU-bottlenecked, only allowing one ublk
> +server thread to handle all the I/O generated from a single CPU can
> +limit peak bandwidth.
> +
> +To address this issue, two changes were made:
> +
> +- ublk servers can now pair up threads with I/Os (i.e. (qid,tag) pairs)
> +  arbitrarily. In particular, the preexisting restriction that all I/Os
> +  in one queue must be served by the same thread is lifted.
> +- ublk servers can now specify ``UBLK_F_RR_TAGS`` when creating a ublk
> +  device to get round-robin tag allocation on each queue

Add ending period (full stop), please.

> +
> +The ublk server can check for the presence of these changes by testing
> +for the ``UBLK_F_RR_TAGS`` feature.
> +
> +With these changes, a ublk server can balance load as follows:
> +
> +- create the device with ``UBLK_F_RR_TAGS`` set in
> +  ``ublksrv_ctrl_dev_info::flags`` when issuing the ``ADD_DEV`` command
> +- issue ``FETCH_REQ``s from ublk server threads to (qid,tag) pairs in
> +  a round-robin manner. For example, for a device configured with
> +  ``nr_hw_queues=2`` and ``queue_depth=4``, and a ublk server having 4
> +  I/O handling threads, ``FETCH_REQ``s could be issued as follows, where
> +  each entry in the table is the pair (``ublksrv_io_cmd::q_id``,
> +  ``ublksrv_io_cmd::tag``) in the payload of the ``FETCH_REQ``.
> +
> +  ======== ======== ======== ========
> +  thread 0 thread 1 thread 2 thread 3
> +  ======== ======== ======== ========
> +  (0, 0)   (0, 1)   (0, 2)   (0, 3)
> +  (1, 3)   (1, 0)   (1, 1)   (1, 2)
> +
> +With this setup, I/O submitted on a CPU which maps to queue 0 will be
> +balanced across all threads instead of all landing on the same thread.
> +Thus, a potential bottleneck is avoided.
> +
> +(*) technically, one I/O handling thread could service multiple queues

       Technically,

> +if it wanted to, but that doesn't help with imbalanced load

Add ending period (full stop), please.


> +
>  Zero copy
>  ---------
>  
>
Bagas Sanjaya May 8, 2025, 2:08 a.m. UTC | #2
On Wed, May 07, 2025 at 03:49:42PM -0600, Uday Shankar wrote:
> +Load balancing
> +--------------
> +
> +A simple approach to designing a ublk server might involve selecting a
> +number of I/O handler threads N, creating devices with N queues, and
> +pairing up I/O handler threads with queues, so that each thread gets a
> +unique qid, and it issues ``FETCH_REQ``s against all tags for that qid.
                             ``FETCH_REQ``\s (escape s)
> +Indeed, before the introduction of the ``UBLK_F_RR_TAGS`` feature, this
> +was essentially the only option (*)

Use reST footnotes syntax, i.e.:

---- >8 ----
diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
index 440b63be4ea8b6..b1d29fceff4e80 100644
--- a/Documentation/block/ublk.rst
+++ b/Documentation/block/ublk.rst
@@ -325,7 +325,7 @@ number of I/O handler threads N, creating devices with N queues, and
 pairing up I/O handler threads with queues, so that each thread gets a
 unique qid, and it issues ``FETCH_REQ``\s against all tags for that qid.
 Indeed, before the introduction of the ``UBLK_F_RR_TAGS`` feature, this
-was essentially the only option (*)
+was essentially the only option [#]_

 This approach can run into performance issues under imbalanced load.
 This architecture taken together with the `blk-mq architecture
@@ -368,8 +368,8 @@ With this setup, I/O submitted on a CPU which maps to queue 0 will be
 balanced across all threads instead of all landing on the same thread.
 Thus, a potential bottleneck is avoided.

-(*) technically, one I/O handling thread could service multiple queues
-if it wanted to, but that doesn't help with imbalanced load
+.. [#] Technically, one I/O handling thread could service multiple queues
+       if it wanted to, but that doesn't help with imbalanced load

 Zero copy
 ---------

> +
> +This approach can run into performance issues under imbalanced load.
> +This architecture taken together with the `blk-mq architecture
> +<https://docs.kernel.org/block/blk-mq.html>`_ implies that there is a
This architecture, taken together with the
:doc:`blk-mq architecture </block/blk-mq>`, implies that ...
> +fixed mapping from I/O submission CPU to the ublk server thread that
> +handles it. If the workload is CPU-bottlenecked, only allowing one ublk
> +server thread to handle all the I/O generated from a single CPU can
> +limit peak bandwidth.
> +
> <snipped>...
> +With these changes, a ublk server can balance load as follows:
> +
> +- create the device with ``UBLK_F_RR_TAGS`` set in
> +  ``ublksrv_ctrl_dev_info::flags`` when issuing the ``ADD_DEV`` command
> +- issue ``FETCH_REQ``s from ublk server threads to (qid,tag) pairs in
> +  a round-robin manner. For example, for a device configured with
> +  ``nr_hw_queues=2`` and ``queue_depth=4``, and a ublk server having 4
> +  I/O handling threads, ``FETCH_REQ``s could be issued as follows, where
> +  each entry in the table is the pair (``ublksrv_io_cmd::q_id``,
> +  ``ublksrv_io_cmd::tag``) in the payload of the ``FETCH_REQ``.

s/``FETCH_REQ``/``FETCH_REQ``\s/ (escape s after FETCH_REQ).

> +
> +  ======== ======== ======== ========
> +  thread 0 thread 1 thread 2 thread 3
> +  ======== ======== ======== ========
> +  (0, 0)   (0, 1)   (0, 2)   (0, 3)
> +  (1, 3)   (1, 0)   (1, 1)   (1, 2)

Add table border in the bottom, i.e.:

---- >8 ----
diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
index e9cbabdd69c553..dc6fdfedba9ab4 100644
--- a/Documentation/block/ublk.rst
+++ b/Documentation/block/ublk.rst
@@ -362,6 +362,7 @@ With these changes, a ublk server can balance load as follows:
   ======== ======== ======== ========
   (0, 0)   (0, 1)   (0, 2)   (0, 3)
   (1, 3)   (1, 0)   (1, 1)   (1, 2)
+  ======== ======== ======== ========

 With this setup, I/O submitted on a CPU which maps to queue 0 will be
 balanced across all threads instead of all landing on the same thread.

Thanks.
Ming Lei May 9, 2025, 7:28 a.m. UTC | #3
On Wed, May 07, 2025 at 03:49:37PM -0600, Uday Shankar wrote:
> Currently, when we process CQEs, we know which ublk_queue we are working
> on because we know which ring we are working on, and ublk_queues and
> rings are in 1:1 correspondence. However, as we decouple ublk_queues
> from ublk server threads, ublk_queues and rings will no longer be in 1:1
> correspondence - each ublk server thread will have a ring, and each
> thread may issue commands against more than one ublk_queue. So in order
> to know which ublk_queue a CQE refers to, plumb that information in the
> associated SQE's user_data.
> 
> Signed-off-by: Uday Shankar <ushankar@purestorage.com>

Looks fine,

Reviewed-by: Ming Lei <ming.lei@redhat.com>



Thanks,
Ming
Ming Lei May 9, 2025, 8:14 a.m. UTC | #4
On Wed, May 07, 2025 at 03:49:40PM -0600, Uday Shankar wrote:
> Towards the goal of decoupling ublk_queues from ublk server threads,
> move resources/data that should be per-thread rather than per-queue out
> of ublk_queue and into a new struct ublk_thread.
> 
> Signed-off-by: Uday Shankar <ushankar@purestorage.com>
> ---
>  tools/testing/selftests/ublk/kublk.c | 225 ++++++++++++++++++-----------------
>  tools/testing/selftests/ublk/kublk.h |  38 ++++--
>  2 files changed, 145 insertions(+), 118 deletions(-)
> 
> diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
> index 3ad9e162816c3a10e9928f9d530908cda7595530..313689f94cd6361a9a0f4b9257085b2a62bc8b8c 100644
> --- a/tools/testing/selftests/ublk/kublk.c
> +++ b/tools/testing/selftests/ublk/kublk.c
> @@ -324,8 +324,8 @@ static void ublk_ctrl_dump(struct ublk_dev *dev)
>  
>  		for (i = 0; i < info->nr_hw_queues; i++) {
>  			ublk_print_cpu_set(&affinity[i], buf, sizeof(buf));
> -			printf("\tqueue %u: tid %d affinity(%s)\n",
> -					i, dev->q[i].tid, buf);
> +			printf("\tqueue %u: affinity(%s)\n",
> +					i, buf);
>  		}
>  		free(affinity);
>  	}
> @@ -395,18 +395,16 @@ static void ublk_queue_deinit(struct ublk_queue *q)
>  		free(q->ios[i].buf_addr);
>  }
>  
> -static void ublk_thread_deinit(struct ublk_queue *q)
> +static void ublk_thread_deinit(struct ublk_thread *t)
>  {
> -	q->tid = 0;
> +	io_uring_unregister_buffers(&t->ring);
>  
> -	io_uring_unregister_buffers(&q->ring);
> +	io_uring_unregister_ring_fd(&t->ring);
>  
> -	io_uring_unregister_ring_fd(&q->ring);
> -
> -	if (q->ring.ring_fd > 0) {
> -		io_uring_unregister_files(&q->ring);
> -		close(q->ring.ring_fd);
> -		q->ring.ring_fd = -1;
> +	if (t->ring.ring_fd > 0) {
> +		io_uring_unregister_files(&t->ring);
> +		close(t->ring.ring_fd);
> +		t->ring.ring_fd = -1;
>  	}
>  }
>  
> @@ -421,7 +419,6 @@ static int ublk_queue_init(struct ublk_queue *q)
>  	q->tgt_ops = dev->tgt.ops;
>  	q->state = 0;
>  	q->q_depth = depth;
> -	q->cmd_inflight = 0;
>  
>  	if (dev->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY) {
>  		q->state |= UBLKSRV_NO_BUF;
> @@ -443,6 +440,7 @@ static int ublk_queue_init(struct ublk_queue *q)
>  		q->ios[i].buf_addr = NULL;
>  		q->ios[i].flags = UBLKSRV_NEED_FETCH_RQ | UBLKSRV_IO_FREE;
>  		q->ios[i].q = q;
> +		q->ios[i].tag = i;
>  
>  		if (q->state & UBLKSRV_NO_BUF)
>  			continue;
> @@ -463,47 +461,46 @@ static int ublk_queue_init(struct ublk_queue *q)
>  	return -ENOMEM;
>  }
>  
> -static int ublk_thread_init(struct ublk_queue *q)
> +static int ublk_thread_init(struct ublk_thread *t)
>  {
> -	struct ublk_dev *dev = q->dev;
> +	struct ublk_dev *dev = t->dev;
>  	int ring_depth = dev->tgt.sq_depth, cq_depth = dev->tgt.cq_depth;
>  	int ret;
>  
> -	q->tid = gettid();
> -
> -	ret = ublk_setup_ring(&q->ring, ring_depth, cq_depth,
> +	ret = ublk_setup_ring(&t->ring, ring_depth, cq_depth,
>  			IORING_SETUP_COOP_TASKRUN |
>  			IORING_SETUP_SINGLE_ISSUER |
>  			IORING_SETUP_DEFER_TASKRUN);
>  	if (ret < 0) {
> -		ublk_err("ublk dev %d queue %d setup io_uring failed %d\n",
> -				q->dev->dev_info.dev_id, q->q_id, ret);
> +		ublk_err("ublk dev %d thread %d setup io_uring failed %d\n",
> +				dev->dev_info.dev_id, t->idx, ret);
>  		goto fail;
>  	}
>  
>  	if (dev->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY) {
> -		ret = io_uring_register_buffers_sparse(&q->ring, q->q_depth);
> +		ret = io_uring_register_buffers_sparse(
> +			&t->ring, dev->dev_info.queue_depth);
>  		if (ret) {
> -			ublk_err("ublk dev %d queue %d register spare buffers failed %d",
> -					dev->dev_info.dev_id, q->q_id, ret);
> +			ublk_err("ublk dev %d thread %d register spare buffers failed %d",
> +					dev->dev_info.dev_id, t->idx, ret);
>  			goto fail;
>  		}
>  	}
>  
> -	io_uring_register_ring_fd(&q->ring);
> +	io_uring_register_ring_fd(&t->ring);
>  
> -	ret = io_uring_register_files(&q->ring, dev->fds, dev->nr_fds);
> +	ret = io_uring_register_files(&t->ring, dev->fds, dev->nr_fds);
>  	if (ret) {
> -		ublk_err("ublk dev %d queue %d register files failed %d\n",
> -				q->dev->dev_info.dev_id, q->q_id, ret);
> +		ublk_err("ublk dev %d thread %d register files failed %d\n",
> +				t->dev->dev_info.dev_id, t->idx, ret);
>  		goto fail;
>  	}
>  
>  	return 0;
>  fail:
> -	ublk_thread_deinit(q);
> -	ublk_err("ublk dev %d queue %d thread init failed\n",
> -			dev->dev_info.dev_id, q->q_id);
> +	ublk_thread_deinit(t);
> +	ublk_err("ublk dev %d thread %d init failed\n",
> +			dev->dev_info.dev_id, t->idx);
>  	return -ENOMEM;
>  }
>  
> @@ -545,8 +542,9 @@ static void ublk_dev_unprep(struct ublk_dev *dev)
>  	close(dev->fds[0]);
>  }
>  
> -int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag)
> +int ublk_queue_io_cmd(struct ublk_io *io)
>  {
> +	struct ublk_thread *t = io->t;
>  	struct ublksrv_io_cmd *cmd;
>  	struct io_uring_sqe *sqe[1];
>  	unsigned int cmd_op = 0;
> @@ -571,13 +569,13 @@ int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag)
>  	else if (io->flags & UBLKSRV_NEED_FETCH_RQ)
>  		cmd_op = UBLK_U_IO_FETCH_REQ;
>  
> -	if (io_uring_sq_space_left(&q->ring) < 1)
> -		io_uring_submit(&q->ring);
> +	if (io_uring_sq_space_left(&t->ring) < 1)
> +		io_uring_submit(&t->ring);
>  
> -	ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 1);
> +	ublk_io_alloc_sqes(io, sqe, 1);
>  	if (!sqe[0]) {
> -		ublk_err("%s: run out of sqe %d, tag %d\n",
> -				__func__, q->q_id, tag);
> +		ublk_err("%s: run out of sqe. thread %u, tag %d\n",
> +				__func__, t->idx, io->tag);
>  		return -1;
>  	}
>  
> @@ -592,42 +590,51 @@ int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag)
>  	sqe[0]->opcode	= IORING_OP_URING_CMD;
>  	sqe[0]->flags	= IOSQE_FIXED_FILE;
>  	sqe[0]->rw_flags	= 0;
> -	cmd->tag	= tag;
> -	cmd->q_id	= q->q_id;
> -	if (!(q->state & UBLKSRV_NO_BUF))
> +	cmd->tag	= io->tag;
> +	cmd->q_id	= io->q->q_id;
> +	if (!(io->q->state & UBLKSRV_NO_BUF))
>  		cmd->addr	= (__u64) (uintptr_t) io->buf_addr;
>  	else
>  		cmd->addr	= 0;
>  
> -	user_data = build_user_data(tag, _IOC_NR(cmd_op), 0, q->q_id, 0);
> +	user_data = build_user_data(io->tag, _IOC_NR(cmd_op), 0, io->q->q_id, 0);
>  	io_uring_sqe_set_data64(sqe[0], user_data);
>  
>  	io->flags = 0;
>  
> -	q->cmd_inflight += 1;
> +	t->cmd_inflight += 1;
>  
> -	ublk_dbg(UBLK_DBG_IO_CMD, "%s: (qid %d tag %u cmd_op %u) iof %x stopping %d\n",
> -			__func__, q->q_id, tag, cmd_op,
> -			io->flags, !!(q->state & UBLKSRV_QUEUE_STOPPING));
> +	ublk_dbg(UBLK_DBG_IO_CMD, "%s: (thread %u qid %d tag %u cmd_op %u) iof %x stopping %d\n",
> +			__func__, t->idx, io->q->q_id, io->tag, cmd_op,
> +			io->flags, !!(t->state & UBLKSRV_THREAD_STOPPING));
>  	return 1;
>  }
>  
> -static void ublk_submit_fetch_commands(struct ublk_queue *q)
> +static void ublk_submit_fetch_commands(struct ublk_thread *t)
>  {
> +	/*
> +	 * Service exclusively the queue whose q_id matches our thread
> +	 * index. This may change in the future.
> +	 */
> +	struct ublk_queue *q = &t->dev->q[t->idx];
> +	struct ublk_io *io;
>  	int i = 0;
>  
> -	for (i = 0; i < q->q_depth; i++)
> -		ublk_queue_io_cmd(q, &q->ios[i], i);
> +	for (i = 0; i < q->q_depth; i++) {
> +		io = &q->ios[i];
> +		io->t = t;
> +		ublk_queue_io_cmd(io);
> +	}
>  }
>  
> -static int ublk_queue_is_idle(struct ublk_queue *q)
> +static int ublk_thread_is_idle(struct ublk_thread *t)
>  {
> -	return !io_uring_sq_ready(&q->ring) && !q->io_inflight;
> +	return !io_uring_sq_ready(&t->ring) && !t->io_inflight;
>  }
>  
> -static int ublk_queue_is_done(struct ublk_queue *q)
> +static int ublk_thread_is_done(struct ublk_thread *t)
>  {
> -	return (q->state & UBLKSRV_QUEUE_STOPPING) && ublk_queue_is_idle(q);
> +	return (t->state & UBLKSRV_THREAD_STOPPING) && ublk_thread_is_idle(t);
>  }
>  
>  static inline void ublksrv_handle_tgt_cqe(struct ublk_queue *q,
> @@ -645,15 +652,16 @@ static inline void ublksrv_handle_tgt_cqe(struct ublk_queue *q,
>  		q->tgt_ops->tgt_io_done(q, tag, cqe);
>  }
>  
> -static void ublk_handle_cqe(struct ublk_dev *dev,
> +static void ublk_handle_cqe(struct ublk_thread *t,
>  		struct io_uring_cqe *cqe, void *data)
>  {
> +	struct ublk_dev *dev = t->dev;
>  	unsigned q_id = user_data_to_q_id(cqe->user_data);
>  	struct ublk_queue *q = &dev->q[q_id];
>  	unsigned tag = user_data_to_tag(cqe->user_data);
>  	unsigned cmd_op = user_data_to_op(cqe->user_data);
>  	int fetch = (cqe->res != UBLK_IO_RES_ABORT) &&
> -		!(q->state & UBLKSRV_QUEUE_STOPPING);
> +		!(t->state & UBLKSRV_THREAD_STOPPING);
>  	struct ublk_io *io;
>  
>  	if (cqe->res < 0 && cqe->res != -ENODEV)
> @@ -664,7 +672,7 @@ static void ublk_handle_cqe(struct ublk_dev *dev,
>  			__func__, cqe->res, q->q_id, tag, cmd_op,
>  			is_target_io(cqe->user_data),
>  			user_data_to_tgt_data(cqe->user_data),
> -			(q->state & UBLKSRV_QUEUE_STOPPING));
> +			(t->state & UBLKSRV_THREAD_STOPPING));
>  
>  	/* Don't retrieve io in case of target io */
>  	if (is_target_io(cqe->user_data)) {
> @@ -673,10 +681,10 @@ static void ublk_handle_cqe(struct ublk_dev *dev,
>  	}
>  
>  	io = &q->ios[tag];
> -	q->cmd_inflight--;
> +	t->cmd_inflight--;
>  
>  	if (!fetch) {
> -		q->state |= UBLKSRV_QUEUE_STOPPING;
> +		t->state |= UBLKSRV_THREAD_STOPPING;
>  		io->flags &= ~UBLKSRV_NEED_FETCH_RQ;
>  	}
>  
> @@ -686,7 +694,7 @@ static void ublk_handle_cqe(struct ublk_dev *dev,
>  			q->tgt_ops->queue_io(q, tag);
>  	} else if (cqe->res == UBLK_IO_RES_NEED_GET_DATA) {
>  		io->flags |= UBLKSRV_NEED_GET_DATA | UBLKSRV_IO_FREE;
> -		ublk_queue_io_cmd(q, io, tag);
> +		ublk_queue_io_cmd(io);
>  	} else {
>  		/*
>  		 * COMMIT_REQ will be completed immediately since no fetching
> @@ -700,87 +708,92 @@ static void ublk_handle_cqe(struct ublk_dev *dev,
>  	}
>  }
>  
> -static int ublk_reap_events_uring(struct ublk_queue *q)
> +static int ublk_reap_events_uring(struct ublk_thread *t)
>  {
>  	struct io_uring_cqe *cqe;
>  	unsigned head;
>  	int count = 0;
>  
> -	io_uring_for_each_cqe(&q->ring, head, cqe) {
> -		ublk_handle_cqe(q->dev, cqe, NULL);
> +	io_uring_for_each_cqe(&t->ring, head, cqe) {
> +		ublk_handle_cqe(t, cqe, NULL);
>  		count += 1;
>  	}
> -	io_uring_cq_advance(&q->ring, count);
> +	io_uring_cq_advance(&t->ring, count);
>  
>  	return count;
>  }
>  
> -static int ublk_process_io(struct ublk_queue *q)
> +static int ublk_process_io(struct ublk_thread *t)
>  {
>  	int ret, reapped;
>  
> -	ublk_dbg(UBLK_DBG_QUEUE, "dev%d-q%d: to_submit %d inflight cmd %u stopping %d\n",
> -				q->dev->dev_info.dev_id,
> -				q->q_id, io_uring_sq_ready(&q->ring),
> -				q->cmd_inflight,
> -				(q->state & UBLKSRV_QUEUE_STOPPING));
> +	ublk_dbg(UBLK_DBG_THREAD, "dev%d-t%u: to_submit %d inflight cmd %u stopping %d\n",
> +				t->dev->dev_info.dev_id,
> +				t->idx, io_uring_sq_ready(&t->ring),
> +				t->cmd_inflight,
> +				(t->state & UBLKSRV_THREAD_STOPPING));
>  
> -	if (ublk_queue_is_done(q))
> +	if (ublk_thread_is_done(t))
>  		return -ENODEV;
>  
> -	ret = io_uring_submit_and_wait(&q->ring, 1);
> -	reapped = ublk_reap_events_uring(q);
> +	ret = io_uring_submit_and_wait(&t->ring, 1);
> +	reapped = ublk_reap_events_uring(t);
>  
> -	ublk_dbg(UBLK_DBG_QUEUE, "submit result %d, reapped %d stop %d idle %d\n",
> -			ret, reapped, (q->state & UBLKSRV_QUEUE_STOPPING),
> -			(q->state & UBLKSRV_QUEUE_IDLE));
> +	ublk_dbg(UBLK_DBG_THREAD, "submit result %d, reapped %d stop %d idle %d\n",
> +			ret, reapped, (t->state & UBLKSRV_THREAD_STOPPING),
> +			(t->state & UBLKSRV_THREAD_IDLE));
>  
>  	return reapped;
>  }
>  
> -static void ublk_queue_set_sched_affinity(const struct ublk_queue *q,
> +static void ublk_thread_set_sched_affinity(const struct ublk_thread *t,
>  		cpu_set_t *cpuset)
>  {
>          if (sched_setaffinity(0, sizeof(*cpuset), cpuset) < 0)
> -                ublk_err("ublk dev %u queue %u set affinity failed",
> -                                q->dev->dev_info.dev_id, q->q_id);
> +		ublk_err("ublk dev %u thread %u set affinity failed",
> +				t->dev->dev_info.dev_id, t->idx);
>  }
>  
> -struct ublk_queue_info {
> -	struct ublk_queue 	*q;
> -	sem_t 			*queue_sem;
> +struct ublk_thread_info {
> +	struct ublk_dev 	*dev;
> +	unsigned		idx;
> +	sem_t 			*ready;
>  	cpu_set_t 		*affinity;
>  };
>  
>  static void *ublk_io_handler_fn(void *data)
>  {
> -	struct ublk_queue_info *info = data;
> -	struct ublk_queue *q = info->q;
> -	int dev_id = q->dev->dev_info.dev_id;
> +	struct ublk_thread_info *info = data;
> +	struct ublk_thread *t = &info->dev->threads[info->idx];
> +	int dev_id = info->dev->dev_info.dev_id;
>  	int ret;
>  
> -	ret = ublk_thread_init(q);
> +	t->dev = info->dev;
> +	t->idx = info->idx;
> +
> +	ret = ublk_thread_init(t);
>  	if (ret) {
> -		ublk_err("ublk dev %d queue %d thread init failed\n",
> -				dev_id, q->q_id);
> +		ublk_err("ublk dev %d thread %u init failed\n",
> +				dev_id, t->idx);
>  		return NULL;
>  	}
>  	/* IO perf is sensitive with queue pthread affinity on NUMA machine*/
> -	ublk_queue_set_sched_affinity(q, info->affinity);
> -	sem_post(info->queue_sem);
> +	ublk_thread_set_sched_affinity(t, info->affinity);
> +	sem_post(info->ready);
>  
> -	ublk_dbg(UBLK_DBG_QUEUE, "tid %d: ublk dev %d queue %d started\n",
> -			q->tid, dev_id, q->q_id);
> +	ublk_dbg(UBLK_DBG_THREAD, "tid %d: ublk dev %d thread %u started\n",
> +			gettid(), dev_id, t->idx);
>  
>  	/* submit all io commands to ublk driver */
> -	ublk_submit_fetch_commands(q);
> +	ublk_submit_fetch_commands(t);
>  	do {
> -		if (ublk_process_io(q) < 0)
> +		if (ublk_process_io(t) < 0)
>  			break;
>  	} while (1);
>  
> -	ublk_dbg(UBLK_DBG_QUEUE, "ublk dev %d queue %d exited\n", dev_id, q->q_id);
> -	ublk_thread_deinit(q);
> +	ublk_dbg(UBLK_DBG_THREAD, "tid %d: ublk dev %d thread %d exiting\n",
> +		 gettid(), dev_id, t->idx);
> +	ublk_thread_deinit(t);
>  	return NULL;
>  }
>  
> @@ -823,20 +836,19 @@ static int ublk_send_dev_event(const struct dev_ctx *ctx, struct ublk_dev *dev,
>  static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
>  {
>  	const struct ublksrv_ctrl_dev_info *dinfo = &dev->dev_info;
> -	struct ublk_queue_info *qinfo;
> +	struct ublk_thread_info *tinfo;
>  	cpu_set_t *affinity_buf;
>  	void *thread_ret;
> -	sem_t queue_sem;
> +	sem_t ready;
>  	int ret, i;
>  
>  	ublk_dbg(UBLK_DBG_DEV, "%s enter\n", __func__);
>  
> -	qinfo = (struct ublk_queue_info *)calloc(sizeof(struct ublk_queue_info),
> -			dinfo->nr_hw_queues);
> -	if (!qinfo)
> +	tinfo = calloc(sizeof(struct ublk_thread_info), dinfo->nr_hw_queues);
> +	if (!tinfo)
>  		return -ENOMEM;
>  
> -	sem_init(&queue_sem, 0, 0);
> +	sem_init(&ready, 0, 0);
>  	ret = ublk_dev_prep(ctx, dev);
>  	if (ret)
>  		return ret;
> @@ -856,17 +868,18 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
>  			goto fail;
>  		}
>  
> -		qinfo[i].q = &dev->q[i];
> -		qinfo[i].queue_sem = &queue_sem;
> -		qinfo[i].affinity = &affinity_buf[i];
> -		pthread_create(&dev->q[i].thread, NULL,
> +		tinfo[i].dev = dev;
> +		tinfo[i].idx = i;
> +		tinfo[i].ready = &ready;
> +		tinfo[i].affinity = &affinity_buf[i];
> +		pthread_create(&dev->threads[i].thread, NULL,
>  				ublk_io_handler_fn,
> -				&qinfo[i]);
> +				&tinfo[i]);
>  	}
>  
>  	for (i = 0; i < dinfo->nr_hw_queues; i++)
> -		sem_wait(&queue_sem);
> -	free(qinfo);
> +		sem_wait(&ready);
> +	free(tinfo);
>  	free(affinity_buf);
>  
>  	/* everything is fine now, start us */
> @@ -889,7 +902,7 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
>  
>  	/* wait until we are terminated */
>  	for (i = 0; i < dinfo->nr_hw_queues; i++)
> -		pthread_join(dev->q[i].thread, &thread_ret);
> +		pthread_join(dev->threads[i].thread, &thread_ret);
>   fail:
>  	for (i = 0; i < dinfo->nr_hw_queues; i++)
>  		ublk_queue_deinit(&dev->q[i]);
> diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
> index 7c912116606429215af7dbc2a8ce6b40ef89bfbd..9eb2207fcebe96d34488d057c881db262b9767b3 100644
> --- a/tools/testing/selftests/ublk/kublk.h
> +++ b/tools/testing/selftests/ublk/kublk.h
> @@ -51,10 +51,12 @@
>  #define UBLK_IO_MAX_BYTES               (1 << 20)
>  #define UBLK_MAX_QUEUES_SHIFT		5
>  #define UBLK_MAX_QUEUES                 (1 << UBLK_MAX_QUEUES_SHIFT)
> +#define UBLK_MAX_THREADS_SHIFT		5
> +#define UBLK_MAX_THREADS		(1 << UBLK_MAX_THREADS_SHIFT)
>  #define UBLK_QUEUE_DEPTH                1024
>  
>  #define UBLK_DBG_DEV            (1U << 0)
> -#define UBLK_DBG_QUEUE          (1U << 1)
> +#define UBLK_DBG_THREAD         (1U << 1)
>  #define UBLK_DBG_IO_CMD         (1U << 2)
>  #define UBLK_DBG_IO             (1U << 3)
>  #define UBLK_DBG_CTRL_CMD       (1U << 4)
> @@ -62,6 +64,7 @@
>  
>  struct ublk_dev;
>  struct ublk_queue;
> +struct ublk_thread;
>  
>  struct stripe_ctx {
>  	/* stripe */
> @@ -120,6 +123,8 @@ struct ublk_io {
>  	unsigned short refs;		/* used by target code only */
>  
>  	struct ublk_queue *q;
> +	struct ublk_thread *t;

Given you have to take static mapping between queue/tag and thread,
'struct ublk_thread' should have been figured out runtime easily,
then we can save 8 bytes, also avoid memory indirect dereference.

sizeof(struct ublk_io) need to be held in single L1 cacheline.

But it can be one followup.

Reviewed-by: Ming Lei <ming.lei@redhat.com>


thanks,
Ming