[00/11] target: fix cmd plugging and completion

Message ID	20210204113513.93204-1-michael.christie@oracle.com
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Mike Christie <michael.christie@oracle.com> To: martin.petersen@oracle.com, linux-scsi@vger.kernel.org, target-devel@vger.kernel.org, mst@redhat.com, jasowang@redhat.com, stefanha@redhat.com, virtualization@lists.linux-foundation.org Subject: [PATCH 00/11] target: fix cmd plugging and completion Date: Thu, 4 Feb 2021 05:35:02 -0600 Message-Id: <20210204113513.93204-1-michael.christie@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Precedence: bulk
Series	target: fix cmd plugging and completion \| expand [00/11] target: fix cmd plugging and completion [01/11] target: pass in fabric ops to session creation [03/11] tcm loop: use blk cmd allocator for se_cmds [04/11] tcm loop: use lio wq cmd submission helper [06/11] target: cleanup cmd flag bits [09/11] target_core_user: add backend plug/unplug callouts

Mike Christie Feb. 4, 2021, 11:35 a.m. UTC

The following patches made over Martin's 5.12 branches fix two
issues:

1. target_core_iblock plugs and unplugs the queue for every
command. To handle this issue and handle an issue that
vhost-scsi and loop were avoiding by adding their own workqueue,
I added a new submission workqueue to LIO. Drivers can pass cmds
to it, and we can then submit batches of cmds.

2. vhost-scsi and loop on the submission side were doing a work
per cmd and on the lio completion side it was doing a work per
cmd. The cap on running works is 512 (max_active) and so we can
end up end up using a lot of threads when submissions start blocking
because they hit the block tag limit or the completion side blocks
trying to send the cmd. In this patchset I just use a cmd list
per session to avoid abusing the workueue layer.

The combined patchset fixes a major perf issue we've been hitting
where IOPs is stuck at 230K when running:

    fio --filename=/dev/sda  --direct=1 --rw=randrw --bs=4k
    --ioengine=libaio --iodepth=128  --numjobs=8 --time_based
    --group_reporting --runtime=60

The patches in this set get me to 350K when using devices that
have native IOPs of around 400-500K.

Note that 5.12 has some interrupt changes that my patches
collide with. Martin's 5.12 branches had the changes so I
based my patches on that.

Chaitanya Kulkarni Feb. 4, 2021, 11:13 p.m. UTC | #1

On 2/4/21 03:41, Mike Christie wrote:
> loop and vhost-scsi do their target cmd submission from driver
> workqueues. This allows them to avoid an issue where the backend may
> block waiting for resources like tags/requests, mem/locks, etc
> and that ends up blocking their entire submission path and for the
> case of vhost-scsi both the submission and completion path.
>
> This patch adds a helper these drivers can use to submit from the
> lio workqueue. This code will then be extended in the next patches
> to fix the plugging of backend devices.
>
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/target/target_core_transport.c | 102 ++++++++++++++++++++++++-
>  include/target/target_core_base.h      |  10 ++-
>  include/target/target_core_fabric.h    |   3 +
>  3 files changed, 111 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
> index 7c5d37bac561..dec89e911348 100644
> --- a/drivers/target/target_core_transport.c
> +++ b/drivers/target/target_core_transport.c
> @@ -41,6 +41,7 @@
>  #include <trace/events/target.h>
>  
>  static struct workqueue_struct *target_completion_wq;
> +static struct workqueue_struct *target_submission_wq;
>  static struct kmem_cache *se_sess_cache;
>  struct kmem_cache *se_ua_cache;
>  struct kmem_cache *t10_pr_reg_cache;
> @@ -129,8 +130,15 @@ int init_se_kmem_caches(void)
>  	if (!target_completion_wq)
>  		goto out_free_lba_map_mem_cache;
>  
> +	target_submission_wq = alloc_workqueue("target_submission",
> +					       WQ_MEM_RECLAIM, 0);
> +	if (!target_submission_wq)
> +		goto out_free_completion_wq;
> +
>  	return 0;
>  
> +out_free_completion_wq:
> +	destroy_workqueue(target_completion_wq);
>  out_free_lba_map_mem_cache:
>  	kmem_cache_destroy(t10_alua_lba_map_mem_cache);
>  out_free_lba_map_cache:
> @@ -153,6 +161,7 @@ int init_se_kmem_caches(void)
>  
>  void release_se_kmem_caches(void)
>  {
> +	destroy_workqueue(target_submission_wq);
>  	destroy_workqueue(target_completion_wq);
>  	kmem_cache_destroy(se_sess_cache);
>  	kmem_cache_destroy(se_ua_cache);
> @@ -218,6 +227,69 @@ static void target_release_sess_cmd_refcnt(struct percpu_ref *ref)
>  	wake_up(&sess->cmd_count_wq);
>  }
>  
> +static void target_queued_submit_work(struct work_struct *work)
> +{
> +	struct se_sess_cmd_queue *sq =
> +				container_of(work, struct se_sess_cmd_queue,
> +					     work);
> +	struct se_session *se_sess = sq->se_sess;
> +	struct se_cmd *se_cmd, *next_cmd;
> +	struct llist_node *cmd_list;
> +
> +	cmd_list = llist_del_all(&sq->cmd_list);
> +	if (!cmd_list)
> +		/* Previous call took what we were queued to submit */
> +		return;
> +
> +	cmd_list = llist_reverse_order(cmd_list);
> +	llist_for_each_entry_safe(se_cmd, next_cmd, cmd_list, se_cmd_list)
> +		se_sess->tfo->submit_queued_cmd(se_cmd);
> +}
> +
> +static void target_queue_cmd_work(struct se_sess_cmd_queue *q,
> +				  struct se_cmd *se_cmd, int cpu)
> +{
> +	llist_add(&se_cmd->se_cmd_list, &q->cmd_list);
> +	queue_work_on(cpu, target_submission_wq, &q->work);
> +}
> +
> +/**
> + * target_queue_cmd_submit - queue a se_cmd to be executed from the lio wq
> + * @se_sess: cmd's session
> + * @cmd_list: cmd to queue
> + */
> +void target_queue_cmd_submit(struct se_session *se_sess, struct se_cmd *se_cmd)
> +{
> +	int cpu = smp_processor_id();
> +
> +	target_queue_cmd_work(&se_sess->sq[cpu], se_cmd, cpu);
> +}
> +EXPORT_SYMBOL_GPL(target_queue_cmd_submit);
> +
> +static void target_flush_queued_cmds(struct se_session *se_sess)
> +{
> +	int i;
> +
> +	if (!se_sess->sq)
> +		return;
> +
> +	for (i = 0; i < se_sess->q_cnt; i++)
> +		cancel_work_sync(&se_sess->sq[i].work);
> +}
> +
> +static void target_init_sess_cmd_queues(struct se_session *se_sess,
> +					struct se_sess_cmd_queue *q,
> +					void (*work_fn)(struct work_struct *work))
> +{
> +	int i;
> +
> +	for (i = 0; i < se_sess->q_cnt; i++) {
> +		init_llist_head(&q[i].cmd_list);
> +		INIT_WORK(&q[i].work, work_fn);
> +		q[i].se_sess = se_sess;
> +	}
> +}
> +
Can we opencode above function if there is only one caller ?
unless there is a specific reason to have it on its own which I failed to
understand.
>  /**
>   * transport_init_session - initialize a session object
>   * @tfo: target core fabric ops
> @@ -228,6 +300,8 @@ static void target_release_sess_cmd_refcnt(struct percpu_ref *ref)
>  int transport_init_session(const struct target_core_fabric_ops *tfo,
>  			   struct se_session *se_sess)
>  {
> +	int rc;
> +
>  	INIT_LIST_HEAD(&se_sess->sess_list);
>  	INIT_LIST_HEAD(&se_sess->sess_acl_list);
>  	spin_lock_init(&se_sess->sess_cmd_lock);
> @@ -235,13 +309,34 @@ int transport_init_session(const struct target_core_fabric_ops *tfo,
>  	init_completion(&se_sess->stop_done);
>  	atomic_set(&se_sess->stopped, 0);
>  	se_sess->tfo = tfo;
> -	return percpu_ref_init(&se_sess->cmd_count,
> -			       target_release_sess_cmd_refcnt, 0, GFP_KERNEL);
> +
> +	if (tfo->submit_queued_cmd) {
> +		se_sess->sq = kcalloc(nr_cpu_ids, sizeof(*se_sess->sq),
> +				      GFP_KERNEL);
> +		if (!se_sess->sq)
> +			return -ENOMEM;
> +
> +		se_sess->q_cnt = nr_cpu_ids;
> +		target_init_sess_cmd_queues(se_sess, se_sess->sq,
> +					    target_queued_submit_work);
> +	}
> +
> +	rc = percpu_ref_init(&se_sess->cmd_count,
> +			     target_release_sess_cmd_refcnt, 0, GFP_KERNEL);
> +	if (rc)
> +		goto free_sq;
> +
> +	return 0;
> +
> +free_sq:
> +	kfree(se_sess->sq);
> +	return rc;
>  }
>  EXPORT_SYMBOL(transport_init_session);
>  
>  void transport_uninit_session(struct se_session *se_sess)
>  {
> +	kfree(se_sess->sq);
>  	/*
>  	 * Drivers like iscsi and loop do not call target_stop_session
>  	 * during session shutdown so we have to drop the ref taken at init
> @@ -1385,7 +1480,6 @@ void transport_init_se_cmd(
>  {
>  	INIT_LIST_HEAD(&cmd->se_delayed_node);
>  	INIT_LIST_HEAD(&cmd->se_qf_node);
> -	INIT_LIST_HEAD(&cmd->se_cmd_list);
>  	INIT_LIST_HEAD(&cmd->state_list);
>  	init_completion(&cmd->t_transport_stop_comp);
>  	cmd->free_compl = NULL;
> @@ -2968,6 +3062,8 @@ void target_wait_for_sess_cmds(struct se_session *se_sess)
>  {
>  	int ret;
>  
> +	target_flush_queued_cmds(se_sess);
> +
>  	WARN_ON_ONCE(!atomic_read(&se_sess->stopped));
>  
>  	do {
> diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h
> index 50103a22b0e2..97138bff14d1 100644
> --- a/include/target/target_core_base.h
> +++ b/include/target/target_core_base.h
> @@ -488,7 +488,7 @@ struct se_cmd {
>  	/* Only used for internal passthrough and legacy TCM fabric modules */
>  	struct se_session	*se_sess;
>  	struct se_tmr_req	*se_tmr_req;
> -	struct list_head	se_cmd_list;
> +	struct llist_node	se_cmd_list;
>  	struct completion	*free_compl;
>  	struct completion	*abrt_compl;
>  	const struct target_core_fabric_ops *se_tfo;
> @@ -612,6 +612,12 @@ static inline struct se_node_acl *fabric_stat_to_nacl(struct config_item *item)
>  			acl_fabric_stat_group);
>  }
>  
> +struct se_sess_cmd_queue {
> +	struct llist_head	cmd_list;
> +	struct work_struct	work;
> +	struct se_session	*se_sess;
> +};
> +
>  struct se_session {
>  	atomic_t		stopped;
>  	u64			sess_bin_isid;
> @@ -629,6 +635,8 @@ struct se_session {
>  	void			*sess_cmd_map;
>  	struct sbitmap_queue	sess_tag_pool;
>  	const struct target_core_fabric_ops *tfo;
> +	struct se_sess_cmd_queue *sq;
> +	int			q_cnt;
>  };
>  
>  struct se_device;
> diff --git a/include/target/target_core_fabric.h b/include/target/target_core_fabric.h
> index cdf610838ba5..899948967a65 100644
> --- a/include/target/target_core_fabric.h
> +++ b/include/target/target_core_fabric.h
> @@ -80,6 +80,7 @@ struct target_core_fabric_ops {
>  	int (*queue_status)(struct se_cmd *);
>  	void (*queue_tm_rsp)(struct se_cmd *);
>  	void (*aborted_task)(struct se_cmd *);
> +	void (*submit_queued_cmd)(struct se_cmd *);
>  	/*
>  	 * fabric module calls for target_core_fabric_configfs.c
>  	 */
> @@ -166,6 +167,8 @@ int	target_submit_tmr(struct se_cmd *se_cmd, struct se_session *se_sess,
>  		unsigned char *sense, u64 unpacked_lun,
>  		void *fabric_tmr_ptr, unsigned char tm_type,
>  		gfp_t, u64, int);
> +void	target_queue_cmd_submit(struct se_session *se_sess,
> +				struct se_cmd *se_cmd);
>  int	transport_handle_cdb_direct(struct se_cmd *);
>  sense_reason_t	transport_generic_new_cmd(struct se_cmd *);
>

Chaitanya Kulkarni Feb. 4, 2021, 11:23 p.m. UTC | #2

On 2/4/21 03:40, Mike Christie wrote:
> This patch adds plug/unplug callouts for iblock. For initiator drivers
> like iscsi which wants to pass multiple cmds to its xmit thread instead
> of one cmd at a time, this increases IOPs by around 10% with vhost-scsi
> (combined with the last patches we can see a total 40-50% increase). For
> driver combos like tcm_loop and faster drivers like the iser initiator, we
> can still see IOPs increase by 20-30% when tcm_loop's nr_hw_queues setting
> is also increased.
>
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/target/target_core_iblock.c | 41 ++++++++++++++++++++++++++++-
>  drivers/target/target_core_iblock.h | 10 +++++++
>  2 files changed, 50 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
> index 8ed93fd205c7..a4951e662615 100644
> --- a/drivers/target/target_core_iblock.c
> +++ b/drivers/target/target_core_iblock.c
> @@ -61,9 +61,18 @@ static struct se_device *iblock_alloc_device(struct se_hba *hba, const char *nam
>  		return NULL;
>  	}
>  
> +	ib_dev->ibd_plug = kcalloc(nr_cpu_ids, sizeof(*ib_dev->ibd_plug),
> +				   GFP_KERNEL);
> +	if (!ib_dev->ibd_plug)
> +		goto free_dev;
> +
>  	pr_debug( "IBLOCK: Allocated ib_dev for %s\n", name);
>  
>  	return &ib_dev->dev;
> +
> +free_dev:
> +	kfree(ib_dev);
> +	return NULL;
>  }
>  
>  static int iblock_configure_device(struct se_device *dev)
> @@ -171,6 +180,7 @@ static void iblock_dev_call_rcu(struct rcu_head *p)
>  	struct se_device *dev = container_of(p, struct se_device, rcu_head);
>  	struct iblock_dev *ib_dev = IBLOCK_DEV(dev);
>  
> +	kfree(ib_dev->ibd_plug);
>  	kfree(ib_dev);
>  }
>  
> @@ -188,6 +198,30 @@ static void iblock_destroy_device(struct se_device *dev)
>  	bioset_exit(&ib_dev->ibd_bio_set);
>  }
>  
> +static struct se_dev_plug *iblock_plug_device(struct se_cmd *se_cmd)
> +{
> +	struct se_device *se_dev = se_cmd->se_dev;
> +	struct iblock_dev *ib_dev = IBLOCK_DEV(se_dev);
> +	struct iblock_dev_plug *ib_dev_plug;
> +
> +	ib_dev_plug = &ib_dev->ibd_plug[se_cmd->cpuid];
> +	if (test_and_set_bit(IBD_PLUGF_PLUGGED, &ib_dev_plug->flags))
> +		return NULL;
> +
> +	blk_start_plug(&ib_dev_plug->blk_plug);
> +	return &ib_dev_plug->se_plug;
> +}
> +
> +static void iblock_unplug_device(struct se_dev_plug *se_plug)
> +{
> +	struct iblock_dev_plug *ib_dev_plug =
> +				container_of(se_plug, struct iblock_dev_plug,
> +					     se_plug);
I think something like on the new line read much easier for me atleast :-

        ib_dev_plug = container_of(se_plug, struct iblock_dev_plug,
se_plug);
> +
> +	blk_finish_plug(&ib_dev_plug->blk_plug);
> +	clear_bit(IBD_PLUGF_PLUGGED, &ib_dev_plug->flags);
> +}
> +
>  static unsigned long long iblock_emulate_read_cap_with_block_size(
>  	struct se_device *dev,
>  	struct block_device *bd,
> @@ -337,7 +371,10 @@ static void iblock_submit_bios(struct bio_list *list)
>  {
>  	struct blk_plug plug;
>  	struct bio *bio;
> -
> +	/*
> +	 * The block layer handles nested plugs, so just plug/unplug to handle
> +	 * fabric drivers that didn't support batching and multi bio cmds.
> +	 */
>  	blk_start_plug(&plug);
>  	while ((bio = bio_list_pop(list)))
>  		submit_bio(bio);
> @@ -870,6 +907,8 @@ static const struct target_backend_ops iblock_ops = {
>  	.configure_device	= iblock_configure_device,
>  	.destroy_device		= iblock_destroy_device,
>  	.free_device		= iblock_free_device,
> +	.plug_device		= iblock_plug_device,
> +	.unplug_device		= iblock_unplug_device,
>  	.parse_cdb		= iblock_parse_cdb,
>  	.set_configfs_dev_params = iblock_set_configfs_dev_params,
>  	.show_configfs_dev_params = iblock_show_configfs_dev_params,
> diff --git a/drivers/target/target_core_iblock.h b/drivers/target/target_core_iblock.h
> index cefc641145b3..8c55375d2f75 100644
> --- a/drivers/target/target_core_iblock.h
> +++ b/drivers/target/target_core_iblock.h
> @@ -4,6 +4,7 @@
>  
>  #include <linux/atomic.h>
>  #include <linux/refcount.h>
> +#include <linux/blkdev.h>
>  #include <target/target_core_base.h>
>  
>  #define IBLOCK_VERSION		"4.0"
> @@ -17,6 +18,14 @@ struct iblock_req {
>  
>  #define IBDF_HAS_UDEV_PATH		0x01
>  
> +#define IBD_PLUGF_PLUGGED		0x01
> +
> +struct iblock_dev_plug {
> +	struct se_dev_plug se_plug;
> +	struct blk_plug blk_plug;
> +	unsigned long flags;
> +};
> +
>  struct iblock_dev {
>  	struct se_device dev;
>  	unsigned char ibd_udev_path[SE_UDEV_PATH_LEN];
> @@ -24,6 +33,7 @@ struct iblock_dev {
>  	struct bio_set	ibd_bio_set;
>  	struct block_device *ibd_bd;
>  	bool ibd_readonly;
> +	struct iblock_dev_plug *ibd_plug;
>  } ____cacheline_aligned;
>  
>  #endif /* TARGET_CORE_IBLOCK_H */

Chaitanya Kulkarni Feb. 4, 2021, 11:26 p.m. UTC | #3

On 2/4/21 03:41, Mike Christie wrote:
> +static void target_queued_compl_work(struct work_struct *work)
> +{
> +	struct se_sess_cmd_queue *cq =
> +				container_of(work, struct se_sess_cmd_queue,
> +					     work);
same here as previously mentioned.

Mike Christie Feb. 5, 2021, 12:43 a.m. UTC | #4

On 2/4/21 5:13 PM, Chaitanya Kulkarni wrote:
> On 2/4/21 03:41, Mike Christie wrote:

>> loop and vhost-scsi do their target cmd submission from driver

>> workqueues. This allows them to avoid an issue where the backend may

>> block waiting for resources like tags/requests, mem/locks, etc

>> and that ends up blocking their entire submission path and for the

>> case of vhost-scsi both the submission and completion path.

>>

>> This patch adds a helper these drivers can use to submit from the

>> lio workqueue. This code will then be extended in the next patches

>> to fix the plugging of backend devices.

>>

>> Signed-off-by: Mike Christie <michael.christie@oracle.com>

>> ---

>>   drivers/target/target_core_transport.c | 102 ++++++++++++++++++++++++-

>>   include/target/target_core_base.h      |  10 ++-

>>   include/target/target_core_fabric.h    |   3 +

>>   3 files changed, 111 insertions(+), 4 deletions(-)

>>

>> diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c

>> index 7c5d37bac561..dec89e911348 100644

>> --- a/drivers/target/target_core_transport.c

>> +++ b/drivers/target/target_core_transport.c

>> @@ -41,6 +41,7 @@

>>   #include <trace/events/target.h>

>>   

>>   static struct workqueue_struct *target_completion_wq;

>> +static struct workqueue_struct *target_submission_wq;

>>   static struct kmem_cache *se_sess_cache;

>>   struct kmem_cache *se_ua_cache;

>>   struct kmem_cache *t10_pr_reg_cache;

>> @@ -129,8 +130,15 @@ int init_se_kmem_caches(void)

>>   	if (!target_completion_wq)

>>   		goto out_free_lba_map_mem_cache;

>>   

>> +	target_submission_wq = alloc_workqueue("target_submission",

>> +					       WQ_MEM_RECLAIM, 0);

>> +	if (!target_submission_wq)

>> +		goto out_free_completion_wq;

>> +

>>   	return 0;

>>   

>> +out_free_completion_wq:

>> +	destroy_workqueue(target_completion_wq);

>>   out_free_lba_map_mem_cache:

>>   	kmem_cache_destroy(t10_alua_lba_map_mem_cache);

>>   out_free_lba_map_cache:

>> @@ -153,6 +161,7 @@ int init_se_kmem_caches(void)

>>   

>>   void release_se_kmem_caches(void)

>>   {

>> +	destroy_workqueue(target_submission_wq);

>>   	destroy_workqueue(target_completion_wq);

>>   	kmem_cache_destroy(se_sess_cache);

>>   	kmem_cache_destroy(se_ua_cache);

>> @@ -218,6 +227,69 @@ static void target_release_sess_cmd_refcnt(struct percpu_ref *ref)

>>   	wake_up(&sess->cmd_count_wq);

>>   }

>>   

>> +static void target_queued_submit_work(struct work_struct *work)

>> +{

>> +	struct se_sess_cmd_queue *sq =

>> +				container_of(work, struct se_sess_cmd_queue,

>> +					     work);

>> +	struct se_session *se_sess = sq->se_sess;

>> +	struct se_cmd *se_cmd, *next_cmd;

>> +	struct llist_node *cmd_list;

>> +

>> +	cmd_list = llist_del_all(&sq->cmd_list);

>> +	if (!cmd_list)

>> +		/* Previous call took what we were queued to submit */

>> +		return;

>> +

>> +	cmd_list = llist_reverse_order(cmd_list);

>> +	llist_for_each_entry_safe(se_cmd, next_cmd, cmd_list, se_cmd_list)

>> +		se_sess->tfo->submit_queued_cmd(se_cmd);

>> +}

>> +

>> +static void target_queue_cmd_work(struct se_sess_cmd_queue *q,

>> +				  struct se_cmd *se_cmd, int cpu)

>> +{

>> +	llist_add(&se_cmd->se_cmd_list, &q->cmd_list);

>> +	queue_work_on(cpu, target_submission_wq, &q->work);

>> +}

>> +

>> +/**

>> + * target_queue_cmd_submit - queue a se_cmd to be executed from the lio wq

>> + * @se_sess: cmd's session

>> + * @cmd_list: cmd to queue

>> + */

>> +void target_queue_cmd_submit(struct se_session *se_sess, struct se_cmd *se_cmd)

>> +{

>> +	int cpu = smp_processor_id();

>> +

>> +	target_queue_cmd_work(&se_sess->sq[cpu], se_cmd, cpu);

>> +}

>> +EXPORT_SYMBOL_GPL(target_queue_cmd_submit);

>> +

>> +static void target_flush_queued_cmds(struct se_session *se_sess)

>> +{

>> +	int i;

>> +

>> +	if (!se_sess->sq)

>> +		return;

>> +

>> +	for (i = 0; i < se_sess->q_cnt; i++)

>> +		cancel_work_sync(&se_sess->sq[i].work);

>> +}

>> +

>> +static void target_init_sess_cmd_queues(struct se_session *se_sess,

>> +					struct se_sess_cmd_queue *q,

>> +					void (*work_fn)(struct work_struct *work))

>> +{

>> +	int i;

>> +

>> +	for (i = 0; i < se_sess->q_cnt; i++) {

>> +		init_llist_head(&q[i].cmd_list);

>> +		INIT_WORK(&q[i].work, work_fn);

>> +		q[i].se_sess = se_sess;

>> +	}

>> +}

>> +

> Can we opencode above function if there is only one caller ?

> unless there is a specific reason to have it on its own which I failed to

> understand.


Patch 10 also calls it. I tried to say that in the end of the patch 
description but it was not too clear now that I read it again.

I couldn't decide if I should do it now or later. I selected now since 
it made the 10th pach smaller.

Mike Christie Feb. 5, 2021, 12:45 a.m. UTC | #5

On 2/4/21 5:23 PM, Chaitanya Kulkarni wrote:
> On 2/4/21 03:40, Mike Christie wrote:

>> This patch adds plug/unplug callouts for iblock. For initiator drivers

>> like iscsi which wants to pass multiple cmds to its xmit thread instead

>> of one cmd at a time, this increases IOPs by around 10% with vhost-scsi

>> (combined with the last patches we can see a total 40-50% increase). For

>> driver combos like tcm_loop and faster drivers like the iser initiator, we

>> can still see IOPs increase by 20-30% when tcm_loop's nr_hw_queues setting

>> is also increased.

>>

>> Signed-off-by: Mike Christie <michael.christie@oracle.com>

>> ---

>>   drivers/target/target_core_iblock.c | 41 ++++++++++++++++++++++++++++-

>>   drivers/target/target_core_iblock.h | 10 +++++++

>>   2 files changed, 50 insertions(+), 1 deletion(-)

>>

>> diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c

>> index 8ed93fd205c7..a4951e662615 100644

>> --- a/drivers/target/target_core_iblock.c

>> +++ b/drivers/target/target_core_iblock.c

>> @@ -61,9 +61,18 @@ static struct se_device *iblock_alloc_device(struct se_hba *hba, const char *nam

>>   		return NULL;

>>   	}

>>   

>> +	ib_dev->ibd_plug = kcalloc(nr_cpu_ids, sizeof(*ib_dev->ibd_plug),

>> +				   GFP_KERNEL);

>> +	if (!ib_dev->ibd_plug)

>> +		goto free_dev;

>> +

>>   	pr_debug( "IBLOCK: Allocated ib_dev for %s\n", name);

>>   

>>   	return &ib_dev->dev;

>> +

>> +free_dev:

>> +	kfree(ib_dev);

>> +	return NULL;

>>   }

>>   

>>   static int iblock_configure_device(struct se_device *dev)

>> @@ -171,6 +180,7 @@ static void iblock_dev_call_rcu(struct rcu_head *p)

>>   	struct se_device *dev = container_of(p, struct se_device, rcu_head);

>>   	struct iblock_dev *ib_dev = IBLOCK_DEV(dev);

>>   

>> +	kfree(ib_dev->ibd_plug);

>>   	kfree(ib_dev);

>>   }

>>   

>> @@ -188,6 +198,30 @@ static void iblock_destroy_device(struct se_device *dev)

>>   	bioset_exit(&ib_dev->ibd_bio_set);

>>   }

>>   

>> +static struct se_dev_plug *iblock_plug_device(struct se_cmd *se_cmd)

>> +{

>> +	struct se_device *se_dev = se_cmd->se_dev;

>> +	struct iblock_dev *ib_dev = IBLOCK_DEV(se_dev);

>> +	struct iblock_dev_plug *ib_dev_plug;

>> +

>> +	ib_dev_plug = &ib_dev->ibd_plug[se_cmd->cpuid];

>> +	if (test_and_set_bit(IBD_PLUGF_PLUGGED, &ib_dev_plug->flags))

>> +		return NULL;

>> +

>> +	blk_start_plug(&ib_dev_plug->blk_plug);

>> +	return &ib_dev_plug->se_plug;

>> +}

>> +

>> +static void iblock_unplug_device(struct se_dev_plug *se_plug)

>> +{

>> +	struct iblock_dev_plug *ib_dev_plug =

>> +				container_of(se_plug, struct iblock_dev_plug,

>> +					     se_plug);

> I think something like on the new line read much easier for me atleast :-

> 

>          ib_dev_plug = container_of(se_plug, struct iblock_dev_plug,

> se_plug);


Yeah nicer. Will change this and the other one.

Chaitanya Kulkarni Feb. 5, 2021, 1:50 a.m. UTC | #6

On 2/4/21 16:44, michael.christie@oracle.com wrote:
>>> +
>> Can we opencode above function if there is only one caller ?
>> unless there is a specific reason to have it on its own which I failed to
>> understand.
> Patch 10 also calls it. I tried to say that in the end of the patch 
> description but it was not too clear now that I read it again.
>
> I couldn't decide if I should do it now or later. I selected now since 
> it made the 10th pach smaller.
>
if it does then fine.

Michael S. Tsirkin Feb. 5, 2021, 4:17 p.m. UTC | #7

On Thu, Feb 04, 2021 at 05:35:07AM -0600, Mike Christie wrote:
> @@ -1132,14 +1127,8 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)

>  		 * vhost_scsi_queue_data_in() and vhost_scsi_queue_status()

>  		 */

>  		cmd->tvc_vq_desc = vc.head;

> -		/*

> -		 * Dispatch cmd descriptor for cmwq execution in process

> -		 * context provided by vhost_scsi_workqueue.  This also ensures

> -		 * cmd is executed on the same kworker CPU as this vhost

> -		 * thread to gain positive L2 cache locality effects.

> -		 */

> -		INIT_WORK(&cmd->work, vhost_scsi_submission_work);

> -		queue_work(vhost_scsi_workqueue, &cmd->work);

> +		target_queue_cmd_submit(tpg->tpg_nexus->tvn_se_sess,

> +					&cmd->tvc_se_cmd);

>  		ret = 0;

>  err:

>  		/*


What about this aspect? Will things still stay on the same CPU?

-- 
MST

Mike Christie Feb. 5, 2021, 5:38 p.m. UTC | #8

On 2/5/21 10:17 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 04, 2021 at 05:35:07AM -0600, Mike Christie wrote:

>> @@ -1132,14 +1127,8 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)

>>  		 * vhost_scsi_queue_data_in() and vhost_scsi_queue_status()

>>  		 */

>>  		cmd->tvc_vq_desc = vc.head;

>> -		/*

>> -		 * Dispatch cmd descriptor for cmwq execution in process

>> -		 * context provided by vhost_scsi_workqueue.  This also ensures

>> -		 * cmd is executed on the same kworker CPU as this vhost

>> -		 * thread to gain positive L2 cache locality effects.

>> -		 */

>> -		INIT_WORK(&cmd->work, vhost_scsi_submission_work);

>> -		queue_work(vhost_scsi_workqueue, &cmd->work);

>> +		target_queue_cmd_submit(tpg->tpg_nexus->tvn_se_sess,

>> +					&cmd->tvc_se_cmd);

>>  		ret = 0;

>>  err:

>>  		/*

> 

> What about this aspect? Will things still stay on the same CPU

Yes, if that is what it's configured to do.

On the submission path there is no change in behavior. target_queue_cmd_submit
does queue_work_on so it executes the cmd on the same CPU in LIO. Once
LIO passes it to the block layer then that layer does whatever is setup.

On the completion path the low level works the same. The low level
driver goes by its ISRs/softirq/completion-thread settings, the block layer
then goes by the queue settings like rq_affinity.

The change in behavior is that in LIO we will do what was configured
in the layer below us instead of always trying to complete on the same
CPU it was submitted on.

Mike Christie Feb. 5, 2021, 6:04 p.m. UTC | #9

On 2/5/21 11:38 AM, Mike Christie wrote:
> On 2/5/21 10:17 AM, Michael S. Tsirkin wrote:

>> On Thu, Feb 04, 2021 at 05:35:07AM -0600, Mike Christie wrote:

>>> @@ -1132,14 +1127,8 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)

>>>  		 * vhost_scsi_queue_data_in() and vhost_scsi_queue_status()

>>>  		 */

>>>  		cmd->tvc_vq_desc = vc.head;

>>> -		/*

>>> -		 * Dispatch cmd descriptor for cmwq execution in process

>>> -		 * context provided by vhost_scsi_workqueue.  This also ensures

>>> -		 * cmd is executed on the same kworker CPU as this vhost

>>> -		 * thread to gain positive L2 cache locality effects.

>>> -		 */

>>> -		INIT_WORK(&cmd->work, vhost_scsi_submission_work);

>>> -		queue_work(vhost_scsi_workqueue, &cmd->work);

>>> +		target_queue_cmd_submit(tpg->tpg_nexus->tvn_se_sess,

>>> +					&cmd->tvc_se_cmd);

>>>  		ret = 0;

>>>  err:

>>>  		/*

>>

>> What about this aspect? Will things still stay on the same CPU

> Yes, if that is what it's configured to do.


Oh yeah, I wasn't sure if you were only asking about the code in this
patch or the combined patchset. The above chunk modifies the submission
code. Like I wrote below, there are no changes CPU use wise in that path.

Patch:

[PATCH 11/11] target, vhost-scsi: don't switch cpus on completion

modifies the completion path in LIO so we can complete the cmd on the
CPU that the layers below LIO were configured to complete on. The user
can then configure those layers to complete on the specific CPU it was
submitted on, just one that shares a cache, or what the layer below the
block layer completed it on.


> 

> On the submission path there is no change in behavior. target_queue_cmd_submit

> does queue_work_on so it executes the cmd on the same CPU in LIO. Once

> LIO passes it to the block layer then that layer does whatever is setup.

> 

> On the completion path the low level works the same. The low level

> driver goes by its ISRs/softirq/completion-thread settings, the block layer

> then goes by the queue settings like rq_affinity.

> 

> The change in behavior is that in LIO we will do what was configured

> in the layer below us instead of always trying to complete on the same

> CPU it was submitted on.

>

Chaitanya Kulkarni Feb. 7, 2021, 1:06 a.m. UTC | #10

On 2/4/21 03:40, Mike Christie wrote:
>  
> +	ib_dev->ibd_plug = kcalloc(nr_cpu_ids, sizeof(*ib_dev->ibd_plug),
> +				   GFP_KERNEL);
I'd actually prefer struct xxx in sizeof, but maybe that is just my
preference.
Not sure what is the standard practice in target code.

Bart Van Assche Feb. 7, 2021, 2:21 a.m. UTC | #11

On 2/6/21 5:06 PM, Chaitanya Kulkarni wrote:
> On 2/4/21 03:40, Mike Christie wrote:

>>  

>> +	ib_dev->ibd_plug = kcalloc(nr_cpu_ids, sizeof(*ib_dev->ibd_plug),

>> +				   GFP_KERNEL);

> I'd actually prefer struct xxx in sizeof, but maybe that is just my

> preference.

> Not sure what is the standard practice in target code.

The above code is easier to verify than the suggested alternative. With
the alternative one has to look up the definition of ibd_plug to verify
correctness of the code. The above code can be verified without having
to look up the definition of the ibd_plug member.

Bart.

Stefan Hajnoczi Feb. 8, 2021, 10:48 a.m. UTC | #12

On Thu, Feb 04, 2021 at 05:35:02AM -0600, Mike Christie wrote:
> The following patches made over Martin's 5.12 branches fix two

> issues:

> 

> 1. target_core_iblock plugs and unplugs the queue for every

> command. To handle this issue and handle an issue that

> vhost-scsi and loop were avoiding by adding their own workqueue,

> I added a new submission workqueue to LIO. Drivers can pass cmds

> to it, and we can then submit batches of cmds.

> 

> 2. vhost-scsi and loop on the submission side were doing a work

> per cmd and on the lio completion side it was doing a work per

> cmd. The cap on running works is 512 (max_active) and so we can

> end up end up using a lot of threads when submissions start blocking

> because they hit the block tag limit or the completion side blocks

> trying to send the cmd. In this patchset I just use a cmd list

> per session to avoid abusing the workueue layer.

> 

> The combined patchset fixes a major perf issue we've been hitting

> where IOPs is stuck at 230K when running:

> 

>     fio --filename=/dev/sda  --direct=1 --rw=randrw --bs=4k

>     --ioengine=libaio --iodepth=128  --numjobs=8 --time_based

>     --group_reporting --runtime=60

> 

> The patches in this set get me to 350K when using devices that

> have native IOPs of around 400-500K.

> 

> Note that 5.12 has some interrupt changes that my patches

> collide with. Martin's 5.12 branches had the changes so I

> based my patches on that.


For vhost-scsi:

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

Michael S. Tsirkin Feb. 8, 2021, 12:01 p.m. UTC | #13

On Thu, Feb 04, 2021 at 05:35:02AM -0600, Mike Christie wrote:
> The following patches made over Martin's 5.12 branches fix two

> issues:

> 

> 1. target_core_iblock plugs and unplugs the queue for every

> command. To handle this issue and handle an issue that

> vhost-scsi and loop were avoiding by adding their own workqueue,

> I added a new submission workqueue to LIO. Drivers can pass cmds

> to it, and we can then submit batches of cmds.

> 

> 2. vhost-scsi and loop on the submission side were doing a work

> per cmd and on the lio completion side it was doing a work per

> cmd. The cap on running works is 512 (max_active) and so we can

> end up end up using a lot of threads when submissions start blocking

> because they hit the block tag limit or the completion side blocks

> trying to send the cmd. In this patchset I just use a cmd list

> per session to avoid abusing the workueue layer.

> 

> The combined patchset fixes a major perf issue we've been hitting

> where IOPs is stuck at 230K when running:

> 

>     fio --filename=/dev/sda  --direct=1 --rw=randrw --bs=4k

>     --ioengine=libaio --iodepth=128  --numjobs=8 --time_based

>     --group_reporting --runtime=60

> 

> The patches in this set get me to 350K when using devices that

> have native IOPs of around 400-500K.

> 

> Note that 5.12 has some interrupt changes that my patches

> collide with. Martin's 5.12 branches had the changes so I

> based my patches on that.

> 


OK so feel free to merge through that branch.

Acked-by: Michael S. Tsirkin <mst@redhat.com>


-- 
MST

[00/11] target: fix cmd plugging and completion

Message

Comments