[V2,0/6] blk-mq: fix blk_mq_alloc_request_hctx

Message ID	20210702150555.2401722-1-ming.lei@redhat.com
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Ming Lei <ming.lei@redhat.com> To: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@lst.de>, "Martin K . Petersen" <martin.petersen@oracle.com>, linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me>, Daniel Wagner <dwagner@suse.de>, Wen Xiong <wenxiong@us.ibm.com>, John Garry <john.garry@huawei.com>, Hannes Reinecke <hare@suse.de>, Keith Busch <kbusch@kernel.org>, Damien Le Moal <damien.lemoal@wdc.com>, Ming Lei <ming.lei@redhat.com> Subject: [PATCH V2 0/6] blk-mq: fix blk_mq_alloc_request_hctx Date: Fri, 2 Jul 2021 23:05:49 +0800 Message-Id: <20210702150555.2401722-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	blk-mq: fix blk_mq_alloc_request_hctx \| expand [V2,0/6] blk-mq: fix blk_mq_alloc_request_hctx [V2,2/6] nvme: pci: pass BLK_MQ_F_MANAGED_IRQ to blk-mq [V2,4/6] scsi: set shost->use_managed_irq if driver uses managed irq [V2,6/6] blk-mq: don't deactivate hctx if managed irq isn't used

Ming Lei July 2, 2021, 3:05 p.m. UTC

Hi,

blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
However, all cpus in hctx->cpumask may be offline.

This usage model isn't well supported by blk-mq which supposes allocator is
always done on one online CPU in hctx->cpumask. This assumption is
related with managed irq, which also requires blk-mq to drain inflight
request in this hctx when the last cpu in hctx->cpumask is going to
offline.

However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
them to ask for request allocation when the specified hctx is inactive
(all cpus in hctx->cpumask are offline).

Fix blk_mq_alloc_request_hctx() by adding/passing flag of BLK_MQ_F_MANAGED_IRQ. 

Meantime optimize blk-mq cpu hotplug handling for non-managed irq.

V2:
	- use flag of BLK_MQ_F_MANAGED_IRQ
	- pass BLK_MQ_F_MANAGED_IRQ from driver explicitly
	- kill BLK_MQ_F_STACKING

Ming Lei (6):
  blk-mq: prepare for not deactivating hctx if managed irq isn't used
  nvme: pci: pass BLK_MQ_F_MANAGED_IRQ to blk-mq
  scsi: add flag of .use_managed_irq to 'struct Scsi_Host'
  scsi: set shost->use_managed_irq if driver uses managed irq
  virtio: add one field into virtio_device for recording if device uses
    managed irq
  blk-mq: don't deactivate hctx if managed irq isn't used

 block/blk-mq-debugfs.c                    |  2 +-
 block/blk-mq.c                            | 27 +++++++++++++----------
 drivers/block/loop.c                      |  2 +-
 drivers/block/virtio_blk.c                |  2 ++
 drivers/md/dm-rq.c                        |  2 +-
 drivers/nvme/host/pci.c                   |  3 ++-
 drivers/scsi/aacraid/linit.c              |  3 +++
 drivers/scsi/be2iscsi/be_main.c           |  3 +++
 drivers/scsi/csiostor/csio_init.c         |  3 +++
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    |  1 +
 drivers/scsi/hpsa.c                       |  3 +++
 drivers/scsi/lpfc/lpfc.h                  |  1 +
 drivers/scsi/lpfc/lpfc_init.c             |  4 ++++
 drivers/scsi/megaraid/megaraid_sas_base.c |  3 +++
 drivers/scsi/mpt3sas/mpt3sas_scsih.c      |  3 +++
 drivers/scsi/qla2xxx/qla_isr.c            |  5 ++++-
 drivers/scsi/scsi_lib.c                   | 12 +++++-----
 drivers/scsi/smartpqi/smartpqi_init.c     |  3 +++
 drivers/scsi/virtio_scsi.c                |  1 +
 drivers/virtio/virtio_pci_common.c        |  1 +
 include/linux/blk-mq.h                    |  6 +----
 include/linux/virtio.h                    |  1 +
 include/scsi/scsi_host.h                  |  3 +++
 23 files changed, 67 insertions(+), 27 deletions(-)

Ming Lei July 5, 2021, 2:48 a.m. UTC | #1

On Fri, Jul 02, 2021 at 11:55:40AM -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 02, 2021 at 11:05:54PM +0800, Ming Lei wrote:

> > blk-mq needs to know if the device uses managed irq, so add one field

> > to virtio_device for recording if device uses managed irq.

> > 

> > If the driver use managed irq, this flag has to be set so it can be

> > passed to blk-mq.

> > 

> > Cc: "Michael S. Tsirkin" <mst@redhat.com>

> > Cc: Jason Wang <jasowang@redhat.com>

> > Cc: virtualization@lists.linux-foundation.org

> > Signed-off-by: Ming Lei <ming.lei@redhat.com>

> 

> 

> The API seems somewhat confusing. virtio does not request

> a managed irq as such, does it? I think it's a decision taken

> by the irq core.


vp_request_msix_vectors():

        if (desc) {
                flags |= PCI_IRQ_AFFINITY;
                desc->pre_vectors++; /* virtio config vector */
                vdev->use_managed_irq = true;
        }

        err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors,
                                             nvectors, flags, desc);

Managed irq is used once PCI_IRQ_AFFINITY is passed to
pci_alloc_irq_vectors_affinity().

> 

> Any way to query the irq to find out if it's managed?


So far the managed info isn't exported via /proc/irq, but if one irq is
managed, its smp_affinity & smp_affinity_list attributes can't be
changed successfully.

If irq's debugfs is enabled, this info can be retrieved.


Thanks,
Ming

Jason Wang July 5, 2021, 3:59 a.m. UTC | #2

在 2021/7/2 下午11:05, Ming Lei 写道:
> blk-mq needs to know if the device uses managed irq, so add one field

> to virtio_device for recording if device uses managed irq.

>

> If the driver use managed irq, this flag has to be set so it can be

> passed to blk-mq.

>

> Cc: "Michael S. Tsirkin"<mst@redhat.com>

> Cc: Jason Wang<jasowang@redhat.com>

> Cc:virtualization@lists.linux-foundation.org

> Signed-off-by: Ming Lei<ming.lei@redhat.com>

> ---

>   drivers/block/virtio_blk.c         | 2 ++

>   drivers/scsi/virtio_scsi.c         | 1 +

>   drivers/virtio/virtio_pci_common.c | 1 +

>   include/linux/virtio.h             | 1 +

>   4 files changed, 5 insertions(+)

>

> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c

> index e4bd3b1fc3c2..33b9c80ac475 100644

> --- a/drivers/block/virtio_blk.c

> +++ b/drivers/block/virtio_blk.c

> @@ -764,6 +764,8 @@ static int virtblk_probe(struct virtio_device *vdev)

>   	vblk->tag_set.queue_depth = queue_depth;

>   	vblk->tag_set.numa_node = NUMA_NO_NODE;

>   	vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;

> +	if (vdev->use_managed_irq)

> +		vblk->tag_set.flags |= BLK_MQ_F_MANAGED_IRQ;



I'm not familiar with blk mq.

But the name is kind of confusing, I guess 
"BLK_MQ_F_AFFINITY_MANAGED_IRQ" is better? (Consider we had 
"IRQD_AFFINITY_MANAGED")

This helps me to differ this from the devres (device managed IRQ) at least.

Thanks

John Garry July 5, 2021, 9:25 a.m. UTC | #3

On 02/07/2021 16:05, Ming Lei wrote:
> blk-mq needs this information of using managed irq for improving

> deactivating hctx, so add such flag to 'struct Scsi_Host', then

> drivers can pass such flag to blk-mq via scsi_mq_setup_tags().

> 

> The rule is that driver has to tell blk-mq if managed irq is used.

> 

> Signed-off-by: Ming Lei<ming.lei@redhat.com>


As was said before, can we have something like this instead of relying 
on the LLDs to do the setting:

--------->8------------

diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
index b595a94c4d16..2037a5b69fe1 100644
--- a/block/blk-mq-pci.c
+++ b/block/blk-mq-pci.c
@@ -37,7 +37,7 @@ int blk_mq_pci_map_queues(struct blk_mq_queue_map 
*qmap, struct pci_dev *pdev,
  		for_each_cpu(cpu, mask)
  			qmap->mq_map[cpu] = qmap->queue_offset + queue;
  	}
-
+	qmap->drain_hwq = 1;
  	return 0;

  fallback:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 46fe5b45a8b8..7b610e680e08 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3408,7 +3408,8 @@ static int blk_mq_update_queue_map(struct 
blk_mq_tag_set *set)
  		set->map[HCTX_TYPE_DEFAULT].nr_queues = set->nr_hw_queues;

  	if (set->ops->map_queues && !is_kdump_kernel()) {
-		int i;
+		struct blk_mq_queue_map *qmap = &set->map[HCTX_TYPE_DEFAULT];
+		int i, ret;

  		/*
  		 * transport .map_queues is usually done in the following
@@ -3427,7 +3428,12 @@ static int blk_mq_update_queue_map(struct 
blk_mq_tag_set *set)
  		for (i = 0; i < set->nr_maps; i++)
  			blk_mq_clear_mq_map(&set->map[i]);

-		return set->ops->map_queues(set);
+		ret = set->ops->map_queues(set);
+		if (ret)
+			return ret;
+		if (qmap->drain_hwq)
+			set->flags |= BLK_MQ_F_MANAGED_IRQ;
+		return 0;
  	} else {
  		BUG_ON(set->nr_maps > 1);
  		return blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 65ed99b3f2f0..2b2e71ff43e0 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -193,6 +193,7 @@ struct blk_mq_queue_map {
  	unsigned int *mq_map;
  	unsigned int nr_queues;
  	unsigned int queue_offset;
+	bool drain_hwq;
  };

Ming Lei July 5, 2021, 9:55 a.m. UTC | #4

On Mon, Jul 05, 2021 at 10:25:38AM +0100, John Garry wrote:
> On 02/07/2021 16:05, Ming Lei wrote:

> > blk-mq needs this information of using managed irq for improving

> > deactivating hctx, so add such flag to 'struct Scsi_Host', then

> > drivers can pass such flag to blk-mq via scsi_mq_setup_tags().

> > 

> > The rule is that driver has to tell blk-mq if managed irq is used.

> > 

> > Signed-off-by: Ming Lei<ming.lei@redhat.com>

> 

> As was said before, can we have something like this instead of relying on

> the LLDs to do the setting:

> 

> --------->8------------

> 

> diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c

> index b595a94c4d16..2037a5b69fe1 100644

> --- a/block/blk-mq-pci.c

> +++ b/block/blk-mq-pci.c

> @@ -37,7 +37,7 @@ int blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap,

> struct pci_dev *pdev,

>  		for_each_cpu(cpu, mask)

>  			qmap->mq_map[cpu] = qmap->queue_offset + queue;

>  	}

> -

> +	qmap->drain_hwq = 1;


The thing is that blk_mq_pci_map_queues() is allowed to be called for
non-managed irqs. Also some managed irq consumers don't use blk_mq_pci_map_queues().

So this way just provides hint about managed irq uses, but we really
need to get this flag set if driver uses managed irq.


Thanks,
Ming

Christoph Hellwig July 6, 2021, 5:37 a.m. UTC | #5

On Mon, Jul 05, 2021 at 05:55:49PM +0800, Ming Lei wrote:
> The thing is that blk_mq_pci_map_queues() is allowed to be called for

> non-managed irqs. Also some managed irq consumers don't use blk_mq_pci_map_queues().

> 

> So this way just provides hint about managed irq uses, but we really

> need to get this flag set if driver uses managed irq.


blk_mq_pci_map_queues is absolutely intended to only be used by
managed irqs.  I wonder if we can enforce that somehow?

Christoph Hellwig July 6, 2021, 5:42 a.m. UTC | #6

On Fri, Jul 02, 2021 at 11:05:54PM +0800, Ming Lei wrote:
> blk-mq needs to know if the device uses managed irq, so add one field

> to virtio_device for recording if device uses managed irq.

> 

> If the driver use managed irq, this flag has to be set so it can be

> passed to blk-mq.

I don't think all this boilerplate code make a whole lot of sense.
I think we need to record this information deep down in the irq code by
setting a flag in struct device only if pci_alloc_irq_vectors_affinity
atually managed to allocate multiple vectors and the PCI_IRQ_AFFINITY
flag was set.  Then blk-mq can look at that flag, and also check that
more than one queue is in used and work based on that.

Ming Lei July 6, 2021, 7:41 a.m. UTC | #7

On Tue, Jul 06, 2021 at 07:37:19AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 05, 2021 at 05:55:49PM +0800, Ming Lei wrote:

> > The thing is that blk_mq_pci_map_queues() is allowed to be called for

> > non-managed irqs. Also some managed irq consumers don't use blk_mq_pci_map_queues().

> > 

> > So this way just provides hint about managed irq uses, but we really

> > need to get this flag set if driver uses managed irq.

> 

> blk_mq_pci_map_queues is absolutely intended to only be used by

> managed irqs.  I wonder if we can enforce that somehow?


It may break some scsi drivers.

And blk_mq_pci_map_queues() just calls pci_irq_get_affinity() to
retrieve the irq's affinity, and the irq can be one non-managed irq,
which affinity is set via either irq_set_affinity_hint() from kernel
or /proc/irq/.


Thanks,
Ming

Ming Lei July 6, 2021, 7:53 a.m. UTC | #8

On Tue, Jul 06, 2021 at 07:42:03AM +0200, Christoph Hellwig wrote:
> On Fri, Jul 02, 2021 at 11:05:54PM +0800, Ming Lei wrote:

> > blk-mq needs to know if the device uses managed irq, so add one field

> > to virtio_device for recording if device uses managed irq.

> > 

> > If the driver use managed irq, this flag has to be set so it can be

> > passed to blk-mq.

> 

> I don't think all this boilerplate code make a whole lot of sense.

> I think we need to record this information deep down in the irq code by

> setting a flag in struct device only if pci_alloc_irq_vectors_affinity

> atually managed to allocate multiple vectors and the PCI_IRQ_AFFINITY

> flag was set.  Then blk-mq can look at that flag, and also check that

> more than one queue is in used and work based on that.


How can blk-mq look at that flag? Usually blk-mq doesn't play with
physical device(HBA).

So far almost all physically properties(segment, max_hw_sectors, ...)
are not provided to blk-mq directly, instead by queue limits abstract.

Thanks,
Ming

Hannes Reinecke July 6, 2021, 10:32 a.m. UTC | #9

On 7/6/21 9:41 AM, Ming Lei wrote:
> On Tue, Jul 06, 2021 at 07:37:19AM +0200, Christoph Hellwig wrote:

>> On Mon, Jul 05, 2021 at 05:55:49PM +0800, Ming Lei wrote:

>>> The thing is that blk_mq_pci_map_queues() is allowed to be called for

>>> non-managed irqs. Also some managed irq consumers don't use blk_mq_pci_map_queues().

>>>

>>> So this way just provides hint about managed irq uses, but we really

>>> need to get this flag set if driver uses managed irq.

>>

>> blk_mq_pci_map_queues is absolutely intended to only be used by

>> managed irqs.  I wonder if we can enforce that somehow?

> 

> It may break some scsi drivers.

> 

> And blk_mq_pci_map_queues() just calls pci_irq_get_affinity() to

> retrieve the irq's affinity, and the irq can be one non-managed irq,

> which affinity is set via either irq_set_affinity_hint() from kernel

> or /proc/irq/.

> 

But that's static, right? IE blk_mq_pci_map_queues() will be called once 
during module init; so what happens if the user changes the mapping 
later on? How will that be transferred to the driver?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Thomas Gleixner July 7, 2021, 9:06 a.m. UTC | #10

On Tue, Jul 06 2021 at 07:42, Christoph Hellwig wrote:
> On Fri, Jul 02, 2021 at 11:05:54PM +0800, Ming Lei wrote:

>> blk-mq needs to know if the device uses managed irq, so add one field

>> to virtio_device for recording if device uses managed irq.

>> 

>> If the driver use managed irq, this flag has to be set so it can be

>> passed to blk-mq.

>

> I don't think all this boilerplate code make a whole lot of sense.

> I think we need to record this information deep down in the irq code by

> setting a flag in struct device only if pci_alloc_irq_vectors_affinity

> atually managed to allocate multiple vectors and the PCI_IRQ_AFFINITY

> flag was set.  Then blk-mq can look at that flag, and also check that

> more than one queue is in used and work based on that.


Ack.

Ming Lei July 7, 2021, 9:42 a.m. UTC | #11

On Wed, Jul 07, 2021 at 11:06:25AM +0200, Thomas Gleixner wrote:
> On Tue, Jul 06 2021 at 07:42, Christoph Hellwig wrote:

> > On Fri, Jul 02, 2021 at 11:05:54PM +0800, Ming Lei wrote:

> >> blk-mq needs to know if the device uses managed irq, so add one field

> >> to virtio_device for recording if device uses managed irq.

> >> 

> >> If the driver use managed irq, this flag has to be set so it can be

> >> passed to blk-mq.

> >

> > I don't think all this boilerplate code make a whole lot of sense.

> > I think we need to record this information deep down in the irq code by

> > setting a flag in struct device only if pci_alloc_irq_vectors_affinity

> > atually managed to allocate multiple vectors and the PCI_IRQ_AFFINITY

> > flag was set.  Then blk-mq can look at that flag, and also check that

> > more than one queue is in used and work based on that.

> 

> Ack.

The problem is that how blk-mq looks at that flag, since the device
representing the controller which allocates irq vectors isn't visible
to blk-mq.

Usually blk-mq(block layer) provides queue limits abstract for drivers
to tell any physical property(dma segment, max sectors, ...) to block
layer, that is why this patch takes this very similar usage to check if
HBA uses managed irq or not.

Thanks, 
Ming

Ming Lei July 7, 2021, 10:53 a.m. UTC | #12

On Tue, Jul 06, 2021 at 12:32:27PM +0200, Hannes Reinecke wrote:
> On 7/6/21 9:41 AM, Ming Lei wrote:

> > On Tue, Jul 06, 2021 at 07:37:19AM +0200, Christoph Hellwig wrote:

> > > On Mon, Jul 05, 2021 at 05:55:49PM +0800, Ming Lei wrote:

> > > > The thing is that blk_mq_pci_map_queues() is allowed to be called for

> > > > non-managed irqs. Also some managed irq consumers don't use blk_mq_pci_map_queues().

> > > > 

> > > > So this way just provides hint about managed irq uses, but we really

> > > > need to get this flag set if driver uses managed irq.

> > > 

> > > blk_mq_pci_map_queues is absolutely intended to only be used by

> > > managed irqs.  I wonder if we can enforce that somehow?

> > 

> > It may break some scsi drivers.

> > 

> > And blk_mq_pci_map_queues() just calls pci_irq_get_affinity() to

> > retrieve the irq's affinity, and the irq can be one non-managed irq,

> > which affinity is set via either irq_set_affinity_hint() from kernel

> > or /proc/irq/.

> > 

> But that's static, right? IE blk_mq_pci_map_queues() will be called once

> during module init; so what happens if the user changes the mapping later

> on? How will that be transferred to the driver?


Yeah, that may not work well enough, but still works since non-managed
irq supports migration.

And there are several SCSI drivers which provide module parameter to
enable/disable managed irq, meantime blk_mq_pci_map_queues() is always
called for mapping queues.


Thanks,
Ming

Christoph Hellwig July 7, 2021, 2:05 p.m. UTC | #13

On Wed, Jul 07, 2021 at 05:42:54PM +0800, Ming Lei wrote:
> The problem is that how blk-mq looks at that flag, since the device

> representing the controller which allocates irq vectors isn't visible

> to blk-mq.


In blk_mq_pci_map_queues and similar helpers.

Ming Lei July 8, 2021, 6:34 a.m. UTC | #14

On Wed, Jul 07, 2021 at 04:05:29PM +0200, Christoph Hellwig wrote:
> On Wed, Jul 07, 2021 at 05:42:54PM +0800, Ming Lei wrote:

> > The problem is that how blk-mq looks at that flag, since the device

> > representing the controller which allocates irq vectors isn't visible

> > to blk-mq.

> 

> In blk_mq_pci_map_queues and similar helpers.

Firstly it depends if drivers call into these helpers, so this way
is fragile.

Secondly, I think it isn't good to expose specific physical devices
into blk-mq which shouldn't deal with physical device directly, also
all the three helpers just duplicates same logic except for retrieving
each vector's affinity from specific physical device.

I will think about how to cleanup them.

Thanks,
Ming

[V2,0/6] blk-mq: fix blk_mq_alloc_request_hctx

Message

Comments