diff mbox series

[v1] scsi: storvsc: Cap cmd_per_lun at can_queue

Message ID 20210305232151.1531-1-melanieplageman@gmail.com
State New
Headers show
Series [v1] scsi: storvsc: Cap cmd_per_lun at can_queue | expand

Commit Message

Melanie Plageman (Microsoft) March 5, 2021, 11:21 p.m. UTC
The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during
allocation.

Cap cmd_per_lun at can_queue to avoid dispatch errors.

Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com>
---
 drivers/scsi/storvsc_drv.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Michael Kelley March 8, 2021, 2:37 p.m. UTC | #1
From: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> Sent: Friday, March 5, 2021 3:22 PM

> 

> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during

> allocation.

> 

> Cap cmd_per_lun at can_queue to avoid dispatch errors.

> 

> Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com>

> ---

>  drivers/scsi/storvsc_drv.c | 2 ++

>  1 file changed, 2 insertions(+)

> 

> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c

> index 6bc5453cea8a..d7953a6e00e6 100644

> --- a/drivers/scsi/storvsc_drv.c

> +++ b/drivers/scsi/storvsc_drv.c

> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,

>  				(max_sub_channels + 1) *

>  				(100 - ring_avail_percent_lowater) / 100;

> 

> +	scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, scsi_driver.can_queue);

> +


I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?
Be aware that the calculation of "can_queue" in this driver is somewhat
flawed -- it should not be based on the size of the ring buffer, but instead on
the maximum number of requests Hyper-V will queue.  And even then,
can_queue doesn't provide the cap you might expect because the blk-mq layer
allocates can_queue tags for each HW queue, not as a total.

I agree that the cmd_per_lun setting is also too big, but we should fix that in
the context of getting all of these different settings working together correctly,
and not piecemeal.

Michael

>  	host = scsi_host_alloc(&scsi_driver,

>  			       sizeof(struct hv_host_device));

>  	if (!host)

> --

> 2.20.1
Melanie Plageman (Microsoft) March 8, 2021, 5:56 p.m. UTC | #2
On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote:
> From: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> Sent: Friday, March 5, 2021 3:22 PM

> > 

> > The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during

> > allocation.

> > 

> > Cap cmd_per_lun at can_queue to avoid dispatch errors.

> > 

> > Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com>

> > ---

> >  drivers/scsi/storvsc_drv.c | 2 ++

> >  1 file changed, 2 insertions(+)

> > 

> > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c

> > index 6bc5453cea8a..d7953a6e00e6 100644

> > --- a/drivers/scsi/storvsc_drv.c

> > +++ b/drivers/scsi/storvsc_drv.c

> > @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,

> >  				(max_sub_channels + 1) *

> >  				(100 - ring_avail_percent_lowater) / 100;

> > 

> > +	scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, scsi_driver.can_queue);

> > +

> 

> I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?


The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set
Scsi_Host->cmd_per_lun in storvsc_probe().

In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is
called and sets the scsi_device->queue_depth to the Scsi_Host's
cmd_per_lun with this code:
    
scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?    
                                        sdev->host->cmd_per_lun : 1);

During dispatch, the scsi_device->queue_depth is used in
scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine
whether or not the device can queue another command.    

On some machines, with the 2048 value of cmd_per_lun that was used to
set the initial scsi_device->queue_depth, commands can be queued that
are later not able to be dispatched after running out of space in the
ringbuffer.

On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD
(running an fio workload that I can provide if needed), storvsc_do_io()
ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.
                                                                      
This is the call stack: 
    
hv_get_bytes_to_write
hv_ringbuffer_write    
vmbus_send_packet      
storvsc_dio_io      
storvsc_queuecommand
scsi_dispatch_cmd    
scsi_queue_rq       
dispatch_rq_list

> Be aware that the calculation of "can_queue" in this driver is somewhat

> flawed -- it should not be based on the size of the ring buffer, but instead on

> the maximum number of requests Hyper-V will queue.  And even then,

> can_queue doesn't provide the cap you might expect because the blk-mq layer

> allocates can_queue tags for each HW queue, not as a total.



The docs for scsi_mid_low_api document Scsi_Host can_queue this way:

  can_queue    
  - must be greater than 0; do not send more than can_queue    
    commands to the adapter. 
        
I did notice that in scsi_host.h, the comment for can_queue does say
can_queue is the "maximum number of simultaneous commands a single hw
queue in HBA will accept." However, I don't see it being used this way
in the code.
                                                                            
During dispatch, In scsi_target_queue_ready(), there is this code:

        if (busy >= starget->can_queue)
                goto starved;

And the scsi_target->can_queue value should be coming from Scsi_host as
mentioned in the scsi_target definition in scsi_device.h
    /*
      * LLDs should set this in the slave_alloc host template callout.
      * If set to zero then there is not limit.
      */
    unsigned int            can_queue;

So, I don't really see how this would be per hardware queue.

> 

> I agree that the cmd_per_lun setting is also too big, but we should fix that in

> the context of getting all of these different settings working together correctly,

> and not piecemeal.

> 


Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe
will also prevent the LUN queue_depth from being set to a value that is
higher than it can ever be set to again by the user when
storvsc_change_queue_depth() is invoked. 

Also in scsi_sysfs sdev_store_queue_depth() there is this check:

          if (depth < 1 || depth > sdev->host->can_queue)
                return -EINVAL;

I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun
is set to the min of the configured cmd_per_lun and
Scsi_Host->can_queue:

    shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);

Best,
Melanie
John Garry March 9, 2021, 10:09 a.m. UTC | #3
On 08/03/2021 17:56, Melanie Plageman wrote:
> On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote:

>> From: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> Sent: Friday, March 5, 2021 3:22 PM

>>>

>>> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during

>>> allocation.

>>>

>>> Cap cmd_per_lun at can_queue to avoid dispatch errors.

>>>

>>> Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com>

>>> ---

>>>   drivers/scsi/storvsc_drv.c | 2 ++

>>>   1 file changed, 2 insertions(+)

>>>

>>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c

>>> index 6bc5453cea8a..d7953a6e00e6 100644

>>> --- a/drivers/scsi/storvsc_drv.c

>>> +++ b/drivers/scsi/storvsc_drv.c

>>> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,

>>>   				(max_sub_channels + 1) *

>>>   				(100 - ring_avail_percent_lowater) / 100;

>>>

>>> +	scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, scsi_driver.can_queue);

>>> +

>>

>> I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?

> 

> The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set

> Scsi_Host->cmd_per_lun in storvsc_probe().

> 

> In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is

> called and sets the scsi_device->queue_depth to the Scsi_Host's

> cmd_per_lun with this code:

>      

> scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?

>                                          sdev->host->cmd_per_lun : 1);

> 

> During dispatch, the scsi_device->queue_depth is used in

> scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine

> whether or not the device can queue another command.

> 

> On some machines, with the 2048 value of cmd_per_lun that was used to

> set the initial scsi_device->queue_depth, commands can be queued that

> are later not able to be dispatched after running out of space in the

> ringbuffer.

> 

> On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD

> (running an fio workload that I can provide if needed), storvsc_do_io()

> ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.

>                                                                        

> This is the call stack:

>      

> hv_get_bytes_to_write

> hv_ringbuffer_write

> vmbus_send_packet

> storvsc_dio_io

> storvsc_queuecommand

> scsi_dispatch_cmd

> scsi_queue_rq

> dispatch_rq_list

> 

>> Be aware that the calculation of "can_queue" in this driver is somewhat

>> flawed -- it should not be based on the size of the ring buffer, but instead on

>> the maximum number of requests Hyper-V will queue.  And even then,

>> can_queue doesn't provide the cap you might expect because the blk-mq layer

>> allocates can_queue tags for each HW queue, not as a total.

> 

> 

> The docs for scsi_mid_low_api document Scsi_Host can_queue this way:

> 

>    can_queue

>    - must be greater than 0; do not send more than can_queue

>      commands to the adapter.

>          

> I did notice that in scsi_host.h, the comment for can_queue does say

> can_queue is the "maximum number of simultaneous commands a single hw

> queue in HBA will accept." However, I don't see it being used this way

> in the code.

>                            


JFYI, the block layer ensures that no more than can_queue requests are 
sent to the host. See scsi_mq_setup_tags(), and how the tagset queue 
depth is set to shost->can_queue.

Thanks,
John


> During dispatch, In scsi_target_queue_ready(), there is this code:

> 

>          if (busy >= starget->can_queue)

>                  goto starved;

> 

> And the scsi_target->can_queue value should be coming from Scsi_host as

> mentioned in the scsi_target definition in scsi_device.h

>      /*

>        * LLDs should set this in the slave_alloc host template callout.

>        * If set to zero then there is not limit.

>        */

>      unsigned int            can_queue;

> 

> So, I don't really see how this would be per hardware queue.

> 

>>

>> I agree that the cmd_per_lun setting is also too big, but we should fix that in

>> the context of getting all of these different settings working together correctly,

>> and not piecemeal.

>>

> 

> Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe

> will also prevent the LUN queue_depth from being set to a value that is

> higher than it can ever be set to again by the user when

> storvsc_change_queue_depth() is invoked.

> 

> Also in scsi_sysfs sdev_store_queue_depth() there is this check:

> 

>            if (depth < 1 || depth > sdev->host->can_queue)

>                  return -EINVAL;

> 

> I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun

> is set to the min of the configured cmd_per_lun and

> Scsi_Host->can_queue:

> 

>      shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);

> 

> Best,

> Melanie

> .

>
Michael Kelley March 9, 2021, 3:45 p.m. UTC | #4
From: Melanie Plageman <melanieplageman@gmail.com> Sent: Monday, March 8, 2021 9:56 AM

> 

> On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote:

> > From: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> Sent: Friday, March

> 5, 2021 3:22 PM

> > >

> > > The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during

> > > allocation.

> > >

> > > Cap cmd_per_lun at can_queue to avoid dispatch errors.

> > >

> > > Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com>

> > > ---

> > >  drivers/scsi/storvsc_drv.c | 2 ++

> > >  1 file changed, 2 insertions(+)

> > >

> > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c

> > > index 6bc5453cea8a..d7953a6e00e6 100644

> > > --- a/drivers/scsi/storvsc_drv.c

> > > +++ b/drivers/scsi/storvsc_drv.c

> > > @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,

> > >  				(max_sub_channels + 1) *

> > >  				(100 - ring_avail_percent_lowater) / 100;

> > >

> > > +	scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun,

> scsi_driver.can_queue);

> > > +

> >

> > I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?

> 

> The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set

> Scsi_Host->cmd_per_lun in storvsc_probe().

> 

> In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is

> called and sets the scsi_device->queue_depth to the Scsi_Host's

> cmd_per_lun with this code:

> 

> scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?

>                                         sdev->host->cmd_per_lun : 1);

> 

> During dispatch, the scsi_device->queue_depth is used in

> scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine

> whether or not the device can queue another command.

> 

> On some machines, with the 2048 value of cmd_per_lun that was used to

> set the initial scsi_device->queue_depth, commands can be queued that

> are later not able to be dispatched after running out of space in the

> ringbuffer.

> 

> On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD

> (running an fio workload that I can provide if needed), storvsc_do_io()

> ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.

> 

> This is the call stack:

> 

> hv_get_bytes_to_write

> hv_ringbuffer_write

> vmbus_send_packet

> storvsc_dio_io

> storvsc_queuecommand

> scsi_dispatch_cmd

> scsi_queue_rq

> dispatch_rq_list

> 

> > Be aware that the calculation of "can_queue" in this driver is somewhat

> > flawed -- it should not be based on the size of the ring buffer, but instead on

> > the maximum number of requests Hyper-V will queue.  And even then,

> > can_queue doesn't provide the cap you might expect because the blk-mq layer

> > allocates can_queue tags for each HW queue, not as a total.

> 

> 

> The docs for scsi_mid_low_api document Scsi_Host can_queue this way:

> 

>   can_queue

>   - must be greater than 0; do not send more than can_queue

>     commands to the adapter.

> 

> I did notice that in scsi_host.h, the comment for can_queue does say

> can_queue is the "maximum number of simultaneous commands a single hw

> queue in HBA will accept." 


Yes, this comment is correct.  The can_queue value is per HW queue.

> However, I don't see it being used this way

> in the code.

> 

> During dispatch, In scsi_target_queue_ready(), there is this code:

> 

>         if (busy >= starget->can_queue)

>                 goto starved;

> 

> And the scsi_target->can_queue value should be coming from Scsi_host as

> mentioned in the scsi_target definition in scsi_device.h

>     /*

>       * LLDs should set this in the slave_alloc host template callout.

>       * If set to zero then there is not limit.

>       */

>     unsigned int            can_queue;

> 

> So, I don't really see how this would be per hardware queue.


For the storvsc driver, the can_queue value in the scsi_target is initialized
to zero in scsi_alloc_target() and it remains unchanged.  Maybe I'm missing
something, but the only place I see that sets starget->can_queue to a
non-zero value is iscsi_target_alloc().  The storvsc slave_alloc() function does
not set it.  So the test in scsi_target_queue_ready() for exceeding can_queue
never executes.

We've run live tests, and can see that the number of requests sent to the
storvsc driver exceeds the can_queue value when the # of HW queues is
greater than 1.  That result is consistent with what I see in the code.

Michael

> 

> >

> > I agree that the cmd_per_lun setting is also too big, but we should fix that in

> > the context of getting all of these different settings working together correctly,

> > and not piecemeal.

> >

> 

> Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe

> will also prevent the LUN queue_depth from being set to a value that is

> higher than it can ever be set to again by the user when

> storvsc_change_queue_depth() is invoked.

> 

> Also in scsi_sysfs sdev_store_queue_depth() there is this check:

> 

>           if (depth < 1 || depth > sdev->host->can_queue)

>                 return -EINVAL;

> 

> I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun

> is set to the min of the configured cmd_per_lun and

> Scsi_Host->can_queue:

> 

>     shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);

> 

> Best,

> Melanie
Michael Kelley March 9, 2021, 3:57 p.m. UTC | #5
From: John Garry <john.garry@huawei.com> Sent: Tuesday, March 9, 2021 2:10 AM

> 

> On 08/03/2021 17:56, Melanie Plageman wrote:

> > On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote:

> >> From: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> Sent: Friday,

> March 5, 2021 3:22 PM

> >>>

> >>> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during

> >>> allocation.

> >>>

> >>> Cap cmd_per_lun at can_queue to avoid dispatch errors.

> >>>

> >>> Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com>

> >>> ---

> >>>   drivers/scsi/storvsc_drv.c | 2 ++

> >>>   1 file changed, 2 insertions(+)

> >>>

> >>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c

> >>> index 6bc5453cea8a..d7953a6e00e6 100644

> >>> --- a/drivers/scsi/storvsc_drv.c

> >>> +++ b/drivers/scsi/storvsc_drv.c

> >>> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,

> >>>   				(max_sub_channels + 1) *

> >>>   				(100 - ring_avail_percent_lowater) / 100;

> >>>

> >>> +	scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun,

> scsi_driver.can_queue);

> >>> +

> >>

> >> I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?

> >

> > The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set

> > Scsi_Host->cmd_per_lun in storvsc_probe().

> >

> > In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is

> > called and sets the scsi_device->queue_depth to the Scsi_Host's

> > cmd_per_lun with this code:

> >

> > scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?

> >                                          sdev->host->cmd_per_lun : 1);

> >

> > During dispatch, the scsi_device->queue_depth is used in

> > scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine

> > whether or not the device can queue another command.

> >

> > On some machines, with the 2048 value of cmd_per_lun that was used to

> > set the initial scsi_device->queue_depth, commands can be queued that

> > are later not able to be dispatched after running out of space in the

> > ringbuffer.

> >

> > On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD

> > (running an fio workload that I can provide if needed), storvsc_do_io()

> > ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.

> >

> > This is the call stack:

> >

> > hv_get_bytes_to_write

> > hv_ringbuffer_write

> > vmbus_send_packet

> > storvsc_dio_io

> > storvsc_queuecommand

> > scsi_dispatch_cmd

> > scsi_queue_rq

> > dispatch_rq_list

> >

> >> Be aware that the calculation of "can_queue" in this driver is somewhat

> >> flawed -- it should not be based on the size of the ring buffer, but instead on

> >> the maximum number of requests Hyper-V will queue.  And even then,

> >> can_queue doesn't provide the cap you might expect because the blk-mq layer

> >> allocates can_queue tags for each HW queue, not as a total.

> >

> >

> > The docs for scsi_mid_low_api document Scsi_Host can_queue this way:

> >

> >    can_queue

> >    - must be greater than 0; do not send more than can_queue

> >      commands to the adapter.

> >

> > I did notice that in scsi_host.h, the comment for can_queue does say

> > can_queue is the "maximum number of simultaneous commands a single hw

> > queue in HBA will accept." However, I don't see it being used this way

> > in the code.

> >

> 

> JFYI, the block layer ensures that no more than can_queue requests are

> sent to the host. See scsi_mq_setup_tags(), and how the tagset queue

> depth is set to shost->can_queue.

> 

> Thanks,

> John


Agree on what's in scsi_mq_setup_tags().  But scsi_mq_setup_tags() calls
blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(),
which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag
set queue_depth as needed until it succeeds.

The key thing is that __blk_mq_alloc_rq_maps() iterates over the
number of HW queues calling __blk_mq_alloc_map_and_request().
The latter function allocates the map and the requests with a count
of the tag set's queue_depth.   There's no logic to apportion the
can_queue value across multiple HW queues. So each HW queue gets
can_queue tags allocated, and the SCSI host driver may see up to
(can_queue * # HW queues) simultaneous requests.

I'm certainly not an expert in this area, but that's what I see in the
code.  We've run live experiments, and can see the number
simultaneous requests sent to the storvsc driver be greater than
can_queue when the # of HW queues is greater than 1, which seems
to be consistent with the code.

Michael

> 

> 

> > During dispatch, In scsi_target_queue_ready(), there is this code:

> >

> >          if (busy >= starget->can_queue)

> >                  goto starved;

> >

> > And the scsi_target->can_queue value should be coming from Scsi_host as

> > mentioned in the scsi_target definition in scsi_device.h

> >      /*

> >        * LLDs should set this in the slave_alloc host template callout.

> >        * If set to zero then there is not limit.

> >        */

> >      unsigned int            can_queue;

> >

> > So, I don't really see how this would be per hardware queue.

> >

> >>

> >> I agree that the cmd_per_lun setting is also too big, but we should fix that in

> >> the context of getting all of these different settings working together correctly,

> >> and not piecemeal.

> >>

> >

> > Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe

> > will also prevent the LUN queue_depth from being set to a value that is

> > higher than it can ever be set to again by the user when

> > storvsc_change_queue_depth() is invoked.

> >

> > Also in scsi_sysfs sdev_store_queue_depth() there is this check:

> >

> >            if (depth < 1 || depth > sdev->host->can_queue)

> >                  return -EINVAL;

> >

> > I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun

> > is set to the min of the configured cmd_per_lun and

> > Scsi_Host->can_queue:

> >

> >      shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);

> >

> > Best,

> > Melanie

> > .

> >
John Garry March 9, 2021, 4:35 p.m. UTC | #6
On 09/03/2021 15:57, Michael Kelley wrote:
> From: John Garry <john.garry@huawei.com> Sent: Tuesday, March 9, 2021 2:10 AM

>>

>> On 08/03/2021 17:56, Melanie Plageman wrote:

>>> On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote:

>>>> From: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> Sent: Friday,

>> March 5, 2021 3:22 PM

>>>>>

>>>>> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during

>>>>> allocation.

>>>>>

>>>>> Cap cmd_per_lun at can_queue to avoid dispatch errors.

>>>>>

>>>>> Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com>

>>>>> ---

>>>>>    drivers/scsi/storvsc_drv.c | 2 ++

>>>>>    1 file changed, 2 insertions(+)

>>>>>

>>>>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c

>>>>> index 6bc5453cea8a..d7953a6e00e6 100644

>>>>> --- a/drivers/scsi/storvsc_drv.c

>>>>> +++ b/drivers/scsi/storvsc_drv.c

>>>>> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,

>>>>>    				(max_sub_channels + 1) *

>>>>>    				(100 - ring_avail_percent_lowater) / 100;

>>>>>

>>>>> +	scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun,

>> scsi_driver.can_queue);

>>>>> +

>>>>

>>>> I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?

>>>

>>> The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set

>>> Scsi_Host->cmd_per_lun in storvsc_probe().

>>>

>>> In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is

>>> called and sets the scsi_device->queue_depth to the Scsi_Host's

>>> cmd_per_lun with this code:

>>>

>>> scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?

>>>                                           sdev->host->cmd_per_lun : 1);

>>>

>>> During dispatch, the scsi_device->queue_depth is used in

>>> scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine

>>> whether or not the device can queue another command.

>>>

>>> On some machines, with the 2048 value of cmd_per_lun that was used to

>>> set the initial scsi_device->queue_depth, commands can be queued that

>>> are later not able to be dispatched after running out of space in the

>>> ringbuffer.

>>>

>>> On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD

>>> (running an fio workload that I can provide if needed), storvsc_do_io()

>>> ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.

>>>

>>> This is the call stack:

>>>

>>> hv_get_bytes_to_write

>>> hv_ringbuffer_write

>>> vmbus_send_packet

>>> storvsc_dio_io

>>> storvsc_queuecommand

>>> scsi_dispatch_cmd

>>> scsi_queue_rq

>>> dispatch_rq_list

>>>

>>>> Be aware that the calculation of "can_queue" in this driver is somewhat

>>>> flawed -- it should not be based on the size of the ring buffer, but instead on

>>>> the maximum number of requests Hyper-V will queue.  And even then,

>>>> can_queue doesn't provide the cap you might expect because the blk-mq layer

>>>> allocates can_queue tags for each HW queue, not as a total.

>>>

>>>

>>> The docs for scsi_mid_low_api document Scsi_Host can_queue this way:

>>>

>>>     can_queue

>>>     - must be greater than 0; do not send more than can_queue

>>>       commands to the adapter.

>>>

>>> I did notice that in scsi_host.h, the comment for can_queue does say

>>> can_queue is the "maximum number of simultaneous commands a single hw

>>> queue in HBA will accept." However, I don't see it being used this way

>>> in the code.

>>>

>>

>> JFYI, the block layer ensures that no more than can_queue requests are

>> sent to the host. See scsi_mq_setup_tags(), and how the tagset queue

>> depth is set to shost->can_queue.

>>

>> Thanks,

>> John

> 

> Agree on what's in scsi_mq_setup_tags().  But scsi_mq_setup_tags() calls

> blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(),

> which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag

> set queue_depth as needed until it succeeds.

> 

> The key thing is that __blk_mq_alloc_rq_maps() iterates over the

> number of HW queues calling __blk_mq_alloc_map_and_request().

> The latter function allocates the map and the requests with a count

> of the tag set's queue_depth.   There's no logic to apportion the

> can_queue value across multiple HW queues. So each HW queue gets

> can_queue tags allocated, and the SCSI host driver may see up to

> (can_queue * # HW queues) simultaneous requests.

> 

> I'm certainly not an expert in this area, but that's what I see in the

> code.  We've run live experiments, and can see the number

> simultaneous requests sent to the storvsc driver be greater than

> can_queue when the # of HW queues is greater than 1, which seems

> to be consistent with the code.


ah, ok. I assumed that # of HW queues = 1 here. So you're describing a 
problem similar to 
https://lore.kernel.org/linux-scsi/b3e4e597-779b-7c1e-0d3c-07bc3dab1bb5@huawei.com/

So if you check nr_hw_queues comment in include/scsi/scsi_host.h, it reads:
the total queue depth per host is nr_hw_queues * can_queue. However, for 
when host_tagset is set, the total queue depth is can_queue.

Setting .host_tagset will ensure at most can_queue requests will be sent 
over all HW queues at any given time. A few SCSI MQ drivers set this now.

Thanks,
John

> 

> Michael

> 

>>

>>

>>> During dispatch, In scsi_target_queue_ready(), there is this code:

>>>

>>>           if (busy >= starget->can_queue)

>>>                   goto starved;

>>>

>>> And the scsi_target->can_queue value should be coming from Scsi_host as

>>> mentioned in the scsi_target definition in scsi_device.h

>>>       /*

>>>         * LLDs should set this in the slave_alloc host template callout.

>>>         * If set to zero then there is not limit.

>>>         */

>>>       unsigned int            can_queue;

>>>

>>> So, I don't really see how this would be per hardware queue.

>>>

>>>>

>>>> I agree that the cmd_per_lun setting is also too big, but we should fix that in

>>>> the context of getting all of these different settings working together correctly,

>>>> and not piecemeal.

>>>>

>>>

>>> Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe

>>> will also prevent the LUN queue_depth from being set to a value that is

>>> higher than it can ever be set to again by the user when

>>> storvsc_change_queue_depth() is invoked.

>>>

>>> Also in scsi_sysfs sdev_store_queue_depth() there is this check:

>>>

>>>             if (depth < 1 || depth > sdev->host->can_queue)

>>>                   return -EINVAL;

>>>

>>> I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun

>>> is set to the min of the configured cmd_per_lun and

>>> Scsi_Host->can_queue:

>>>

>>>       shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);

>>>

>>> Best,

>>> Melanie

>>> .

>>>

>
Michael Kelley March 9, 2021, 5:09 p.m. UTC | #7
From: John Garry <john.garry@huawei.com> Sent: Tuesday, March 9, 2021 8:36 AM

> 

> On 09/03/2021 15:57, Michael Kelley wrote:

> > From: John Garry <john.garry@huawei.com> Sent: Tuesday, March 9, 2021 2:10 AM

> >>

> >> On 08/03/2021 17:56, Melanie Plageman wrote:

> >>> On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote:

> >>>> From: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> Sent: Friday,

> >> March 5, 2021 3:22 PM

> >>>>>

> >>>>> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during

> >>>>> allocation.

> >>>>>

> >>>>> Cap cmd_per_lun at can_queue to avoid dispatch errors.

> >>>>>

> >>>>> Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com>

> >>>>> ---

> >>>>>    drivers/scsi/storvsc_drv.c | 2 ++

> >>>>>    1 file changed, 2 insertions(+)

> >>>>>

> >>>>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c

> >>>>> index 6bc5453cea8a..d7953a6e00e6 100644

> >>>>> --- a/drivers/scsi/storvsc_drv.c

> >>>>> +++ b/drivers/scsi/storvsc_drv.c

> >>>>> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,

> >>>>>    				(max_sub_channels + 1) *

> >>>>>    				(100 - ring_avail_percent_lowater) / 100;

> >>>>>

> >>>>> +	scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun,

> >> scsi_driver.can_queue);

> >>>>> +

> >>>>

> >>>> I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?

> >>>

> >>> The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set

> >>> Scsi_Host->cmd_per_lun in storvsc_probe().

> >>>

> >>> In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is

> >>> called and sets the scsi_device->queue_depth to the Scsi_Host's

> >>> cmd_per_lun with this code:

> >>>

> >>> scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?

> >>>                                           sdev->host->cmd_per_lun : 1);

> >>>

> >>> During dispatch, the scsi_device->queue_depth is used in

> >>> scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine

> >>> whether or not the device can queue another command.

> >>>

> >>> On some machines, with the 2048 value of cmd_per_lun that was used to

> >>> set the initial scsi_device->queue_depth, commands can be queued that

> >>> are later not able to be dispatched after running out of space in the

> >>> ringbuffer.

> >>>

> >>> On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD

> >>> (running an fio workload that I can provide if needed), storvsc_do_io()

> >>> ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.

> >>>

> >>> This is the call stack:

> >>>

> >>> hv_get_bytes_to_write

> >>> hv_ringbuffer_write

> >>> vmbus_send_packet

> >>> storvsc_dio_io

> >>> storvsc_queuecommand

> >>> scsi_dispatch_cmd

> >>> scsi_queue_rq

> >>> dispatch_rq_list

> >>>

> >>>> Be aware that the calculation of "can_queue" in this driver is somewhat

> >>>> flawed -- it should not be based on the size of the ring buffer, but instead on

> >>>> the maximum number of requests Hyper-V will queue.  And even then,

> >>>> can_queue doesn't provide the cap you might expect because the blk-mq layer

> >>>> allocates can_queue tags for each HW queue, not as a total.

> >>>

> >>>

> >>> The docs for scsi_mid_low_api document Scsi_Host can_queue this way:

> >>>

> >>>     can_queue

> >>>     - must be greater than 0; do not send more than can_queue

> >>>       commands to the adapter.

> >>>

> >>> I did notice that in scsi_host.h, the comment for can_queue does say

> >>> can_queue is the "maximum number of simultaneous commands a single hw

> >>> queue in HBA will accept." However, I don't see it being used this way

> >>> in the code.

> >>>

> >>

> >> JFYI, the block layer ensures that no more than can_queue requests are

> >> sent to the host. See scsi_mq_setup_tags(), and how the tagset queue

> >> depth is set to shost->can_queue.

> >>

> >> Thanks,

> >> John

> >

> > Agree on what's in scsi_mq_setup_tags().  But scsi_mq_setup_tags() calls

> > blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(),

> > which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag

> > set queue_depth as needed until it succeeds.

> >

> > The key thing is that __blk_mq_alloc_rq_maps() iterates over the

> > number of HW queues calling __blk_mq_alloc_map_and_request().

> > The latter function allocates the map and the requests with a count

> > of the tag set's queue_depth.   There's no logic to apportion the

> > can_queue value across multiple HW queues. So each HW queue gets

> > can_queue tags allocated, and the SCSI host driver may see up to

> > (can_queue * # HW queues) simultaneous requests.

> >

> > I'm certainly not an expert in this area, but that's what I see in the

> > code.  We've run live experiments, and can see the number

> > simultaneous requests sent to the storvsc driver be greater than

> > can_queue when the # of HW queues is greater than 1, which seems

> > to be consistent with the code.

> 

> ah, ok. I assumed that # of HW queues = 1 here. So you're describing a

> problem similar to

> https://lore.kernel.org/linux-scsi/b3e4e597-779b-7c1e-0d3c-07bc3dab1bb5@huawei.com/


Yes.

> 

> So if you check nr_hw_queues comment in include/scsi/scsi_host.h, it reads:

> the total queue depth per host is nr_hw_queues * can_queue. However, for

> when host_tagset is set, the total queue depth is can_queue.

> 

> Setting .host_tagset will ensure at most can_queue requests will be sent

> over all HW queues at any given time. A few SCSI MQ drivers set this now.

> 


I had seen the "host_tagset" option in some recent investigations, and it
looks like it better models what the storvsc driver wants.  But given that
only a few drivers were using it, it seemed like it might be far enough off the
beaten path to be a bit risky.

The storvsc driver is the driver for the synthetic SCSI controller provided
by Microsoft's Hyper-V hypervisor, and as such it is used heavily in the
Azure public cloud.  The settings in storvsc have changed incrementally
over the years, and there are now some inconsistencies as Melanie has
pointed out.  Storvsc and the underlying Hyper-V run in VMs with relatively
low IOPS limits and also in VMs that can hit hundreds of K IOPS.  Getting
enough parallelism to drive the high IOPS while not overloading in the
low IOPS environments is the challenge.   As a purely synthetic device,
there aren't really multiple HW queues, but configuring for multiple
queues helps drive the higher end IOPS numbers in VMs with sizable
vCPU counts.

So setting "host_tagset" while allowing multiple HW queues for
higher parallelism might be a good approach.  We will experiment.

Michael
diff mbox series

Patch

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 6bc5453cea8a..d7953a6e00e6 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1946,6 +1946,8 @@  static int storvsc_probe(struct hv_device *device,
 				(max_sub_channels + 1) *
 				(100 - ring_avail_percent_lowater) / 100;
 
+	scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, scsi_driver.can_queue);
+
 	host = scsi_host_alloc(&scsi_driver,
 			       sizeof(struct hv_host_device));
 	if (!host)