mbox series

[v5,0/3] SCSI: Fix issues between removing device and error handle

Message ID 20240605091731.3111195-1-haowenchao22@gmail.com
Headers show
Series SCSI: Fix issues between removing device and error handle | expand

Message

Wenchao Hao June 5, 2024, 9:17 a.m. UTC
2 issues are triggered because devices in removing would be skipped
when calling shost_for_each_device(), these issues are mainly in error
recovery path, which are:

1. statistic info printed at beginning of scsi_error_handler is wrong;
2. device reset is not triggered. drivers like smartpqi only implement
   eh_device_reset_handler, if device reset is skipped, the commands
   which had been sent to firmware or devices hardware are not cleared.
   The error handle would flush all these commands in scsi_unjam_host().
   When the commands are finished by hardware, use after free issue is
   triggered.
   The issue first happened with smartpqi devices, and can be reproduced
   with scsi_debug. I did not see any description about SDEV_DEL state
   can not perform device, so this is should be addressed.

A new macro shost_for_each_device_include_deleted() is added to address
these issues. The newly added macro would not skip scsi_device which is
in removing when iterate host's scsi_device and is called when statistic
host's error info and trying to reset scsi_device in error recovery path.

V5:
 - Rewrite cover letter and add fixes tag to each patch

V4:
 - Remove the forth patch which fix IO hang when device removing
   becaust the issue is fixed by commit '6df0e077d76bd (scsi: core:
   Kick the requeue list after inserting when flushing)'

V3:
  - Update patch description
  - Update comments of functions added

V2:
  - Fix IO hang by run all devices' queue after error handler
  - Do not modify shost_for_each_device() directly but add a new
    helper to iterate devices but do not skip devices in removing

Wenchao Hao (3):
  scsi: core: Add new helper to iterate all devices of host
  scsi: scsi_error: Fix wrong statistic when print error info
  scsi: scsi_error: Fix device reset is not triggered

 drivers/scsi/scsi.c        | 46 ++++++++++++++++++++++++++------------
 drivers/scsi/scsi_error.c  |  4 ++--
 include/scsi/scsi_device.h | 25 ++++++++++++++++++---
 3 files changed, 56 insertions(+), 19 deletions(-)

Comments

Wenchao Hao June 12, 2024, 3:06 p.m. UTC | #1
On 6/12/24 4:33 PM, Hannes Reinecke wrote:
> On 6/5/24 11:17, Wenchao Hao wrote:
>> shost_for_each_device() would skip devices which is in SDEV_CANCEL or
>> SDEV_DEL state, for some scenarios, we donot want to skip these devices,
>> so add a new macro shost_for_each_device_include_deleted() to handle it.
>>
>> Following changes are introduced:
>>
>> 1. Rework scsi_device_get(), add new helper __scsi_device_get() which
>>     determine if skip deleted scsi_device by parameter "skip_deleted".
>> 2. Add new parameter "skip_deleted" to __scsi_iterate_devices() which
>>     is used when calling __scsi_device_get()
>> 3. Update shost_for_each_device() to call __scsi_iterate_devices() with
>>     "skip_deleted" true
>> 4. Add new macro shost_for_each_device_include_deleted() which call
>>     __scsi_iterate_devices() with "skip_deleted" false
>>
>> Signed-off-by: Wenchao Hao <haowenchao22@gmail.com>
>> ---
>>   drivers/scsi/scsi.c        | 46 ++++++++++++++++++++++++++------------
>>   include/scsi/scsi_device.h | 25 ++++++++++++++++++---
>>   2 files changed, 54 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
>> index 3e0c0381277a..5913de543d93 100644
>> --- a/drivers/scsi/scsi.c
>> +++ b/drivers/scsi/scsi.c
>> @@ -735,20 +735,18 @@ int scsi_cdl_enable(struct scsi_device *sdev, bool enable)
>>       return 0;
>>   }
>>   -/**
>> - * scsi_device_get  -  get an additional reference to a scsi_device
>> +/*
>> + * __scsi_device_get  -  get an additional reference to a scsi_device
>>    * @sdev:    device to get a reference to
>> - *
>> - * Description: Gets a reference to the scsi_device and increments the use count
>> - * of the underlying LLDD module.  You must hold host_lock of the
>> - * parent Scsi_Host or already have a reference when calling this.
>> - *
>> - * This will fail if a device is deleted or cancelled, or when the LLD module
>> - * is in the process of being unloaded.
>> + * @skip_deleted: when true, would return failed if device is deleted
>>    */
>> -int scsi_device_get(struct scsi_device *sdev)
>> +static int __scsi_device_get(struct scsi_device *sdev, bool skip_deleted)
>>   {
>> -    if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL)
>> +    /*
>> +     * if skip_deleted is true and device is in removing, return failed
>> +     */
>> +    if (skip_deleted &&
>> +        (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL))
>>           goto fail;
> 
> Nack.
> SDEV_DEL means the device is about to be deleted, so we _must not_ access it at all.
> 

Sorry I added SDEV_DEL here at hand without understanding what it means.
Actually, just include scsi_device which is in SDEV_CANCEL would fix the
issues I described.

The issues are because device removing concurrent with error handle.
Normally, error handle would not be triggered when scsi_device is in
SDEV_DEL. Below is my analysis, if it is wrong, please correct me.

If there are scsi_cmnd remain unfinished when removing scsi_device,
the removing process would waiting for all commands to be finished.
If commands error happened and trigger error handle, the removing
process would be blocked until error handle finished, because
__scsi_remove_device called  del_gendisk() which would wait all
requests to be finished. So now scsi_device is in SDEV_CANCEL.

If the scsi_device is already in SDEV_DEL, then no scsi_cmnd has been
dispatched to this scsi_device, then error handle would never triggered.

I want to change the new function __scsi_device_get() as following,
please help to review.

/*
 * __scsi_device_get  -  get an additional reference to a scsi_device
 * @sdev:	device to get a reference to
 * @skip_canceled: when true, would return failed if device is deleted
 */
static int __scsi_device_get(struct scsi_device *sdev, bool skip_canceled)
{
	/*
	 * if skip_canceled is true and device is in removing, return failed
	 */
	if (sdev->sdev_state == SDEV_DEL ||
	    (sdev->sdev_state == SDEV_CANCEL && skip_canceled))
		goto fail;
	if (!try_module_get(sdev->host->hostt->module))
		goto fail;
	if (!get_device(&sdev->sdev_gendev))
		goto fail_put_module;
	return 0;

fail_put_module:
	module_put(sdev->host->hostt->module);
fail:
	return -ENXIO;
}

> Cheers,
> 
> Hannes