[v12,00/20] scsi: Allow scsi_execute users to request retries

Message ID	20231114013750.76609-1-michael.christie@oracle.com
Headers	show Return-Path: <linux-scsi-owner@vger.kernel.org> From: Mike Christie <michael.christie@oracle.com> To: mwilck@suse.com, john.g.garry@oracle.com, bvanassche@acm.org, hch@lst.de, martin.petersen@oracle.com, linux-scsi@vger.kernel.org, james.bottomley@hansenpartnership.com Subject: [PATCH v12 00/20] scsi: Allow scsi_execute users to request retries Date: Mon, 13 Nov 2023 19:37:30 -0600 Message-Id: <20231114013750.76609-1-michael.christie@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	scsi: Allow scsi_execute users to request retries \| expand [v12,00/20] scsi: Allow scsi_execute users to request retries [v12,01/20] scsi: Allow passthrough to request scsi-ml retries [v12,02/20] scsi: Have scsi-ml retry scsi_probe_lun errors [v12,03/20] scsi: retry INQUIRY after timeout [v12,04/20] scsi: sd: Have scsi-ml retry read_capacity_16 errors [v12,05/20] scsi: Use separate buf for START_STOP in sd_spinup_disk [v12,06/20] scsi: Have scsi-ml retry sd_spinup_disk errors [v12,07/20] scsi: hp_sw: Have scsi-ml retry scsi_execute_cmd errors [v12,08/20] scsi: rdac: Have scsi-ml retry send_mode_select errors [v12,09/20] scsi: spi: Have scsi-ml retry spi_execute UAs [v12,10/20] scsi: sd: Have scsi-ml retry sd_sync_cache errors [v12,11/20] scsi: ch: Remove unit_attention [v12,12/20] scsi: ch: Have scsi-ml retry ch_do_scsi UAs [v12,13/20] scsi: Have scsi-ml retry scsi_mode_sense UAs [v12,14/20] scsi: Have scsi-ml retry scsi_report_lun_scan errors [v12,15/20] scsi: sd: Have pr commands retry UAs [v12,16/20] scsi: sd: Have scsi-ml retry read_capacity_10 errors [v12,17/20] scsi: ses: Have scsi-ml retry scsi_execute_cmd errors [v12,18/20] scsi: sr: Have scsi-ml retry get_sectorsize errors [v12,19/20] scsi: ufs: Have scsi-ml retry start stop errors [v12,20/20] scsi: Add kunit tests for scsi_check_passthrough

Mike Christie Nov. 14, 2023, 1:37 a.m. UTC

The following patches were made over Martin's 6.7 staging branch
which contains a sd change that this patch requires.

The patches allow scsi_execute_cmd users to have scsi-ml retry the
cmd for it instead of the caller having to parse the error and loop
itself.

Note that I dropped most reviewed-by tags, because the structs changed
where before we only had struct scsi_failure we now have the struct
scsi_failure that is in:

struct scsi_failures {
        /*
         * If the failure does not have a specific limit in the scsi_failure
         * then this limit is followed.
         */
        int total_allowed;
        int total_retries;
        struct scsi_failure *failure_definitions;
};

so we can limit the total number of retries. The setup for this is
then different and I'm not sure if we everyone will like it. The
other parts of the conversions patches have not changed.


 drivers/scsi/Kconfig                        |   9 +
 drivers/scsi/ch.c                           |  27 ++-
 drivers/scsi/device_handler/scsi_dh_hp_sw.c |  49 +++--
 drivers/scsi/device_handler/scsi_dh_rdac.c  |  84 +++----
 drivers/scsi/scsi_lib.c                     |  26 ++-
 drivers/scsi/scsi_lib_test.c                | 330 ++++++++++++++++++++++++++++
 drivers/scsi/scsi_scan.c                    | 107 +++++----
 drivers/scsi/scsi_transport_spi.c           |  35 +--
 drivers/scsi/sd.c                           | 326 +++++++++++++++++----------
 drivers/scsi/ses.c                          |  66 ++++--
 drivers/scsi/sr.c                           |  38 ++--
 drivers/ufs/core/ufshcd.c                   |  22 +-
 12 files changed, 822 insertions(+), 297 deletions(-)

v12:
- Fix bug where a user could request no retries and skip going
through the scsi-eh.
- Drop support for allowing caller to have scsi-ml not retry
failures (we only allow caller to request retries).
- Fix formatting.
- Add support to control total number of retries.
- Fix kunit tests to add a missing test and comments.
- Fix missing SCMD_FAILURE_ASCQ_ANY in sd_spinup_disk.

v11:
- Document scsi_failure.result special values
- Fix sshdr fix git commit message where there was a missing word
- Use designated initializers for cdb setup
- Fix up various coding style comments from John like redoing if/else
error/success checks.
- Add patch to fix rdac issue where stale SCSH_DH values were returned
- Remove old comment from:
"[PATCH v10 16/33] scsi: spi: Have scsi-ml retry spi_execute errors"
- Drop EOPNOTSUPP use from:
"[PATCH v10 17/33] scsi: sd: Fix sshdr use in sd_suspend_common"
- Init errno to 0 when declared in:
"[PATCH v10 20/33] scsi: ch: Have scsi-ml retry ch_do_scsi errors"
- Add diffstat below

v10:
- Drop "," after {}.
- Hopefully fix outlook issues.

v9:
- Drop spi_execute changes from [PATCH] scsi: spi: Fix sshdr use
- Change git commit message for sshdr use fixes.

v8:
- Rebase.
- Saw the discussion about the possible bug where callers are
accessing the sshdr when it's not setup, so I added some patches
for that since I was going over the same code.

v7:
- Rebase against scsi_execute_cmd patchset.

v6:
- Fix kunit build when it's built as a module but scsi is not.
- Drop [PATCH 17/35] scsi: ufshcd: Convert to scsi_exec_req because
  that driver no longer uses scsi_execute.
- Convert ufshcd to use the scsi_failures struct because it now just does
  direct retries and does not do it's own deadline timing.
- Add back memset in read_capacity_16.
- Remove memset in get_sectorsize and replace with { } to init buf.

v5:
- Fix spelling (made sure I ran checkpatch strict)
- Drop SCMD_FAILURE_NONE
- Rename SCMD_FAILURE_ANY
- Fix media_not_present handling where it was being retried instead of
failed.
- Fix ILLEGAL_REQUEST handling in read_capacity_16 so it was not retried.
- Fix coding style, spelling and and naming convention in kunit and added
  more tests to handle cases like the media_not_present one where we want
  to force failures instead of retries.
- Drop cxlflash patch because it actually checked it's internal state before
  performing a retry which we currently do not support.

v4:
- Redefine cmd definitions if the cmd is touched.
- Fix up coding style issues.
- Use sam_status enum.
- Move failures initialization to scsi_initialize_rq
(also fixes KASAN error).
- Add kunit test.
- Add function comments.

v3:
- Use a for loop in scsi_check_passthrough
- Fix result handling/testing.
- Fix scsi_status_is_good handling.
- make __scsi_exec_req take a const arg
- Fix formatting in patch 24

v2:
- Rename scsi_prep_sense
- Change scsi_check_passthrough's loop and added some fixes
- Modified scsi_execute* so it uses a struct to pass in args

John Garry Nov. 16, 2023, 10:53 a.m. UTC | #1

On 14/11/2023 01:37, Mike Christie wrote:
> For passthrough we don't retry any error we get a check condition for.

nit: For passthrough we don't retry any error which we get a check 
condition for.

> This results in a lot of callers driving their own retries for all UAs,
> specific UAs, NOT_READY, specific sense values or any type of failure.
> 
> This adds the core code to allow passthrough users to specify what errors
> they want scsi-ml to retry for them. We can then convert users to drop
> a lot of their sense parsing and retry handling.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>

Looks ok to me, so feel free to pick up:
Reviewed-by: John Garry <john.g.garry@oracle.com>

> ---
>   drivers/scsi/scsi_lib.c    | 92 ++++++++++++++++++++++++++++++++++++++
>   include/scsi/scsi_device.h | 48 ++++++++++++++++++++
>   2 files changed, 140 insertions(+)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index cf3864f72093..dee43c6f7ad0 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -184,6 +184,92 @@ void scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
>   	__scsi_queue_insert(cmd, reason, true);
>   }
>   
> +void scsi_reset_failures(struct scsi_failures *failures)

nit: maybe scsi_reset_failures_retries as the name, to be more specific

> +{
> +	struct scsi_failure *failure;
> +
> +	failures->total_retries = 0;
> +
> +	for (failure = failures->failure_definitions; failure->result;
> +	     failure++)
> +		failure->retries = 0;
> +}
> +EXPORT_SYMBOL_GPL(scsi_reset_failures);
> +
> +/**
> + * scsi_check_passthrough - Determine if passthrough scsi_cmnd needs a retry.
> + * @scmd: scsi_cmnd to check.
> + * @failures: scsi_failures struct that lists failures to check for.
> + *
> + * Returns -EAGAIN if the caller should retry else 0.
> + */
> +static int scsi_check_passthrough(struct scsi_cmnd *scmd,
> +				  struct scsi_failures *failures)
> +{
> +	struct scsi_failure *failure;
> +	struct scsi_sense_hdr sshdr;
> +	enum sam_status status;
> +
> +	if (!failures)
> +		return 0;
> +
> +	for (failure = failures->failure_definitions; failure->result;
> +	     failure++) {
> +		if (failure->result == SCMD_FAILURE_RESULT_ANY)
> +			goto maybe_retry;
> +
> +		if (host_byte(scmd->result) &&
> +		    host_byte(scmd->result) == host_byte(failure->result))
> +			goto maybe_retry;
> +
> +		status = status_byte(scmd->result);
> +		if (!status)
> +			continue;
> +
> +		if (failure->result == SCMD_FAILURE_STAT_ANY &&
> +		    !scsi_status_is_good(scmd->result))
> +			goto maybe_retry;
> +
> +		if (status != status_byte(failure->result))
> +			continue;
> +
> +		if (status_byte(failure->result) != SAM_STAT_CHECK_CONDITION ||
> +		    failure->sense == SCMD_FAILURE_SENSE_ANY)
> +			goto maybe_retry;
> +
> +		if (!scsi_command_normalize_sense(scmd, &sshdr))
> +			return 0;
> +
> +		if (failure->sense != sshdr.sense_key)
> +			continue;
> +
> +		if (failure->asc == SCMD_FAILURE_ASC_ANY)
> +			goto maybe_retry;
> +
> +		if (failure->asc != sshdr.asc)
> +			continue;
> +
> +		if (failure->ascq == SCMD_FAILURE_ASCQ_ANY ||
> +		    failure->ascq == sshdr.ascq)
> +			goto maybe_retry;
> +	}
> +
> +	return 0;
> +
> +maybe_retry:
> +	if (failure->allowed) {
> +		if (failure->allowed == SCMD_FAILURE_NO_LIMIT ||
> +		    ++failure->retries <= failure->allowed)
> +			return -EAGAIN;
> +	} else {
> +		if (failures->total_allowed == SCMD_FAILURE_NO_LIMIT ||
> +		    ++failures->total_retries <= failures->total_allowed)
> +			return -EAGAIN;
> +	}
> +
> +	return 0;
> +}
> +
>   /**
>    * scsi_execute_cmd - insert request and wait for the result
>    * @sdev:	scsi_device
> @@ -214,6 +300,7 @@ int scsi_execute_cmd(struct scsi_device *sdev, const unsigned char *cmd,
>   			      args->sense_len != SCSI_SENSE_BUFFERSIZE))
>   		return -EINVAL;
>   
> +retry:
>   	req = scsi_alloc_request(sdev->request_queue, opf, args->req_flags);
>   	if (IS_ERR(req))
>   		return PTR_ERR(req);
> @@ -237,6 +324,11 @@ int scsi_execute_cmd(struct scsi_device *sdev, const unsigned char *cmd,
>   	 */
>   	blk_execute_rq(req, true);
>   
> +	if (scsi_check_passthrough(scmd, args->failures) == -EAGAIN) {
> +		blk_mq_free_request(req);
> +		goto retry;
> +	}
> +
>   	/*
>   	 * Some devices (USB mass-storage in particular) may transfer
>   	 * garbage data together with a residue indicating that the data
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 10480eb582b2..c92d6d9e644e 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -483,6 +483,52 @@ extern int scsi_is_sdev_device(const struct device *);
>   extern int scsi_is_target_device(const struct device *);
>   extern void scsi_sanitize_inquiry_string(unsigned char *s, int len);
>   
> +/*
> + * scsi_execute_cmd users can set scsi_failure.result to have
> + * scsi_check_passthrough fail/retry a command. scsi_failure.result can be a
> + * specific host byte or message code, or SCMD_FAILURE_RESULT_ANY can be used
> + * to match any host or message code.
> + */
> +#define SCMD_FAILURE_RESULT_ANY	0x7fffffff
> +/*
> + * Set scsi_failure.result to SCMD_FAILURE_STAT_ANY to fail/retry any failure
> + * scsi_status_is_good returns false for.
> + */
> +#define SCMD_FAILURE_STAT_ANY	0xff
> +/*
> + * The following can be set to the scsi_failure sense, asc and ascq fields to
> + * match on any sense, ASC, or ASCQ value.
> + */
> +#define SCMD_FAILURE_SENSE_ANY	0xff
> +#define SCMD_FAILURE_ASC_ANY	0xff
> +#define SCMD_FAILURE_ASCQ_ANY	0xff
> +/* Always retry a matching failure. */
> +#define SCMD_FAILURE_NO_LIMIT	-1
> +
> +struct scsi_failure {
> +	int result;
> +	u8 sense;
> +	u8 asc;
> +	u8 ascq;
> +
> +	/*
> +	 * Number of attempts allowed for this failure. It does not count
> +	 * for the total_allowed.
> +	 */
> +	s8 allowed;
> +	s8 retries;
> +};
> +
> +struct scsi_failures {
> +	/*
> +	 * If the failure does not have a specific limit in the scsi_failure
> +	 * then this limit is followed.
> +	 */
> +	int total_allowed;
> +	int total_retries;
> +	struct scsi_failure *failure_definitions;
> +};
> +
>   /* Optional arguments to scsi_execute_cmd */
>   struct scsi_exec_args {
>   	unsigned char *sense;		/* sense buffer */
> @@ -491,12 +537,14 @@ struct scsi_exec_args {
>   	blk_mq_req_flags_t req_flags;	/* BLK_MQ_REQ flags */
>   	int scmd_flags;			/* SCMD flags */
>   	int *resid;			/* residual length */
> +	struct scsi_failures *failures;	/* failures to retry */
>   };
>   
>   int scsi_execute_cmd(struct scsi_device *sdev, const unsigned char *cmd,
>   		     blk_opf_t opf, void *buffer, unsigned int bufflen,
>   		     int timeout, int retries,
>   		     const struct scsi_exec_args *args);
> +void scsi_reset_failures(struct scsi_failures *failures);
>   
>   extern void sdev_disable_disk_events(struct scsi_device *sdev);
>   extern void sdev_enable_disk_events(struct scsi_device *sdev);

John Garry Nov. 16, 2023, 11:02 a.m. UTC | #2

On 14/11/2023 01:37, Mike Christie wrote:
> The following patches were made over Martin's 6.7 staging branch
> which contains a sd change that this patch requires.

Hi Mike,

It would be nice if we could apply to 6.8 staging - was the sd.c change 
not in v6.7-rc1?

Cheers,
John

John Garry Nov. 16, 2023, 11:39 a.m. UTC | #3

On 14/11/2023 01:37, Mike Christie wrote:

Feel free to pick up:
Reviewed-by: John Garry <john.g.garry@oracle.com>

> +	the_result = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN, buffer,
> +				      RC16_LEN, SD_TIMEOUT, sdkp->max_retries,

BTW, some might find it a little confusing that we have 
scsi_execute_cmd() retries arg as well as being able to pass 
exec_args.failure, and exec_args.failure may allow us to retry. Could 
this be improved, even in terms of arg or struct member naming/comments?

Thanks,
John

> +				      &exec_args);
>   
> -	} while (the_result && retries);
> +	if (the_result > 0) {
> +		if (media_not_present(sdkp, &sshdr))
> +			return -ENODEV;
> +
> +		sense_valid = scsi_sense_valid(&sshdr);
> +		if (sense_valid && sshdr.sense_key == ILLEGAL_REQUEST &&
> +		    (sshdr.asc == 0x20 || sshdr.asc == 0x24) &&
> +		     sshdr.ascq == 0x00) {
> +			/*
> +			 * Invalid Command Operation Code or Invalid Field in
> +			 * CDB, just retry silently with RC10
> +			 */
> +			return -EINVAL;
> +		}
> +	}

Mike Christie Nov. 16, 2023, 5:15 p.m. UTC | #4

On 11/16/23 5:39 AM, John Garry wrote:
> On 14/11/2023 01:37, Mike Christie wrote:
> 
> Feel free to pick up:
> Reviewed-by: John Garry <john.g.garry@oracle.com>
> 
>> +    the_result = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN, buffer,
>> +                      RC16_LEN, SD_TIMEOUT, sdkp->max_retries,
> 
> BTW, some might find it a little confusing that we have scsi_execute_cmd() retries arg as well as being able to pass exec_args.failure, and exec_args.failure may allow us to retry. Could this be improved, even in terms of arg or struct member naming/comments?

Will add some comments.

Martin W, if you are going to ask about this again, the answer is that
in a future patchset, we can kill the reties passed directly into
scsi_execute_cmd.

We could make scsi_decide_disposition's failures an array of
scsi_failure structs that gets checked in scsi_decide_disposition
and scsi_check_passthrough. We would need to add a function callout
to the scsi_failure struct for some of the more complicated failure
handling. That could then be used for some of the other scsi_execute_cmd
cases as well. For the normal/fast path case though we would need
something to avoid function pointers.

Martin Wilck Nov. 16, 2023, 5:57 p.m. UTC | #5

On Thu, 2023-11-16 at 11:15 -0600, Mike Christie wrote:
> On 11/16/23 5:39 AM, John Garry wrote:
> > On 14/11/2023 01:37, Mike Christie wrote:
> > 
> > Feel free to pick up:
> > Reviewed-by: John Garry <john.g.garry@oracle.com>
> > 
> > > +    the_result = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN,
> > > buffer,
> > > +                      RC16_LEN, SD_TIMEOUT, sdkp->max_retries,
> > 
> > BTW, some might find it a little confusing that we have
> > scsi_execute_cmd() retries arg as well as being able to pass
> > exec_args.failure, and exec_args.failure may allow us to retry.
> > Could this be improved, even in terms of arg or struct member
> > naming/comments?
> 
> Will add some comments.
> 
> Martin W, if you are going to ask about this again, the answer is
> that
> in a future patchset, we can kill the reties passed directly into
> scsi_execute_cmd.
> 

Did I ask about this? ... I dimly remember ;-)

> We could make scsi_decide_disposition's failures an array of
> scsi_failure structs that gets checked in scsi_decide_disposition
> and scsi_check_passthrough. We would need to add a function callout
> to the scsi_failure struct for some of the more complicated failure
> handling. That could then be used for some of the other
> scsi_execute_cmd
> cases as well. For the normal/fast path case though we would need
> something to avoid function pointers.

I am not sure if I like this idea. scsi_decide_disposition() is
challenging to read already. I am not convinced that converting it into
a loop over "struct scsi_failure" would make it easier to understand.
I guess we'd need to see how it would look like. 

To my understanding, the two retry counts are currently orthogonal, the
one in scsi_execute_cmd() representing the standard mid-layer
"maybe_retry" case in scsi_decide_disposition, and the the other one
representing the additional "high level" passthrough retry. We need to
be careful when merging them.

Wrt the callout, I'd actually thought about that when looking at your
previous set, but I didn't want to interfere too much. I thought that
using callouts might simplify the scsi_failure handling, and might
actually make the new code easier to read. I can't judge possible
effects on performance.

Cheers,
Martin

[v12,00/20] scsi: Allow scsi_execute users to request retries

Message

Comments