Message ID | b0f92045-d10f-4038-a746-e3d87e5830e8@acm.org |
---|---|
State | New |
Headers | show |
Series | RFC: Retrying SCSI pass-through commands | expand |
On 8/1/24 5:22 AM, Bart Van Assche wrote: > Hi, > > Recently I noticed that a particular UFS-based device does not resume > correctly. The logs of the device show that sd_start_stop_device() does > not retry the START STOP UNIT command if the device reports a unit > attention. I think that's a bug in the SCSI core. The following hack > makes resume work again. I think this confirms my understanding of this > issue (sd_start_stop_device() sets RQF_PM): > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index da7dac77f8cd..e21becc5bcf9 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -1816,6 +1816,8 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd) > * assume caller has checked sense and determined > * the check condition was retryable. > */ > + if (req->rq_flags & RQF_PM) > + return false; > if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req)) > return true; > > My understanding is that SCSI pass-through commands submitted from > user space must not be retried. Are there any objections against > modifying the behavior of the SCSI core such that it retries > REQ_OP_DRV_* operations submitted by the SCSI core, as illustrated > by the pseudo-code below? Looking at the code, e.g. sd_start_stop_device(): res = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN, NULL, 0, SD_TIMEOUT, sdkp->max_retries, &exec_args); It seems that it is expected that the retry count will be honored. But that indeed is not the case as scsi_noretry_cmd() will always return false for REQ_OP_DRV_* commands. So may be we should have a RQF_USER_OP_DRV flag to differentiate user REQ_OP_DRV_* passthrough commands from internally issued REQ_OP_DRV_* commands. Or the reverse flag, e.g. RQF_INTERNAL_OP_DRV, that we can set in e.g. scsi_execute_cmnd(). > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index da7dac77f8cd..e21becc5bcf9 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -1816,6 +1816,12 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd) > * assume caller has checked sense and determined > * the check condition was retryable. > */ > - if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req)) > - return true; > + if (req->cmd_flags & REQ_FAILFAST_DEV) > + return true; > + if (/* submitted by the SCSI core */) > + return false; > + if (blk_rq_is_passthrough(req)) > + return true; > > Thanks, > > Bart.
On 8/1/24 5:22 AM, Bart Van Assche wrote: > Hi, > > Recently I noticed that a particular UFS-based device does not resume > correctly. The logs of the device show that sd_start_stop_device() does > not retry the START STOP UNIT command if the device reports a unit > attention. I think that's a bug in the SCSI core. The following hack > makes resume work again. I think this confirms my understanding of this > issue (sd_start_stop_device() sets RQF_PM): > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index da7dac77f8cd..e21becc5bcf9 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -1816,6 +1816,8 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd) > * assume caller has checked sense and determined > * the check condition was retryable. > */ > + if (req->rq_flags & RQF_PM) > + return false; > if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req)) > return true; > > My understanding is that SCSI pass-through commands submitted from > user space must not be retried. Are there any objections against > modifying the behavior of the SCSI core such that it retries > REQ_OP_DRV_* operations submitted by the SCSI core, as illustrated > by the pseudo-code below? Another thing: may be check scsi_check_passthrough(). See the call to that in scsi_execute_cmd() and how it is used to drive retry of the command. Tweaking that may be what you need for your UFS device problem ? > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index da7dac77f8cd..e21becc5bcf9 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -1816,6 +1816,12 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd) > * assume caller has checked sense and determined > * the check condition was retryable. > */ > - if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req)) > - return true; > + if (req->cmd_flags & REQ_FAILFAST_DEV) > + return true; > + if (/* submitted by the SCSI core */) > + return false; > + if (blk_rq_is_passthrough(req)) > + return true; > > Thanks, > > Bart.
On 8/1/24 12:04 AM, Damien Le Moal wrote: > Another thing: may be check scsi_check_passthrough(). See the call to that in > scsi_execute_cmd() and how it is used to drive retry of the command. Tweaking > that may be what you need for your UFS device problem ? My understanding is that making scsi_check_passthrough() retry commands would require modifying all scsi_execute_cmd() callers. If the scsi_exec_args. failures pointer is NULL, the scsi_check_passthrough() function does not trigger a retry. If scsi_noretry_cmd() is modified, none of the scsi_execute_cmd() callers have to be modified. Thanks, Bart.
On 7/31/24 8:33 PM, Damien Le Moal wrote: > Looking at the code, e.g. sd_start_stop_device(): > > res = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN, NULL, 0, SD_TIMEOUT, > sdkp->max_retries, &exec_args); > > It seems that it is expected that the retry count will be honored. But that > indeed is not the case as scsi_noretry_cmd() will always return false for > REQ_OP_DRV_* commands. > > So may be we should have a RQF_USER_OP_DRV flag to differentiate user > REQ_OP_DRV_* passthrough commands from internally issued REQ_OP_DRV_* commands. > Or the reverse flag, e.g. RQF_INTERNAL_OP_DRV, that we can set in e.g. > scsi_execute_cmnd(). Hmm ... how about using the simplest possible patch? The patch below seems to work fine. Thanks, Bart. [PATCH] scsi: core: Retry commands submitted by the SCSI core Pass-through commands either come from user space (SG_IO ioctl, SCSI_IOCTL_*, BSG), from the SCSI core or from a SCSI driver (ch, cxlflash, pktcdvd, scsi_scan, scsi_dh_*, scsi_transport_spi, sd, sg, st, virtio_scsi). All this code sets scsi_cmnd.allowed to the maximum number of allowed retries. Hence, removing the blk_rq_is_passthrough() check from scsi_noretry_cmd() has the effect that scsi_cmnd.allowed is respected for pass-through commands. diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 612489afe8d2..c09f65cfa898 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -1855,7 +1855,7 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd) * assume caller has checked sense and determined * the check condition was retryable. */ - if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req)) + if (req->cmd_flags & REQ_FAILFAST_DEV) return true; return false;
On 8/2/24 05:12, Bart Van Assche wrote: > On 7/31/24 8:33 PM, Damien Le Moal wrote: >> Looking at the code, e.g. sd_start_stop_device(): >> >> res = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN, NULL, 0, SD_TIMEOUT, >> sdkp->max_retries, &exec_args); >> >> It seems that it is expected that the retry count will be honored. But that >> indeed is not the case as scsi_noretry_cmd() will always return false for >> REQ_OP_DRV_* commands. >> >> So may be we should have a RQF_USER_OP_DRV flag to differentiate user >> REQ_OP_DRV_* passthrough commands from internally issued REQ_OP_DRV_* commands. >> Or the reverse flag, e.g. RQF_INTERNAL_OP_DRV, that we can set in e.g. >> scsi_execute_cmnd(). > > Hmm ... how about using the simplest possible patch? The patch below seems > to work fine. > > Thanks, > > Bart. > > [PATCH] scsi: core: Retry commands submitted by the SCSI core > > Pass-through commands either come from user space (SG_IO ioctl, > SCSI_IOCTL_*, BSG), from the SCSI core or from a SCSI driver (ch, > cxlflash, pktcdvd, scsi_scan, scsi_dh_*, scsi_transport_spi, sd, > sg, st, virtio_scsi). All this code sets scsi_cmnd.allowed to the > maximum number of allowed retries. Hence, removing the > blk_rq_is_passthrough() check from scsi_noretry_cmd() has the > effect that scsi_cmnd.allowed is respected for pass-through > commands. > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index 612489afe8d2..c09f65cfa898 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -1855,7 +1855,7 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd) > * assume caller has checked sense and determined > * the check condition was retryable. > */ > - if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req)) > + if (req->cmd_flags & REQ_FAILFAST_DEV) > return true; But that will cause *all* passthrough requests to be retried, including those issued from userspace, no ? I do not think that such change would be acceptable as that can break userspace using passthrough and expecting to deal with errors and handling them however they see fit, which may not be retrying commands. > > return false; >
On 8/1/24 5:47 PM, Damien Le Moal wrote: > On 8/2/24 05:12, Bart Van Assche wrote: >> [PATCH] scsi: core: Retry commands submitted by the SCSI core >> >> Pass-through commands either come from user space (SG_IO ioctl, >> SCSI_IOCTL_*, BSG), from the SCSI core or from a SCSI driver (ch, >> cxlflash, pktcdvd, scsi_scan, scsi_dh_*, scsi_transport_spi, sd, >> sg, st, virtio_scsi). All this code sets scsi_cmnd.allowed to the >> maximum number of allowed retries. Hence, removing the >> blk_rq_is_passthrough() check from scsi_noretry_cmd() has the >> effect that scsi_cmnd.allowed is respected for pass-through >> commands. >> >> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c >> index 612489afe8d2..c09f65cfa898 100644 >> --- a/drivers/scsi/scsi_error.c >> +++ b/drivers/scsi/scsi_error.c >> @@ -1855,7 +1855,7 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd) >> * assume caller has checked sense and determined >> * the check condition was retryable. >> */ >> - if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req)) >> + if (req->cmd_flags & REQ_FAILFAST_DEV) >> return true; > > But that will cause *all* passthrough requests to be retried, including those > issued from userspace, no ? I do not think that such change would be acceptable > as that can break userspace using passthrough and expecting to deal with errors > and handling them however they see fit, which may not be retrying commands. In the SG_IO ioctl implementation I found the following assignment: scmd->allowed = 0; In drivers/scsi/sg.c I found the following (SG_DEFAULT_RETRIES == 0): scmd->allowed = SG_DEFAULT_RETRIES; Isn't that sufficient to prevent retries in the scsi_noretry_cmd() callers that check scmd->allowed? The only scsi_noretry_cmd() that does not check scmd->allowed is scsi_io_completion(). I propose to add an explicit blk_rq_is_passthrough() check in that function next to the scsi_noretry_cmd() call. Thanks, Bart.
On 7/31/24 3:22 PM, Bart Van Assche wrote: > > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index da7dac77f8cd..e21becc5bcf9 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -1816,6 +1816,12 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd) > * assume caller has checked sense and determined > * the check condition was retryable. > */ > - if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req)) > - return true; > + if (req->cmd_flags & REQ_FAILFAST_DEV) > + return true; > + if (/* submitted by the SCSI core */) > + return false; > + if (blk_rq_is_passthrough(req)) > + return true; Do you just want to retry UAs or all internal passthrough command errors that go through here? For just the specific UA or NOT_READY/0x04/0x01 case for an example. Does every scsi core passthrough caller want that retried? If so, then I can see where you are coming from where you feel it's a bug. I would agree we would normally want to retry that in general. Maybe others know about some specific old case though. However, I'm not sure for MEDIUM_ERROR or ABORTED_COMMAND. I think MEDIUM_ERROR probably would not come up for the cases we are talking about though. I don't think we want to always retry DID_TIME_OUT though. The funny thing is that I think Martin just wanted to retry one specific case for that error. We had to do the scsi_check_passthrough patches so we could retry just for scsi_probe_lun.
On 2024/08/03 13:42, Mike Christie wrote: > On 7/31/24 3:22 PM, Bart Van Assche wrote: >> >> >> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c >> index da7dac77f8cd..e21becc5bcf9 100644 >> --- a/drivers/scsi/scsi_error.c >> +++ b/drivers/scsi/scsi_error.c >> @@ -1816,6 +1816,12 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd) >> * assume caller has checked sense and determined >> * the check condition was retryable. >> */ >> - if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req)) >> - return true; >> + if (req->cmd_flags & REQ_FAILFAST_DEV) >> + return true; >> + if (/* submitted by the SCSI core */) >> + return false; >> + if (blk_rq_is_passthrough(req)) >> + return true; > > Do you just want to retry UAs or all internal passthrough command > errors that go through here? I think that we definitely do *not* want to retry all internal commands. E.g. I do not see the point in retrying INQUIRY or VPD page accesses failing on device scan. > > For just the specific UA or NOT_READY/0x04/0x01 case for an example. > Does every scsi core passthrough caller want that retried? If so, then > I can see where you are coming from where you feel it's a bug. I would > agree we would normally want to retry that in general. Maybe others know > about some specific old case though. I think the command Bart had a problem with is START_STOP_UNIT. If that is the only case causing problems, then we should probably improve sd_start_stop_device() to allow retries. But if there are more commands that needs the same, then the patch that Bart is proposing makes sense I think, but it will require an audit of all REQ_OP_DRV_XXX internal users to make sure that they all use that command with -> allowed set to 0. But hopefully, most cases go through scsi_execute_cmnd(), which should simplify this audit. > > However, I'm not sure for MEDIUM_ERROR or ABORTED_COMMAND. > I think MEDIUM_ERROR probably would not come up for the cases we are > talking about though. > > I don't think we want to always retry DID_TIME_OUT though. The funny > thing is that I think Martin just wanted to retry one specific case > for that error. We had to do the scsi_check_passthrough patches so > we could retry just for scsi_probe_lun.
On 8/5/24 10:47 AM, Damien Le Moal wrote: > On 2024/08/03 13:42, Mike Christie wrote: >> Do you just want to retry UAs or all internal passthrough command >> errors that go through here? > > I think that we definitely do *not* want to retry all internal commands. E.g. I > do not see the point in retrying INQUIRY or VPD page accesses failing on device > scan. > >> >> For just the specific UA or NOT_READY/0x04/0x01 case for an example. >> Does every scsi core passthrough caller want that retried? If so, then >> I can see where you are coming from where you feel it's a bug. I would >> agree we would normally want to retry that in general. Maybe others know >> about some specific old case though. > > I think the command Bart had a problem with is START_STOP_UNIT. If that is the > only case causing problems, then we should probably improve > sd_start_stop_device() to allow retries. But if there are more commands that > needs the same, then the patch that Bart is proposing makes sense I think, but > it will require an audit of all REQ_OP_DRV_XXX internal users to make sure that > they all use that command with -> allowed set to 0. But hopefully, most cases go > through scsi_execute_cmnd(), which should simplify this audit. Thanks Mike and Damien for having shared your thoughts. I only need the START STOP UNIT command to be retried. So the lowest risk approach is to introduce a mechanism that only causes that command to be retried. I plan to look into this and prepare a patch. Bart.
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index da7dac77f8cd..e21becc5bcf9 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -1816,6 +1816,8 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd) * assume caller has checked sense and determined * the check condition was retryable. */ + if (req->rq_flags & RQF_PM) + return false; if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req)) return true; My understanding is that SCSI pass-through commands submitted from user space must not be retried. Are there any objections against modifying the behavior of the SCSI core such that it retries REQ_OP_DRV_* operations submitted by the SCSI core, as illustrated by the pseudo-code below? diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index da7dac77f8cd..e21becc5bcf9 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -1816,6 +1816,12 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd) * assume caller has checked sense and determined * the check condition was retryable. */ - if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req)) - return true; + if (req->cmd_flags & REQ_FAILFAST_DEV) + return true; + if (/* submitted by the SCSI core */) + return false; + if (blk_rq_is_passthrough(req)) + return true; Thanks,