Message ID | 20230804154821.3232094-4-bvanassche@acm.org |
---|---|
State | Superseded |
Headers | show |
Series | Improve performance for zoned UFS devices | expand |
Hi Bart! > If zoned writes (REQ_OP_WRITE) for a sequential write required zone > have a starting LBA that differs from the write pointer, e.g. because > zoned writes have been reordered, then the storage device will respond > with an UNALIGNED WRITE COMMAND error. Send commands that failed with > an unaligned write error to the SCSI error handler if zone write > locking is disabled. Let the SCSI error handler sort SCSI commands per > LBA before resubmitting these. > > If zone write locking is disabled, increase the number of retries for > write commands sent to a sequential zone to the maximum number of > outstanding commands because in the worst case the number of times > reordered zoned writes have to be retried is (number of outstanding > writes per sequential zone) - 1. I am afraid that I find falling back to rely on the error handler pretty kludgy. It seems like there would be a more straightforward way ensure that request ordering is preserved for devices that are known not to reorder internally. I probably missed the finer details of what was discussed while I was away. But why can't we address the specific corner cases that cause the unexpected reordering at the block layer? Sorting requests in the SCSI error handler after a reported failure just seems like papering over the fact that there's a problem elsewhere.
On 8/7/23 19:24, Martin K. Petersen wrote: >> If zoned writes (REQ_OP_WRITE) for a sequential write required zone >> have a starting LBA that differs from the write pointer, e.g. because >> zoned writes have been reordered, then the storage device will respond >> with an UNALIGNED WRITE COMMAND error. Send commands that failed with >> an unaligned write error to the SCSI error handler if zone write >> locking is disabled. Let the SCSI error handler sort SCSI commands per >> LBA before resubmitting these. >> >> If zone write locking is disabled, increase the number of retries for >> write commands sent to a sequential zone to the maximum number of >> outstanding commands because in the worst case the number of times >> reordered zoned writes have to be retried is (number of outstanding >> writes per sequential zone) - 1. > > I am afraid that I find falling back to rely on the error handler pretty > kludgy. It seems like there would be a more straightforward way ensure > that request ordering is preserved for devices that are known not to > reorder internally. > > I probably missed the finer details of what was discussed while I was > away. But why can't we address the specific corner cases that cause the > unexpected reordering at the block layer? Sorting requests in the SCSI > error handler after a reported failure just seems like papering over the > fact that there's a problem elsewhere. Hi Martin, An important question is whether it is possible to preserve the write order all the time. The software layers and hardware components that are involved in this context are: * The filesystem. * The block layer. * The I/O scheduler if an I/O scheduler is present. * The SCSI core. * The SCSI LLD. * The storage controller (UFSHCI in this case). * The link between storage controller and storage device. * The storage device (UFS in this case). The SCSI protocol allows SCSI devices, including UFS devices, to respond with a unit attention or the SCSI BUSY status at any time. If multiple write commands are pending and some of the pending SCSI commands are not executed because of a unit attention or because of another reason, this causes command reordering. The link between UFS controller and UFS device has a low but non-zero BER. If a SCSI command is lost by this link and has to be resent, this can cause reordering. Although I agree that the code in this patch that sorts and resubmits requests should be triggered infrequently, I don't think that such code can be avoided entirely. You may have noticed that a significant effort has been undertaken to eliminate certain causes of command reordering. See also: * [PATCH v4 0/5] ufs: Do not requeue while ungating the clock (https://lore.kernel.org/linux-scsi/20230529202640.11883-1-bvanassche@acm.org/). * [PATCH v6 00/11] mq-deadline: Improve support for zoned block devices (https://lore.kernel.org/linux-block/20230517174230.897144-1-bvanassche@acm.org/) * less special casing for flush requests v2 (https://lore.kernel.org/linux-block/20230519044050.107790-1-hch@lst.de/) As you may be aware performance matters for UFS devices and performance of UFS devices increases gradually over time. It is important that the code added by this patch is triggered infrequently to achieve good performance so I have an interest myself in making sure that this code is triggered infrequently in current and also in future kernels. Since I think that it is not possible to avoid sorting and resubmitting requests entirely, I propose to proceed with the approach of this patch series. Thanks, Bart.
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index c67cdcdc3ba8..766a576b9be6 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -27,6 +27,7 @@ #include <linux/blkdev.h> #include <linux/delay.h> #include <linux/jiffies.h> +#include <linux/list_sort.h> #include <scsi/scsi.h> #include <scsi/scsi_cmnd.h> @@ -698,6 +699,16 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd) fallthrough; case ILLEGAL_REQUEST: + /* + * Unaligned write command. This may indicate that zoned writes + * have been received by the device in the wrong order. If zone + * write locking is disabled, retry after all pending commands + * have completed. + */ + if (sshdr.asc == 0x21 && sshdr.ascq == 0x04 && + blk_queue_no_zone_write_lock(scsi_cmd_to_rq(scmd)->q)) + return NEEDS_DELAYED_RETRY; + if (sshdr.asc == 0x20 || /* Invalid command operation code */ sshdr.asc == 0x21 || /* Logical block address out of range */ sshdr.asc == 0x22 || /* Invalid function */ @@ -2186,6 +2197,25 @@ void scsi_eh_ready_devs(struct Scsi_Host *shost, } EXPORT_SYMBOL_GPL(scsi_eh_ready_devs); +/* + * Returns a negative value if @_a has a lower starting sector than @_b, zero if + * both have the same starting sector and a positive value otherwise. + */ +static int scsi_cmp_sector(void *priv, const struct list_head *_a, + const struct list_head *_b) +{ + struct scsi_cmnd *a = list_entry(_a, typeof(*a), eh_entry); + struct scsi_cmnd *b = list_entry(_b, typeof(*b), eh_entry); + const sector_t pos_a = blk_rq_pos(scsi_cmd_to_rq(a)); + const sector_t pos_b = blk_rq_pos(scsi_cmd_to_rq(b)); + + if (pos_a < pos_b) + return -1; + if (pos_a > pos_b) + return 1; + return 0; +} + /** * scsi_eh_flush_done_q - finish processed commands or retry them. * @done_q: list_head of processed commands. @@ -2194,6 +2224,13 @@ void scsi_eh_flush_done_q(struct list_head *done_q) { struct scsi_cmnd *scmd, *next; + /* + * Sort pending SCSI commands in starting sector order. This is + * important if one of the SCSI devices associated with @shost is a + * zoned block device for which zone write locking is disabled. + */ + list_sort(NULL, done_q, scsi_cmp_sector); + list_for_each_entry_safe(scmd, next, done_q, eh_entry) { list_del_init(&scmd->eh_entry); if (scsi_device_online(scmd->device) && diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 59176946ab56..69da8aee13df 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1443,6 +1443,7 @@ static void scsi_complete(struct request *rq) case ADD_TO_MLQUEUE: scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY); break; + case NEEDS_DELAYED_RETRY: default: scsi_eh_scmd_add(cmd); break; diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 68b12afa0721..27b9ebe05b90 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1235,6 +1235,9 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd) cmd->transfersize = sdp->sector_size; cmd->underflow = nr_blocks << 9; cmd->allowed = sdkp->max_retries; + if (blk_queue_no_zone_write_lock(rq->q) && + blk_rq_is_seq_zoned_write(rq)) + cmd->allowed += rq->q->nr_requests; cmd->sdb.length = nr_blocks * sdp->sector_size; SCSI_LOG_HLQUEUE(1, diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h index ec093594ba53..6600db046227 100644 --- a/include/scsi/scsi.h +++ b/include/scsi/scsi.h @@ -93,6 +93,7 @@ static inline int scsi_status_is_check_condition(int status) * Internal return values. */ enum scsi_disposition { + NEEDS_DELAYED_RETRY = 0x2000, NEEDS_RETRY = 0x2001, SUCCESS = 0x2002, FAILED = 0x2003,