diff mbox series

[v11,04/16] scsi: core: Introduce a mechanism for reordering requests in the error handler

Message ID 20230822191822.337080-5-bvanassche@acm.org
State Superseded
Headers show
Series Improve write performance for zoned UFS devices | expand

Commit Message

Bart Van Assche Aug. 22, 2023, 7:16 p.m. UTC
Introduce the .eh_needs_prepare_resubmit and the .eh_prepare_resubmit
function pointers in struct scsi_driver. Make the error handler call
.eh_prepare_resubmit() before resubmitting commands if any of the
.eh_needs_prepare_resubmit() invocations return true. A later patch
will use this functionality to sort SCSI commands by LBA from inside
the SCSI disk driver before these are resubmitted by the error handler.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_error.c  | 65 ++++++++++++++++++++++++++++++++++++++
 drivers/scsi/scsi_priv.h   |  1 +
 include/scsi/scsi_driver.h |  2 ++
 3 files changed, 68 insertions(+)

Comments

Damien Le Moal Aug. 23, 2023, 11:22 p.m. UTC | #1
On 8/24/23 00:15, Bart Van Assche wrote:
> On 8/22/23 23:26, Hannes Reinecke wrote:
>> On 8/22/23 21:16, Bart Van Assche wrote:
>>> +/*
>>> + * Comparison function that allows to sort SCSI commands by ULD driver.
>>> + */
>>> +static int scsi_cmp_uld(void *priv, const struct list_head *_a,
>>> +            const struct list_head *_b)
>>> +{
>>> +    struct scsi_cmnd *a = list_entry(_a, typeof(*a), eh_entry);
>>> +    struct scsi_cmnd *b = list_entry(_b, typeof(*b), eh_entry);
>>> +
>>> +    /* See also the comment above the list_sort() definition. */
>>> +    return scsi_cmd_to_driver(a) > scsi_cmd_to_driver(b);
>>
>> I have to agree with Christoph here.
>> Comparing LBA numbers at the SCSI level is really the wrong place.
>> SCSI commands might be anything, and quite some of these commands don't
>> even have LBA numbers. So trying to order them will be pointless.
>>
>> The reordering mechanism really has to go into the block layer, with
>> the driver failing the request and the block layer resubmitting in-order.
> 
> Hi Hannes,
> 
> Please take another look at patches 04/16 and 05/16. As one can see no
> LBA numbers are being compared in the SCSI core - comparing LBA numbers
> happens in the sd (SCSI disk) driver. The code that you replied to
> compares ULD pointers, a well-defined concept in the SCSI core.
> 
> Your request to move the functionality from patches 04/16 and 05/16 into
> the block layer would involve the following:
> * Report the unaligned write errors (because a write did not happen at the
>    write pointer) to the block layer (BLK_STS_WP_MISMATCH?).
> * Introduce a mechanism in the block layer for postponing error handling
>    until all outstanding commands have failed. The approach from the SCSI
>    core (tracking the number of failed and the number of busy commands
>    and only waking up the error handler after these counters are equal)
>    would be unacceptable because of the runtime overhead this mechanism
>    would introduce in the block layer hot path. Additionally, I strongly
>    doubt that it is possible to introduce any mechanism for postponing
>    error handling in the block layer without introducing additional
>    overhead in the hot path.
> * Christoph's opinion is that NVMe software should use zone append
>    (REQ_OP_ZONE_APPEND) instead of regular writes (REQ_OP_WRITE) when
>    writing to a zoned namespace. So the SCSI subsystem would be the only
>    user of the new mechanism introduced in the block layer. The reason we
>    chose REQ_OP_WRITE for zoned UFS devices is because the SCSI standard
>    does not support a zone append command and introducing a zone append
>    command in the SCSI standards is not something that can be realized in
>    time for the first generation of zoned UFS devices.

The sd driver does have zone append emulation using regular writes. The
emulation relies on zone write locking to avoid issues with adapters that do not
have strong ordering guarantees, but that could be adapted to be removed for UFS
devices with write ordering guarantees. This solution would greatly simplify
your series since zone append requests are not subject to zone write locking at
the block layer. So no changes would be needed at that layer.

However, that implies that your preferred use case (f2fs) must be adapted to use
zone append instead of regular writes. That in itself may be a bigger-ish
change, but in the long run, likely a better one I think as that would be
compatible with NVMe ZNS and also future UFS standards defining a zone append
command.

> 
> Because I assume that both Jens and Christoph disagree strongly with your
> request: I have no plans to move the code for sorting zoned writes into
> the block layer core.
> 
> Jens and Christoph, please correct me if I misunderstood something.
> 
> Thanks,
> 
> Bart.
Bart Van Assche Aug. 24, 2023, 2:47 p.m. UTC | #2
On 8/23/23 16:22, Damien Le Moal wrote:
> The sd driver does have zone append emulation using regular writes. The
> emulation relies on zone write locking to avoid issues with adapters that do not
> have strong ordering guarantees, but that could be adapted to be removed for UFS
> devices with write ordering guarantees. This solution would greatly simplify
> your series since zone append requests are not subject to zone write locking at
> the block layer. So no changes would be needed at that layer.
> 
> However, that implies that your preferred use case (f2fs) must be adapted to use
> zone append instead of regular writes. That in itself may be a bigger-ish
> change, but in the long run, likely a better one I think as that would be
> compatible with NVMe ZNS and also future UFS standards defining a zone append
> command.

Hi Damien,

Thanks for the feedback. I agree that it would be great to have zone append
support in F2FS. However, I do not agree that switching from regular writes
to zone append in F2FS would remove the need for sorting SCSI commands by LBA
in the SCSI error handler. Even if F2FS would submit zoned writes then the
following mechanisms could still cause reordering of the zoned writes after
these have been translated into regular writes:
* The SCSI protocol allows SCSI devices, including UFS devices, to respond
with a unit attention or the SCSI BUSY status at any time. If multiple write
commands are pending and some of the pending SCSI writes are not executed
because of a unit attention or because of another reason, this causes
command reordering.
* Although the link between the UFS controller and the UFS device is pretty
reliable, there is a non-zero chance that a SCSI command is lost. If this
happens the SCSI timeout and error handlers are activated. This can cause
reordering of write commands.

In other words, whether F2FS submits regular writes (REQ_OP_WRITE) or zone
appends (REQ_OP_ZONE_APPEND), I think we need the entire patch series.

Thanks,

Bart.
Hannes Reinecke Aug. 24, 2023, 4:44 p.m. UTC | #3
On 8/24/23 16:47, Bart Van Assche wrote:
> On 8/23/23 16:22, Damien Le Moal wrote:
>> The sd driver does have zone append emulation using regular writes. The
>> emulation relies on zone write locking to avoid issues with adapters 
>> that do not
>> have strong ordering guarantees, but that could be adapted to be 
>> removed for UFS
>> devices with write ordering guarantees. This solution would greatly 
>> simplify
>> your series since zone append requests are not subject to zone write 
>> locking at
>> the block layer. So no changes would be needed at that layer.
>>
>> However, that implies that your preferred use case (f2fs) must be 
>> adapted to use
>> zone append instead of regular writes. That in itself may be a bigger-ish
>> change, but in the long run, likely a better one I think as that would be
>> compatible with NVMe ZNS and also future UFS standards defining a zone 
>> append
>> command.
> 
> Hi Damien,
> 
> Thanks for the feedback. I agree that it would be great to have zone append
> support in F2FS. However, I do not agree that switching from regular writes
> to zone append in F2FS would remove the need for sorting SCSI commands 
> by LBA in the SCSI error handler. Even if F2FS would submit zoned writes
> then the following mechanisms could still cause reordering of the zoned
> write after these have been translated into regular writes:
> * The SCSI protocol allows SCSI devices, including UFS devices, to respond
> with a unit attention or the SCSI BUSY status at any time. If multiple 
> write commands are pending and some of the pending SCSI writes are not
> executed because of a unit attention or because of another reason, this
> causes command reordering.

Yes. But the important thing to remember is that with 'zone append' the 
resulting LBA will be returned on completion, they will _not_ be 
specified in the submission. So any command reordering doesn't affect 
the zone append commands as they heven't been written yet.

> * Although the link between the UFS controller and the UFS device is pretty
> reliable, there is a non-zero chance that a SCSI command is lost. If this
> happens the SCSI timeout and error handlers are activated. This can cause
> reordering of write commands.
> 
Again, reordering is not an issue with zone append. With zone append you 
specify in which zone the command should land, and upon completion the 
LBA where the data is written will be returned.

So if there is an error the command has not been written, consequently 
there is no LBA to worry about, and you can reorder at will.

Cheers,

Hannes
Bart Van Assche Aug. 24, 2023, 4:52 p.m. UTC | #4
On 8/24/23 09:44, Hannes Reinecke wrote:
> On 8/24/23 16:47, Bart Van Assche wrote:
>> Thanks for the feedback. I agree that it would be great to have zone 
>> append
>> support in F2FS. However, I do not agree that switching from regular 
>> writes
>> to zone append in F2FS would remove the need for sorting SCSI commands 
>> by LBA in the SCSI error handler. Even if F2FS would submit zoned writes
>> then the following mechanisms could still cause reordering of the zoned
>> write after these have been translated into regular writes:
>> * The SCSI protocol allows SCSI devices, including UFS devices, to 
>> respond
>> with a unit attention or the SCSI BUSY status at any time. If multiple 
>> write commands are pending and some of the pending SCSI writes are not
>> executed because of a unit attention or because of another reason, this
>> causes command reordering.
> 
> Yes. But the important thing to remember is that with 'zone append' the 
> resulting LBA will be returned on completion, they will _not_ be 
> specified in the submission. So any command reordering doesn't affect 
> the zone append commands as they heven't been written yet.
> 
>> * Although the link between the UFS controller and the UFS device is 
>> pretty
>> reliable, there is a non-zero chance that a SCSI command is lost. If this
>> happens the SCSI timeout and error handlers are activated. This can cause
>> reordering of write commands.
>>
> Again, reordering is not an issue with zone append. With zone append you 
> specify in which zone the command should land, and upon completion the 
> LBA where the data is written will be returned.
> 
> So if there is an error the command has not been written, consequently 
> there is no LBA to worry about, and you can reorder at will.

Hi Hannes,

I agree that reordering is not an issue for NVMe zone append commands.
It is an issue however with SCSI devices because there is no zone append
command in the SCSI command set. The sd_zbc.c code translates zone
appends (REQ_OP_ZONE_APPEND) into regular WRITE commands. If these WRITE
commands are reordered, the ZBC standard requires that these commands
fail with an UNALIGNED WRITE error. So I think for SCSI devices what you
wrote is wrong.

Bart.
Damien Le Moal Aug. 24, 2023, 11:53 p.m. UTC | #5
On 8/25/23 01:52, Bart Van Assche wrote:
> On 8/24/23 09:44, Hannes Reinecke wrote:
>> On 8/24/23 16:47, Bart Van Assche wrote:
>>> Thanks for the feedback. I agree that it would be great to have zone 
>>> append
>>> support in F2FS. However, I do not agree that switching from regular 
>>> writes
>>> to zone append in F2FS would remove the need for sorting SCSI commands 
>>> by LBA in the SCSI error handler. Even if F2FS would submit zoned writes
>>> then the following mechanisms could still cause reordering of the zoned
>>> write after these have been translated into regular writes:
>>> * The SCSI protocol allows SCSI devices, including UFS devices, to 
>>> respond
>>> with a unit attention or the SCSI BUSY status at any time. If multiple 
>>> write commands are pending and some of the pending SCSI writes are not
>>> executed because of a unit attention or because of another reason, this
>>> causes command reordering.
>>
>> Yes. But the important thing to remember is that with 'zone append' the 
>> resulting LBA will be returned on completion, they will _not_ be 
>> specified in the submission. So any command reordering doesn't affect 
>> the zone append commands as they heven't been written yet.
>>
>>> * Although the link between the UFS controller and the UFS device is 
>>> pretty
>>> reliable, there is a non-zero chance that a SCSI command is lost. If this
>>> happens the SCSI timeout and error handlers are activated. This can cause
>>> reordering of write commands.
>>>
>> Again, reordering is not an issue with zone append. With zone append you 
>> specify in which zone the command should land, and upon completion the 
>> LBA where the data is written will be returned.
>>
>> So if there is an error the command has not been written, consequently 
>> there is no LBA to worry about, and you can reorder at will.
> 
> Hi Hannes,
> 
> I agree that reordering is not an issue for NVMe zone append commands.
> It is an issue however with SCSI devices because there is no zone append
> command in the SCSI command set. The sd_zbc.c code translates zone
> appends (REQ_OP_ZONE_APPEND) into regular WRITE commands. If these WRITE
> commands are reordered, the ZBC standard requires that these commands
> fail with an UNALIGNED WRITE error. So I think for SCSI devices what you
> wrote is wrong.

I think that Hannes point was that if you ensure that the rejected regular write
commands used to emulate zone append when requeued go through the sd driver
again when resubmitted, they will be changed again to emulate the original zone
append using the latest wp location, which is assumed correct. And that does not
depend on the ordering. So requeuing these regular writes does not need sorting.
It can be in any order. The constraint is of course that they must be re-preped
from the original REQ_OP_ZONE_APPEND every time they are requeued.
Bart Van Assche Aug. 25, 2023, 1 a.m. UTC | #6
On 8/24/23 16:53, Damien Le Moal wrote:
> I think that Hannes point was that if you ensure that the rejected regular write
> commands used to emulate zone append when requeued go through the sd driver
> again when resubmitted, they will be changed again to emulate the original zone
> append using the latest wp location, which is assumed correct. And that does not
> depend on the ordering. So requeuing these regular writes does not need sorting.
> It can be in any order. The constraint is of course that they must be re-preped
> from the original REQ_OP_ZONE_APPEND every time they are requeued.

Hi Damien,

Although this is a possible approach, I do not agree that this is a better
approach. Compared to the patch series that I posted, the disadvantages of
this approach are as follows:
* It relies on the zone append emulation. Although that emulation is not too
   complicated, it adds more code in the hot path. The spin lock in the zone
   append emulation is expected to cause performance issues at the performance
   levels supported by UFS devices. Current UFS devices already support more
   than 300 K IOPS and future UFS devices are expected to support even higher
   performance.
* It would be challenging to derive the write pointer in the error handler
   without submitting a REPORT ZONES command to the storage device. This
   means that the error handler becomes more complicated.

In summary: the feedback is appreciated but I do not agree that the zone
append approach is a better approach for UFS devices than the approach of
this patch series.

Thanks,

Bart.
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index c67cdcdc3ba8..c4d817f044a0 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -27,6 +27,7 @@ 
 #include <linux/blkdev.h>
 #include <linux/delay.h>
 #include <linux/jiffies.h>
+#include <linux/list_sort.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -2186,6 +2187,68 @@  void scsi_eh_ready_devs(struct Scsi_Host *shost,
 }
 EXPORT_SYMBOL_GPL(scsi_eh_ready_devs);
 
+/*
+ * Returns true if .eh_prepare_resubmit should be called for the commands in
+ * @done_q.
+ */
+static bool scsi_needs_preparation(struct list_head *done_q)
+{
+	struct scsi_cmnd *scmd;
+
+	list_for_each_entry(scmd, done_q, eh_entry) {
+		struct scsi_driver *uld = scsi_cmd_to_driver(scmd);
+		bool (*npr)(struct scsi_cmnd *) = uld->eh_needs_prepare_resubmit;
+
+		if (npr && npr(scmd))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Comparison function that allows to sort SCSI commands by ULD driver.
+ */
+static int scsi_cmp_uld(void *priv, const struct list_head *_a,
+			const struct list_head *_b)
+{
+	struct scsi_cmnd *a = list_entry(_a, typeof(*a), eh_entry);
+	struct scsi_cmnd *b = list_entry(_b, typeof(*b), eh_entry);
+
+	/* See also the comment above the list_sort() definition. */
+	return scsi_cmd_to_driver(a) > scsi_cmd_to_driver(b);
+}
+
+void scsi_call_prepare_resubmit(struct list_head *done_q)
+{
+	struct scsi_cmnd *scmd, *next;
+
+	if (!scsi_needs_preparation(done_q))
+		return;
+
+	/* Sort pending SCSI commands by ULD. */
+	list_sort(NULL, done_q, scsi_cmp_uld);
+
+	/*
+	 * Call .eh_prepare_resubmit for each range of commands with identical
+	 * ULD driver pointer.
+	 */
+	list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
+		struct scsi_driver *uld = scsi_cmd_to_driver(scmd);
+		struct list_head *prev, uld_cmd_list;
+
+		while (&next->eh_entry != done_q &&
+		       scsi_cmd_to_driver(next) == uld)
+			next = list_next_entry(next, eh_entry);
+		if (!uld->eh_prepare_resubmit)
+			continue;
+		prev = scmd->eh_entry.prev;
+		list_cut_position(&uld_cmd_list, prev, next->eh_entry.prev);
+		uld->eh_prepare_resubmit(&uld_cmd_list);
+		list_splice(&uld_cmd_list, prev);
+	}
+}
+
 /**
  * scsi_eh_flush_done_q - finish processed commands or retry them.
  * @done_q:	list_head of processed commands.
@@ -2194,6 +2257,8 @@  void scsi_eh_flush_done_q(struct list_head *done_q)
 {
 	struct scsi_cmnd *scmd, *next;
 
+	scsi_call_prepare_resubmit(done_q);
+
 	list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
 		list_del_init(&scmd->eh_entry);
 		if (scsi_device_online(scmd->device) &&
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index f42388ecb024..df4af4645430 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -101,6 +101,7 @@  int scsi_eh_get_sense(struct list_head *work_q,
 		      struct list_head *done_q);
 bool scsi_noretry_cmd(struct scsi_cmnd *scmd);
 void scsi_eh_done(struct scsi_cmnd *scmd);
+void scsi_call_prepare_resubmit(struct list_head *done_q);
 
 /* scsi_lib.c */
 extern int scsi_maybe_unblock_host(struct scsi_device *sdev);
diff --git a/include/scsi/scsi_driver.h b/include/scsi/scsi_driver.h
index 4ce1988b2ba0..00ffa470724a 100644
--- a/include/scsi/scsi_driver.h
+++ b/include/scsi/scsi_driver.h
@@ -18,6 +18,8 @@  struct scsi_driver {
 	int (*done)(struct scsi_cmnd *);
 	int (*eh_action)(struct scsi_cmnd *, int);
 	void (*eh_reset)(struct scsi_cmnd *);
+	bool (*eh_needs_prepare_resubmit)(struct scsi_cmnd *cmd);
+	void (*eh_prepare_resubmit)(struct list_head *cmd_list);
 };
 #define to_scsi_driver(drv) \
 	container_of((drv), struct scsi_driver, gendrv)