mbox series

[v7,0/7] Improve performance for zoned UFS devices

Message ID 20230809202355.1171455-1-bvanassche@acm.org
Headers show
Series Improve performance for zoned UFS devices | expand

Message

Bart Van Assche Aug. 9, 2023, 8:23 p.m. UTC
Hi Jens,

This patch series improves small write IOPS by a factor of four (+300%) for
zoned UFS devices on my test setup with a UFSHCI 3.0 controller. Please
consider this patch series for the next merge window.

Thank you,

Bart.

Changes compared to v6:
 - Removed QUEUE_FLAG_NO_ZONE_WRITE_LOCK and instead introduced a flag in
   the request queue limits data structure.

Changes compared to v5:
 - Renamed scsi_cmp_lba() into scsi_cmp_sector().
 - Improved several source code comments.

Changes compared to v4:
 - Dropped the patch that introduces the REQ_NO_ZONE_WRITE_LOCK flag.
 - Dropped the null_blk patch and added two scsi_debug patches instead.
 - Dropped the f2fs patch.
 - Split the patch for the UFS driver into two patches.
 - Modified several patch descriptions and source code comments.
 - Renamed dd_use_write_locking() into dd_use_zone_write_locking().
 - Moved the list_sort() call from scsi_unjam_host() into scsi_eh_flush_done_q()
   such that sorting happens just before reinserting.
 - Removed the scsi_cmd_retry_allowed() call from scsi_check_sense() to make
   sure that the retry counter is adjusted once per retry instead of twice.

Changes compared to v3:
 - Restored the patch that introduces QUEUE_FLAG_NO_ZONE_WRITE_LOCK. That patch
   had accidentally been left out from v2.
 - In patch "block: Introduce the flag REQ_NO_ZONE_WRITE_LOCK", improved the
   patch description and added the function blk_no_zone_write_lock().
 - In patch "block/mq-deadline: Only use zone locking if necessary", moved the
   blk_queue_is_zoned() call into dd_use_write_locking().
 - In patch "fs/f2fs: Disable zone write locking", set REQ_NO_ZONE_WRITE_LOCK
   from inside __bio_alloc() instead of in f2fs_submit_write_bio().

Changes compared to v2:
 - Renamed the request queue flag for disabling zone write locking.
 - Introduced a new request flag for disabling zone write locking.
 - Modified the mq-deadline scheduler such that zone write locking is only
   disabled if both flags are set.
 - Added an F2FS patch that sets the request flag for disabling zone write
   locking.
 - Only disable zone write locking in the UFS driver if auto-hibernation is
   disabled.

Changes compared to v1:
 - Left out the patches that are already upstream.
 - Switched the approach in patch "scsi: Retry unaligned zoned writes" from
   retrying immediately to sending unaligned write commands to the SCSI error
   handler.

Bart Van Assche (7):
  block: Introduce the use_zone_write_lock member variable
  block/mq-deadline: Only use zone locking if necessary
  scsi: core: Retry unaligned zoned writes
  scsi: scsi_debug: Support disabling zone write locking
  scsi: scsi_debug: Support injecting unaligned write errors
  scsi: ufs: Split an if-condition
  scsi: ufs: Disable zone write locking

 block/blk-settings.c      |  6 ++++++
 block/mq-deadline.c       | 24 ++++++++++++++++------
 drivers/scsi/scsi_debug.c | 20 ++++++++++++++++++-
 drivers/scsi/scsi_error.c | 37 ++++++++++++++++++++++++++++++++++
 drivers/scsi/scsi_lib.c   |  1 +
 drivers/scsi/sd.c         |  3 +++
 drivers/ufs/core/ufshcd.c | 42 ++++++++++++++++++++++++++++++++++++---
 include/linux/blkdev.h    |  1 +
 include/scsi/scsi.h       |  1 +
 9 files changed, 125 insertions(+), 10 deletions(-)

Comments

Damien Le Moal Aug. 10, 2023, 1:33 a.m. UTC | #1
On 8/10/23 05:23, Bart Van Assche wrote:
> Writes in sequential write required zones must happen at the write
> pointer. Even if the submitter of the write commands (e.g. a filesystem)
> submits writes for sequential write required zones in order, the block
> layer or the storage controller may reorder these write commands.
> 
> The zone locking mechanism in the mq-deadline I/O scheduler serializes
> write commands for sequential zones. Some but not all storage controllers
> require this serialization. Introduce a new request queue limit member
> variable to allow block drivers to indicate that they preserve the order
> of write commands and thus do not require serialization of writes per
> zone.
> 
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-settings.c   | 6 ++++++
>  include/linux/blkdev.h | 1 +
>  2 files changed, 7 insertions(+)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 0046b447268f..b75c97971860 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -56,6 +56,7 @@ void blk_set_default_limits(struct queue_limits *lim)
>  	lim->alignment_offset = 0;
>  	lim->io_opt = 0;
>  	lim->misaligned = 0;
> +	lim->use_zone_write_lock = true;
>  	lim->zoned = BLK_ZONED_NONE;

Given that the default for zoned is BLK_ZONED_NONE, having use_zone_write_lock
default to true is strange. It would be better to set the default to false and
have disk_set_zoned() set it to true if needed, with an additional argument to
specify if it should be the case or not. E.g., for SMR drives, sd.c would call
something like:

disk_set_zoned(sdkp->disk, BLK_ZONED_HM, sdp->use_zone_write_lock);

sd.c would default to sdp->use_zone_write_lock == true and UFS driver can set
it to false. That would be cleaner I think.

>  	lim->zone_write_granularity = 0;
>  	lim->dma_alignment = 511;
> @@ -685,6 +686,11 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  						   b->max_secure_erase_sectors);
>  	t->zone_write_granularity = max(t->zone_write_granularity,
>  					b->zone_write_granularity);
> +	/*
> +	 * Whether or not the zone write lock should be used depends on the
> +	 * bottom driver only.
> +	 */
> +	t->use_zone_write_lock = b->use_zone_write_lock;

Given that DM bio targets do not have a scheduler and do not have a zone lock
bitmap allocated, I do not think this is necessary at all. This can remain to
false, thus in sync with the fact that there is no IO scheduler.

>  	t->zoned = max(t->zoned, b->zoned);
>  	return ret;
>  }
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 2f5371b8482c..deffa1f13444 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -316,6 +316,7 @@ struct queue_limits {
>  	unsigned char		misaligned;
>  	unsigned char		discard_misaligned;
>  	unsigned char		raid_partial_stripes_expensive;
> +	bool			use_zone_write_lock;
>  	enum blk_zoned_model	zoned;
>  
>  	/*
Bart Van Assche Aug. 10, 2023, 2:02 p.m. UTC | #2
On 8/9/23 18:33, Damien Le Moal wrote:
> On 8/10/23 05:23, Bart Van Assche wrote:
>> Writes in sequential write required zones must happen at the write
>> pointer. Even if the submitter of the write commands (e.g. a filesystem)
>> submits writes for sequential write required zones in order, the block
>> layer or the storage controller may reorder these write commands.
>>
>> The zone locking mechanism in the mq-deadline I/O scheduler serializes
>> write commands for sequential zones. Some but not all storage controllers
>> require this serialization. Introduce a new request queue limit member
>> variable to allow block drivers to indicate that they preserve the order
>> of write commands and thus do not require serialization of writes per
>> zone.
>>
>> Cc: Damien Le Moal <dlemoal@kernel.org>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ming Lei <ming.lei@redhat.com>
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>> ---
>>   block/blk-settings.c   | 6 ++++++
>>   include/linux/blkdev.h | 1 +
>>   2 files changed, 7 insertions(+)
>>
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 0046b447268f..b75c97971860 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -56,6 +56,7 @@ void blk_set_default_limits(struct queue_limits *lim)
>>   	lim->alignment_offset = 0;
>>   	lim->io_opt = 0;
>>   	lim->misaligned = 0;
>> +	lim->use_zone_write_lock = true;
>>   	lim->zoned = BLK_ZONED_NONE;
> 
> Given that the default for zoned is BLK_ZONED_NONE, having use_zone_write_lock
> default to true is strange. It would be better to set the default to false and
> have disk_set_zoned() set it to true if needed, with an additional argument to
> specify if it should be the case or not. E.g., for SMR drives, sd.c would call
> something like:
> 
> disk_set_zoned(sdkp->disk, BLK_ZONED_HM, sdp->use_zone_write_lock);
> 
> sd.c would default to sdp->use_zone_write_lock == true and UFS driver can set
> it to false. That would be cleaner I think.

Hi Damien,

Thanks for the detailed feedback.

My concerns about the above proposal are as follows:
* The information about whether or not the zone write lock should be used comes
   from the block driver, e.g. a SCSI LLD.
* sdp, the SCSI disk pointer, is owned by the ULD.
* An ULD may be attached and detached multiple times during the lifetime of a
   logical unit without the LLD being informed about this. So how to set
   sdp->use_zone_write_lock without introducing a new callback or member variable
   in a data structure owned by the LLD?

Hence my preference to store use_zone_write_lock in a data structure that has the
same lifetime as the logical unit and not in any data structure controlled by the
ULD.

>>   	lim->zone_write_granularity = 0;
>>   	lim->dma_alignment = 511;
>> @@ -685,6 +686,11 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>   						   b->max_secure_erase_sectors);
>>   	t->zone_write_granularity = max(t->zone_write_granularity,
>>   					b->zone_write_granularity);
>> +	/*
>> +	 * Whether or not the zone write lock should be used depends on the
>> +	 * bottom driver only.
>> +	 */
>> +	t->use_zone_write_lock = b->use_zone_write_lock;
> 
> Given that DM bio targets do not have a scheduler and do not have a zone lock
> bitmap allocated, I do not think this is necessary at all. This can remain to
> false, thus in sync with the fact that there is no IO scheduler.

How about the request-based dm drivers (dm-mpath and dm-target)? Isn't the dm-mpath
driver request based because that allows the I/O scheduler to be configured on top
of the dm-mpath driver? From https://lwn.net/Articles/274292/: "The basic idea to
resolve the issue is to move multipathing layer down below the I/O scheduler, and it
was proposed from Mike Christie as the block layer (request-based) multipath".

Thanks,

Bart.
Christoph Hellwig Aug. 11, 2023, 1:29 p.m. UTC | #3
Just as last time:  comparing the lbas really has no business in the
SCSI core.
Bart Van Assche Aug. 11, 2023, 3:41 p.m. UTC | #4
On 8/10/23 17:39, Damien Le Moal wrote:
> On 8/10/23 23:02, Bart Van Assche wrote:
>> My concerns about the above proposal are as follows:
>> * The information about whether or not the zone write lock should be used comes
>>     from the block driver, e.g. a SCSI LLD.
> 
> Yes.
> 
>> * sdp, the SCSI disk pointer, is owned by the ULD.
>> * An ULD may be attached and detached multiple times during the lifetime of a
>>     logical unit without the LLD being informed about this. So how to set
>>     sdp->use_zone_write_lock without introducing a new callback or member variable
>>     in a data structure owned by the LLD?
> 
> That would be set during device scan and device revalidate. And if the value
> changes, then disk_set_zoned() should be called again to update the queue limit.
> That is already what is done for the zoned limit indicating the type of the
> drive. My point is that the zoned limit should dictate if use_zone_write_lock
> can be true. The default should be be:
> 
> 	q->limits.use_zone_write_lock = q->limits.zoned != BLK_ZONED_NONE.
> 
> And as proposed, if the UFS driver wants to disable zone write locking, all it
> needs to do is call "disk_set_zoned(disk, BLK_ZONED_HM, false)". I did not try
> to actually code that, and the scsi disk driver may be in the way and that may
> need to be done there, hence the suggestion of having a use_zone_write_lock flag
> in the scsi device structure so that the UFS driver can set it as needed as well
> (and detect changes when revalidating). That should work, but I may be missing
> something.

Hi Damien,

Adding a use_zone_write_lock argument to disk_set_zoned() seems wrong to 
me. The information about whether or not to use the zone write lock 
comes from the LLD. The zone model is retrieved by the ULD. Since both 
pieces of information come from different drivers, both properties 
should be modified independently.

Moving the use_zone_write_lock member variable into a data structure 
owned by the ULD seems wrong to me because that member variable is set 
by the LLD.

Reviewers are allowed to request changes for well-designed and working 
code but should be able to explain why they request these changes. Can 
you please explain why you care about the value of 
q->limits.use_zone_write_lock for the BLK_ZONED_NONE case? If 
dd_use_zone_write_locking() would be renamed and would be moved into 
include/linux/blkdev.h and if reads of q->limits.use_zone_write_lock 
would be changed into blk_use_zone_write_locking() calls then the value 
of q->limits.use_zone_write_lock for the BLK_ZONED_NONE case wouldn't 
matter at all.

Thanks,

Bart.
Damien Le Moal Aug. 12, 2023, 2:44 a.m. UTC | #5
On 8/12/23 00:41, Bart Van Assche wrote:
> On 8/10/23 17:39, Damien Le Moal wrote:
>> On 8/10/23 23:02, Bart Van Assche wrote:
>>> My concerns about the above proposal are as follows:
>>> * The information about whether or not the zone write lock should be used comes
>>>     from the block driver, e.g. a SCSI LLD.
>>
>> Yes.
>>
>>> * sdp, the SCSI disk pointer, is owned by the ULD.
>>> * An ULD may be attached and detached multiple times during the lifetime of a
>>>     logical unit without the LLD being informed about this. So how to set
>>>     sdp->use_zone_write_lock without introducing a new callback or member variable
>>>     in a data structure owned by the LLD?
>>
>> That would be set during device scan and device revalidate. And if the value
>> changes, then disk_set_zoned() should be called again to update the queue limit.
>> That is already what is done for the zoned limit indicating the type of the
>> drive. My point is that the zoned limit should dictate if use_zone_write_lock
>> can be true. The default should be be:
>>
>> 	q->limits.use_zone_write_lock = q->limits.zoned != BLK_ZONED_NONE.
>>
>> And as proposed, if the UFS driver wants to disable zone write locking, all it
>> needs to do is call "disk_set_zoned(disk, BLK_ZONED_HM, false)". I did not try
>> to actually code that, and the scsi disk driver may be in the way and that may
>> need to be done there, hence the suggestion of having a use_zone_write_lock flag
>> in the scsi device structure so that the UFS driver can set it as needed as well
>> (and detect changes when revalidating). That should work, but I may be missing
>> something.
> 
> Hi Damien,
> 
> Adding a use_zone_write_lock argument to disk_set_zoned() seems wrong to 
> me. The information about whether or not to use the zone write lock 
> comes from the LLD. The zone model is retrieved by the ULD. Since both 
> pieces of information come from different drivers, both properties 
> should be modified independently.

OK. But I still think that disk_set_zoned() is the place where the default for
use_zone_write_lock should be set.

And we need a clean way for the LLD to change use_zone_write_lock.

> 
> Moving the use_zone_write_lock member variable into a data structure 
> owned by the ULD seems wrong to me because that member variable is set 
> by the LLD.
> 
> Reviewers are allowed to request changes for well-designed and working 
> code but should be able to explain why they request these changes. Can 
> you please explain why you care about the value of 
> q->limits.use_zone_write_lock for the BLK_ZONED_NONE case? If 

Because use_zone_write_lock should ever be true for !BLK_ZONED_NONE case.
Allowing use_zone_write_lock to be true even with zoned == BLK_ZONED_NONE forces
to always check that the device is zoned to ensure that the value of
use_zone_write_lock is valid. This is awckward and uselessly complicate things.

> dd_use_zone_write_locking() would be renamed and would be moved into 
> include/linux/blkdev.h and if reads of q->limits.use_zone_write_lock 
> would be changed into blk_use_zone_write_locking() calls then the value 
> of q->limits.use_zone_write_lock for the BLK_ZONED_NONE case wouldn't 
> matter at all.

Sure, that would work. But again, why the need to check both model and actual
value of use_zone_write_lock ? I do not think it is that hard to keep
use_zone_write_lock to false for the BLK_ZONED_NONE case. Then
blk_use_zone_write_locking() is reduced to returning the value of
use_zone_write_lock.

> 
> Thanks,
> 
> Bart.