diff mbox series

[01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features

Message ID 20250604020850.1304633-2-yi.zhang@huaweicloud.com
State New
Headers show
Series fallocate: introduce FALLOC_FL_WRITE_ZEROES flag | expand

Commit Message

Zhang Yi June 4, 2025, 2:08 a.m. UTC
From: Zhang Yi <yi.zhang@huawei.com>

Currently, disks primarily implement the write zeroes command (aka
REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
physically writing zeros to the disk media (e.g., HDDs), while the
second performs an unmap operation on the logical blocks, effectively
putting them into a deallocated state (e.g., SSDs). The first method is
generally slow, while the second method is typically very fast.

For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
the write zeros operation by placing disk blocks into a deallocated
state, which opportunistically avoids writing zeroes to media while
still guaranteeing that subsequent reads from the specified block range
will return zeroed data. This is a best-effort optimization, not a
mandatory requirement, some devices may partially fall back to writing
physical zeroes due to factors such as misalignment or being asked to
clear a block range smaller than the device's internal allocation unit.
Therefore, the speed of this operation is not guaranteed.

It is difficult to determine whether the storage device supports unmap
write zeroes operation. We cannot determine this by only querying
bdev_limits(bdev)->max_write_zeroes_sectors. First, add a new queue
limit feature, BLK_FEAT_WRITE_ZEROES_UNMAP, to indicate whether a device
supports this unmap write zeroes operation. Then, add a new counterpart
flag, BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED and a sysfs entry, which
allow users to disable this operation if the speed is very slow on some
sepcial devices.

Finally, for the stacked devices cases, the BLK_FEAT_WRITE_ZEROES_UNMAP
should be supported both by the stacking driver and all underlying
devices.

Thanks to Martin K. Petersen for optimizing the documentation of the
write_zeroes_unmap sysfs interface.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 Documentation/ABI/stable/sysfs-block | 20 ++++++++++++++++++++
 block/blk-settings.c                 |  6 ++++++
 block/blk-sysfs.c                    | 25 +++++++++++++++++++++++++
 include/linux/blkdev.h               | 18 ++++++++++++++++++
 4 files changed, 69 insertions(+)

Comments

Christoph Hellwig June 11, 2025, 6:09 a.m. UTC | #1
On Wed, Jun 04, 2025 at 10:08:41AM +0800, Zhang Yi wrote:
> +static ssize_t queue_write_zeroes_unmap_show(struct gendisk *disk, char *page)

..

> +static int queue_write_zeroes_unmap_store(struct gendisk *disk,
> +		const char *page, size_t count, struct queue_limits *lim)

We're probably getting close to wanting macros for the sysfs
flags, similar to the one for the features (QUEUE_SYSFS_FEATURE).

No need to do this now, just thinking along.

> +/* supports unmap write zeroes command */
> +#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))


Should this be exposed through sysfs as a read-only value?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Zhang Yi June 11, 2025, 7:31 a.m. UTC | #2
On 2025/6/11 14:09, Christoph Hellwig wrote:
> On Wed, Jun 04, 2025 at 10:08:41AM +0800, Zhang Yi wrote:
>> +static ssize_t queue_write_zeroes_unmap_show(struct gendisk *disk, char *page)
> 
> ..
> 
>> +static int queue_write_zeroes_unmap_store(struct gendisk *disk,
>> +		const char *page, size_t count, struct queue_limits *lim)
> 
> We're probably getting close to wanting macros for the sysfs
> flags, similar to the one for the features (QUEUE_SYSFS_FEATURE).
> 
> No need to do this now, just thinking along.

Yes.

> 
>> +/* supports unmap write zeroes command */
>> +#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
> 
> 
> Should this be exposed through sysfs as a read-only value?

Uh, are you suggesting adding another sysfs interface to expose
this feature?

> 
> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks,
Yi.
Christoph Hellwig June 12, 2025, 4:47 a.m. UTC | #3
On Wed, Jun 11, 2025 at 03:31:21PM +0800, Zhang Yi wrote:
> >> +/* supports unmap write zeroes command */
> >> +#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
> > 
> > 
> > Should this be exposed through sysfs as a read-only value?
> 
> Uh, are you suggesting adding another sysfs interface to expose
> this feature?

That was the idea.  Or do we have another way to report this capability?
Zhang Yi June 12, 2025, 11:20 a.m. UTC | #4
On 2025/6/12 12:47, Christoph Hellwig wrote:
> On Wed, Jun 11, 2025 at 03:31:21PM +0800, Zhang Yi wrote:
>>>> +/* supports unmap write zeroes command */
>>>> +#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
>>>
>>>
>>> Should this be exposed through sysfs as a read-only value?
>>
>> Uh, are you suggesting adding another sysfs interface to expose
>> this feature?
> 
> That was the idea.  Or do we have another way to report this capability?
> 

Exposing this feature looks useful, but I think adding a new interface
might be somewhat redundant, and it's also difficult to name the new
interface. What about extend this interface to include 3 types? When
read, it exposes the following:

 - none     : the device doesn't support BLK_FEAT_WRITE_ZEROES_UNMAP.
 - enabled  : the device supports BLK_FEAT_WRITE_ZEROES_UNMAP, but the
              BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED is not set.
 - disabled : the device supports BLK_FEAT_WRITE_ZEROES_UNMAP, and the
              BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED is set.

Users can write '0' and '1' to disable and enable this operation if it
is not 'none', thoughts?

Best regards,
Yi.
Darrick J. Wong June 12, 2025, 3:03 p.m. UTC | #5
On Thu, Jun 12, 2025 at 07:20:45PM +0800, Zhang Yi wrote:
> On 2025/6/12 12:47, Christoph Hellwig wrote:
> > On Wed, Jun 11, 2025 at 03:31:21PM +0800, Zhang Yi wrote:
> >>>> +/* supports unmap write zeroes command */
> >>>> +#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
> >>>
> >>>
> >>> Should this be exposed through sysfs as a read-only value?
> >>
> >> Uh, are you suggesting adding another sysfs interface to expose
> >> this feature?
> > 
> > That was the idea.  Or do we have another way to report this capability?
> > 
> 
> Exposing this feature looks useful, but I think adding a new interface
> might be somewhat redundant, and it's also difficult to name the new
> interface. What about extend this interface to include 3 types? When
> read, it exposes the following:
> 
>  - none     : the device doesn't support BLK_FEAT_WRITE_ZEROES_UNMAP.
>  - enabled  : the device supports BLK_FEAT_WRITE_ZEROES_UNMAP, but the
>               BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED is not set.
>  - disabled : the device supports BLK_FEAT_WRITE_ZEROES_UNMAP, and the
>               BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED is set.
> 
> Users can write '0' and '1' to disable and enable this operation if it
> is not 'none', thoughts?

Perhaps it should reuse the enumeration pattern elsewhere in sysfs?
For example,

# cat /sys/block/sda/queue/scheduler
none [mq-deadline]
# echo none > /sys/block/sda/queue/scheduler
# cat /sys/block/sda/queue/scheduler
[none] mq-deadline

(Annoying that this seems to be opencoded wherever it appears...)

--D

> Best regards,
> Yi.
> 
>
Zhang Yi June 13, 2025, 3:15 a.m. UTC | #6
On 2025/6/12 23:03, Darrick J. Wong wrote:
> On Thu, Jun 12, 2025 at 07:20:45PM +0800, Zhang Yi wrote:
>> On 2025/6/12 12:47, Christoph Hellwig wrote:
>>> On Wed, Jun 11, 2025 at 03:31:21PM +0800, Zhang Yi wrote:
>>>>>> +/* supports unmap write zeroes command */
>>>>>> +#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
>>>>>
>>>>>
>>>>> Should this be exposed through sysfs as a read-only value?
>>>>
>>>> Uh, are you suggesting adding another sysfs interface to expose
>>>> this feature?
>>>
>>> That was the idea.  Or do we have another way to report this capability?
>>>
>>
>> Exposing this feature looks useful, but I think adding a new interface
>> might be somewhat redundant, and it's also difficult to name the new
>> interface. What about extend this interface to include 3 types? When
>> read, it exposes the following:
>>
>>  - none     : the device doesn't support BLK_FEAT_WRITE_ZEROES_UNMAP.
>>  - enabled  : the device supports BLK_FEAT_WRITE_ZEROES_UNMAP, but the
>>               BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED is not set.
>>  - disabled : the device supports BLK_FEAT_WRITE_ZEROES_UNMAP, and the
>>               BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED is set.
>>
>> Users can write '0' and '1' to disable and enable this operation if it
>> is not 'none', thoughts?
> 
> Perhaps it should reuse the enumeration pattern elsewhere in sysfs?
> For example,
> 
> # cat /sys/block/sda/queue/scheduler
> none [mq-deadline]
> # echo none > /sys/block/sda/queue/scheduler
> # cat /sys/block/sda/queue/scheduler
> [none] mq-deadline
> 
> (Annoying that this seems to be opencoded wherever it appears...)
> 

Yeah, this solution looks good to me. However, we currently have only
two selections (none and unmap). What if we keep it as is and simply
hide this interface if BLK_FEAT_WRITE_ZEROES_UNMAP is not set, making
it visible only when the device supports this feature? Something like
below:

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index e918b2c93aed..204ee4d5f63f 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -747,6 +747,9 @@ static umode_t queue_attr_visible(struct kobject *kobj, struct attribute *attr,
             attr == &queue_max_active_zones_entry.attr) &&
            !blk_queue_is_zoned(q))
                return 0;
+       if (attr == &queue_write_zeroes_unmap_entry.attr &&
+           !(q->limits.features & BLK_FEAT_WRITE_ZEROES_UNMAP))
+               return 0;

        return attr->mode;
 }

Thanks,
Yi.
Christoph Hellwig June 13, 2025, 5:56 a.m. UTC | #7
On Fri, Jun 13, 2025 at 11:15:41AM +0800, Zhang Yi wrote:
> Yeah, this solution looks good to me. However, we currently have only
> two selections (none and unmap). What if we keep it as is and simply
> hide this interface if BLK_FEAT_WRITE_ZEROES_UNMAP is not set, making
> it visible only when the device supports this feature? Something like
> below:

I really hate having all kinds of different interfaces for configurations.
Maybe we should redo this similar to the other hardware/software interfaces
and have a hw_ limit that is exposed by the driver and re-only in
sysfs, and then the user configurable one without _hw.  Setting it to
zero disables the feature.
Darrick J. Wong June 13, 2025, 2:54 p.m. UTC | #8
On Fri, Jun 13, 2025 at 07:56:30AM +0200, Christoph Hellwig wrote:
> On Fri, Jun 13, 2025 at 11:15:41AM +0800, Zhang Yi wrote:
> > Yeah, this solution looks good to me. However, we currently have only
> > two selections (none and unmap). What if we keep it as is and simply
> > hide this interface if BLK_FEAT_WRITE_ZEROES_UNMAP is not set, making
> > it visible only when the device supports this feature? Something like
> > below:
> 
> I really hate having all kinds of different interfaces for configurations.

I really hate the open-coded string parsing nonsense that is sysfs. ;)

> Maybe we should redo this similar to the other hardware/software interfaces
> and have a hw_ limit that is exposed by the driver and re-only in
> sysfs, and then the user configurable one without _hw.  Setting it to
> zero disables the feature.

Yeah, that fits the /sys/block/foo/queue model better.

--D
Zhang Yi June 14, 2025, 4:48 a.m. UTC | #9
On 2025/6/13 22:54, Darrick J. Wong wrote:
> On Fri, Jun 13, 2025 at 07:56:30AM +0200, Christoph Hellwig wrote:
>> On Fri, Jun 13, 2025 at 11:15:41AM +0800, Zhang Yi wrote:
>>> Yeah, this solution looks good to me. However, we currently have only
>>> two selections (none and unmap). What if we keep it as is and simply
>>> hide this interface if BLK_FEAT_WRITE_ZEROES_UNMAP is not set, making
>>> it visible only when the device supports this feature? Something like
>>> below:
>>
>> I really hate having all kinds of different interfaces for configurations.
> 
> I really hate the open-coded string parsing nonsense that is sysfs. ;)
> 
>> Maybe we should redo this similar to the other hardware/software interfaces
>> and have a hw_ limit that is exposed by the driver and re-only in
>> sysfs, and then the user configurable one without _hw.  Setting it to
>> zero disables the feature.
> 
> Yeah, that fits the /sys/block/foo/queue model better.
> 

OK, well. Please let me confirm, are you both suggesting adding
max_hw_write_zeores_unmap_sectors and max_write_zeroes_unmap_sectors to
the queue_limits instead of adding BLK_FEAT_WRITE_ZEROES_UNMAP to the
queue_limits->features. Something like the following.

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 378d3a1a22fc..14394850863c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -376,7 +376,9 @@ struct queue_limits {
        unsigned int            max_hw_discard_sectors;
        unsigned int            max_user_discard_sectors;
        unsigned int            max_secure_erase_sectors;
-       unsigned int            max_write_zeroes_sectors;
+       unsigned int            max_hw_write_zeroes_sectors;
+       unsigned int            max_hw_write_zeores_unmap_sectors;
+       unsigned int            max_write_zeroes_unmap_sectors;
        unsigned int            max_hw_zone_append_sectors;
        unsigned int            max_zone_append_sectors;
        unsigned int            discard_granularity;

Besides, we should also rename max_write_zeroes_sectors to
max_hw_write_zeroes_sectors since it is a hardware limitation reported
by the driver.  If the device supports unmap write zeroes,
max_hw_write_zeores_unmap_sectors should be equal to
max_hw_write_zeroes_sectors, otherwise it should be 0.

Right?

Best regards,
Yi.
Christoph Hellwig June 16, 2025, 5:39 a.m. UTC | #10
On Sat, Jun 14, 2025 at 12:48:26PM +0800, Zhang Yi wrote:
> >> Maybe we should redo this similar to the other hardware/software interfaces
> >> and have a hw_ limit that is exposed by the driver and re-only in
> >> sysfs, and then the user configurable one without _hw.  Setting it to
> >> zero disables the feature.
> > 
> > Yeah, that fits the /sys/block/foo/queue model better.
> > 
> 
> OK, well. Please let me confirm, are you both suggesting adding
> max_hw_write_zeores_unmap_sectors and max_write_zeroes_unmap_sectors to
> the queue_limits instead of adding BLK_FEAT_WRITE_ZEROES_UNMAP to the
> queue_limits->features. Something like the following.

Yes.

> Besides, we should also rename max_write_zeroes_sectors to
> max_hw_write_zeroes_sectors since it is a hardware limitation reported
> by the driver.  If the device supports unmap write zeroes,
> max_hw_write_zeores_unmap_sectors should be equal to
> max_hw_write_zeroes_sectors, otherwise it should be 0.

We've only done the hw names when we allow and overwrite or cap based
on other values.  So far we've not done any of that to
max_write_zeroes_sectors.
diff mbox series

Patch

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 4ba771b56b3b..8e7d513286c4 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -778,6 +778,26 @@  Description:
 		0, write zeroes is not supported by the device.
 
 
+What:		/sys/block/<disk>/queue/write_zeroes_unmap
+Date:		January 2025
+Contact:	Zhang Yi <yi.zhang@huawei.com>
+Description:
+		[RW] When read, this file will display whether the device has
+		enabled the unmap write zeroes operation. This operation
+		indicates that the device supports zeroing data in a specified
+		block range without incurring the cost of physically writing
+		zeroes to media for each individual block. It implements a
+		zeroing operation which opportunistically avoids writing zeroes
+		to media while still guaranteeing that subsequent reads from the
+		specified block range will return zeroed data. This operation is
+		a best-effort optimization, a device may fall back to physically
+		writing zeroes to media due to other factors such as
+		misalignment or being asked to clear a block range smaller than
+		the device's internal allocation unit. So the speed of this
+		operation is not guaranteed. Writing a value of '0' to this file
+		disables this operation.
+
+
 What:		/sys/block/<disk>/queue/zone_append_max_bytes
 Date:		May 2020
 Contact:	linux-block@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index a000daafbfb4..de99763fd668 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -698,6 +698,8 @@  int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->features &= ~BLK_FEAT_NOWAIT;
 	if (!(b->features & BLK_FEAT_POLL))
 		t->features &= ~BLK_FEAT_POLL;
+	if (!(b->features & BLK_FEAT_WRITE_ZEROES_UNMAP))
+		t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
 
 	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
 
@@ -820,6 +822,10 @@  int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->zone_write_granularity = 0;
 		t->max_zone_append_sectors = 0;
 	}
+
+	if (!t->max_write_zeroes_sectors)
+		t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
+
 	blk_stack_atomic_writes_limits(t, b, start);
 
 	return ret;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b2b9b89d6967..e918b2c93aed 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -457,6 +457,29 @@  static int queue_wc_store(struct gendisk *disk, const char *page,
 	return 0;
 }
 
+static ssize_t queue_write_zeroes_unmap_show(struct gendisk *disk, char *page)
+{
+	return sysfs_emit(page, "%u\n",
+			  blk_queue_write_zeroes_unmap(disk->queue));
+}
+
+static int queue_write_zeroes_unmap_store(struct gendisk *disk,
+		const char *page, size_t count, struct queue_limits *lim)
+{
+	unsigned long val;
+	ssize_t ret;
+
+	ret = queue_var_store(&val, page, count);
+	if (ret < 0)
+		return ret;
+
+	if (val)
+		lim->flags &= ~BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED;
+	else
+		lim->flags |= BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED;
+	return 0;
+}
+
 #define QUEUE_RO_ENTRY(_prefix, _name)			\
 static struct queue_sysfs_entry _prefix##_entry = {	\
 	.attr	= { .name = _name, .mode = 0444 },	\
@@ -514,6 +537,7 @@  QUEUE_LIM_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
 
 QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
 QUEUE_LIM_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
+QUEUE_LIM_RW_ENTRY(queue_write_zeroes_unmap, "write_zeroes_unmap");
 QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
 QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
 
@@ -662,6 +686,7 @@  static struct attribute *queue_attrs[] = {
 	&queue_atomic_write_unit_min_entry.attr,
 	&queue_atomic_write_unit_max_entry.attr,
 	&queue_max_write_zeroes_sectors_entry.attr,
+	&queue_write_zeroes_unmap_entry.attr,
 	&queue_max_zone_append_sectors_entry.attr,
 	&queue_zone_write_granularity_entry.attr,
 	&queue_rotational_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 332b56f323d9..6f1cf97b1f00 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -340,6 +340,9 @@  typedef unsigned int __bitwise blk_features_t;
 #define BLK_FEAT_ATOMIC_WRITES \
 	((__force blk_features_t)(1u << 16))
 
+/* supports unmap write zeroes command */
+#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
+
 /*
  * Flags automatically inherited when stacking limits.
  */
@@ -360,6 +363,10 @@  typedef unsigned int __bitwise blk_flags_t;
 /* passthrough command IO accounting */
 #define BLK_FLAG_IOSTATS_PASSTHROUGH	((__force blk_flags_t)(1u << 2))
 
+/* disable the unmap write zeroes operation */
+#define BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED \
+					((__force blk_flags_t)(1u << 3))
+
 struct queue_limits {
 	blk_features_t		features;
 	blk_flags_t		flags;
@@ -1378,6 +1385,17 @@  static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
 	return bdev_limits(bdev)->max_write_zeroes_sectors;
 }
 
+static inline bool blk_queue_write_zeroes_unmap(struct request_queue *q)
+{
+	return (q->limits.features & BLK_FEAT_WRITE_ZEROES_UNMAP) &&
+		!(q->limits.flags & BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED);
+}
+
+static inline bool bdev_write_zeroes_unmap(struct block_device *bdev)
+{
+	return blk_queue_write_zeroes_unmap(bdev_get_queue(bdev));
+}
+
 static inline bool bdev_nonrot(struct block_device *bdev)
 {
 	return blk_queue_nonrot(bdev_get_queue(bdev));