diff mbox series

[RFC,v4,01/11] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features

Message ID 20250421021509.2366003-2-yi.zhang@huaweicloud.com
State New
Headers show
Series fallocate: introduce FALLOC_FL_WRITE_ZEROES flag | expand

Commit Message

Zhang Yi April 21, 2025, 2:14 a.m. UTC
From: Zhang Yi <yi.zhang@huawei.com>

Currently, disks primarily implement the write zeroes command (aka
REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
physically writing zeros to the disk media (e.g., HDDs), while the
second performs an unmap operation on the logical blocks, effectively
putting them into a deallocated state (e.g., SSDs). The first method is
generally slow, while the second method is typically very fast.

For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
the write zeros operation by placing disk blocks into a deallocated
state (this is a best effort, not a mandatory requirement, some devices
may partially fall back to writing physical zeroes due to factors such
as receiving unaligned commands). However, it is difficult to determine
whether the storage device supports unmap write zeroes. We cannot
determine this by querying bdev_limits(bdev)->max_write_zeroes_sectors.

Therefore, add a new queue limit feature, BLK_FEAT_WRITE_ZEROES_UNMAP
and the corresponding sysfs entry, to indicate whether the block device
explicitly supports the unmapped write zeroes command. Each device
driver should set this bit if it is certain that the attached disk
supports this command. If the bit is not set, the disk either does not
support it, or its support status is unknown.

For the stacked devices cases, the BLK_FEAT_WRITE_ZEROES_UNMAP should be
supported both by the stacking driver and all underlying devices.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 Documentation/ABI/stable/sysfs-block | 18 ++++++++++++++++++
 block/blk-settings.c                 |  6 ++++++
 block/blk-sysfs.c                    |  3 +++
 include/linux/blkdev.h               |  8 ++++++++
 4 files changed, 35 insertions(+)

Comments

Christoph Hellwig May 5, 2025, 11:54 a.m. UTC | #1
Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Martin K. Petersen May 6, 2025, 4:21 a.m. UTC | #2
Hi Zhang!

> +		[RO] Devices that explicitly support the unmap write zeroes
> +		operation in which a single write zeroes request with the unmap
> +		bit set to zero out the range of contiguous blocks on storage
> +		by freeing blocks, rather than writing physical zeroes to the
> +		media. If the write_zeroes_unmap is set to 1, this indicates
> +		that the device explicitly supports the write zero command.
> +		However, this may be a best-effort optimization rather than a
> +		mandatory requirement, some devices may partially fall back to
> +		writing physical zeroes due to factors such as receiving
> +		unaligned commands. If the parameter is set to 0, the device
> +		either does not support this operation, or its support status is
> +		unknown.

I am not so keen on mixing Write Zeroes (which is NVMe-speak) and Unmap
(which is SCSI). Also, Deallocate and Unmap reflect block provisioning
state on the device but don't really convey what is semantically
important for your proposed change (zeroing speed and/or media wear
reduction).

That said, I'm having a hard time coming up with a better term.
WRITE_ZEROES_OPTIMIZED, maybe? Naming is hard...

For the description, perhaps something like the following which tries to
focus on the block layer semantics without using protocol-specific
terminology?

[RO] This parameter indicates whether a device supports zeroing data in
a specified block range without incurring the cost of physically writing
zeroes to media for each individual block. This operation is a
best-effort optimization, a device may fall back to physically writing
zeroes to media due to other factors such as misalignment or being asked
to clear a block range smaller than the device's internal allocation
unit. If write_zeroes_unmap is set to 1, the device implements a zeroing
operation which opportunistically avoids writing zeroes to media while
still guaranteeing that subsequent reads from the specified block range
will return zeroed data. If write_zeroes_unmap is set to 0, the device
may have to write each logical block media during a zeroing operation.
Zhang Yi May 6, 2025, 7:51 a.m. UTC | #3
Hi, Martin!

On 2025/5/6 12:21, Martin K. Petersen wrote:
> 
> Hi Zhang!
> 
>> +		[RO] Devices that explicitly support the unmap write zeroes
>> +		operation in which a single write zeroes request with the unmap
>> +		bit set to zero out the range of contiguous blocks on storage
>> +		by freeing blocks, rather than writing physical zeroes to the
>> +		media. If the write_zeroes_unmap is set to 1, this indicates
>> +		that the device explicitly supports the write zero command.
>> +		However, this may be a best-effort optimization rather than a
>> +		mandatory requirement, some devices may partially fall back to
>> +		writing physical zeroes due to factors such as receiving
>> +		unaligned commands. If the parameter is set to 0, the device
>> +		either does not support this operation, or its support status is
>> +		unknown.
> 
> I am not so keen on mixing Write Zeroes (which is NVMe-speak) and Unmap
> (which is SCSI). Also, Deallocate and Unmap reflect block provisioning
> state on the device but don't really convey what is semantically
> important for your proposed change (zeroing speed and/or media wear
> reduction).
> 

Since this flag doesn't strictly guarantee zeroing speed or media wear
reduction optimizations, but rather reflects typical optimization
behavior across most supported devices and cases. Therefore, I propose
using a name that accurately indicates the function of the block device.
However, also can't think of a better name either. Using the name
WRITE_ZEROES_UNMAP seems appropriate to convey that the block device
supports this type of Deallocate and Unmap state.

> That said, I'm having a hard time coming up with a better term.
> WRITE_ZEROES_OPTIMIZED, maybe? Naming is hard...

Using WRITE_ZEROES_OPTIMIZED feels somewhat too generic to me, and
users may not fully grasp the specific optimizations it entails based
on the name.

> 
> For the description, perhaps something like the following which tries to
> focus on the block layer semantics without using protocol-specific
> terminology?
> 
> [RO] This parameter indicates whether a device supports zeroing data in
> a specified block range without incurring the cost of physically writing
> zeroes to media for each individual block. This operation is a
> best-effort optimization, a device may fall back to physically writing
> zeroes to media due to other factors such as misalignment or being asked
> to clear a block range smaller than the device's internal allocation
> unit. If write_zeroes_unmap is set to 1, the device implements a zeroing
> operation which opportunistically avoids writing zeroes to media while
> still guaranteeing that subsequent reads from the specified block range
> will return zeroed data. If write_zeroes_unmap is set to 0, the device
> may have to write each logical block media during a zeroing operation.
> 

Thank you for optimizing the description, it looks good to me. I'd like
to this one in my next iteration. :)

Thanks,
Yi.
diff mbox series

Patch

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 3879963f0f01..6531cdfcaacf 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -763,6 +763,24 @@  Description:
 		0, write zeroes is not supported by the device.
 
 
+What:		/sys/block/<disk>/queue/write_zeroes_unmap
+Date:		January 2025
+Contact:	Zhang Yi <yi.zhang@huawei.com>
+Description:
+		[RO] Devices that explicitly support the unmap write zeroes
+		operation in which a single write zeroes request with the unmap
+		bit set to zero out the range of contiguous blocks on storage
+		by freeing blocks, rather than writing physical zeroes to the
+		media. If the write_zeroes_unmap is set to 1, this indicates
+		that the device explicitly supports the write zero command.
+		However, this may be a best-effort optimization rather than a
+		mandatory requirement, some devices may partially fall back to
+		writing physical zeroes due to factors such as receiving
+		unaligned commands. If the parameter is set to 0, the device
+		either does not support this operation, or its support status is
+		unknown.
+
+
 What:		/sys/block/<disk>/queue/zone_append_max_bytes
 Date:		May 2020
 Contact:	linux-block@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 6b2dbe645d23..3331d07bd5d9 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -697,6 +697,8 @@  int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->features &= ~BLK_FEAT_NOWAIT;
 	if (!(b->features & BLK_FEAT_POLL))
 		t->features &= ~BLK_FEAT_POLL;
+	if (!(b->features & BLK_FEAT_WRITE_ZEROES_UNMAP))
+		t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
 
 	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
 
@@ -819,6 +821,10 @@  int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->zone_write_granularity = 0;
 		t->max_zone_append_sectors = 0;
 	}
+
+	if (!t->max_write_zeroes_sectors)
+		t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
+
 	blk_stack_atomic_writes_limits(t, b, start);
 
 	return ret;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index a2882751f0d2..7a9c20bd3779 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -261,6 +261,7 @@  static ssize_t queue_##_name##_show(struct gendisk *disk, char *page)	\
 
 QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA);
 QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX);
+QUEUE_SYSFS_FEATURE_SHOW(write_zeroes_unmap, BLK_FEAT_WRITE_ZEROES_UNMAP);
 
 static ssize_t queue_poll_show(struct gendisk *disk, char *page)
 {
@@ -510,6 +511,7 @@  QUEUE_LIM_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
 
 QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
 QUEUE_LIM_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
+QUEUE_LIM_RO_ENTRY(queue_write_zeroes_unmap, "write_zeroes_unmap");
 QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
 QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
 
@@ -656,6 +658,7 @@  static struct attribute *queue_attrs[] = {
 	&queue_atomic_write_unit_min_entry.attr,
 	&queue_atomic_write_unit_max_entry.attr,
 	&queue_max_write_zeroes_sectors_entry.attr,
+	&queue_write_zeroes_unmap_entry.attr,
 	&queue_max_zone_append_sectors_entry.attr,
 	&queue_zone_write_granularity_entry.attr,
 	&queue_rotational_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e39c45bc0a97..7c8752578e36 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -342,6 +342,9 @@  typedef unsigned int __bitwise blk_features_t;
 #define BLK_FEAT_ATOMIC_WRITES \
 	((__force blk_features_t)(1u << 16))
 
+/* supports unmap write zeroes command */
+#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
+
 /*
  * Flags automatically inherited when stacking limits.
  */
@@ -1341,6 +1344,11 @@  static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
 	return bdev_limits(bdev)->max_write_zeroes_sectors;
 }
 
+static inline bool bdev_write_zeroes_unmap(struct block_device *bdev)
+{
+	return bdev_limits(bdev)->features & BLK_FEAT_WRITE_ZEROES_UNMAP;
+}
+
 static inline bool bdev_nonrot(struct block_device *bdev)
 {
 	return blk_queue_nonrot(bdev_get_queue(bdev));