Message ID | 20250421021509.2366003-8-yi.zhang@huaweicloud.com |
---|---|
State | New |
Headers | show |
Series | fallocate: introduce FALLOC_FL_WRITE_ZEROES flag | expand |
On Mon, Apr 21, 2025 at 10:15:05AM +0800, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@huawei.com> > > Add a new attribute flag to statx to determine whether a bdev or a file > supports the unmap write zeroes command. > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > --- > block/bdev.c | 4 ++++ > fs/ext4/inode.c | 9 ++++++--- > include/uapi/linux/stat.h | 1 + > 3 files changed, 11 insertions(+), 3 deletions(-) > > diff --git a/block/bdev.c b/block/bdev.c > index 4844d1e27b6f..29b0e5feb138 100644 > --- a/block/bdev.c > +++ b/block/bdev.c > @@ -1304,6 +1304,10 @@ void bdev_statx(struct path *path, struct kstat *stat, > queue_atomic_write_unit_max_bytes(bd_queue)); > } > > + if (bdev_write_zeroes_unmap(bdev)) > + stat->attributes |= STATX_ATTR_WRITE_ZEROES_UNMAP; > + stat->attributes_mask |= STATX_ATTR_WRITE_ZEROES_UNMAP; Hmm, shouldn't this always be set by stat? But I might just be really confused what attributes_mask is, and might in fact have misapplied it in past patches of my own.. Also shouldn't the patches to report the flag go into the bdev/ext4 patches that actually implement the feature for the respective files to keep bisectability?
On Mon, May 05, 2025 at 03:22:08PM +0200, Christoph Hellwig wrote: > On Mon, Apr 21, 2025 at 10:15:05AM +0800, Zhang Yi wrote: > > From: Zhang Yi <yi.zhang@huawei.com> > > > > Add a new attribute flag to statx to determine whether a bdev or a file > > supports the unmap write zeroes command. > > > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > > --- > > block/bdev.c | 4 ++++ > > fs/ext4/inode.c | 9 ++++++--- > > include/uapi/linux/stat.h | 1 + > > 3 files changed, 11 insertions(+), 3 deletions(-) > > > > diff --git a/block/bdev.c b/block/bdev.c > > index 4844d1e27b6f..29b0e5feb138 100644 > > --- a/block/bdev.c > > +++ b/block/bdev.c > > @@ -1304,6 +1304,10 @@ void bdev_statx(struct path *path, struct kstat *stat, > > queue_atomic_write_unit_max_bytes(bd_queue)); > > } > > > > + if (bdev_write_zeroes_unmap(bdev)) > > + stat->attributes |= STATX_ATTR_WRITE_ZEROES_UNMAP; > > + stat->attributes_mask |= STATX_ATTR_WRITE_ZEROES_UNMAP; > > Hmm, shouldn't this always be set by stat? But I might just be > really confused what attributes_mask is, and might in fact have > misapplied it in past patches of my own.. attributes_mask contains attribute flags known to the filesystem, whereas attributes contains flags actually set on the file. "known_attributes" would have been a better name, but that's water under the bridge. :P > Also shouldn't the patches to report the flag go into the bdev/ext4 > patches that actually implement the feature for the respective files > to keep bisectability? /I/ think so... --D
On 2025/5/5 22:29, Darrick J. Wong wrote: > On Mon, May 05, 2025 at 03:22:08PM +0200, Christoph Hellwig wrote: >> On Mon, Apr 21, 2025 at 10:15:05AM +0800, Zhang Yi wrote: >>> From: Zhang Yi <yi.zhang@huawei.com> >>> >>> Add a new attribute flag to statx to determine whether a bdev or a file >>> supports the unmap write zeroes command. >>> >>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> >>> --- >>> block/bdev.c | 4 ++++ >>> fs/ext4/inode.c | 9 ++++++--- >>> include/uapi/linux/stat.h | 1 + >>> 3 files changed, 11 insertions(+), 3 deletions(-) >>> >>> diff --git a/block/bdev.c b/block/bdev.c >>> index 4844d1e27b6f..29b0e5feb138 100644 >>> --- a/block/bdev.c >>> +++ b/block/bdev.c >>> @@ -1304,6 +1304,10 @@ void bdev_statx(struct path *path, struct kstat *stat, >>> queue_atomic_write_unit_max_bytes(bd_queue)); >>> } >>> >>> + if (bdev_write_zeroes_unmap(bdev)) >>> + stat->attributes |= STATX_ATTR_WRITE_ZEROES_UNMAP; >>> + stat->attributes_mask |= STATX_ATTR_WRITE_ZEROES_UNMAP; >> >> Hmm, shouldn't this always be set by stat? But I might just be >> really confused what attributes_mask is, and might in fact have >> misapplied it in past patches of my own.. > > attributes_mask contains attribute flags known to the filesystem, > whereas attributes contains flags actually set on the file. > "known_attributes" would have been a better name, but that's water under > the bridge. :P > >> Also shouldn't the patches to report the flag go into the bdev/ext4 >> patches that actually implement the feature for the respective files >> to keep bisectability? > > /I/ think so... > OK, since this statx reporting flag is not strongly tied to FALLOC_FL_WRITE_ZEROES in vfs_fallocate(), I'll split this patch into three separate patches. Thanks, Yi.
On Tue, May 06, 2025 at 12:28:54PM +0800, Zhang Yi wrote: > OK, since this statx reporting flag is not strongly tied to > FALLOC_FL_WRITE_ZEROES in vfs_fallocate(), I'll split this patch into > three separate patches. I don't think that is the right thing to do do. Keep the flag addition here, and then report it in the ext4 and bdev patches adding FALLOC_FL_WRITE_ZEROES as the reporting should be consistent with the added support.
On Mon, May 05, 2025 at 07:29:45AM -0700, Darrick J. Wong wrote: > attributes_mask contains attribute flags known to the filesystem, > whereas attributes contains flags actually set on the file. > "known_attributes" would have been a better name, but that's water under > the bridge. :P Oooh. I think I was very confused at what this patch does, and what it does seems confused as well. The patch adds a new flag to the STATX_ATTR_* namespace, which historically was used for persistent on-disk flags like immutable, not the STATX_* namespace where I assumed it, and which has no support mask. Which seems really odd for a pure kernel feature. Then again it seems to follow STATX_ATTR_WRITE_ATOMIC which seems just as wrongly place unless I'm missing something?
On Tue, May 06, 2025 at 07:02:39AM +0200, Christoph Hellwig wrote: > On Mon, May 05, 2025 at 07:29:45AM -0700, Darrick J. Wong wrote: > > attributes_mask contains attribute flags known to the filesystem, > > whereas attributes contains flags actually set on the file. > > "known_attributes" would have been a better name, but that's water under > > the bridge. :P > > Oooh. I think I was very confused at what this patch does, and what > it does seems confused as well. > > The patch adds a new flag to the STATX_ATTR_* namespace, which > historically was used for persistent on-disk flags like immutable, > not the STATX_* namespace where I assumed it, and which has no > support mask. Which seems really odd for a pure kernel feature. > Then again it seems to follow STATX_ATTR_WRITE_ATOMIC which seems > just as wrongly place unless I'm missing something? I think STATX_* (i.e. not STATX_ATTR_*) flags have two purposes: 1) to declare that specific fields in struct statx actually have meaning, most notably in scenarios where zeroes are valid field contents; and 2) if filling out the field is expensive, userspace can elect not to have it filled by leaving the bit unset. I don't know how userspace is supposed to figure out which fields are expensive. STATX_ATTR_* are supposed to be reflect persistent inode state. I think STATX_ATTR_WRITE_ATOMIC is a (now unremovable) artifact of the era when we were going to have a new iflag and feature bit for all the new forcealign functionality. For XFS it's not necessary anymore because we always have software fallback and the statx::atomic_write_* fields being nonzero is sufficient to detect the functionality. (I'm confused about the whole premise of /this/ patch -- it's a "fast zeroing" fallocate flag that causes the *device* to unmap, so that the filesystem can preallocate and avoid unwritten extent conversions? What happens if the block device is thinp and it runs out of space? That seems antithetical to fallocate...) --D
On Mon, May 05, 2025 at 10:36:54PM -0700, Darrick J. Wong wrote: > I think STATX_* (i.e. not STATX_ATTR_*) flags have two purposes: 1) to > declare that specific fields in struct statx actually have meaning, most > notably in scenarios where zeroes are valid field contents; and 2) if > filling out the field is expensive, userspace can elect not to have it > filled by leaving the bit unset. I don't know how userspace is supposed > to figure out which fields are expensive. Yes. > (I'm confused about the whole premise of /this/ patch -- it's a "fast > zeroing" fallocate flag that causes the *device* to unmap, so that the > filesystem can preallocate and avoid unwritten extent conversions? Yes. > What happens if the block device is thinp and it runs out of space? > That seems antithetical to fallocate...) While the origin posix_fallocate was about space preallocatіon, these days fallocate seems to be more about extent layout and/or fast zeroing. I'm not a huge fan of either this or the hardware atomics as they force a FTL layer world view which is quite ingrained but also rather stupid, but some folks really want to go down there full throttle, so..
On 2025/5/6 12:39, Christoph Hellwig wrote: > On Tue, May 06, 2025 at 12:28:54PM +0800, Zhang Yi wrote: >> OK, since this statx reporting flag is not strongly tied to >> FALLOC_FL_WRITE_ZEROES in vfs_fallocate(), I'll split this patch into >> three separate patches. > > I don't think that is the right thing to do do. Keep the flag addition > here, and then report it in the ext4 and bdev patches adding > FALLOC_FL_WRITE_ZEROES as the reporting should be consistent with > the added support. > Sorry, but I don't understand your suggestion. The STATX_ATTR_WRITE_ZEROES_UNMAP attribute only indicate whether the bdev and the block device that under the specified file support unmap write zeroes commoand. It does not reflect whether the bdev and the filesystems support FALLOC_FL_WRITE_ZEROES. The implementation of FALLOC_FL_WRITE_ZEROES doesn't fully rely on the unmap write zeroes commoand now, users simply refer to this attribute flag to determine whether to use FALLOC_FL_WRITE_ZEROES when preallocating a file. So, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES doesn't have strong relations, why do you suggested to put this into the ext4 and bdev patches that adding FALLOC_FL_WRITE_ZEROES? Thanks, Yi.
On 2025/5/6 13:47, Christoph Hellwig wrote: > On Mon, May 05, 2025 at 10:36:54PM -0700, Darrick J. Wong wrote: >> I think STATX_* (i.e. not STATX_ATTR_*) flags have two purposes: 1) to >> declare that specific fields in struct statx actually have meaning, most >> notably in scenarios where zeroes are valid field contents; and 2) if >> filling out the field is expensive, userspace can elect not to have it >> filled by leaving the bit unset. I don't know how userspace is supposed >> to figure out which fields are expensive. > > Yes. > IIUC, it seems I was misled by STATX_ATTR_WRITE_ATOMIC, adding this STATX_ATTR_WRITE_ZEROES_UNMAP attribute flag is incorrect. The right approach should be to add STATX_WRITE_ZEROES_UNMAP, setting it in the result_mask if the request_mask includes this flag and bdev_write_zeroes_unmap(bdev) returns true. Something like below. Is my understanding right? diff --git a/block/bdev.c b/block/bdev.c index 4ba48b8735e7..e1367f30dbce 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -1303,9 +1303,9 @@ void bdev_statx(const struct path *path, struct kstat *stat, u32 request_mask) queue_atomic_write_unit_max_bytes(bd_queue)); } + if (request_mask & STATX_WRITE_ZEROES_UNMAP && + bdev_write_zeroes_unmap(bdev)) + stat->result_mask |= STATX_WRITE_ZEROES_UNMAP; stat->blksize = bdev_io_min(bdev); Thanks, Yi.
On Tue, May 06, 2025 at 07:25:06PM +0800, Zhang Yi wrote: > + if (request_mask & STATX_WRITE_ZEROES_UNMAP && > + bdev_write_zeroes_unmap(bdev)) > + stat->result_mask |= STATX_WRITE_ZEROES_UNMAP; That would be my expectation. But then again this area seems to confuse me a lot, so maybe we'll get Christian or Dave to chim in.
On Tue, May 06, 2025 at 07:16:56PM +0800, Zhang Yi wrote: > Sorry, but I don't understand your suggestion. The > STATX_ATTR_WRITE_ZEROES_UNMAP attribute only indicate whether the bdev > and the block device that under the specified file support unmap write > zeroes commoand. It does not reflect whether the bdev and the > filesystems support FALLOC_FL_WRITE_ZEROES. The implementation of > FALLOC_FL_WRITE_ZEROES doesn't fully rely on the unmap write zeroes > commoand now, users simply refer to this attribute flag to determine > whether to use FALLOC_FL_WRITE_ZEROES when preallocating a file. > So, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES doesn't > have strong relations, why do you suggested to put this into the ext4 > and bdev patches that adding FALLOC_FL_WRITE_ZEROES? So what is the point of STATX_ATTR_WRITE_ZEROES_UNMAP?
On Tue, May 06, 2025 at 02:10:12PM +0200, Christoph Hellwig wrote: > On Tue, May 06, 2025 at 07:25:06PM +0800, Zhang Yi wrote: > > + if (request_mask & STATX_WRITE_ZEROES_UNMAP && > > + bdev_write_zeroes_unmap(bdev)) > > + stat->result_mask |= STATX_WRITE_ZEROES_UNMAP; > > That would be my expectation. But then again this area seems to > confuse me a lot, so maybe we'll get Christian or Dave to chim in. Um... does STATX_WRITE_ZEROES_UNMAP protect a field somewhere? It might be nice to expose the request alignment granularity/max size/etc. Or does this flag exist solely to support discovering that FALLOC_FL_WRITE_ZEROES is supported? In which case, why not discover its existence by calling fallocate(fd, WRITE_ZEROES, 0, 0) like the other modes? --D
On 2025/5/6 20:11, Christoph Hellwig wrote: > On Tue, May 06, 2025 at 07:16:56PM +0800, Zhang Yi wrote: >> Sorry, but I don't understand your suggestion. The >> STATX_ATTR_WRITE_ZEROES_UNMAP attribute only indicate whether the bdev >> and the block device that under the specified file support unmap write >> zeroes commoand. It does not reflect whether the bdev and the >> filesystems support FALLOC_FL_WRITE_ZEROES. The implementation of >> FALLOC_FL_WRITE_ZEROES doesn't fully rely on the unmap write zeroes >> commoand now, users simply refer to this attribute flag to determine >> whether to use FALLOC_FL_WRITE_ZEROES when preallocating a file. >> So, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES doesn't >> have strong relations, why do you suggested to put this into the ext4 >> and bdev patches that adding FALLOC_FL_WRITE_ZEROES? > > So what is the point of STATX_ATTR_WRITE_ZEROES_UNMAP? My idea is not to strictly limiting the use of FALLOC_FL_WRITE_ZEROES to only bdev or files where bdev_unmap_write_zeroes() returns true. In other words, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES are not consistent, they are two independent features. Even if some devices STATX_ATTR_WRITE_ZEROES_UNMAP are not set, users should still be allowed to call fallcoate(FALLOC_FL_WRITE_ZEROES). This is because some devices and drivers currently cannot reliably ascertain whether they support the unmap write zero command; however, certain devices, such as specific cloud storage devices, do support it. Users of these devices may also wish to use FALLOC_FL_WRITE_ZEROES to expedite the zeroing process. Therefore, I think that the current point of STATX_ATTR_WRITE_ZEROES_UNMAP (possibly STATX_WRITE_ZEROES_UNMAP) should be to just indicate whether a bdev or file supports the unmap write zero command (i.e., whether bdev_unmap_write_zeroes() returns true). If we use standard SCSI and NVMe storage devices, and the STATX_ATTR_WRITE_ZEROES_UNMAP attribute is set, users can be assured that FALLOC_FL_WRITE_ZEROES is fast and can choose to use fallocate(FALLOC_FL_WRITE_ZEROES) immediately. Would you prefer to make STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES consistent, which means fallcoate(FALLOC_FL_WRITE_ZEROES) will return -EOPNOTSUPP if the block device doesn't set STATX_ATTR_WRITE_ZEROES_UNMAP ? If so, I'd suggested we need to: 1) Remove STATX_ATTR_WRITE_ZEROES_UNMAP since users can check the existence by calling fallocate(FALLOC_FL_WRITE_ZEROES) directly, this statx flag seems useless. 2) Make the BLK_FEAT_WRITE_ZEROES_UNMAP sysfs interface to RW, allowing users to adjust the block device's support state according to the real situation. Thanks, Yi.
On 2025/5/6 23:55, Darrick J. Wong wrote: > On Tue, May 06, 2025 at 02:10:12PM +0200, Christoph Hellwig wrote: >> On Tue, May 06, 2025 at 07:25:06PM +0800, Zhang Yi wrote: >>> + if (request_mask & STATX_WRITE_ZEROES_UNMAP && >>> + bdev_write_zeroes_unmap(bdev)) >>> + stat->result_mask |= STATX_WRITE_ZEROES_UNMAP; >> >> That would be my expectation. But then again this area seems to >> confuse me a lot, so maybe we'll get Christian or Dave to chim in. > > Um... does STATX_WRITE_ZEROES_UNMAP protect a field somewhere? > It might be nice to expose the request alignment granularity/max > size/etc. I think that simply returning the support state is sufficient at the moment. __blkdev_issue_write_zeroes() will send write zeroes through multiple iterations, and there are no specific restrictions on the parameters provided by users. > Or does this flag exist solely to support discovering that > FALLOC_FL_WRITE_ZEROES is supported? In which case, why not discover > its existence by calling fallocate(fd, WRITE_ZEROES, 0, 0) like the > other modes? > Current STATX_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES are inconsistent, we allow users to call fallocate(FALLOC_FL_WRITE_ZEROES) on files that STATX_WRITE_ZEROES_UNMAP is not set. Users can check whether the device supports unmap write zeroes through STATX_WRITE_ZEROES_UNMAP and then decide to call fallocate(FALLOC_FL_WRITE_ZEROES) if it is supported. Please see this explanation for details. https://lore.kernel.org/linux-fsdevel/20250421021509.2366003-1-yi.zhang@huaweicloud.com/T/#mc1618822bc27d486296216fc1643d5531fee03e1 However, removing STATX_WRITE_ZEROES_UNMAP also seems good to me(Perhaps it would be better.).It means we do not allow to call fallocate(FALLOC_FL_WRITE_ZEROES) if the device does not explicitly support unmap write zeroes. Thanks, Yi.
On Wed, May 07, 2025 at 03:33:23PM +0800, Zhang Yi wrote: > On 2025/5/6 20:11, Christoph Hellwig wrote: > > On Tue, May 06, 2025 at 07:16:56PM +0800, Zhang Yi wrote: > >> Sorry, but I don't understand your suggestion. The > >> STATX_ATTR_WRITE_ZEROES_UNMAP attribute only indicate whether the bdev > >> and the block device that under the specified file support unmap write > >> zeroes commoand. It does not reflect whether the bdev and the > >> filesystems support FALLOC_FL_WRITE_ZEROES. The implementation of > >> FALLOC_FL_WRITE_ZEROES doesn't fully rely on the unmap write zeroes > >> commoand now, users simply refer to this attribute flag to determine > >> whether to use FALLOC_FL_WRITE_ZEROES when preallocating a file. > >> So, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES doesn't > >> have strong relations, why do you suggested to put this into the ext4 > >> and bdev patches that adding FALLOC_FL_WRITE_ZEROES? > > > > So what is the point of STATX_ATTR_WRITE_ZEROES_UNMAP? > > My idea is not to strictly limiting the use of FALLOC_FL_WRITE_ZEROES to > only bdev or files where bdev_unmap_write_zeroes() returns true. In > other words, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES > are not consistent, they are two independent features. Even if some > devices STATX_ATTR_WRITE_ZEROES_UNMAP are not set, users should still be > allowed to call fallcoate(FALLOC_FL_WRITE_ZEROES). This is because some > devices and drivers currently cannot reliably ascertain whether they > support the unmap write zero command; however, certain devices, such as > specific cloud storage devices, do support it. Users of these devices > may also wish to use FALLOC_FL_WRITE_ZEROES to expedite the zeroing > process. > > Therefore, I think that the current point of > STATX_ATTR_WRITE_ZEROES_UNMAP (possibly STATX_WRITE_ZEROES_UNMAP) should > be to just indicate whether a bdev or file supports the unmap write zero > command (i.e., whether bdev_unmap_write_zeroes() returns true). If we > use standard SCSI and NVMe storage devices, and the > STATX_ATTR_WRITE_ZEROES_UNMAP attribute is set, users can be assured > that FALLOC_FL_WRITE_ZEROES is fast and can choose to use > fallocate(FALLOC_FL_WRITE_ZEROES) immediately. > > Would you prefer to make STATX_ATTR_WRITE_ZEROES_UNMAP and > FALLOC_FL_WRITE_ZEROES consistent, which means > fallcoate(FALLOC_FL_WRITE_ZEROES) will return -EOPNOTSUPP if the block > device doesn't set STATX_ATTR_WRITE_ZEROES_UNMAP ? > > If so, I'd suggested we need to: > 1) Remove STATX_ATTR_WRITE_ZEROES_UNMAP since users can check the > existence by calling fallocate(FALLOC_FL_WRITE_ZEROES) directly, this > statx flag seems useless. > 2) Make the BLK_FEAT_WRITE_ZEROES_UNMAP sysfs interface to RW, allowing > users to adjust the block device's support state according to the > real situation. Sounds fine to me... ;) --D > Thanks, > Yi. > >
On Wed, May 07, 2025 at 03:33:23PM +0800, Zhang Yi wrote: > On 2025/5/6 20:11, Christoph Hellwig wrote: > > On Tue, May 06, 2025 at 07:16:56PM +0800, Zhang Yi wrote: > >> Sorry, but I don't understand your suggestion. The > >> STATX_ATTR_WRITE_ZEROES_UNMAP attribute only indicate whether the bdev > >> and the block device that under the specified file support unmap write > >> zeroes commoand. It does not reflect whether the bdev and the > >> filesystems support FALLOC_FL_WRITE_ZEROES. The implementation of > >> FALLOC_FL_WRITE_ZEROES doesn't fully rely on the unmap write zeroes > >> commoand now, users simply refer to this attribute flag to determine > >> whether to use FALLOC_FL_WRITE_ZEROES when preallocating a file. > >> So, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES doesn't > >> have strong relations, why do you suggested to put this into the ext4 > >> and bdev patches that adding FALLOC_FL_WRITE_ZEROES? > > > > So what is the point of STATX_ATTR_WRITE_ZEROES_UNMAP? > > My idea is not to strictly limiting the use of FALLOC_FL_WRITE_ZEROES to > only bdev or files where bdev_unmap_write_zeroes() returns true. In > other words, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES > are not consistent, they are two independent features. Even if some > devices STATX_ATTR_WRITE_ZEROES_UNMAP are not set, users should still be > allowed to call fallcoate(FALLOC_FL_WRITE_ZEROES). This is because some > devices and drivers currently cannot reliably ascertain whether they > support the unmap write zero command; however, certain devices, such as > specific cloud storage devices, do support it. Users of these devices > may also wish to use FALLOC_FL_WRITE_ZEROES to expedite the zeroing > process. What are those "cloud storage devices" where you set it reliably, i.e.g what drivers? > Therefore, I think that the current point of > STATX_ATTR_WRITE_ZEROES_UNMAP (possibly STATX_WRITE_ZEROES_UNMAP) should > be to just indicate whether a bdev or file supports the unmap write zero > command (i.e., whether bdev_unmap_write_zeroes() returns true). If we > use standard SCSI and NVMe storage devices, and the > STATX_ATTR_WRITE_ZEROES_UNMAP attribute is set, users can be assured > that FALLOC_FL_WRITE_ZEROES is fast and can choose to use > fallocate(FALLOC_FL_WRITE_ZEROES) immediately. That's breaking the abstracton again. An attribute must say something about the specific file, not about some underlying semi-related feature. > Would you prefer to make STATX_ATTR_WRITE_ZEROES_UNMAP and > FALLOC_FL_WRITE_ZEROES consistent, which means > fallcoate(FALLOC_FL_WRITE_ZEROES) will return -EOPNOTSUPP if the block > device doesn't set STATX_ATTR_WRITE_ZEROES_UNMAP ? Not sure where the block device comes from here, both of these operate on a file. > If so, I'd suggested we need to: > 1) Remove STATX_ATTR_WRITE_ZEROES_UNMAP since users can check the > existence by calling fallocate(FALLOC_FL_WRITE_ZEROES) directly, this > statx flag seems useless. Yes, that was my inital thought. > 2) Make the BLK_FEAT_WRITE_ZEROES_UNMAP sysfs interface to RW, allowing > users to adjust the block device's support state according to the > real situation. No, it's a feature and not a flag.
On 2025/5/8 13:01, Christoph Hellwig wrote: > On Wed, May 07, 2025 at 03:33:23PM +0800, Zhang Yi wrote: >> On 2025/5/6 20:11, Christoph Hellwig wrote: >>> On Tue, May 06, 2025 at 07:16:56PM +0800, Zhang Yi wrote: >>>> Sorry, but I don't understand your suggestion. The >>>> STATX_ATTR_WRITE_ZEROES_UNMAP attribute only indicate whether the bdev >>>> and the block device that under the specified file support unmap write >>>> zeroes commoand. It does not reflect whether the bdev and the >>>> filesystems support FALLOC_FL_WRITE_ZEROES. The implementation of >>>> FALLOC_FL_WRITE_ZEROES doesn't fully rely on the unmap write zeroes >>>> commoand now, users simply refer to this attribute flag to determine >>>> whether to use FALLOC_FL_WRITE_ZEROES when preallocating a file. >>>> So, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES doesn't >>>> have strong relations, why do you suggested to put this into the ext4 >>>> and bdev patches that adding FALLOC_FL_WRITE_ZEROES? >>> >>> So what is the point of STATX_ATTR_WRITE_ZEROES_UNMAP? >> >> My idea is not to strictly limiting the use of FALLOC_FL_WRITE_ZEROES to >> only bdev or files where bdev_unmap_write_zeroes() returns true. In >> other words, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES >> are not consistent, they are two independent features. Even if some >> devices STATX_ATTR_WRITE_ZEROES_UNMAP are not set, users should still be >> allowed to call fallcoate(FALLOC_FL_WRITE_ZEROES). This is because some >> devices and drivers currently cannot reliably ascertain whether they >> support the unmap write zero command; however, certain devices, such as >> specific cloud storage devices, do support it. Users of these devices >> may also wish to use FALLOC_FL_WRITE_ZEROES to expedite the zeroing >> process. > > What are those "cloud storage devices" where you set it reliably, > i.e.g what drivers? I don't have these 'cloud storage devices' now, but Ted had mentioned those cloud-emulated block devices such as Google's Persistent Desk or Amazon's Elastic Block Device in. I'm not sure if they can accurately report the BLK_FEAT_WRITE_ZEROES_UNMAP feature, maybe Ted can give more details. https://lore.kernel.org/linux-fsdevel/20250106161732.GG1284777@mit.edu/ > >> Therefore, I think that the current point of >> STATX_ATTR_WRITE_ZEROES_UNMAP (possibly STATX_WRITE_ZEROES_UNMAP) should >> be to just indicate whether a bdev or file supports the unmap write zero >> command (i.e., whether bdev_unmap_write_zeroes() returns true). If we >> use standard SCSI and NVMe storage devices, and the >> STATX_ATTR_WRITE_ZEROES_UNMAP attribute is set, users can be assured >> that FALLOC_FL_WRITE_ZEROES is fast and can choose to use >> fallocate(FALLOC_FL_WRITE_ZEROES) immediately. > > That's breaking the abstracton again. An attribute must say something > about the specific file, not about some underlying semi-related feature. OK. > >> Would you prefer to make STATX_ATTR_WRITE_ZEROES_UNMAP and >> FALLOC_FL_WRITE_ZEROES consistent, which means >> fallcoate(FALLOC_FL_WRITE_ZEROES) will return -EOPNOTSUPP if the block >> device doesn't set STATX_ATTR_WRITE_ZEROES_UNMAP ? > > Not sure where the block device comes from here, both of these operate > on a file. I am referring to the block device on which the filesystem is mounted. The support status of the file is directly dependent on this block device. > >> If so, I'd suggested we need to: >> 1) Remove STATX_ATTR_WRITE_ZEROES_UNMAP since users can check the >> existence by calling fallocate(FALLOC_FL_WRITE_ZEROES) directly, this >> statx flag seems useless. > > Yes, that was my inital thought. > >> 2) Make the BLK_FEAT_WRITE_ZEROES_UNMAP sysfs interface to RW, allowing >> users to adjust the block device's support state according to the >> real situation. > > No, it's a feature and not a flag. > I am a bit confused about the feature and the flag, I checked the other features, and it appears that features such as BLK_FEAT_ROTATIONAL allow to be modified, is this flexibility due to historical reasons or for the convenience of testing? Think about this again, I suppose we should keep the BLK_FEAT_WRITE_ZEROES_UNMAP as read-only and add a new flag, BLK_FALG_WRITE_ZEROES_UNMAP_DISABLED, to disable the FALLOC_FL_WRITE_ZEROES. Since the Write Zeroes does not guarantee performance, and some devices may claim to support **UNMAP** Write Zeroes but exhibit extremely slow write-zeroes speeds. Users may want be able to disable it. Thoughts? Thanks, Yi.
On Thu, May 08, 2025 at 08:17:14PM +0800, Zhang Yi wrote: > On 2025/5/8 13:01, Christoph Hellwig wrote: > >> > >> My idea is not to strictly limiting the use of FALLOC_FL_WRITE_ZEROES to > >> only bdev or files where bdev_unmap_write_zeroes() returns true. In > >> other words, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES > >> are not consistent, they are two independent features. Even if some > >> devices STATX_ATTR_WRITE_ZEROES_UNMAP are not set, users should still be > >> allowed to call fallcoate(FALLOC_FL_WRITE_ZEROES). This is because some > >> devices and drivers currently cannot reliably ascertain whether they > >> support the unmap write zero command; however, certain devices, such as > >> specific cloud storage devices, do support it. Users of these devices > >> may also wish to use FALLOC_FL_WRITE_ZEROES to expedite the zeroing > >> process. > > > > What are those "cloud storage devices" where you set it reliably, > > i.e.g what drivers? > > I don't have these 'cloud storage devices' now, but Ted had mentioned > those cloud-emulated block devices such as Google's Persistent Desk or > Amazon's Elastic Block Device in. I'm not sure if they can accurately > report the BLK_FEAT_WRITE_ZEROES_UNMAP feature, maybe Ted can give more > details. > > https://lore.kernel.org/linux-fsdevel/20250106161732.GG1284777@mit.edu/ There's nothing really exotic about what I was referring to in terms of "cloud storage devices". Perhaps a better way of describing them is to consider devices such as dm-thin, or a Ceph Block Device, which is being exposed as a SCSI or NVME device. The distinction I was trying to make is performance-related. Suppose you call WRITE_ZEROS on a 14TB region. After the WRITES_ZEROS complete, a read anywhere on that 14TB region will return zeros. That's easy. But the question is when you call WRITE_ZEROS, will the storage device (a) go away for a day or more before it completes (which would be the case if it is a traditional spinning rust platter), or (b) will it be basically instaneous, because all dm-thin or a Ceph Block Device needs to do is to delete one or more entries in its mapping table. The problem is two-fold. First, there's no way for the kernel to know whether a storage device will behave as (a) or (b), because SCSI and other storage specifications say that performance is out of scope. They only talk about the functional results (afterwards, if yout try to read from the region, you will get zeros), and are utterly silent about how long it migt take. The second problem is that if you are an application program, there is no way you will be willing to call fallocate(WRITE_ZEROS, 14TB) if you don't know whether the disk will go away for a day or whether it will be instaneous. But because there is no way for the kernel to know whether WRITE_ZEROS will be fast or not, how would you expect the kernel to expose STATX_ATTR_WRITE_ZEROES_UNMAP? Cristoph's formulation "breaking the abstraction" perfectly encapsulate the SCSI specification's position on the matter, and I agree it's a valid position. It's just not terribly useful for the application programmer. Things which some programs/users might want to know or rely upon, but which is normally quite impossible are: * Will the write zero / discard operation take a "reasonable" amount of time? (Yes, not necessarilly well defined, but we know it when we see it, and hours or days is generally not reasonable.) * Is the operation reliable --- i.e., is the device allowed to randomly decide that it won't actually zero the requested blocks (as is the case of discard) whenever it feels like it. * Is the operation guaranteed to make the data irretreviable even in face of an attacker with low-level access to the device. (And this is also not necessarily well defined; does the attacker have access to a scanning electronic microscope, or can do a liquid nitrogen destructive access of the flash device?) The UFS (Universal Flash Storage) spec comes the closest to providing commands that distinguish between these various cases, but for most storage specifications, like SCSI, it is absolutely requires peaking behind the abstraction barrier defined by the specification, and so ultimately, the kernel can't know. About the best you can do is to require manual configuration; perhaps a config file at the database or userspace cluster file system level because the system adminsitrator knows --- maybe because the hyperscale cloud provider has leaned on the storage vendor to tell them under NDA, storage specs be damned or they won't spend $$$ millions with that storage vendor --- or because the database administrator discovers that using fallocate(WRITE_ZEROS) causes performance to tank, so they manually disable the use of WRITE_ZEROS. Could this be done in the kernel? Sure. We could have a file, say, /sys/block/sdXX/queue/write_zeros where the write_zeros file is writeable, and so the administrator can force-disable WRITES_ZERO by writing 0 into the file. And could this be queried via a STATX attribute? I suppose, although to be honest, I'm used to doing this by looking at the sysfs files. For example, just recently I coded up the following: static int is_rotational (const char *device_name EXT2FS_ATTR((unused))) { int rotational = -1; #ifdef __linux__ char path[1024]; struct stat st; FILE *f; if ((stat(device_name, &st) < 0) || !S_ISBLK(st.st_mode)) return -1; snprintf(path, sizeof(path), "/sys/dev/block/%d:%d/queue/rotational", major(st.st_rdev), minor(st.st_rdev)); f = fopen(path, "r"); if (!f) { snprintf(path, sizeof(path), "/sys/dev/block/%d:%d/../queue/rotational", major(st.st_rdev), minor(st.st_rdev)); f = fopen(path, "r"); } if (f) { if (fscanf(f, "%d", &rotational) != 1) rotational = -1; fclose(f); } #endif return rotational; } Easy-peasy! Who needs statx? :-) - Ted
On 2025/5/9 4:24, Theodore Ts'o wrote: > On Thu, May 08, 2025 at 08:17:14PM +0800, Zhang Yi wrote: >> On 2025/5/8 13:01, Christoph Hellwig wrote: >>>> >>>> My idea is not to strictly limiting the use of FALLOC_FL_WRITE_ZEROES to >>>> only bdev or files where bdev_unmap_write_zeroes() returns true. In >>>> other words, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES >>>> are not consistent, they are two independent features. Even if some >>>> devices STATX_ATTR_WRITE_ZEROES_UNMAP are not set, users should still be >>>> allowed to call fallcoate(FALLOC_FL_WRITE_ZEROES). This is because some >>>> devices and drivers currently cannot reliably ascertain whether they >>>> support the unmap write zero command; however, certain devices, such as >>>> specific cloud storage devices, do support it. Users of these devices >>>> may also wish to use FALLOC_FL_WRITE_ZEROES to expedite the zeroing >>>> process. >>> >>> What are those "cloud storage devices" where you set it reliably, >>> i.e.g what drivers? >> >> I don't have these 'cloud storage devices' now, but Ted had mentioned >> those cloud-emulated block devices such as Google's Persistent Desk or >> Amazon's Elastic Block Device in. I'm not sure if they can accurately >> report the BLK_FEAT_WRITE_ZEROES_UNMAP feature, maybe Ted can give more >> details. >> >> https://lore.kernel.org/linux-fsdevel/20250106161732.GG1284777@mit.edu/ > > There's nothing really exotic about what I was referring to in terms > of "cloud storage devices". Perhaps a better way of describing them > is to consider devices such as dm-thin, or a Ceph Block Device, which > is being exposed as a SCSI or NVME device. OK, then correctly reporting the BLK_FEAT_WRITE_ZEROES_UNMAP feature should no longer be a major problem. It seems that we do not need to pay much attention to enabling this feature manually. > > The distinction I was trying to make is performance-related. Suppose > you call WRITE_ZEROS on a 14TB region. After the WRITES_ZEROS > complete, a read anywhere on that 14TB region will return zeros. > That's easy. But the question is when you call WRITE_ZEROS, will the > storage device (a) go away for a day or more before it completes (which > would be the case if it is a traditional spinning rust platter), or > (b) will it be basically instaneous, because all dm-thin or a Ceph Block > Device needs to do is to delete one or more entries in its mapping > table. Yes. > > The problem is two-fold. First, there's no way for the kernel to know > whether a storage device will behave as (a) or (b), because SCSI and > other storage specifications say that performance is out of scope. > They only talk about the functional results (afterwards, if yout try > to read from the region, you will get zeros), and are utterly silent > about how long it migt take. The second problem is that if you are an > application program, there is no way you will be willing to call > fallocate(WRITE_ZEROS, 14TB) if you don't know whether the disk will > go away for a day or whether it will be instaneous. > > But because there is no way for the kernel to know whether WRITE_ZEROS > will be fast or not, how would you expect the kernel to expose > STATX_ATTR_WRITE_ZEROES_UNMAP? Cristoph's formulation "breaking the > abstraction" perfectly encapsulate the SCSI specification's position > on the matter, and I agree it's a valid position. It's just not > terribly useful for the application programmer. Yes. > > Things which some programs/users might want to know or rely upon, but which is normally quite impossible are: > > * Will the write zero / discard operation take a "reasonable" amount > of time? (Yes, not necessarilly well defined, but we know it when > we see it, and hours or days is generally not reasonable.) > > * Is the operation reliable --- i.e., is the device allowed to > randomly decide that it won't actually zero the requested blocks (as > is the case of discard) whenever it feels like it. > > * Is the operation guaranteed to make the data irretreviable even in > face of an attacker with low-level access to the device. (And this > is also not necessarily well defined; does the attacker have access > to a scanning electronic microscope, or can do a liquid nitrogen > destructive access of the flash device?) Yes. > > The UFS (Universal Flash Storage) spec comes the closest to providing > commands that distinguish between these various cases, but for most > storage specifications, like SCSI, it is absolutely requires peaking > behind the abstraction barrier defined by the specification, and so > ultimately, the kernel can't know. > > About the best you can do is to require manual configuration; perhaps a > config file at the database or userspace cluster file system level > because the system adminsitrator knows --- maybe because the hyperscale > cloud provider has leaned on the storage vendor to tell them under > NDA, storage specs be damned or they won't spend $$$ millions with > that storage vendor --- or because the database administrator discovers > that using fallocate(WRITE_ZEROS) causes performance to tank, so they > manually disable the use of WRITE_ZEROS. Yes, this is indeed what we should consider. > > Could this be done in the kernel? Sure. We could have a file, say, > /sys/block/sdXX/queue/write_zeros where the write_zeros file is > writeable, and so the administrator can force-disable WRITES_ZERO by > writing 0 into the file. And could this be queried via a STATX > attribute? I suppose, although to be honest, I'm used to doing this > by looking at the sysfs files. For example, just recently I coded up > the following: > > static int is_rotational (const char *device_name EXT2FS_ATTR((unused))) > { > int rotational = -1; > #ifdef __linux__ > char path[1024]; > struct stat st; > FILE *f; > > if ((stat(device_name, &st) < 0) || !S_ISBLK(st.st_mode)) > return -1; > > snprintf(path, sizeof(path), "/sys/dev/block/%d:%d/queue/rotational", > major(st.st_rdev), minor(st.st_rdev)); > f = fopen(path, "r"); > if (!f) { > snprintf(path, sizeof(path), > "/sys/dev/block/%d:%d/../queue/rotational", > major(st.st_rdev), minor(st.st_rdev)); > f = fopen(path, "r"); > } > if (f) { > if (fscanf(f, "%d", &rotational) != 1) > rotational = -1; > fclose(f); > } > #endif > return rotational; > } > > Easy-peasy! Who needs statx? :-) > Yes. as I replied earlier, I'm going to implement this with a new flag, BLK_FALG_WRITE_ZEROES_UNMAP_DISABLED, similar to the existing BLK_FLAG_WRITE_CACHE_DISABLED. Make /sys/block/<disk>/queue/write_zeroes_unmap to read-write. Regarding whether to rename it to 'write_zeroes', I need to reconsider, as the naming aligns perfectly with FALLOC_FL_WRITE_ZEROES, but the **UNMAP** semantics cannot be adequately expressed. Thank you for your detailed explanation and suggestions! Best regards. Yi.
diff --git a/block/bdev.c b/block/bdev.c index 4844d1e27b6f..29b0e5feb138 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -1304,6 +1304,10 @@ void bdev_statx(struct path *path, struct kstat *stat, queue_atomic_write_unit_max_bytes(bd_queue)); } + if (bdev_write_zeroes_unmap(bdev)) + stat->attributes |= STATX_ATTR_WRITE_ZEROES_UNMAP; + stat->attributes_mask |= STATX_ATTR_WRITE_ZEROES_UNMAP; + stat->blksize = bdev_io_min(bdev); blkdev_put_no_open(bdev); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 94c7d2d828a6..38caf2f39c6d 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5653,6 +5653,7 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path, struct inode *inode = d_inode(path->dentry); struct ext4_inode *raw_inode; struct ext4_inode_info *ei = EXT4_I(inode); + struct block_device *bdev = inode->i_sb->s_bdev; unsigned int flags; if ((request_mask & STATX_BTIME) && @@ -5672,8 +5673,6 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path, stat->result_mask |= STATX_DIOALIGN; if (dio_align == 1) { - struct block_device *bdev = inode->i_sb->s_bdev; - /* iomap defaults */ stat->dio_mem_align = bdev_dma_alignment(bdev) + 1; stat->dio_offset_align = bdev_logical_block_size(bdev); @@ -5695,6 +5694,9 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path, generic_fill_statx_atomic_writes(stat, awu_min, awu_max); } + if (S_ISREG(inode->i_mode) && bdev_write_zeroes_unmap(bdev)) + stat->attributes |= STATX_ATTR_WRITE_ZEROES_UNMAP; + flags = ei->i_flags & EXT4_FL_USER_VISIBLE; if (flags & EXT4_APPEND_FL) stat->attributes |= STATX_ATTR_APPEND; @@ -5714,7 +5716,8 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path, STATX_ATTR_ENCRYPTED | STATX_ATTR_IMMUTABLE | STATX_ATTR_NODUMP | - STATX_ATTR_VERITY); + STATX_ATTR_VERITY | + STATX_ATTR_WRITE_ZEROES_UNMAP); generic_fillattr(idmap, request_mask, inode, stat); return 0; diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h index f78ee3670dd5..279ce7e34df7 100644 --- a/include/uapi/linux/stat.h +++ b/include/uapi/linux/stat.h @@ -251,6 +251,7 @@ struct statx { #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */ #define STATX_ATTR_DAX 0x00200000 /* File is currently in DAX state */ #define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File supports atomic write operations */ +#define STATX_ATTR_WRITE_ZEROES_UNMAP 0x00800000 /* File supports unmap write zeroes */ #endif /* _UAPI_LINUX_STAT_H */