[v3,00/30] Zone write plugging

Message ID	20240328004409.594888-1-dlemoal@kernel.org
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F01EB1BDEB; Thu, 28 Mar 2024 00:44:12 +0000 (UTC) From: Damien Le Moal <dlemoal@kernel.org> To: linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>, linux-scsi@vger.kernel.org, "Martin K . Petersen" <martin.petersen@oracle.com>, dm-devel@lists.linux.dev, Mike Snitzer <snitzer@redhat.com>, linux-nvme@lists.infradead.org, Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de> Subject: [PATCH v3 00/30] Zone write plugging Date: Thu, 28 Mar 2024 09:43:39 +0900 Message-ID: <20240328004409.594888-1-dlemoal@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Zone write plugging \| expand [v3,00/30] Zone write plugging [v3,02/30] block: Restore sector of flush requests [v3,04/30] block: Introduce blk_zone_update_request_bio() [v3,06/30] block: Allow using bio_attempt_back_merge() internally [v3,08/30] block: Introduce zone write plugging [v3,10/30] block: Fake max open zones limit when there is no limit [v3,12/30] block: Implement zone append emulation [v3,14/30] dm: Use the block layer zone append emulation [v3,16/30] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature [v3,18/30] null_blk: Introduce zone_append_max_sectors attribute [v3,20/30] nvmet: zns: Do not reference the gendisk conv_zones_bitmap [v3,22/30] block: Simplify blk_revalidate_disk_zones() interface [v3,24/30] block: Remove elevator required features [v3,26/30] block: Move zone related debugfs attribute to blk-zoned.c [v3,28/30] block: Remove zone write locking [v3,30/30] block: Do not special-case plugging of zone write operations

Damien Le Moal March 28, 2024, 12:43 a.m. UTC

The patch series introduces zone write plugging (ZWP) as the new
mechanism to control the ordering of writes to zoned block devices.
ZWP replaces zone write locking (ZWL) which is implemented only by
mq-deadline today. ZWP also allows emulating zone append operations
using regular writes for zoned devices that do not natively support this
operation (e.g. SMR HDDs). This patch series removes the scsi disk
driver and device mapper zone append emulation to use ZWP emulation.

Unlike ZWL which operates on requests, ZWP operates on BIOs. A zone
write plug is simply a BIO list that is atomically manipulated using a
spinlock and a kblockd submission work. A write BIO to a zone is
"plugged" to delay its execution if a write BIO for the same zone was
already issued, that is, if a write request for the same zone is being
executed. The next plugged BIO is unplugged and issued once the write
request completes.

This mechanism allows to:
 - Untangle zone write ordering from the block IO schedulers. This
   allows removing the restriction on using only mq-deadline for zoned
   block devices. Any block IO scheduler, including "none" can be used.
 - Zone write plugging operates on BIOs instead of requests. Plugged
   BIOs waiting for execution thus do not hold scheduling tags and thus
   do not prevent other BIOs from being submitted to the device (reads
   or writes to other zones). Depending on the workload, this can
   significantly improve the device use and the performance.
 - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
   device mapper) can use ZWP. It is mandatory for the
   former but optional for the latter: BIO-based driver can use zone
   write plugging to implement write ordering guarantees, or the drivers
   can implement their own if needed.
 - The code is less invasive in the block layer and in device drivers.
   ZWP implementation is mostly limited to blk-zoned.c, with some small
   changes in blk-mq.c, blk-merge.c and bio.c.

Performance evaluation results are shown below.

The series is based on 6.9.0-rc1 and organized as follows:

 - Patch 1 to 7 are preparatory changes for patch 8.
 - Patch 8, 9 and 10 introduce ZWP
 - Patch 11 and 12 add zone append emulation to ZWP.
 - Patch 13 to 20 modify zoned block device drivers to use ZWP and
   prepare for the removal of ZWL.
 - Patch 21 to 30 remove zone write locking

Overall, these changes do not significantly increase the amount of code
(the higher number of addition shown by diff-stat is in fact due to a
larger amount of comments in the code).

Many thanks must go to Christoph Hellwig for comments and suggestions
he provided on earlier versions of these patches.

Performance evaluation results
==============================

Environments:
 - Intel Xeon 16-cores/32-threads, 128GB of RAM
 - Kernel:
   - ZWL (baseline): 6.9.0-rc1
   - ZWP: 6.9.0-rc1 patched kernel to add zone write plugging
   (both kernels were compiled with the same configuration turning off
   most heavy debug features)

Workoads:
 - seqw4K1: 4KB sequential write, qd=1
 - seqw4K16: 4KB sequential write, qd=16
 - seqw1M16: 1MB sequential write, qd=16
 - rndw4K16: 4KB random write, qd=16
 - rndw128K16: 128KB random write, qd=16
 - btrfs workoad: Single fio job writing 128 MB files using 128 KB
   direct IOs at qd=16.

Devices:
 - nullblk (zoned): 4096 zones of 256 MB, 128 max open zones.
 - NVMe ZNS drive: 1 TB ZNS drive with 2GB zone size, 14 max open and
   active zones.
 - SMR HDD: 20 TB disk with 256MB zone size, 128 max open zones.

For ZWP, the result show the performance percentage increase (or
decrease) against ZWL (baseline) case.

1) null_blk zoned device:

             +--------+--------+-------+--------+---------+---------+
             |seqw4K1 |seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
             |(MB/s)  | (MB/s) |(MB/s) | (MB/s) | (KIOPS)| (KIOPS)  |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWL    | 961    | 835    | 18640 | 14480  | 415    | 167      |
 |mq-deadline|        |        |       |        |        |          |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 963    | 845    | 18640 | 14630  | 452    | 165      |
 |mq-deadline| (+0%)  | (+1%)  | (+0%) | (+1%)  | (+8%)  | (-1%)    |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 731    | 651    | 15780 | 12710  | 129    | 108      |
 |    bfq    | (-23%) | (-22%) | (-15%)| (-12%) | (-68%) | (-15%)   |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 2511   | 1632   | 27470 | 19340  | 336    | 150      |
 |   none    | (+161%)| (+95%) | (+47%)| (+33%) | (-19%) | (-19%)   |
 +-----------+--------+--------+-------+--------+--------+----------+

ZWP with mq-deadline gives performance very similar to zone write
locking, showing that zone write plugging overhead is acceptable.
But ZWP ability to run fast block devices with the none scheduler
shows brings all the benefits of zone write plugging and results in
significant performance increase for all workloads. The exception to
this are random write workloads with multiple jobs: for these, the
faster request submission rate achieved by zone write plugging results
in higher contention on null-blk zone spinlock, which degrades
performance.

2) NVMe ZNS drive:

             +--------+--------+-------+--------+--------+----------+
             |seqw4K1 |seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
             |(MB/s)  | (MB/s) |(MB/s) | (MB/s) | (KIOPS)|  (KIOPS) |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWL    | 183    | 707    | 1083  | 1101   | 53.6   | 14.1     |
 |mq-deadline|        |        |       |        |        |          |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 183    | 719    | 1082  | 1103   |55.5    | 14.1     |
 |mq-deadline|(-0%)   | (+1%)  | (+0%) | (+0%)  |(+3%)   | (+0%)    |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 175    | 691    | 1078  | 1097   | 28.3   | 11.2     |
 |    bfq    | (-4%)  | (-2%)  | (-0%) | (-0%)  | (-47%) | (-20%)   |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 190    | 665    | 1083  | 1105   | 51.4   | 14.1     |
 |   none    | (+4%)  | (-5%)  | (+0%) | (+0%)  | (-4%)  | (+0%)    |
 +-----------+--------+--------+-------+--------+--------+----------+

Zone write plugging overhead does not significantly impact performance.
Similar to nullblk, using the none scheduler leads to performance
increase for most workloads.

3) SMR SATA HDD:

             +-------+--------+-------+--------+--------+----------+
             |seqw4K1|seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
             |(MB/s) | (MB/s) |(MB/s) | (MB/s) | (KIOPS)|  (KIOPS) |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWL    | 107   | 243    | 246   | 246    | 2.2    | 0.769    |
 |mq-deadline|       |        |       |        |        |          |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWP    | 109   | 240    | 246   | 244    | 2.2    | 0.767    |
 |mq-deadline|(+1%)  | (-1%)  | (-0%) | (-0%)  | (+0%)  | (+0%)    |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWP    | 104   | 240    | 247   | 244    | 2.3    | 0.765    |
 |    bfq    | (-2%) | (-1%)  | (+0%) | (-0%)  | (+0%)  | (+0%)    |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWP    | 115   | 235    | 246   | 243    | 2.2    | 0.771    |
 |   none    | (+7%) | (-3%)  | (+0%) | (-1%)  | (+0%)  | (+0%)    |
 +-----------+-------+--------+-------+--------+--------+----------+

Performance with purely sequential write workloads at high queue depth
somewhat decrease a little when using zone write plugging. This is due
to the different IO pattern that ZWP generates where the first writes to
a zone start being issued when the end of the previous zone are still
being written. Depending on how the disk handles queued commands, seek
may be generated, slightly impacting the throughput achieved. Such pure
sequential write workloads are however rare with SMR drives.

4) Zone append tests using btrfs:

             +-------------+-------------+-----------+-------------+
             |  null-blk   |  null_blk   |    ZNS    |     SMR     |
             |  native ZA  | emulated ZA | native ZA | emulated ZA |
             |    (MB/s)   |   (MB/s)    |   (MB/s)  |    (MB/s)   |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWL    | 2434        | N/A         | 1083      | 244         |
 |mq-deadline|             |             |           |             |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWP    | 2361        | 3111        | 1087      | 239         |
 |mq-deadline| (+1%)       |             | (+0%)     | (-2%)       |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWP    | 2299        | 2840        | 1082      | 239         |
 |    bfq    | (-4%)       |             | (+0%)     | (-2%)       |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWP    | 2443        | 3152        | 1078      | 238         |
 |    none   | (+0%)       |             | (-0%)     | (-2%)       |
 +-----------+-------------+-------------+-----------+-------------+

With a more realistic use of the device though a file system, ZWP does
not introduce significant performance differences, except for SMR for
the same reason as with the fio sequential workloads at high queue
depth.

Changes from v2:
 - Added Patch 1 (Christoph's comment)
 - Fixed error code setup in Patch 3 (Bart's comment)
 - Split former patch 26 into patches 27 and 28
 - Modified patch 8 (zone write plugging) introduction to remove the
   kmem_cache use and address Bart's and Christoph comments.
 - Changed from using a mempool of zone write plugs to using a simple
   free-list (patch 9)
 - Simplified patch 10 as suggested by Christoph
 - Moved common code to a helper in patch 13 as suggested by Christoph

Changes from v1:
 - Added patch 6
 - Rewrite of patch 7 to use a hash table of dynamically allocated zone
   write plugs. This results in changes in patch 11 and the addition of
   patch 8 and 9.
 - Rebased everything on 6.9.0-rc1
 - Added review tags for patches that did not change

Damien Le Moal (30):
  block: Do not force full zone append completion in req_bio_endio()
  block: Restore sector of flush requests
  block: Remove req_bio_endio()
  block: Introduce blk_zone_update_request_bio()
  block: Introduce bio_straddles_zones() and
    bio_offset_from_zone_start()
  block: Allow using bio_attempt_back_merge() internally
  block: Remember zone capacity when revalidating zones
  block: Introduce zone write plugging
  block: Pre-allocate zone write plugs
  block: Fake max open zones limit when there is no limit
  block: Allow zero value of max_zone_append_sectors queue limit
  block: Implement zone append emulation
  block: Allow BIO-based drivers to use blk_revalidate_disk_zones()
  dm: Use the block layer zone append emulation
  scsi: sd: Use the block layer zone append emulation
  ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  null_blk: Introduce zone_append_max_sectors attribute
  null_blk: Introduce fua attribute
  nvmet: zns: Do not reference the gendisk conv_zones_bitmap
  block: Remove BLK_STS_ZONE_RESOURCE
  block: Simplify blk_revalidate_disk_zones() interface
  block: mq-deadline: Remove support for zone write locking
  block: Remove elevator required features
  block: Do not check zone type in blk_check_zone_append()
  block: Move zone related debugfs attribute to blk-zoned.c
  block: Replace zone_wlock debugfs entry with zone_wplugs entry
  block: Remove zone write locking
  block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED
  block: Do not special-case plugging of zone write operations

 block/Kconfig                     |    5 -
 block/Makefile                    |    1 -
 block/bio.c                       |    7 +
 block/blk-core.c                  |   11 +-
 block/blk-flush.c                 |    1 +
 block/blk-merge.c                 |   22 +-
 block/blk-mq-debugfs-zoned.c      |   22 -
 block/blk-mq-debugfs.c            |    3 +-
 block/blk-mq-debugfs.h            |    6 +-
 block/blk-mq.c                    |  140 ++-
 block/blk-mq.h                    |   31 -
 block/blk-settings.c              |   46 +-
 block/blk-sysfs.c                 |    2 +-
 block/blk-zoned.c                 | 1412 +++++++++++++++++++++++++++--
 block/blk.h                       |   69 +-
 block/elevator.c                  |   46 +-
 block/elevator.h                  |    1 -
 block/genhd.c                     |    3 +-
 block/mq-deadline.c               |  176 +---
 drivers/block/null_blk/main.c     |   24 +-
 drivers/block/null_blk/null_blk.h |    2 +
 drivers/block/null_blk/zoned.c    |   23 +-
 drivers/block/ublk_drv.c          |    5 +-
 drivers/block/virtio_blk.c        |    2 +-
 drivers/md/dm-core.h              |    2 +-
 drivers/md/dm-zone.c              |  476 +---------
 drivers/md/dm.c                   |   75 +-
 drivers/md/dm.h                   |    4 +-
 drivers/nvme/host/core.c          |    2 +-
 drivers/nvme/target/zns.c         |   10 +-
 drivers/scsi/scsi_lib.c           |    1 -
 drivers/scsi/sd.c                 |    8 -
 drivers/scsi/sd.h                 |   19 -
 drivers/scsi/sd_zbc.c             |  335 +------
 include/linux/blk-mq.h            |   85 +-
 include/linux/blk_types.h         |   30 +-
 include/linux/blkdev.h            |  104 +--
 37 files changed, 1745 insertions(+), 1466 deletions(-)
 delete mode 100644 block/blk-mq-debugfs-zoned.c

Bart Van Assche March 28, 2024, 9:38 p.m. UTC | #1

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> +		/*
> +		 * Remember the capacity of the first sequential zone and check
> +		 * if it is constant for all zones.
> +		 */
> +		if (!args->zone_capacity)
> +			args->zone_capacity = zone->capacity;
> +		if (zone->capacity != args->zone_capacity) {
> +			pr_warn("%s: Invalid variable zone capacity\n",
> +				disk->disk_name);
> +			return -ENODEV;
> +		}

SMR disks may have a smaller last zone. Does the above code handle such
SMR disks correctly?

Thanks,

Bart.

Bart Van Assche March 28, 2024, 10:25 p.m. UTC | #2

On 3/27/24 9:30 PM, Christoph Hellwig wrote:
> Please verify my idea carefully, but I think we can do without the
> RCU grace period and thus the rcu_head in struct blk_zone_wplug:
> 
> When the zwplug is removed from the hash, we set the
> BLK_ZONE_WPLUG_UNHASHED flag under disk->zone_wplugs_lock.  Once
> caller see that flag any lookup that modifies the structure
> will fail/wait.  If we then just clear BLK_ZONE_WPLUG_UNHASHED after
> the final put in disk_put_zone_wplug when we know the bio list is
> empty and no other state is kept (if there might be flags left
> we should clear them before), it is perfectly fine for the
> zwplug to get reused for another zone at this point.

Hi Christoph,

I don't think this is allowed without grace period between kfree()
and reusing a zwplug because another thread might be iterating over
the hlist while only holding an RCU reader lock.

Thanks,

Bart.

Damien Le Moal March 28, 2024, 10:33 p.m. UTC | #3

On 3/29/24 07:29, Bart Van Assche wrote:
> On 3/27/24 5:43 PM, Damien Le Moal wrote:
>> Allocating zone write plugs using kmalloc() does not guarantee that
>> enough write plugs can be allocated to simultaneously write up to
>> the maximum number of active zones or maximum number of open zones of
>> a zoned block device.
>>
>> Avoid any issue with memory allocation by pre-allocating zone write
>> plugs up to the disk maximum number of open zones or maximum number of
>> active zones, whichever is larger. For zoned devices that do not have
>> open or active zone limits, the default 128 is used as the number of
>> write plugs to pre-allocate.
>>
>> Pre-allocated zone write plugs are managed using a free list. If a
>> change to the device zone limits is detected, the disk free list is
>> grown if needed when blk_revalidate_disk_zones() is executed.
> 
> Is there a way to retry bio submission if allocating a zone write plug
> fails? Would that make it possible to drop this patch?

This patch is merged into the main zone write plugging patch in v4 (about to
post it) and the free list is replaced with a mempool.
Note that for BIOs that do not have REQ_NOWAIT, the allocation is done with
GFP_NIO. If that fails, the OOM killer is probably already wreaking the system...

> 
> Thanks,
> 
> Bart.
>

Damien Le Moal March 28, 2024, 10:40 p.m. UTC | #4

On 3/29/24 06:38, Bart Van Assche wrote:
> On 3/27/24 5:43 PM, Damien Le Moal wrote:
>> +		/*
>> +		 * Remember the capacity of the first sequential zone and check
>> +		 * if it is constant for all zones.
>> +		 */
>> +		if (!args->zone_capacity)
>> +			args->zone_capacity = zone->capacity;
>> +		if (zone->capacity != args->zone_capacity) {
>> +			pr_warn("%s: Invalid variable zone capacity\n",
>> +				disk->disk_name);
>> +			return -ENODEV;
>> +		}
> 
> SMR disks may have a smaller last zone. Does the above code handle such
> SMR disks correctly?

SMR drives known to have a smaller last zone have a smaller conventional zone,
not a sequential zone. But good point, I will handle that on the check for
conventional zones.

> 
> Thanks,
> 
> Bart.

Damien Le Moal March 28, 2024, 10:42 p.m. UTC | #5

On 3/29/24 06:28, Bart Van Assche wrote:
> On 3/27/24 5:43 PM, Damien Le Moal wrote:
>> Moving req_bio_endio() code into its only caller, blk_update_request(),
>> allows reducing accesses to and tests of bio and request fields. Also,
>> given that partial completions of zone append operations is not
>> possible and that zone append operations cannot be merged, the update
>> of the BIO sector using the request sector for these operations can be
>> moved directly before the call to bio_endio().
> 
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> 
>> -	if (unlikely(error && !blk_rq_is_passthrough(req) &&
>> -		     !(req->rq_flags & RQF_QUIET)) &&
>> -		     !test_bit(GD_DEAD, &req->q->disk->state)) {
>> +	if (unlikely(error && !blk_rq_is_passthrough(req) && !quiet) &&
>> +	    !test_bit(GD_DEAD, &req->q->disk->state)) {
> 
> A question that is independent of this patch series: is it a bug or is
> it a feature that the GD_DEAD bit test is not marked as "unlikely"?

likely/unlikely are optimizations... I guess that bit test could be under
unlikely() as well. Though if we are dealing with a removable media device, this
may not be appropriate, which may be why it is not under unlikely(). Not sure.

> 
> Thanks,
> 
> Bart.

Jens Axboe March 28, 2024, 11:03 p.m. UTC | #6

On 3/28/24 4:43 PM, Damien Le Moal wrote:
> On 3/29/24 03:14, Bart Van Assche wrote:
>> On 3/27/24 17:43, Damien Le Moal wrote:
>>> This reverts commit 748dc0b65ec2b4b7b3dbd7befcc4a54fdcac7988.
>>>
>>> Partial zone append completions cannot be supported as there is no
>>> guarantees that the fragmented data will be written sequentially in the
>>> same manner as with a full command. Commit 748dc0b65ec2 ("block: fix
>>> partial zone append completion handling in req_bio_endio()") changed
>>> req_bio_endio() to always advance a partially failed BIO by its full
>>> length, but this can lead to incorrect accounting. So revert this
>>> change and let low level device drivers handle this case by always
>>> failing completely zone append operations. With this revert, users will
>>> still see an IO error for a partially completed zone append BIO.
>>>
>>> Fixes: 748dc0b65ec2 ("block: fix partial zone append completion handling in req_bio_endio()")
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>>> ---
>>>   block/blk-mq.c | 9 ++-------
>>>   1 file changed, 2 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 555ada922cf0..32afb87efbd0 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -770,16 +770,11 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
>>>   		/*
>>>   		 * Partial zone append completions cannot be supported as the
>>>   		 * BIO fragments may end up not being written sequentially.
>>> -		 * For such case, force the completed nbytes to be equal to
>>> -		 * the BIO size so that bio_advance() sets the BIO remaining
>>> -		 * size to 0 and we end up calling bio_endio() before returning.
>>>   		 */
>>> -		if (bio->bi_iter.bi_size != nbytes) {
>>> +		if (bio->bi_iter.bi_size != nbytes)
>>>   			bio->bi_status = BLK_STS_IOERR;
>>> -			nbytes = bio->bi_iter.bi_size;
>>> -		} else {
>>> +		else
>>>   			bio->bi_iter.bi_sector = rq->__sector;
>>> -		}
>>>   	}
>>>   
>>>   	bio_advance(bio, nbytes);
>>
>> Hi Damien,
>>
>> This patch looks good to me but shouldn't it be separated from this
>> patch series? I think that will help this patch to get merged sooner.
> 
> Yes, it can go on its own. But patch 3 depends on it so I kept it in the series.
> 
> Jens,
> 
> How would you like to proceed with this one ?

I can just pick it up separately.

Damien Le Moal March 28, 2024, 11:13 p.m. UTC | #7

On 3/29/24 08:05, Jens Axboe wrote:
> 
> On Thu, 28 Mar 2024 09:43:39 +0900, Damien Le Moal wrote:
>> The patch series introduces zone write plugging (ZWP) as the new
>> mechanism to control the ordering of writes to zoned block devices.
>> ZWP replaces zone write locking (ZWL) which is implemented only by
>> mq-deadline today. ZWP also allows emulating zone append operations
>> using regular writes for zoned devices that do not natively support this
>> operation (e.g. SMR HDDs). This patch series removes the scsi disk
>> driver and device mapper zone append emulation to use ZWP emulation.
>>
>> [...]
> 
> Applied, thanks!
> 
> [01/30] block: Do not force full zone append completion in req_bio_endio()
>         commit: 55251fbdf0146c252ceff146a1bb145546f3e034
> 
> Best regards,

Thanks Jens. Will this also be in your block/for-next branch ?
Otherwise, the series will have a conflict in patch 3.

Damien Le Moal March 28, 2024, 11:33 p.m. UTC | #8

On 3/29/24 08:27, Jens Axboe wrote:
> On 3/28/24 5:13 PM, Damien Le Moal wrote:
>> On 3/29/24 08:05, Jens Axboe wrote:
>>>
>>> On Thu, 28 Mar 2024 09:43:39 +0900, Damien Le Moal wrote:
>>>> The patch series introduces zone write plugging (ZWP) as the new
>>>> mechanism to control the ordering of writes to zoned block devices.
>>>> ZWP replaces zone write locking (ZWL) which is implemented only by
>>>> mq-deadline today. ZWP also allows emulating zone append operations
>>>> using regular writes for zoned devices that do not natively support this
>>>> operation (e.g. SMR HDDs). This patch series removes the scsi disk
>>>> driver and device mapper zone append emulation to use ZWP emulation.
>>>>
>>>> [...]
>>>
>>> Applied, thanks!
>>>
>>> [01/30] block: Do not force full zone append completion in req_bio_endio()
>>>         commit: 55251fbdf0146c252ceff146a1bb145546f3e034
>>>
>>> Best regards,
>>
>> Thanks Jens. Will this also be in your block/for-next branch ?
>> Otherwise, the series will have a conflict in patch 3.
> 
> It'll go into 6.9, and I'll rebase the for-6.10/block branch once -rc2
> is out. That should take care of the dependency.

OK. Thanks. I will wait for next week to send out v4 then.

Bart Van Assche March 29, 2024, 6:54 p.m. UTC | #9

On 3/27/24 5:44 PM, Damien Le Moal wrote:
> In preparation to completely remove zone write locking, replace the
> "zone_wlock" mq-debugfs entry that was listing zones that are
> write-locked with the zone_wplugs entry which lists the zones that
> currently have a write plug allocated.
> 
> The write plug information provided is: the zone number, the zone write
> plug flags, the zone write plug write pointer offset and the number of
> BIOs currently waiting for execution in the zone write plug BIO list.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Bart Van Assche March 29, 2024, 8:50 p.m. UTC | #10

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> In preparation for adding a generic zone append emulation using zone
> write plugging, allow device drivers supporting zoned block device to
> set a the max_zone_append_sectors queue limit of a device to 0 to
> indicate the lack of native support for zone append operations and that
> the block layer should emulate these operations using regular write
> operations.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Bart Van Assche March 29, 2024, 9:27 p.m. UTC | #11

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Bart Van Assche March 29, 2024, 9:29 p.m. UTC | #12

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> With zone write plugging enabled at the block layer level, any zone

zone -> zoned

Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Bart Van Assche March 29, 2024, 9:36 p.m. UTC | #13

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> Add the fua configfs attribute and module parameter to allow
> configuring if the device supports FUA or not. Using this attribute
> has an effect on the null_blk device only if memory backing is enabled
> together with a write cache (cache_size option).
> 
> This new attribute allows configuring a null_blk device with a write
> cache but without FUA support. This is convenient to test the block
> layer flush machinery.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Chaitanya Kulkarni April 2, 2024, 6:42 a.m. UTC | #14

On 3/27/24 17:43, Damien Le Moal wrote:
> Add the fua configfs attribute and module parameter to allow
> configuring if the device supports FUA or not. Using this attribute
> has an effect on the null_blk device only if memory backing is enabled
> together with a write cache (cache_size option).
> 
> This new attribute allows configuring a null_blk device with a write
> cache but without FUA support. This is convenient to test the block
> layer flush machinery.
> 
> Signed-off-by: Damien Le Moal<dlemoal@kernel.org>
> Reviewed-by: Hannes Reinecke<hare@suse.de>


Looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck

Chaitanya Kulkarni April 2, 2024, 6:43 a.m. UTC | #15

On 3/27/24 17:43, Damien Le Moal wrote:
> With zone write plugging enabled at the block layer level, any zone
> device can only ever see at most a single write operation per zone.
> There is thus no need to request a block scheduler with strick per-zone
> sequential write ordering control through the ELEVATOR_F_ZBD_SEQ_WRITE
> feature. Removing this allows using a zoned null_blk device with any
> scheduler, including "none".
> 
> Signed-off-by: Damien Le Moal<dlemoal@kernel.org>
> Reviewed-by: Hannes Reinecke<hare@suse.de>


Looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck

[v3,00/30] Zone write plugging

Message

Comments