[v5,00/28] Zone write plugging

Message ID	20240403084247.856481-1-dlemoal@kernel.org
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 408116CDBD; Wed, 3 Apr 2024 08:42:49 +0000 (UTC) From: Damien Le Moal <dlemoal@kernel.org> To: linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>, linux-scsi@vger.kernel.org, "Martin K . Petersen" <martin.petersen@oracle.com>, dm-devel@lists.linux.dev, Mike Snitzer <snitzer@redhat.com>, linux-nvme@lists.infradead.org, Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de> Subject: [PATCH v5 00/28] Zone write plugging Date: Wed, 3 Apr 2024 17:42:19 +0900 Message-ID: <20240403084247.856481-1-dlemoal@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Zone write plugging \| expand [v5,00/28] Zone write plugging [v5,01/28] block: Restore sector of flush requests [v5,02/28] block: Remove req_bio_endio() [v5,03/28] block: Introduce blk_zone_update_request_bio() [v5,04/28] block: Introduce bio_straddles_zones() and bio_offset_from_zone_start() [v5,05/28] block: Allow using bio_attempt_back_merge() internally [v5,06/28] block: Remember zone capacity when revalidating zones [v5,07/28] block: Introduce zone write plugging [v5,08/28] block: Fake max open zones limit when there is no limit [v5,09/28] block: Allow zero value of max_zone_append_sectors queue limit [v5,10/28] block: Implement zone append emulation [v5,11/28] block: Allow BIO-based drivers to use blk_revalidate_disk_zones() [v5,12/28] dm: Use the block layer zone append emulation [v5,13/28] scsi: sd: Use the block layer zone append emulation [v5,14/28] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature [v5,15/28] null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature [v5,16/28] null_blk: Introduce zone_append_max_sectors attribute [v5,17/28] null_blk: Introduce fua attribute [v5,18/28] nvmet: zns: Do not reference the gendisk conv_zones_bitmap [v5,19/28] block: Remove BLK_STS_ZONE_RESOURCE [v5,20/28] block: Simplify blk_revalidate_disk_zones() interface [v5,21/28] block: mq-deadline: Remove support for zone write locking [v5,22/28] block: Remove elevator required features [v5,23/28] block: Do not check zone type in blk_check_zone_append() [v5,24/28] block: Move zone related debugfs attribute to blk-zoned.c [v5,25/28] block: Replace zone_wlock debugfs entry with zone_wplugs entry [v5,26/28] block: Remove zone write locking [v5,27/28] block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED [v5,28/28] block: Do not special-case plugging of zone write operations

Message ID

20240403084247.856481-1-dlemoal@kernel.org

Headers

From: Damien Le Moal <dlemoal@kernel.org>
To: linux-block@vger.kernel.org,
	Jens Axboe <axboe@kernel.dk>,
	linux-scsi@vger.kernel.org,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	dm-devel@lists.linux.dev,
	Mike Snitzer <snitzer@redhat.com>,
	linux-nvme@lists.infradead.org,
	Keith Busch <kbusch@kernel.org>,
	Christoph Hellwig <hch@lst.de>
Subject: [PATCH v5 00/28] Zone write plugging
Date: Wed,  3 Apr 2024 17:42:19 +0900
Message-ID: <20240403084247.856481-1-dlemoal@kernel.org>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

Zone write plugging | expand

Message

Damien Le Moal April 3, 2024, 8:42 a.m. UTC

The patch series introduces zone write plugging (ZWP) as the new
mechanism to control the ordering of writes to zoned block devices.
ZWP replaces zone write locking (ZWL) which is implemented only by
mq-deadline today. ZWP also allows emulating zone append operations
using regular writes for zoned devices that do not natively support this
operation (e.g. SMR HDDs). This patch series removes the scsi disk
driver and device mapper zone append emulation to use ZWP emulation.

Unlike ZWL which operates on requests, ZWP operates on BIOs. A zone
write plug is simply a BIO list that is atomically manipulated using a
spinlock and a kblockd submission work. A write BIO to a zone is
"plugged" to delay its execution if a write BIO for the same zone was
already issued, that is, if a write request for the same zone is being
executed. The next plugged BIO is unplugged and issued once the write
request completes.

This mechanism allows to:
 - Untangle zone write ordering from the block IO schedulers. This
   allows removing the restriction on using only mq-deadline for zoned
   block devices. Any block IO scheduler, including "none" can be used.
 - Zone write plugging operates on BIOs instead of requests. Plugged
   BIOs waiting for execution thus do not hold scheduling tags and thus
   do not prevent other BIOs from being submitted to the device (reads
   or writes to other zones). Depending on the workload, this can
   significantly improve the device use and the performance.
 - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
   device mapper) can use ZWP. It is mandatory for the
   former but optional for the latter: BIO-based driver can use zone
   write plugging to implement write ordering guarantees, or the drivers
   can implement their own if needed.
 - The code is less invasive in the block layer and in device drivers.
   ZWP implementation is mostly limited to blk-zoned.c, with some small
   changes in blk-mq.c, blk-merge.c and bio.c.

Performance evaluation results are shown below.

The series is based on block/for-next and organized as follows:

 - Patch 1 to 6 are preparatory changes for patch 7.
 - Patch 7 and 8 introduce ZWP
 - Patch 9 and 10 add zone append emulation to ZWP.
 - Patch 11 to 18 modify zoned block device drivers to use ZWP and
   prepare for the removal of ZWL.
 - Patch 19 to 28 remove zone write locking

Overall, these changes do not significantly increase the amount of code
(the higher number of addition shown by diff-stat is in fact due to a
larger amount of comments in the code).

Many thanks must go to Christoph Hellwig for comments and suggestions
he provided on earlier versions of these patches.

Performance evaluation results
==============================

Environments:
 - Intel Xeon 16-cores/32-threads, 128GB of RAM
 - Kernel:
   - ZWL (baseline): block/for-next (based on 6.9.0-rc2)
   - ZWP: block/for-next patched kernel to add zone write plugging
     (both kernels were compiled with the same configuration turning
     off most heavy debug features)

Workoads:
 - seqw4K1: 4KB sequential write, qd=1
 - seqw4K16: 4KB sequential write, qd=16
 - seqw1M16: 1MB sequential write, qd=16
 - rndw4K16: 4KB random write, qd=16
 - rndw128K16: 128KB random write, qd=16
 - btrfs workoad: Single fio job writing 128 MB files using 128 KB
   direct IOs at qd=16.

Devices:
 - nullblk (zoned): 4096 zones of 256 MB, 128 max open zones.
 - NVMe ZNS drive: 1 TB ZNS drive with 2GB zone size, 14 max open and
   active zones.
 - SMR HDD: 20 TB disk with 256MB zone size, 128 max open zones.

For ZWP, the result show the performance percentage increase (or
decrease) against ZWL (baseline) case.

1) null_blk zoned device:

             +--------+--------+-------+--------+---------+---------+
             |seqw4K1 |seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
             |(MB/s)  | (MB/s) |(MB/s) | (MB/s) | (KIOPS)| (KIOPS)  |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWL    | 940    | 840    | 18550 | 14400  | 424    | 167      |
 |mq-deadline|        |        |       |        |        |          |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 943    | 845    | 18660 | 14770  | 461    | 165      |
 |mq-deadline| (+0%)  | (+0%)  | (+0%) | (+1%)  | (+8%)  | (-1%)    |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 756    | 668    | 16020 | 12980  | 135    | 101      |
 |    bfq    | (-19%) | (-20%) | (-13%)| (-9%)  | (-68%) | (-39%)   |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 2639   | 1715   | 28190 | 19760  | 344    | 150      |
 |   none    | (+180%)| (+104%)| (+51%)| (+37%) | (-18%) | (-10%)   |
 +-----------+--------+--------+-------+--------+--------+----------+

ZWP with mq-deadline gives performance very similar to zone write
locking, showing that zone write plugging overhead is acceptable.
But ZWP ability to run fast block devices with the none scheduler
shows brings all the benefits of zone write plugging and results in
significant performance increase for all workloads. The exception to
this are random write workloads with multiple jobs: for these, the
faster request submission rate achieved by zone write plugging results
in higher contention on null-blk zone spinlock, which degrades
performance.

2) NVMe ZNS drive:

             +--------+--------+-------+--------+--------+----------+
             |seqw4K1 |seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
             |(MB/s)  | (MB/s) |(MB/s) | (MB/s) | (KIOPS)|  (KIOPS) |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWL    | 183    | 702    | 1066  | 1103   | 53.5   | 14.5     |
 |mq-deadline|        |        |       |        |        |          |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 183    | 719    | 1086  | 1108   | 55.6   | 14.7     |
 |mq-deadline| (-0%)  | (+1%)  | (+0%) | (+0%)  | (+3%)  | (+1%)    |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 178    | 691    | 1082  | 1106   | 30.8   | 11.5     |
 |    bfq    | (-3%)  | (-2%)  | (-0%) | (+0%)  | (-42%) | (-20%)   |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 190    | 666    | 1083  | 1108   | 51.4   | 14.7     |
 |   none    | (+4%)  | (-5%)  | (+0%) | (+0%)  | (-4%)  | (+0%)    |
 +-----------+--------+--------+-------+--------+--------+----------+

Zone write plugging overhead does not significantly impact performance.
Similar to nullblk, using the none scheduler leads to performance
increase for most workloads.

3) SMR SATA HDD:

             +-------+--------+-------+--------+--------+----------+
             |seqw4K1|seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
             |(MB/s) | (MB/s) |(MB/s) | (MB/s) | (KIOPS)|  (KIOPS) |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWL    | 107   | 243    | 245   | 246    | 2.2    | 0.763    |
 |mq-deadline|       |        |       |        |        |          |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWP    | 107   | 242    | 245   | 245    | 2.2    | 0.772    |
 |mq-deadline| (+0%) | (-0%)  | (+0%) | (-0%)  | (+0%)  | (+0%)    |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWP    | 104   | 241    | 246   | 242    | 2.2    | 0.765    |
 |    bfq    | (-2%) | (-0%)  | (+0%) | (-0%)  | (+0%)  | (+0%)    |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWP    | 115   | 235    | 249   | 242    | 2.2    | 0.763    |
 |   none    | (+7%) | (-3%)  | (+1%) | (-1%)  | (+0%)  | (+0%)    |
 +-----------+-------+--------+-------+--------+--------+----------+

Performance with purely sequential write workloads at high queue depth
somewhat decrease a little when using zone write plugging. This is due
to the different IO pattern that ZWP generates where the first writes to
a zone start being issued when the end of the previous zone are still
being written. Depending on how the disk handles queued commands, seek
may be generated, slightly impacting the throughput achieved. Such pure
sequential write workloads are however rare with SMR drives.

4) Zone append tests using btrfs:

             +-------------+-------------+-----------+-------------+
             |  null-blk   |  null_blk   |    ZNS    |     SMR     |
             |  native ZA  | emulated ZA | native ZA | emulated ZA |
             |    (MB/s)   |   (MB/s)    |   (MB/s)  |    (MB/s)   |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWL    | 2441        | N/A         | 1081      | 243         |
 |mq-deadline|             |             |           |             |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWP    | 2361        | 2999        | 1085      | 239         |
 |mq-deadline| (-1%)       |             | (+0%)     | (-2%)       |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWP    | 2299        | 2730        | 1080      | 240         |
 |    bfq    | (-4%)       |             | (+0%)     | (-2%)       |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWP    | 2443        | 3152        | 1083      | 240         |
 |    none   | (+0%)       |             | (+0%)     | (-1%)       |
 +-----------+-------------+-------------+-----------+-------------+

With a more realistic use of the device though a file system, ZWP does
not introduce significant performance differences, except for SMR for
the same reason as with the fio sequential workloads at high queue
depth.

Changes from v4:
 - Modified patch 7 to:
   - Fix a compilation warning
   - Integrate disk_lookup_zone_wplug() into disk_get_zone_wplug() so
     that all users of a zone write plug do a get+put when referencing
     the plug.
   - Introduced inline functions blk_zone_write_complete_request() and
     blk_zone_write_bio_endio() to integrate the "if (plugging)" test
     and avoid repeating it for all calls.
 - Added review tags

Changes from v3:
 - Rebase on block/for-next
 - Removed old patch 1 as it is already applied
 - Addressed Bart and Christoph comment in patch 4
 - Merged former patch 8 and 9 together and changed the zone write plug
   allocation to use a mempool
 - Removed the zone_wplugs_mempool_size filed from patch 8 and instead
   directly reference mempool->min_nr
 - Added review tags

Changes from v2:
 - Added Patch 1 (Christoph's comment)
 - Fixed error code setup in Patch 3 (Bart's comment)
 - Split former patch 26 into patches 27 and 28
 - Modified patch 8 (zone write plugging) introduction to remove the
   kmem_cache use and address Bart's and Christoph comments.
 - Changed from using a mempool of zone write plugs to using a simple
   free-list (patch 9)
 - Simplified patch 10 as suggested by Christoph
 - Moved common code to a helper in patch 13 as suggested by Christoph

Changes from v1:
 - Added patch 6
 - Rewrite of patch 7 to use a hash table of dynamically allocated zone
   write plugs. This results in changes in patch 11 and the addition of
   patch 8 and 9.
 - Rebased everything on 6.9.0-rc1
 - Added review tags for patches that did not change

Damien Le Moal (28):
  block: Restore sector of flush requests
  block: Remove req_bio_endio()
  block: Introduce blk_zone_update_request_bio()
  block: Introduce bio_straddles_zones() and bio_offset_from_zone_start()
  block: Allow using bio_attempt_back_merge() internally
  block: Remember zone capacity when revalidating zones
  block: Introduce zone write plugging
  block: Fake max open zones limit when there is no limit
  block: Allow zero value of max_zone_append_sectors queue limit
  block: Implement zone append emulation
  block: Allow BIO-based drivers to use blk_revalidate_disk_zones()
  dm: Use the block layer zone append emulation
  scsi: sd: Use the block layer zone append emulation
  ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  null_blk: Introduce zone_append_max_sectors attribute
  null_blk: Introduce fua attribute
  nvmet: zns: Do not reference the gendisk conv_zones_bitmap
  block: Remove BLK_STS_ZONE_RESOURCE
  block: Simplify blk_revalidate_disk_zones() interface
  block: mq-deadline: Remove support for zone write locking
  block: Remove elevator required features
  block: Do not check zone type in blk_check_zone_append()
  block: Move zone related debugfs attribute to blk-zoned.c
  block: Replace zone_wlock debugfs entry with zone_wplugs entry
  block: Remove zone write locking
  block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED
  block: Do not special-case plugging of zone write operations

 block/Kconfig                     |    5 -
 block/Makefile                    |    1 -
 block/bio.c                       |    6 +
 block/blk-core.c                  |   11 +-
 block/blk-flush.c                 |    1 +
 block/blk-merge.c                 |   22 +-
 block/blk-mq-debugfs-zoned.c      |   22 -
 block/blk-mq-debugfs.c            |    3 +-
 block/blk-mq-debugfs.h            |    6 +-
 block/blk-mq.c                    |  125 ++-
 block/blk-mq.h                    |   31 -
 block/blk-settings.c              |   46 +-
 block/blk-sysfs.c                 |    2 +-
 block/blk-zoned.c                 | 1336 +++++++++++++++++++++++++++--
 block/blk.h                       |   80 +-
 block/elevator.c                  |   46 +-
 block/elevator.h                  |    1 -
 block/genhd.c                     |    3 +-
 block/mq-deadline.c               |  176 +---
 drivers/block/null_blk/main.c     |   22 +-
 drivers/block/null_blk/null_blk.h |    2 +
 drivers/block/null_blk/zoned.c    |   23 +-
 drivers/block/ublk_drv.c          |    5 +-
 drivers/block/virtio_blk.c        |    2 +-
 drivers/md/dm-core.h              |    2 +-
 drivers/md/dm-zone.c              |  476 +---------
 drivers/md/dm.c                   |   75 +-
 drivers/md/dm.h                   |    4 +-
 drivers/nvme/host/core.c          |    2 +-
 drivers/nvme/target/zns.c         |   10 +-
 drivers/scsi/scsi_lib.c           |    1 -
 drivers/scsi/sd.c                 |    8 -
 drivers/scsi/sd.h                 |   19 -
 drivers/scsi/sd_zbc.c             |  335 +-------
 include/linux/blk-mq.h            |   85 +-
 include/linux/blk_types.h         |   30 +-
 include/linux/blkdev.h            |  103 ++-
 37 files changed, 1663 insertions(+), 1464 deletions(-)
 delete mode 100644 block/blk-mq-debugfs-zoned.c

Comments

Hans Holmberg April 4, 2024, 1:40 p.m. UTC | #1

On 2024-04-03 10:43, Damien Le Moal wrote:
> The patch series introduces zone write plugging (ZWP) as the new
> mechanism to control the ordering of writes to zoned block devices.
> ZWP replaces zone write locking (ZWL) which is implemented only by
> mq-deadline today. ZWP also allows emulating zone append operations
> using regular writes for zoned devices that do not natively support this
> operation (e.g. SMR HDDs). This patch series removes the scsi disk
> driver and device mapper zone append emulation to use ZWP emulation.
> 
> Unlike ZWL which operates on requests, ZWP operates on BIOs. A zone
> write plug is simply a BIO list that is atomically manipulated using a
> spinlock and a kblockd submission work. A write BIO to a zone is
> "plugged" to delay its execution if a write BIO for the same zone was
> already issued, that is, if a write request for the same zone is being
> executed. The next plugged BIO is unplugged and issued once the write
> request completes.
> 
> This mechanism allows to:
>   - Untangle zone write ordering from the block IO schedulers. This
>     allows removing the restriction on using only mq-deadline for zoned
>     block devices. Any block IO scheduler, including "none" can be used.
>   - Zone write plugging operates on BIOs instead of requests. Plugged
>     BIOs waiting for execution thus do not hold scheduling tags and thus
>     do not prevent other BIOs from being submitted to the device (reads
>     or writes to other zones). Depending on the workload, this can
>     significantly improve the device use and the performance.
>   - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
>     device mapper) can use ZWP. It is mandatory for the
>     former but optional for the latter: BIO-based driver can use zone
>     write plugging to implement write ordering guarantees, or the drivers
>     can implement their own if needed.
>   - The code is less invasive in the block layer and in device drivers.
>     ZWP implementation is mostly limited to blk-zoned.c, with some small
>     changes in blk-mq.c, blk-merge.c and bio.c.
> 
> Performance evaluation results are shown below.
> 
> The series is based on block/for-next and organized as follows:
> 
>   - Patch 1 to 6 are preparatory changes for patch 7.
>   - Patch 7 and 8 introduce ZWP
>   - Patch 9 and 10 add zone append emulation to ZWP.
>   - Patch 11 to 18 modify zoned block device drivers to use ZWP and
>     prepare for the removal of ZWL.
>   - Patch 19 to 28 remove zone write locking
> 
> Overall, these changes do not significantly increase the amount of code
> (the higher number of addition shown by diff-stat is in fact due to a
> larger amount of comments in the code).
> 
> Many thanks must go to Christoph Hellwig for comments and suggestions
> he provided on earlier versions of these patches.
> 
> Performance evaluation results
> ==============================
> 
> Environments:
>   - Intel Xeon 16-cores/32-threads, 128GB of RAM
>   - Kernel:
>     - ZWL (baseline): block/for-next (based on 6.9.0-rc2)
>     - ZWP: block/for-next patched kernel to add zone write plugging
>       (both kernels were compiled with the same configuration turning
>       off most heavy debug features)
> 
> Workoads:
>   - seqw4K1: 4KB sequential write, qd=1
>   - seqw4K16: 4KB sequential write, qd=16
>   - seqw1M16: 1MB sequential write, qd=16
>   - rndw4K16: 4KB random write, qd=16
>   - rndw128K16: 128KB random write, qd=16
>   - btrfs workoad: Single fio job writing 128 MB files using 128 KB
>     direct IOs at qd=16.
> 
> Devices:
>   - nullblk (zoned): 4096 zones of 256 MB, 128 max open zones.
>   - NVMe ZNS drive: 1 TB ZNS drive with 2GB zone size, 14 max open and
>     active zones.
>   - SMR HDD: 20 TB disk with 256MB zone size, 128 max open zones.
> 
> For ZWP, the result show the performance percentage increase (or
> decrease) against ZWL (baseline) case.
> 
> 1) null_blk zoned device:
> 
>               +--------+--------+-------+--------+---------+---------+
>               |seqw4K1 |seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
>               |(MB/s)  | (MB/s) |(MB/s) | (MB/s) | (KIOPS)| (KIOPS)  |
>   +-----------+--------+--------+-------+--------+--------+----------+
>   |    ZWL    | 940    | 840    | 18550 | 14400  | 424    | 167      |
>   |mq-deadline|        |        |       |        |        |          |
>   +-----------+--------+--------+-------+--------+--------+----------+
>   |    ZWP    | 943    | 845    | 18660 | 14770  | 461    | 165      |
>   |mq-deadline| (+0%)  | (+0%)  | (+0%) | (+1%)  | (+8%)  | (-1%)    |
>   +-----------+--------+--------+-------+--------+--------+----------+
>   |    ZWP    | 756    | 668    | 16020 | 12980  | 135    | 101      |
>   |    bfq    | (-19%) | (-20%) | (-13%)| (-9%)  | (-68%) | (-39%)   |
>   +-----------+--------+--------+-------+--------+--------+----------+
>   |    ZWP    | 2639   | 1715   | 28190 | 19760  | 344    | 150      |
>   |   none    | (+180%)| (+104%)| (+51%)| (+37%) | (-18%) | (-10%)   |
>   +-----------+--------+--------+-------+--------+--------+----------+
> 
> ZWP with mq-deadline gives performance very similar to zone write
> locking, showing that zone write plugging overhead is acceptable.
> But ZWP ability to run fast block devices with the none scheduler
> shows brings all the benefits of zone write plugging and results in
> significant performance increase for all workloads. The exception to
> this are random write workloads with multiple jobs: for these, the
> faster request submission rate achieved by zone write plugging results
> in higher contention on null-blk zone spinlock, which degrades
> performance.
> 
> 2) NVMe ZNS drive:
> 
>               +--------+--------+-------+--------+--------+----------+
>               |seqw4K1 |seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
>               |(MB/s)  | (MB/s) |(MB/s) | (MB/s) | (KIOPS)|  (KIOPS) |
>   +-----------+--------+--------+-------+--------+--------+----------+
>   |    ZWL    | 183    | 702    | 1066  | 1103   | 53.5   | 14.5     |
>   |mq-deadline|        |        |       |        |        |          |
>   +-----------+--------+--------+-------+--------+--------+----------+
>   |    ZWP    | 183    | 719    | 1086  | 1108   | 55.6   | 14.7     |
>   |mq-deadline| (-0%)  | (+1%)  | (+0%) | (+0%)  | (+3%)  | (+1%)    |
>   +-----------+--------+--------+-------+--------+--------+----------+
>   |    ZWP    | 178    | 691    | 1082  | 1106   | 30.8   | 11.5     |
>   |    bfq    | (-3%)  | (-2%)  | (-0%) | (+0%)  | (-42%) | (-20%)   |
>   +-----------+--------+--------+-------+--------+--------+----------+
>   |    ZWP    | 190    | 666    | 1083  | 1108   | 51.4   | 14.7     |
>   |   none    | (+4%)  | (-5%)  | (+0%) | (+0%)  | (-4%)  | (+0%)    |
>   +-----------+--------+--------+-------+--------+--------+----------+
> 
> Zone write plugging overhead does not significantly impact performance.
> Similar to nullblk, using the none scheduler leads to performance
> increase for most workloads.
> 
> 3) SMR SATA HDD:
> 
>               +-------+--------+-------+--------+--------+----------+
>               |seqw4K1|seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
>               |(MB/s) | (MB/s) |(MB/s) | (MB/s) | (KIOPS)|  (KIOPS) |
>   +-----------+-------+--------+-------+--------+--------+----------+
>   |    ZWL    | 107   | 243    | 245   | 246    | 2.2    | 0.763    |
>   |mq-deadline|       |        |       |        |        |          |
>   +-----------+-------+--------+-------+--------+--------+----------+
>   |    ZWP    | 107   | 242    | 245   | 245    | 2.2    | 0.772    |
>   |mq-deadline| (+0%) | (-0%)  | (+0%) | (-0%)  | (+0%)  | (+0%)    |
>   +-----------+-------+--------+-------+--------+--------+----------+
>   |    ZWP    | 104   | 241    | 246   | 242    | 2.2    | 0.765    |
>   |    bfq    | (-2%) | (-0%)  | (+0%) | (-0%)  | (+0%)  | (+0%)    |
>   +-----------+-------+--------+-------+--------+--------+----------+
>   |    ZWP    | 115   | 235    | 249   | 242    | 2.2    | 0.763    |
>   |   none    | (+7%) | (-3%)  | (+1%) | (-1%)  | (+0%)  | (+0%)    |
>   +-----------+-------+--------+-------+--------+--------+----------+
> 
> Performance with purely sequential write workloads at high queue depth
> somewhat decrease a little when using zone write plugging. This is due
> to the different IO pattern that ZWP generates where the first writes to
> a zone start being issued when the end of the previous zone are still
> being written. Depending on how the disk handles queued commands, seek
> may be generated, slightly impacting the throughput achieved. Such pure
> sequential write workloads are however rare with SMR drives.
> 
> 4) Zone append tests using btrfs:
> 
>               +-------------+-------------+-----------+-------------+
>               |  null-blk   |  null_blk   |    ZNS    |     SMR     |
>               |  native ZA  | emulated ZA | native ZA | emulated ZA |
>               |    (MB/s)   |   (MB/s)    |   (MB/s)  |    (MB/s)   |
>   +-----------+-------------+-------------+-----------+-------------+
>   |    ZWL    | 2441        | N/A         | 1081      | 243         |
>   |mq-deadline|             |             |           |             |
>   +-----------+-------------+-------------+-----------+-------------+
>   |    ZWP    | 2361        | 2999        | 1085      | 239         |
>   |mq-deadline| (-1%)       |             | (+0%)     | (-2%)       |
>   +-----------+-------------+-------------+-----------+-------------+
>   |    ZWP    | 2299        | 2730        | 1080      | 240         |
>   |    bfq    | (-4%)       |             | (+0%)     | (-2%)       |
>   +-----------+-------------+-------------+-----------+-------------+
>   |    ZWP    | 2443        | 3152        | 1083      | 240         |
>   |    none   | (+0%)       |             | (+0%)     | (-1%)       |
>   +-----------+-------------+-------------+-----------+-------------+
> 
> With a more realistic use of the device though a file system, ZWP does
> not introduce significant performance differences, except for SMR for
> the same reason as with the fio sequential workloads at high queue
> depth.


I re-ran the same RocksDB testing I did for v4 on this patch series.
Same results.

Tested-by: Hans Holmberg <hans.holmberg@wdc.com>

Setup
-----

Workloads: filluniquerandom followed by three iterations of overwrites,
readrandom, fwdrange, revrange and varieties of readwhilewriting,
Config: Direct IO, 400M key/values

Userspace: db_bench / RocksDB 7.7.3, using ZenFS with libzbd as backend

IO Scheduler: [none]

NVMe ZNS drive: 1 TB ZNS drive with 2GB zone size, 14 max open and
active zones.

System: 16-cores (AMD Epyc 7302P) with 128 GB of DRAM

Results:
--------

iostat: 33.6T read, 6.5T written. No errors/failures.

Comparing performance with the baseline requiring [mq-deadline] to avoid
write reordering, we can see a ~8-10% throughput improvement for read
heavy workloads.

No regressions seen for these workloads, db_bench write throughput and
read and write tail latencies look unaffected.



> Changes from v4:
>   - Modified patch 7 to:
>     - Fix a compilation warning
>     - Integrate disk_lookup_zone_wplug() into disk_get_zone_wplug() so
>       that all users of a zone write plug do a get+put when referencing
>       the plug.
>     - Introduced inline functions blk_zone_write_complete_request() and
>       blk_zone_write_bio_endio() to integrate the "if (plugging)" test
>       and avoid repeating it for all calls.
>   - Added review tags
> 
> Changes from v3:
>   - Rebase on block/for-next
>   - Removed old patch 1 as it is already applied
>   - Addressed Bart and Christoph comment in patch 4
>   - Merged former patch 8 and 9 together and changed the zone write plug
>     allocation to use a mempool
>   - Removed the zone_wplugs_mempool_size filed from patch 8 and instead
>     directly reference mempool->min_nr
>   - Added review tags
> 
> Changes from v2:
>   - Added Patch 1 (Christoph's comment)
>   - Fixed error code setup in Patch 3 (Bart's comment)
>   - Split former patch 26 into patches 27 and 28
>   - Modified patch 8 (zone write plugging) introduction to remove the
>     kmem_cache use and address Bart's and Christoph comments.
>   - Changed from using a mempool of zone write plugs to using a simple
>     free-list (patch 9)
>   - Simplified patch 10 as suggested by Christoph
>   - Moved common code to a helper in patch 13 as suggested by Christoph
> 
> Changes from v1:
>   - Added patch 6
>   - Rewrite of patch 7 to use a hash table of dynamically allocated zone
>     write plugs. This results in changes in patch 11 and the addition of
>     patch 8 and 9.
>   - Rebased everything on 6.9.0-rc1
>   - Added review tags for patches that did not change
> 
> Damien Le Moal (28):
>    block: Restore sector of flush requests
>    block: Remove req_bio_endio()
>    block: Introduce blk_zone_update_request_bio()
>    block: Introduce bio_straddles_zones() and bio_offset_from_zone_start()
>    block: Allow using bio_attempt_back_merge() internally
>    block: Remember zone capacity when revalidating zones
>    block: Introduce zone write plugging
>    block: Fake max open zones limit when there is no limit
>    block: Allow zero value of max_zone_append_sectors queue limit
>    block: Implement zone append emulation
>    block: Allow BIO-based drivers to use blk_revalidate_disk_zones()
>    dm: Use the block layer zone append emulation
>    scsi: sd: Use the block layer zone append emulation
>    ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
>    null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
>    null_blk: Introduce zone_append_max_sectors attribute
>    null_blk: Introduce fua attribute
>    nvmet: zns: Do not reference the gendisk conv_zones_bitmap
>    block: Remove BLK_STS_ZONE_RESOURCE
>    block: Simplify blk_revalidate_disk_zones() interface
>    block: mq-deadline: Remove support for zone write locking
>    block: Remove elevator required features
>    block: Do not check zone type in blk_check_zone_append()
>    block: Move zone related debugfs attribute to blk-zoned.c
>    block: Replace zone_wlock debugfs entry with zone_wplugs entry
>    block: Remove zone write locking
>    block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED
>    block: Do not special-case plugging of zone write operations
> 
>   block/Kconfig                     |    5 -
>   block/Makefile                    |    1 -
>   block/bio.c                       |    6 +
>   block/blk-core.c                  |   11 +-
>   block/blk-flush.c                 |    1 +
>   block/blk-merge.c                 |   22 +-
>   block/blk-mq-debugfs-zoned.c      |   22 -
>   block/blk-mq-debugfs.c            |    3 +-
>   block/blk-mq-debugfs.h            |    6 +-
>   block/blk-mq.c                    |  125 ++-
>   block/blk-mq.h                    |   31 -
>   block/blk-settings.c              |   46 +-
>   block/blk-sysfs.c                 |    2 +-
>   block/blk-zoned.c                 | 1336 +++++++++++++++++++++++++++--
>   block/blk.h                       |   80 +-
>   block/elevator.c                  |   46 +-
>   block/elevator.h                  |    1 -
>   block/genhd.c                     |    3 +-
>   block/mq-deadline.c               |  176 +---
>   drivers/block/null_blk/main.c     |   22 +-
>   drivers/block/null_blk/null_blk.h |    2 +
>   drivers/block/null_blk/zoned.c    |   23 +-
>   drivers/block/ublk_drv.c          |    5 +-
>   drivers/block/virtio_blk.c        |    2 +-
>   drivers/md/dm-core.h              |    2 +-
>   drivers/md/dm-zone.c              |  476 +---------
>   drivers/md/dm.c                   |   75 +-
>   drivers/md/dm.h                   |    4 +-
>   drivers/nvme/host/core.c          |    2 +-
>   drivers/nvme/target/zns.c         |   10 +-
>   drivers/scsi/scsi_lib.c           |    1 -
>   drivers/scsi/sd.c                 |    8 -
>   drivers/scsi/sd.h                 |   19 -
>   drivers/scsi/sd_zbc.c             |  335 +-------
>   include/linux/blk-mq.h            |   85 +-
>   include/linux/blk_types.h         |   30 +-
>   include/linux/blkdev.h            |  103 ++-
>   37 files changed, 1663 insertions(+), 1464 deletions(-)
>   delete mode 100644 block/blk-mq-debugfs-zoned.c
>

Damien Le Moal April 4, 2024, 11:07 p.m. UTC | #2

On 4/5/24 03:31, Bart Van Assche wrote:
> On 4/3/24 01:42, Damien Le Moal wrote:
>> diff --git a/block/bio.c b/block/bio.c
>> index d24420ed1c4c..127c06443bca 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -1576,6 +1576,12 @@ void bio_endio(struct bio *bio)
>>   	if (!bio_integrity_endio(bio))
>>   		return;
>>   
>> +	/*
>> +	 * For write BIOs to zoned devices, signal the completion of the BIO so
>> +	 * that the next write BIO can be submitted by zone write plugging.
>> +	 */
>> +	blk_zone_write_bio_endio(bio);
>> +
>>   	rq_qos_done_bio(bio);
> 
> The function name "blk_zone_write_bio_endio()" is misleading since that
> function is called for all bio operation types and not only for zoned
> write bios. How about renaming blk_zone_write_bio_endio() into 
> blk_zone_bio_endio()? If that function is renamed the above comment is
> no longer necessary in bio_endio() and can be moved to above the
> blk_zone_bio_endio() definition.

OK.

> 
>> @@ -1003,6 +1007,13 @@ static enum bio_merge_status bio_attempt_front_merge(struct request *req,
>> +	/*
>> +	 * A front merge for zone writes can happen only if the user submitted
>> +	 * writes out of order. Do not attempt this to let the write fail.
>> +	 */
> 
> "zone writes" -> "zoned writes"?

Well, "writes to zones of a zoned block device" would be the correct way to
describe this. The shortcut is "zone writes" but can be "zoned write" too.

>> +/*
>> + * Zone write plug flags bits:
> 
> Zone write -> zoned write

Nope. This is flags for zone write plugs (as in struct blk_zone_wplug). Not
talking about flags for writes to zones.

>> +static bool disk_insert_zone_wplug(struct gendisk *disk,
>> +				   struct blk_zone_wplug *zwplug)
>> +{
>> +	struct blk_zone_wplug *zwplg;
>> +	unsigned long flags;
>> +	unsigned int idx =
>> +		hash_32(zwplug->zone_no, disk->zone_wplugs_hash_bits);
>> +
>> +	/*
>> +	 * Add the new zone write plug to the hash table, but carefully as we
>> +	 * are racing with other submission context, so we may already have a
>> +	 * zone write plug for the same zone.
>> +	 */
>> +	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>> +	hlist_for_each_entry_rcu(zwplg, &disk->zone_wplugs_hash[idx], node) {
>> +		if (zwplg->zone_no == zwplug->zone_no) {
>> +			spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
>> +			return false;
>> +		}
>> +	}
>> +	hlist_add_head_rcu(&zwplug->node, &disk->zone_wplugs_hash[idx]);
>> +	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
>> +
>> +	return true;
>> +}
> 
> The above code can be made easier to read and more compact by using
> guard(spinlock_irqsave)(...) instead of spin_lock_irqsave() + 
> spin_unlock_irqrestore().

I am not familiar with this annotation and I do not see it used in the block
layer. So I would rather not do that to be clear about locking/unlocking.
Such change can come as cleanups later.

> 
>> +static struct blk_zone_wplug *disk_get_zone_wplug(struct gendisk *disk,
>> +						  sector_t sector)
>> +{
>> +	unsigned int zno = disk_zone_no(disk, sector);
>> +	unsigned int idx = hash_32(zno, disk->zone_wplugs_hash_bits);
>> +	struct blk_zone_wplug *zwplug;
>> +
>> +	rcu_read_lock();
>> +
>> +	hlist_for_each_entry_rcu(zwplug, &disk->zone_wplugs_hash[idx], node) {
>> +		if (zwplug->zone_no == zno &&
>> +		    atomic_inc_not_zero(&zwplug->ref)) {
>> +			rcu_read_unlock();
>> +			return zwplug;
>> +		}
>> +	}
>> +
>> +	rcu_read_unlock();
>> +
>> +	return NULL;
>> +}
> 
> The above code can also be made more compact by using guard(rcu)()
> instead of rcu_read_lock() + rcu_read_unlock().

Same comment as above.

>> +/*
>> + * Get a reference on the write plug for the zone containing @sector.
>> + * If the plug does not exist, it is allocated and hashed.
>> + * Return a pointer to the zone write plug with the plug spinlock held.
>> + */
> 
> Holding a spinlock upon return is not my favorite approach. Has the
> following alternative been considered?
> - Remove the spinlock flags argument from this function and instead add
>    two other arguments: prev_plug_flags and new_plug_flags.
> - If a zone plug is found or allocated, copy the existing zone plug
>    flags into *prev_plug_flags and set the zone plug flags that have been
>    passed in new_plug_flags (logical or).
> - From blk_revalidate_zone_cb(), pass 0 as the new_plug_flags argument.
> - From blk_zone_wplug_handle_write, pass BLK_ZONE_WPLUG_PLUGGED as the
>    new_plug_flags argument.

I would rather not do this because this is a lot more complicated on the caller
sites as all callers would need to check for the UNHASHED case and potentially
retry again. As it is, zone write plugging is already not trivial, so I am
trying to keep things as simple as I can for now ? The function name makes it
*very* clear that it takes the plug spinlock, and that avoids needing the caller
to worry about the state of the plug they will get.

> 
> Thanks,
> 
> Bart.