[06/26] block: Introduce zone write plugging

Zone write plugging implements a per-zone "plug" for write operations to
tightly control the submission and execution order of writes to
sequential write required zones of a zoned block device. Per-zone
plugging of writes guarantees that at any time at most one write request
per zone is in flight. This mechanism is intended to replace zone write
locking which is controlled at the scheduler level and implemented only
by mq-deadline.

Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.

This mechanism allows to:
 - Untangles zone write ordering from block IO schedulers. This allows
   removing the restriction on using only mq-deadline for zoned block
   devices. Any block IO scheduler, including "none" can be used.
 - Zone write plugging operates on BIOs instead of requests. Plugged
   BIOs waiting for execution thus do not hold scheduling tags and thus
   are not preventing other BIOs to proceed (reads or writes to other
   zones). Depending on the workload, this can significantly improve
   the device use and performance.
 - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
   device mapper) can use zone write plugging. It is mandatory for the
   former but optional for the latter: BIO-based driver can use zone
   write plugging to implement write ordering guarantees, or the drivers
   can implement their own if needed.
 - The code is less invasive in the block layer and is mostly limited to
   blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
   bio.c.

Zone write plugging is implemented using struct blk_zone_wplug. This
structurei includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs.

Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugging. This enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.

Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion processing of BIOs and requests flagged trigger
respectively calls to the functions blk_zone_write_plug_bio_endio() and
blk_zone_write_plug_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_plug_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) are being executed for any zone. The
handling of zone write plug using a per-zone plug spinlock maximizes
parrallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.

Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.

Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance loss, blk_mq_submit_bio() calls the function
blk_zone_write_plug_attempt_merge() to try to merge other plugged BIOs
with the one just unplugged. Successful merging is signaled using
blk_zone_write_plug_bio_merged(), called from bio_attempt_back_merge().
Furthermore, to avoid recalculating the number of segments of plugged
BIOs to attempt merging, the number of segments of a plugged BIO is
saved using the new struct bio field __bi_nr_segments. To avoid growing
the size of struct bio, this field is added as a union with the
bio_cookie field. This is safe to do as polling is always disabled for
plugged BIOs.

When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This kept and reused when the
plugged BIO is unplugged and submitted again using
submit_bio_noacct_nocheck(). For this case, the unplugged BIO is already
flagged with BIO_ZONE_WRITE_PLUGGING and blk_mq_submit_bio() proceeds
directly to allocating a new request for the BIO, re-using the usage
reference count taken when the BIO was plugged. This extra reference
count is dropped in blk_zone_write_plug_attempt_merge() for any plugged
BIO that is successfully merged. Given that BIO-based devices will not
take this path, the extra reference is dropped when a plugged BIO is
unplugged and submitted.

To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources().

This commit contains contributions from Christoph Hellwig <hch@lst.de>.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/bio.c               |   7 +
 block/blk-merge.c         |  11 +
 block/blk-mq.c            |  28 +++
 block/blk-zoned.c         | 408 +++++++++++++++++++++++++++++++++++++-
 block/blk.h               |  32 ++-
 block/genhd.c             |   2 +-
 include/linux/blk-mq.h    |   2 +
 include/linux/blk_types.h |   8 +-
 include/linux/blkdev.h    |   8 +
 9 files changed, 496 insertions(+), 10 deletions(-)

Message ID	20240202073104.2418230-7-dlemoal@kernel.org
State	New
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7174518029; Fri, 2 Feb 2024 07:31:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706859076; cv=none; b=lM6vEhoBRW7lMn8NW74gKsakmc6EVRpiTQP3Dupp5OGp/KTssjqHk/RdCusuDYVm8xLw3CtkNo8u2ck2N2BDdgCohrjxzEfzRA8CLldvRMzHJmpijZoImD3DPpqMeMgSFHJkiaAM54qFPKJO8cR3edifv05YtzAKH/XNMOGbFic= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706859076; c=relaxed/simple; bh=0l0eVmmD1OUeKet59zGadJ10mhGIVzqSR9coRMOn6q4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=hW/ppAm8p/RrIvlxsyun4sIjIMCNpaqtjLzii5D/B8QDaQj407aJrPAiD32YCl+smn4KCx1L0ekvpnnWmB+xIfBJUIpG/wK1hXY3q1BymQhkAZ555iZVoEKRpkRcccEO4XFA0Wqge9J9GkjzVexplcXep4Q8C/l4ej9+tV+uRBc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=dBZbTImR; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="dBZbTImR" Received: by smtp.kernel.org (Postfix) with ESMTPSA id BE3C9C433F1; Fri, 2 Feb 2024 07:31:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1706859075; bh=0l0eVmmD1OUeKet59zGadJ10mhGIVzqSR9coRMOn6q4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=dBZbTImRl4Mpr7QEsF89qqzFHP01fHOXVgWMztSEstbPGBuGMFz80t5kywnYLHVKs ZWQBZxLKwE0Rn4DIDMRWdwOecHhoeYzhLZT/O/hsjLOxtqGXk0nr6lv6qZZeOgVWCG iEF4mAtUSyYAznz5hp2YO0n4VEbxmGJcRwKeoRBWQWIvTJuoe2p3EkUmsfLaM+/IRs oyZxOMqs3PdhZsLZZFr7ydhi+HC4xaK9NmiajEtLgLd6YOYDAZgRL7zlVKCKMR9xbf XLF7aVivzrQOOZW3jaP2lzkKXZRIE1DrbwZFhXUNOvPlb9wAjN0geThISfkTIcitMk yRVXlsPS9f8EA== From: Damien Le Moal <dlemoal@kernel.org> To: linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>, linux-scsi@vger.kernel.org, "Martin K . Petersen" <martin.petersen@oracle.com>, dm-devel@lists.linux.dev, Mike Snitzer <snitzer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Subject: [PATCH 06/26] block: Introduce zone write plugging Date: Fri, 2 Feb 2024 16:30:44 +0900 Message-ID: <20240202073104.2418230-7-dlemoal@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240202073104.2418230-1-dlemoal@kernel.org> References: <20240202073104.2418230-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: linux-scsi@vger.kernel.org List-Id: <linux-scsi.vger.kernel.org> List-Subscribe: <mailto:linux-scsi+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-scsi+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Zone write plugging \| expand [00/26] Zone write plugging [01/26] block: Restore sector of flush requests [02/26] block: Remove req_bio_endio() [03/26] block: Introduce bio_straddle_zones() and bio_offset_from_zone_start() [04/26] block: Introduce blk_zone_complete_request_bio() [05/26] block: Allow using bio_attempt_back_merge() internally [06/26] block: Introduce zone write plugging [07/26] block: Allow zero value of max_zone_append_sectors queue limit [08/26] block: Implement zone append emulation [09/26] block: Allow BIO-based drivers to use blk_revalidate_disk_zones() [10/26] dm: Use the block layer zone append emulation [11/26] scsi: sd: Use the block layer zone append emulation [12/26] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature [13/26] null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature [14/26] null_blk: Introduce zone_append_max_sectors attribute [15/26] null_blk: Introduce fua attribute [16/26] nvmet: zns: Do not reference the gendisk conv_zones_bitmap [17/26] block: Remove BLK_STS_ZONE_RESOURCE [18/26] block: Simplify blk_revalidate_disk_zones() interface [19/26] block: mq-deadline: Remove support for zone write locking [20/26] block: Remove elevator required features [21/26] block: Do not check zone type in blk_check_zone_append() [22/26] block: Move zone related debugfs attribute to blk-zoned.c [23/26] block: Remove zone write locking [24/26] block: Do not special-case plugging of zone write operations [25/26] block: Reduce zone write plugging memory usage [26/26] block: Add zone_active_wplugs debugfs entry

[06/26] block: Introduce zone write plugging

Commit Message

Comments

Patch