mbox series

[v6,0/5] Initial support for multi-actuator HDDs

Message ID 20210827075045.642269-1-damien.lemoal@wdc.com
Headers show
Series Initial support for multi-actuator HDDs | expand

Message

Damien Le Moal Aug. 27, 2021, 7:50 a.m. UTC
Single LUN multi-actuator hard-disks are cappable to seek and execute
multiple commands in parallel. This capability is exposed to the host
using the Concurrent Positioning Ranges VPD page (SCSI) and Log (ATA).
Each positioning range describes the contiguous set of LBAs that an
actuator serves.

This series adds support to the scsi disk driver to retreive this
information and advertize it to user space through sysfs. libata is
also modified to handle ATA drives.

The first patch adds the block layer plumbing to expose concurrent
sector ranges of the device through sysfs as a sub-directory of the
device sysfs queue directory. Patch 2 and 3 add support to sd and
libata. Finally patch 4 documents the sysfs queue attributed changes.
Patch 5 fixes a typo in the document file (strictly speaking, not
related to this series).

This series does not attempt in any way to optimize accesses to
multi-actuator devices (e.g. block IO schedulers or filesystems). This
initial support only exposes the independent access ranges information
to user space through sysfs.

Changes from v5:
* Changed type names in patch 1:
  - struct blk_crange -> sturct blk_independent_access_range
  - struct blk_cranges -> sturct blk_independent_access_ranges
  All functions and variables are renamed accordingly, using shorter
  names related to the new type names, e.g.
  sturct blk_independent_access_ranges -> iaranges or iars.
* Update the commit message of patch 1 to 4. Patch 1 and 4 titles are
  also new.
* Dropped reviewed-tags on modified patches. Patch 3 and 5 are
  unmodified

Changes from v4:
* Fixed kdoc comment function name mismatch for disk_register_cranges()
  in patch 1

Changes from v3:
* Modified patch 1:
  - Prefix functions that take a struct gendisk as argument with
    "disk_". Modified patch 2 accordingly.
  - Added a functional release operation for struct blk_cranges kobj to
    ensure that this structure is freed only after all references to it
    are released, including kobject_del() execution for all crange sysfs
    entries.
* Added patch 5 to separate the typo fix from the crange documentation
  addition.
* Added reviewed-by tags

Changes from v2:
* Update patch 1 to fix a compilation warning for a potential NULL
  pointer dereference of the cr argument of blk_queue_set_cranges().
  Warning reported by the kernel test robot <lkp@intel.com>).

Changes from v1:
* Moved libata-scsi hunk from patch 1 to patch 3 where it belongs
* Fixed unintialized variable in patch 2
  Reported-by: kernel test robot <lkp@intel.com>
  Reported-by: Dan Carpenter <dan.carpenter@oracle.com
* Changed patch 3 adding struct ata_cpr_log to contain both the number
  of concurrent ranges and the array of concurrent ranges.
* Added a note in the documentation (patch 4) about the unit used for
  the concurrent ranges attributes.

Damien Le Moal (5):
  block: Add independent access ranges support
  scsi: sd: add concurrent positioning ranges support
  libata: support concurrent positioning ranges log
  doc: document sysfs queue/independent_access_ranges attributes
  doc: Fix typo in request queue sysfs documentation

 Documentation/block/queue-sysfs.rst |  30 ++-
 block/Makefile                      |   2 +-
 block/blk-iaranges.c                | 322 ++++++++++++++++++++++++++++
 block/blk-sysfs.c                   |  26 ++-
 block/blk.h                         |   4 +
 drivers/ata/libata-core.c           |  52 +++++
 drivers/ata/libata-scsi.c           |  48 ++++-
 drivers/scsi/sd.c                   |  81 +++++++
 drivers/scsi/sd.h                   |   1 +
 include/linux/ata.h                 |   1 +
 include/linux/blkdev.h              |  34 +++
 include/linux/libata.h              |  15 ++
 12 files changed, 597 insertions(+), 19 deletions(-)
 create mode 100644 block/blk-iaranges.c

Comments

Christoph Hellwig Aug. 27, 2021, 4:41 p.m. UTC | #1
On Fri, Aug 27, 2021 at 02:28:58PM +0000, Tim Walker wrote:
> There is nothing in the spec that requires the ranges to be contiguous
> or non-overlapping.

Yikes, that is a pretty stupid standard.  Almost as bad as allowing
non-uniform sized non-power of two sized zones :)

> It's easy to imagine a HDD architecture that allows multiple heads to access the same sectors on the disk. It's also easy to imagine a workload scenario where parallel access to the same disk could be useful. (Think of a typical storage design that sequentially writes new user data gradually filling the disk, while simultaneously supporting random user reads over the written data.)

But for those drivers you do not actually need this scheme at all.
Storage devices that support higher concurrency are bog standard with
SSDs and if you want to go back storage arrays.  The only interesting
case is when these ranges are separate so that the access can be carved
up based on the boundary.  Now I don't want to give people ideas with
overlapping but not identical, which would be just horrible.
Damien Le Moal Aug. 29, 2021, 10:50 p.m. UTC | #2
On 2021/08/28 2:38, Phillip Susi wrote:
> 
> Tim Walker <tim.t.walker@seagate.com> writes:
> 
>> The IO Scheduler is a useful place to implement per-actuator load
>> management, but with the LBA-to-actuator mapping available to user
>> space (via sysfs) it could also be done at the user level. Or pretty
>> much anywhere else where we have knowledge and control of the various
>> streams.
> 
> I suppose there may be some things user space could do with the
> information, but mainly doesn't it have to be done in the IO scheduler?

Correct, if the user does not use a file system then optimizations will depend
on the user application and the IO scheduler.

> As it stands now, it is going to try to avoid seeking between the two
> regions even though the drive can service a contiguous stream from both
> just fine, right?

Correct. But any IO scheduler optimization will kick-in only and only if the
user is accessing the drive at a queue depth beyond the drive max QD, 32 for
SATA. If the drive is exercised at a QD less than its maximum, the scheduler
does not hold on to requests (at least mq-deadline does not, not sure about
bfq). So even with only this patch set (no optimizations at the kernel level),
the user can still make things work as expected, that is, get multiple streams
of IOs to execute in parallel.
Damien Le Moal Aug. 29, 2021, 10:55 p.m. UTC | #3
On 2021/08/28 1:43, Christoph Hellwig wrote:
> On Fri, Aug 27, 2021 at 02:28:58PM +0000, Tim Walker wrote:
>> There is nothing in the spec that requires the ranges to be contiguous
>> or non-overlapping.
> 
> Yikes, that is a pretty stupid standard.  Almost as bad as allowing
> non-uniform sized non-power of two sized zones :)
> 
>> It's easy to imagine a HDD architecture that allows multiple heads to access the same sectors on the disk. It's also easy to imagine a workload scenario where parallel access to the same disk could be useful. (Think of a typical storage design that sequentially writes new user data gradually filling the disk, while simultaneously supporting random user reads over the written data.)
> 
> But for those drivers you do not actually need this scheme at all.

Agree.

> Storage devices that support higher concurrency are bog standard with
> SSDs and if you want to go back storage arrays.  The only interesting
> case is when these ranges are separate so that the access can be carved
> up based on the boundary.  Now I don't want to give people ideas with
> overlapping but not identical, which would be just horrible.

Agree too. And looking at my patch again, the function disk_check_iaranges() in
patch 1 only checks that the overall sector range of all access ranges is form 0
to capacity - 1, but it does not check for holes nor overlap. I need to change
that and ignore any disk that reports overlapping ranges or ranges with holes in
the LBA space. Holes would be horrible and if we have overlap, then the drive
can optimize by itself. Will resend a V7 with corrections for that.