mbox series

[v3,0/4] Initial support for multi-actuator HDDs

Message ID 20210726013806.84815-1-damien.lemoal@wdc.com
Headers show
Series Initial support for multi-actuator HDDs | expand

Message

Damien Le Moal July 26, 2021, 1:38 a.m. UTC
Single LUN multi-actuator hard-disks are cappable to seek and execute
multiple commands in parallel. This capability is exposed to the host
using the Concurrent Positioning Ranges VPD page (SCSI) and Log (ATA).
Each positioning range describes the contiguous set of LBAs that an
actuator serves.

This series adds support the scsi disk driver to retreive this
information and advertize it to user space through sysfs. libata is also
modified to handle ATA drives.

The first patch adds the block layer plumbing to expose concurrent
sector ranges of the device through sysfs as a sub-directory of the
device sysfs queue directory. Patch 2 and 3 add support to sd and
libata. Finally patch 4 documents the sysfs queue attributed changes.

This series does not attempt in any way to optimize accesses to
multi-actuator devices (e.g. block IO scheduler or filesystems). This
initial support only exposes the actuators information to user space
through sysfs.

Changes from v2:
* Update patch 1 to fix a compilation warning for a potential NULL
  pointer dereference of the cr argument of blk_queue_set_cranges().
  Warning reported by the kernel test robot <lkp@intel.com>).

Changes from v1:
* Moved libata-scsi hunk from patch 1 to patch 3 where it belongs
* Fixed unintialized variable in patch 2
  Reported-by: kernel test robot <lkp@intel.com>
  Reported-by: Dan Carpenter <dan.carpenter@oracle.com
* Changed patch 3 adding struct ata_cpr_log to contain both the number
  of concurrent ranges and the array of concurrent ranges.
* Added a note in the documentation (patch 4) about the unit used for
  the concurrent ranges attributes.

Damien Le Moal (4):
  block: Add concurrent positioning ranges support
  scsi: sd: add concurrent positioning ranges support
  libata: support concurrent positioning ranges log
  doc: document sysfs queue/cranges attributes

 Documentation/block/queue-sysfs.rst |  30 ++-
 block/Makefile                      |   2 +-
 block/blk-cranges.c                 | 295 ++++++++++++++++++++++++++++
 block/blk-sysfs.c                   |  13 ++
 block/blk.h                         |   3 +
 drivers/ata/libata-core.c           |  52 +++++
 drivers/ata/libata-scsi.c           |  48 ++++-
 drivers/scsi/sd.c                   |  81 ++++++++
 drivers/scsi/sd.h                   |   1 +
 include/linux/ata.h                 |   1 +
 include/linux/blkdev.h              |  29 +++
 include/linux/libata.h              |  15 ++
 12 files changed, 559 insertions(+), 11 deletions(-)
 create mode 100644 block/blk-cranges.c

Comments

Damien Le Moal July 28, 2021, 10:59 p.m. UTC | #1
On 2021/07/26 10:38, Damien Le Moal wrote:
> Single LUN multi-actuator hard-disks are cappable to seek and execute
> multiple commands in parallel. This capability is exposed to the host
> using the Concurrent Positioning Ranges VPD page (SCSI) and Log (ATA).
> Each positioning range describes the contiguous set of LBAs that an
> actuator serves.
> 
> This series adds support the scsi disk driver to retreive this
> information and advertize it to user space through sysfs. libata is also
> modified to handle ATA drives.
> 
> The first patch adds the block layer plumbing to expose concurrent
> sector ranges of the device through sysfs as a sub-directory of the
> device sysfs queue directory. Patch 2 and 3 add support to sd and
> libata. Finally patch 4 documents the sysfs queue attributed changes.
> 
> This series does not attempt in any way to optimize accesses to
> multi-actuator devices (e.g. block IO scheduler or filesystems). This
> initial support only exposes the actuators information to user space
> through sysfs.

Jens, Martin,

Any comment on this series ?

> 
> Changes from v2:
> * Update patch 1 to fix a compilation warning for a potential NULL
>   pointer dereference of the cr argument of blk_queue_set_cranges().
>   Warning reported by the kernel test robot <lkp@intel.com>).
> 
> Changes from v1:
> * Moved libata-scsi hunk from patch 1 to patch 3 where it belongs
> * Fixed unintialized variable in patch 2
>   Reported-by: kernel test robot <lkp@intel.com>
>   Reported-by: Dan Carpenter <dan.carpenter@oracle.com
> * Changed patch 3 adding struct ata_cpr_log to contain both the number
>   of concurrent ranges and the array of concurrent ranges.
> * Added a note in the documentation (patch 4) about the unit used for
>   the concurrent ranges attributes.
> 
> Damien Le Moal (4):
>   block: Add concurrent positioning ranges support
>   scsi: sd: add concurrent positioning ranges support
>   libata: support concurrent positioning ranges log
>   doc: document sysfs queue/cranges attributes
> 
>  Documentation/block/queue-sysfs.rst |  30 ++-
>  block/Makefile                      |   2 +-
>  block/blk-cranges.c                 | 295 ++++++++++++++++++++++++++++
>  block/blk-sysfs.c                   |  13 ++
>  block/blk.h                         |   3 +
>  drivers/ata/libata-core.c           |  52 +++++
>  drivers/ata/libata-scsi.c           |  48 ++++-
>  drivers/scsi/sd.c                   |  81 ++++++++
>  drivers/scsi/sd.h                   |   1 +
>  include/linux/ata.h                 |   1 +
>  include/linux/blkdev.h              |  29 +++
>  include/linux/libata.h              |  15 ++
>  12 files changed, 559 insertions(+), 11 deletions(-)
>  create mode 100644 block/blk-cranges.c
>
Damien Le Moal Aug. 6, 2021, 2:12 a.m. UTC | #2
On 2021/07/26 10:38, Damien Le Moal wrote:
> Single LUN multi-actuator hard-disks are cappable to seek and execute
> multiple commands in parallel. This capability is exposed to the host
> using the Concurrent Positioning Ranges VPD page (SCSI) and Log (ATA).
> Each positioning range describes the contiguous set of LBAs that an
> actuator serves.
> 
> This series adds support the scsi disk driver to retreive this
> information and advertize it to user space through sysfs. libata is also
> modified to handle ATA drives.
> 
> The first patch adds the block layer plumbing to expose concurrent
> sector ranges of the device through sysfs as a sub-directory of the
> device sysfs queue directory. Patch 2 and 3 add support to sd and
> libata. Finally patch 4 documents the sysfs queue attributed changes.
> 
> This series does not attempt in any way to optimize accesses to
> multi-actuator devices (e.g. block IO scheduler or filesystems). This
> initial support only exposes the actuators information to user space
> through sysfs.

Jens, Martin,

re-ping... Any comment on this series ?

> 
> Changes from v2:
> * Update patch 1 to fix a compilation warning for a potential NULL
>   pointer dereference of the cr argument of blk_queue_set_cranges().
>   Warning reported by the kernel test robot <lkp@intel.com>).
> 
> Changes from v1:
> * Moved libata-scsi hunk from patch 1 to patch 3 where it belongs
> * Fixed unintialized variable in patch 2
>   Reported-by: kernel test robot <lkp@intel.com>
>   Reported-by: Dan Carpenter <dan.carpenter@oracle.com
> * Changed patch 3 adding struct ata_cpr_log to contain both the number
>   of concurrent ranges and the array of concurrent ranges.
> * Added a note in the documentation (patch 4) about the unit used for
>   the concurrent ranges attributes.
> 
> Damien Le Moal (4):
>   block: Add concurrent positioning ranges support
>   scsi: sd: add concurrent positioning ranges support
>   libata: support concurrent positioning ranges log
>   doc: document sysfs queue/cranges attributes
> 
>  Documentation/block/queue-sysfs.rst |  30 ++-
>  block/Makefile                      |   2 +-
>  block/blk-cranges.c                 | 295 ++++++++++++++++++++++++++++
>  block/blk-sysfs.c                   |  13 ++
>  block/blk.h                         |   3 +
>  drivers/ata/libata-core.c           |  52 +++++
>  drivers/ata/libata-scsi.c           |  48 ++++-
>  drivers/scsi/sd.c                   |  81 ++++++++
>  drivers/scsi/sd.h                   |   1 +
>  include/linux/ata.h                 |   1 +
>  include/linux/blkdev.h              |  29 +++
>  include/linux/libata.h              |  15 ++
>  12 files changed, 559 insertions(+), 11 deletions(-)
>  create mode 100644 block/blk-cranges.c
>
Martin K. Petersen Aug. 6, 2021, 3:41 a.m. UTC | #3
Damien,

> Single LUN multi-actuator hard-disks are cappable to seek and execute

> multiple commands in parallel. This capability is exposed to the host

> using the Concurrent Positioning Ranges VPD page (SCSI) and Log (ATA).

> Each positioning range describes the contiguous set of LBAs that an

> actuator serves.


I have to say that I prefer the multi-LUN model.

> The first patch adds the block layer plumbing to expose concurrent

> sector ranges of the device through sysfs as a sub-directory of the

> device sysfs queue directory.


So how do you envision this range reporting should work when putting
DM/MD on top of a multi-actuator disk?

And even without multi-actuator drives, how would you express concurrent
ranges on a DM/MD device sitting on top of a several single-actuator
devices?

While I appreciate that it is easy to just export what the hardware
reports in sysfs, I also think we should consider how filesystems would
use that information. And how things would work outside of the simple
fs-on-top-of-multi-actuator-drive case.

-- 
Martin K. Petersen	Oracle Linux Engineering
Damien Le Moal Aug. 6, 2021, 4:05 a.m. UTC | #4
On 2021/08/06 12:42, Martin K. Petersen wrote:
> 
> Damien,
> 
>> Single LUN multi-actuator hard-disks are cappable to seek and execute
>> multiple commands in parallel. This capability is exposed to the host
>> using the Concurrent Positioning Ranges VPD page (SCSI) and Log (ATA).
>> Each positioning range describes the contiguous set of LBAs that an
>> actuator serves.
> 
> I have to say that I prefer the multi-LUN model.

It is certainly easier: nothing to do :)
SATA, as usual, makes things harder...

> 
>> The first patch adds the block layer plumbing to expose concurrent
>> sector ranges of the device through sysfs as a sub-directory of the
>> device sysfs queue directory.
> 
> So how do you envision this range reporting should work when putting
> DM/MD on top of a multi-actuator disk?

The ranges are attached to the device request queue. So the DM/MD target driver
can use that information from the underlying devices for whatever possible
optimization. For the logical device exposed by the target driver, the ranges
are not limits so they are not inherited. As is, right now, DM target devices
will not show any range information for the logical devices they create, even if
the underlying devices have multiple ranges.

The DM/MD target driver is free to set any range information pertinent to the
target. E.g. dm-liear could set the range information corresponding to sector
chunks from different devices used to build the dm-linear device.

> And even without multi-actuator drives, how would you express concurrent
> ranges on a DM/MD device sitting on top of a several single-actuator
> devices?

Similar comment as above: it is up to the DM/MD target driver to decide if range
information can be useful. For dm-linear, there are obvious cases where it is.
Ex: 2 single actuator drives concatenated together can generate 2 ranges
similarly to a real split-actuator disk. Expressing the chunks of a dm-linear
setup as ranges may not always be possible though, that is, if we keep the
assumption that a range is independent from others in terms of command
execution. Ex: a dm-linear setup that shuffles a drive LBA mapping (high to low
and low to high) has no business showing sector ranges.

> While I appreciate that it is easy to just export what the hardware
> reports in sysfs, I also think we should consider how filesystems would
> use that information. And how things would work outside of the simple
> fs-on-top-of-multi-actuator-drive case.

Without any change anywhere in existing code (kernel and applications using raw
disk accesses), things will just work as is. The multi/split actuator drive will
behave as a single actuator drive, even for commands spanning range boundaries.
Your guess on potential IOPS gains is as good as mine in this case. Performance
will totally depend on the workload but will not be worse than an equivalent
single actuator disk.

FS block allocators can definitely use the range information to distribute
writes among actuators. For reads, well, gains will depend on the workload,
obviously, but optimizations at the block IO scheduler level can improve things
too, especially if the drive is being used at a QD beyond its capability (that
is, requests are accumulated in the IO scheduler).

Similar write optimization can be achieved by applications using block device
files directly. This series is intended for this case for now. FS and bloc IO
scheduler optimization can be added later.
Hannes Reinecke Aug. 6, 2021, 8:35 a.m. UTC | #5
On 8/6/21 6:05 AM, Damien Le Moal wrote:
> On 2021/08/06 12:42, Martin K. Petersen wrote:

>>

>> Damien,

>>

>>> Single LUN multi-actuator hard-disks are cappable to seek and execute

>>> multiple commands in parallel. This capability is exposed to the host

>>> using the Concurrent Positioning Ranges VPD page (SCSI) and Log (ATA).

>>> Each positioning range describes the contiguous set of LBAs that an

>>> actuator serves.

>>

>> I have to say that I prefer the multi-LUN model.

> 

> It is certainly easier: nothing to do :)

> SATA, as usual, makes things harder...

> 

>>

>>> The first patch adds the block layer plumbing to expose concurrent

>>> sector ranges of the device through sysfs as a sub-directory of the

>>> device sysfs queue directory.

>>

>> So how do you envision this range reporting should work when putting

>> DM/MD on top of a multi-actuator disk?

> 

> The ranges are attached to the device request queue. So the DM/MD target driver

> can use that information from the underlying devices for whatever possible

> optimization. For the logical device exposed by the target driver, the ranges

> are not limits so they are not inherited. As is, right now, DM target devices

> will not show any range information for the logical devices they create, even if

> the underlying devices have multiple ranges.

> 

> The DM/MD target driver is free to set any range information pertinent to the

> target. E.g. dm-liear could set the range information corresponding to sector

> chunks from different devices used to build the dm-linear device.

> 

And indeed, that would be the easiest consumer.
One 'just' needs to have a simple script converting the sysfs ranges
into the corresponding dm-linear table definitions, and create one DM
device for each range.
That would simulate the multi-LUN approach.
Not sure if that would warrant a 'real' DM target, seeing that it's
fully scriptable.

>> And even without multi-actuator drives, how would you express concurrent

>> ranges on a DM/MD device sitting on top of a several single-actuator

>> devices?

> 

> Similar comment as above: it is up to the DM/MD target driver to decide if range

> information can be useful. For dm-linear, there are obvious cases where it is.

> Ex: 2 single actuator drives concatenated together can generate 2 ranges

> similarly to a real split-actuator disk. Expressing the chunks of a dm-linear

> setup as ranges may not always be possible though, that is, if we keep the

> assumption that a range is independent from others in terms of command

> execution. Ex: a dm-linear setup that shuffles a drive LBA mapping (high to low

> and low to high) has no business showing sector ranges.

> 

>> While I appreciate that it is easy to just export what the hardware

>> reports in sysfs, I also think we should consider how filesystems would

>> use that information. And how things would work outside of the simple

>> fs-on-top-of-multi-actuator-drive case.

> 

> Without any change anywhere in existing code (kernel and applications using raw

> disk accesses), things will just work as is. The multi/split actuator drive will

> behave as a single actuator drive, even for commands spanning range boundaries.

> Your guess on potential IOPS gains is as good as mine in this case. Performance

> will totally depend on the workload but will not be worse than an equivalent

> single actuator disk.

> 

> FS block allocators can definitely use the range information to distribute

> writes among actuators. For reads, well, gains will depend on the workload,

> obviously, but optimizations at the block IO scheduler level can improve things

> too, especially if the drive is being used at a QD beyond its capability (that

> is, requests are accumulated in the IO scheduler).

> 

> Similar write optimization can be achieved by applications using block device

> files directly. This series is intended for this case for now. FS and bloc IO

> scheduler optimization can be added later.

> 

> 

Rumours have it that Paolo Valente is working on adapting BFQ to utilize
the range information for better actuator utilisation.
And eventually one should modify filesystem utilities like xfs to adapt
the metadata layout to multi-actuator drives.

The _real_ fun starts once the HDD manufactures starts putting out
multi-actuator SMR drives :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
Damien Le Moal Aug. 6, 2021, 8:52 a.m. UTC | #6
On 2021/08/06 17:36, Hannes Reinecke wrote:
> On 8/6/21 6:05 AM, Damien Le Moal wrote:
>> On 2021/08/06 12:42, Martin K. Petersen wrote:
>>>
>>> Damien,
>>>
>>>> Single LUN multi-actuator hard-disks are cappable to seek and execute
>>>> multiple commands in parallel. This capability is exposed to the host
>>>> using the Concurrent Positioning Ranges VPD page (SCSI) and Log (ATA).
>>>> Each positioning range describes the contiguous set of LBAs that an
>>>> actuator serves.
>>>
>>> I have to say that I prefer the multi-LUN model.
>>
>> It is certainly easier: nothing to do :)
>> SATA, as usual, makes things harder...
>>
>>>
>>>> The first patch adds the block layer plumbing to expose concurrent
>>>> sector ranges of the device through sysfs as a sub-directory of the
>>>> device sysfs queue directory.
>>>
>>> So how do you envision this range reporting should work when putting
>>> DM/MD on top of a multi-actuator disk?
>>
>> The ranges are attached to the device request queue. So the DM/MD target driver
>> can use that information from the underlying devices for whatever possible
>> optimization. For the logical device exposed by the target driver, the ranges
>> are not limits so they are not inherited. As is, right now, DM target devices
>> will not show any range information for the logical devices they create, even if
>> the underlying devices have multiple ranges.
>>
>> The DM/MD target driver is free to set any range information pertinent to the
>> target. E.g. dm-liear could set the range information corresponding to sector
>> chunks from different devices used to build the dm-linear device.
>>
> And indeed, that would be the easiest consumer.
> One 'just' needs to have a simple script converting the sysfs ranges
> into the corresponding dm-linear table definitions, and create one DM
> device for each range.
> That would simulate the multi-LUN approach.
> Not sure if that would warrant a 'real' DM target, seeing that it's
> fully scriptable.
> 
>>> And even without multi-actuator drives, how would you express concurrent
>>> ranges on a DM/MD device sitting on top of a several single-actuator
>>> devices?
>>
>> Similar comment as above: it is up to the DM/MD target driver to decide if range
>> information can be useful. For dm-linear, there are obvious cases where it is.
>> Ex: 2 single actuator drives concatenated together can generate 2 ranges
>> similarly to a real split-actuator disk. Expressing the chunks of a dm-linear
>> setup as ranges may not always be possible though, that is, if we keep the
>> assumption that a range is independent from others in terms of command
>> execution. Ex: a dm-linear setup that shuffles a drive LBA mapping (high to low
>> and low to high) has no business showing sector ranges.
>>
>>> While I appreciate that it is easy to just export what the hardware
>>> reports in sysfs, I also think we should consider how filesystems would
>>> use that information. And how things would work outside of the simple
>>> fs-on-top-of-multi-actuator-drive case.
>>
>> Without any change anywhere in existing code (kernel and applications using raw
>> disk accesses), things will just work as is. The multi/split actuator drive will
>> behave as a single actuator drive, even for commands spanning range boundaries.
>> Your guess on potential IOPS gains is as good as mine in this case. Performance
>> will totally depend on the workload but will not be worse than an equivalent
>> single actuator disk.
>>
>> FS block allocators can definitely use the range information to distribute
>> writes among actuators. For reads, well, gains will depend on the workload,
>> obviously, but optimizations at the block IO scheduler level can improve things
>> too, especially if the drive is being used at a QD beyond its capability (that
>> is, requests are accumulated in the IO scheduler).
>>
>> Similar write optimization can be achieved by applications using block device
>> files directly. This series is intended for this case for now. FS and bloc IO
>> scheduler optimization can be added later.
>>
>>
> Rumours have it that Paolo Valente is working on adapting BFQ to utilize
> the range information for better actuator utilisation.

Paolo has a talk on this subject scheduled for SNIA SDC 2021.

https://storagedeveloper.org/events/sdc-2021/abstracts#hd-Walker

> And eventually one should modify filesystem utilities like xfs to adapt
> the metadata layout to multi-actuator drives.
> 
> The _real_ fun starts once the HDD manufactures starts putting out
> multi-actuator SMR drives :-)

Well, that does not change things that much in the end. The same constraints
remain, and the sector ranges will be aligned to zones. So no added difficulty.

> 
> Cheers,
> 
> Hannes
>