diff mbox series

[v3,01/18] block: introduce duration-limits priority class

Message ID 20230124190308.127318-2-niklas.cassel@wdc.com
State New
Headers show
Series Add Command Duration Limits support | expand

Commit Message

Niklas Cassel Jan. 24, 2023, 7:02 p.m. UTC
From: Damien Le Moal <damien.lemoal@opensource.wdc.com>

Introduce the IOPRIO_CLASS_DL priority class to indicate that IOs should
be executed using duration-limits targets. The duration target to apply
to a command is indicated using the priority level. Up to 8 levels are
supported, with level 0 indiating "no limit".

This priority class has effect only if the target device supports the
command duration limits feature and this feature is enabled by the user.

While it is recommended to not use an ioscheduler when using the
IOPRIO_CLASS_DL priority class, if using the BFQ or mq-deadline scheduler,
IOPRIO_CLASS_DL is mapped to IOPRIO_CLASS_RT.

The reason for this is twofold:
1) Each priority level for the IOPRIO_CLASS_DL priority class represents a
duration limit descriptor (DLD) inside the device. Users can configure
these limits themselves using passthrough commands, so from a block layer
perspective, Linux has no idea of how each DLD is actually configured.

By mapping a command to IOPRIO_CLASS_RT, the chance that a command exceeds
its duration limit (because it was held too long in the scheduler) is
decreased. It is still possible to use the IOPRIO_CLASS_DL priority class
for "low priority" IOs by configuring a large limit in the respective DLD.

2) On ATA drives, IOPRIO_CLASS_DL commands and NCQ priority commands
(IOPRIO_CLASS_RT) cannot be used together. A mix of CDL and high priority
commands cannot be sent to a device. By mapping IOPRIO_CLASS_DL to
IOPRIO_CLASS_RT, we ensure that a device will never receive a mix of these
two incompatible priority classes.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 block/bfq-iosched.c         | 10 ++++++++++
 block/blk-ioprio.c          |  3 +++
 block/ioprio.c              |  3 ++-
 block/mq-deadline.c         |  1 +
 include/linux/ioprio.h      |  2 +-
 include/uapi/linux/ioprio.h |  7 +++++++
 6 files changed, 24 insertions(+), 2 deletions(-)

Comments

Bart Van Assche Jan. 24, 2023, 7:27 p.m. UTC | #1
On 1/24/23 11:02, Niklas Cassel wrote:
> Introduce the IOPRIO_CLASS_DL priority class to indicate that IOs should
> be executed using duration-limits targets. The duration target to apply
> to a command is indicated using the priority level. Up to 8 levels are
> supported, with level 0 indiating "no limit".
> 
> This priority class has effect only if the target device supports the
> command duration limits feature and this feature is enabled by the user.
> 
> While it is recommended to not use an ioscheduler when using the
> IOPRIO_CLASS_DL priority class, if using the BFQ or mq-deadline scheduler,
> IOPRIO_CLASS_DL is mapped to IOPRIO_CLASS_RT.
> 
> The reason for this is twofold:
> 1) Each priority level for the IOPRIO_CLASS_DL priority class represents a
> duration limit descriptor (DLD) inside the device. Users can configure
> these limits themselves using passthrough commands, so from a block layer
> perspective, Linux has no idea of how each DLD is actually configured.
> 
> By mapping a command to IOPRIO_CLASS_RT, the chance that a command exceeds
> its duration limit (because it was held too long in the scheduler) is
> decreased. It is still possible to use the IOPRIO_CLASS_DL priority class
> for "low priority" IOs by configuring a large limit in the respective DLD.
> 
> 2) On ATA drives, IOPRIO_CLASS_DL commands and NCQ priority commands
> (IOPRIO_CLASS_RT) cannot be used together. A mix of CDL and high priority
> commands cannot be sent to a device. By mapping IOPRIO_CLASS_DL to
> IOPRIO_CLASS_RT, we ensure that a device will never receive a mix of these
> two incompatible priority classes.

Implementing duration limit support using the I/O priority mechanism 
makes it impossible to configure the I/O priority for commands that have 
a duration limit. Shouldn't the duration limit be independent of the I/O 
priority? Am I perhaps missing something?

Thanks,

Bart.
Bart Van Assche Jan. 24, 2023, 8:36 p.m. UTC | #2
On 1/24/23 11:27, Bart Van Assche wrote:
> Implementing duration limit support using the I/O priority mechanism 
> makes it impossible to configure the I/O priority for commands that have 
> a duration limit. Shouldn't the duration limit be independent of the I/O 
> priority? Am I perhaps missing something?

(replying to my own e-mail)

In SAM-6 I found the following: "The device server may use the duration 
expiration time to determine the order of processing commands with
the SIMPLE task attribute within the task set. A difference in duration 
expiration time between commands may override other scheduling 
considerations (e.g., different times to access different logical block 
addresses or vendor specific scheduling considerations). Processing of a 
collection of commands with different command duration limit settings 
should cause a command with an earlier duration expiration time to 
complete with status sooner than a command with a later duration 
expiration time."

Do I understand correctly that it is optional for a SCSI device to 
interpret the command duration as a priority and that this is not mandatory?

Thanks,

Bart.
Damien Le Moal Jan. 24, 2023, 9:29 p.m. UTC | #3
On 1/25/23 04:27, Bart Van Assche wrote:
> On 1/24/23 11:02, Niklas Cassel wrote:
>> Introduce the IOPRIO_CLASS_DL priority class to indicate that IOs should
>> be executed using duration-limits targets. The duration target to apply
>> to a command is indicated using the priority level. Up to 8 levels are
>> supported, with level 0 indiating "no limit".
>>
>> This priority class has effect only if the target device supports the
>> command duration limits feature and this feature is enabled by the user.
>>
>> While it is recommended to not use an ioscheduler when using the
>> IOPRIO_CLASS_DL priority class, if using the BFQ or mq-deadline scheduler,
>> IOPRIO_CLASS_DL is mapped to IOPRIO_CLASS_RT.
>>
>> The reason for this is twofold:
>> 1) Each priority level for the IOPRIO_CLASS_DL priority class represents a
>> duration limit descriptor (DLD) inside the device. Users can configure
>> these limits themselves using passthrough commands, so from a block layer
>> perspective, Linux has no idea of how each DLD is actually configured.
>>
>> By mapping a command to IOPRIO_CLASS_RT, the chance that a command exceeds
>> its duration limit (because it was held too long in the scheduler) is
>> decreased. It is still possible to use the IOPRIO_CLASS_DL priority class
>> for "low priority" IOs by configuring a large limit in the respective DLD.
>>
>> 2) On ATA drives, IOPRIO_CLASS_DL commands and NCQ priority commands
>> (IOPRIO_CLASS_RT) cannot be used together. A mix of CDL and high priority
>> commands cannot be sent to a device. By mapping IOPRIO_CLASS_DL to
>> IOPRIO_CLASS_RT, we ensure that a device will never receive a mix of these
>> two incompatible priority classes.
> 
> Implementing duration limit support using the I/O priority mechanism 
> makes it impossible to configure the I/O priority for commands that have 
> a duration limit. Shouldn't the duration limit be independent of the I/O 
> priority? Am I perhaps missing something?

I/O priority at the device level does not exist with SAS and with SATA,
the ACS specifications mandates that NCQ I/O priority and CDL cannot be
used mixed together. So from the device point of view, I/O priority and
CDL are mutually exclusive. No issues.

Now, if you are talking about the host level I/O priority scheduling done
by mq-deadline and bfq, the CDL priority class maps to the RT class. They
are the same, as they should. There is nothing more real-time than CDL in
my opinion :)

Furthermore, if we do not reuse the I/O priority interface, we will have
to add another field to BIOs & requests to propagate the cdl index from
user space down to the device LLD, almost exactly in the manner of I/O
priorities, including all the controls with merging etc. That would be a
lot of overhead to achieve the possibility of prioritized CDL commands.

CDL in of itself allows the user to define "prioritized" commands by
defining CDLs on the drive that are sorted in increasing time limit order,
i.e. with low CDL index numbers having low limits, and higher priority
within the class (as CDL index == prio level). With that, schedulers can
still do the right thing as they do now, with the additional benefit that
they can even be improved to base their scheduling decisions on a known
time limit for the command execution. But such optimization is not
implemented by this series.

> 
> Thanks,
> 
> Bart.
>
Damien Le Moal Jan. 24, 2023, 9:48 p.m. UTC | #4
On 1/25/23 05:36, Bart Van Assche wrote:
> On 1/24/23 11:27, Bart Van Assche wrote:
>> Implementing duration limit support using the I/O priority mechanism 
>> makes it impossible to configure the I/O priority for commands that have 
>> a duration limit. Shouldn't the duration limit be independent of the I/O 
>> priority? Am I perhaps missing something?
> 
> (replying to my own e-mail)
> 
> In SAM-6 I found the following: "The device server may use the duration 
> expiration time to determine the order of processing commands with
> the SIMPLE task attribute within the task set. A difference in duration 
> expiration time between commands may override other scheduling 
> considerations (e.g., different times to access different logical block 
> addresses or vendor specific scheduling considerations). Processing of a 
> collection of commands with different command duration limit settings 
> should cause a command with an earlier duration expiration time to 
> complete with status sooner than a command with a later duration 
> expiration time."
> 
> Do I understand correctly that it is optional for a SCSI device to 
> interpret the command duration as a priority and that this is not mandatory?

This describes the expected behavior from the drive in terms of command
execution ordering when CDL is used. The text is a little "soft" and sound
as if this behavior is optional because CDL is a combination of time
limits AND a policy for each time limit. The policy of a CDL indicates
what the drive behavior should be if a command misses its time limit. In
short, there are 2 policies:
1) best-effort: the time limit is a hint of sorts. If the drive fails to
execute the command before the limit expires, the command is not aborted
and execution continues.
2) fast-fail: If the drive fails to execute the command before the time
limit expires, the command must be completed with an error immediately.

And CDL also has a field, settable by the user, that describes an allowed
performance degradation to achieve CDL scheduling in time. That is, most
important for the best-effort case to indicate how "serious" the user is
about the CDL limit "hint".

So as you can see I think, the SAM-6 text is vague because of the many
possible variations in scheduling policies that need to be implemented by
a drive.
Bart Van Assche Jan. 24, 2023, 10:43 p.m. UTC | #5
On 1/24/23 13:29, Damien Le Moal wrote:
> I/O priority at the device level does not exist with SAS and with SATA,
> the ACS specifications mandates that NCQ I/O priority and CDL cannot be
> used mixed together. So from the device point of view, I/O priority and
> CDL are mutually exclusive. No issues.
> 
> Now, if you are talking about the host level I/O priority scheduling done
> by mq-deadline and bfq, the CDL priority class maps to the RT class. They
> are the same, as they should. There is nothing more real-time than CDL in
> my opinion :)
> 
> Furthermore, if we do not reuse the I/O priority interface, we will have
> to add another field to BIOs & requests to propagate the cdl index from
> user space down to the device LLD, almost exactly in the manner of I/O
> priorities, including all the controls with merging etc. That would be a
> lot of overhead to achieve the possibility of prioritized CDL commands.
> 
> CDL in of itself allows the user to define "prioritized" commands by
> defining CDLs on the drive that are sorted in increasing time limit order,
> i.e. with low CDL index numbers having low limits, and higher priority
> within the class (as CDL index == prio level). With that, schedulers can
> still do the right thing as they do now, with the additional benefit that
> they can even be improved to base their scheduling decisions on a known
> time limit for the command execution. But such optimization is not
> implemented by this series.

Hi Damien,

What if a device that supports I/O priorities (e.g. an NVMe device that 
supports configuring the SQ priority) and a device that supports command 
duration limits (e.g. a SATA hard disk) are combined via the device 
mapper into a single block device? Should I/O be submitted to the dm 
device with one of the existing I/O priority classes (not supported by 
SATA hard disks) or with I/O priority class IOPRIO_CLASS_DL (not 
supported by NVMe devices)?

Shouldn't the ATA core translate the existing I/O priority levels into a 
command duration limit instead of introducing a new I/O priority class 
that is only supported by ATA devices?

Thanks,

Bart.
Damien Le Moal Jan. 24, 2023, 10:59 p.m. UTC | #6
On 1/25/23 07:43, Bart Van Assche wrote:
> On 1/24/23 13:29, Damien Le Moal wrote:
>> I/O priority at the device level does not exist with SAS and with SATA,
>> the ACS specifications mandates that NCQ I/O priority and CDL cannot be
>> used mixed together. So from the device point of view, I/O priority and
>> CDL are mutually exclusive. No issues.
>>
>> Now, if you are talking about the host level I/O priority scheduling done
>> by mq-deadline and bfq, the CDL priority class maps to the RT class. They
>> are the same, as they should. There is nothing more real-time than CDL in
>> my opinion :)
>>
>> Furthermore, if we do not reuse the I/O priority interface, we will have
>> to add another field to BIOs & requests to propagate the cdl index from
>> user space down to the device LLD, almost exactly in the manner of I/O
>> priorities, including all the controls with merging etc. That would be a
>> lot of overhead to achieve the possibility of prioritized CDL commands.
>>
>> CDL in of itself allows the user to define "prioritized" commands by
>> defining CDLs on the drive that are sorted in increasing time limit order,
>> i.e. with low CDL index numbers having low limits, and higher priority
>> within the class (as CDL index == prio level). With that, schedulers can
>> still do the right thing as they do now, with the additional benefit that
>> they can even be improved to base their scheduling decisions on a known
>> time limit for the command execution. But such optimization is not
>> implemented by this series.
> 
> Hi Damien,
> 
> What if a device that supports I/O priorities (e.g. an NVMe device that 
> supports configuring the SQ priority) and a device that supports command 
> duration limits (e.g. a SATA hard disk) are combined via the device 
> mapper into a single block device? Should I/O be submitted to the dm 
> device with one of the existing I/O priority classes (not supported by 
> SATA hard disks) or with I/O priority class IOPRIO_CLASS_DL (not 
> supported by NVMe devices)?

That is not a use case we considered. My gut feeling is that this is
something the target driver should handle when processing a user IO.
Note that I was not aware that Linux NVMe driver supported queue priorities...

> Shouldn't the ATA core translate the existing I/O priority levels into a 
> command duration limit instead of introducing a new I/O priority class 
> that is only supported by ATA devices?

There is only one priority class that ATA understands: RT (the level is
irrelevant and ignored). All RT class IOs are mapped to high priority NCQ
commands. All other classes map to normal priority (no priority bit set)
commands.

And sure, we could map the level of RT class IOs to a CDL index, as we do
for the CDL class, but what would be the point ? The user should use the
CDL class in that case.

Furthermore, there is one additional thing that we do not yet support but
will later: CDL descriptor 0 can be used to set a target time limit for
high priority NCQ commands. Without this new feature introduced with CDL,
the drive is free to schedule high priority NCQ commands as it wants, and
that is hard coded in FW. So you can endup with very aggressive scheduling
leading to significant overall IOPS drop and long tail latency for low
priority commands. See page 11 and 20 of this presentation for an example:

https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf

For a drive that supports both CDL and NCQ priority, with CDL feature
turned off, CDL descriptor 0 defines the time limit hint for high priority
NCQ commands. Again, CDL and NCQ high priority are mutually exclusive.

So for clarity, I really would prefer separating CDL and RT classes as we
did. We could integrate CDL support reusing the RT class + level for CDL
index, but I think this may be very confusing for users, especially
considering that the CLDs on a drive can be defined in any order the user
wants, resulting in indexes/levels that does do not have any particular
order, making it impossible for the host to correctly schedule commands.
Bart Van Assche Jan. 25, 2023, 12:05 a.m. UTC | #7
On 1/24/23 14:59, Damien Le Moal wrote:
> There is only one priority class that ATA understands: RT (the level is
> irrelevant and ignored). All RT class IOs are mapped to high priority NCQ
> commands. All other classes map to normal priority (no priority bit set)
> commands.
> 
> And sure, we could map the level of RT class IOs to a CDL index, as we do
> for the CDL class, but what would be the point ? The user should use the
> CDL class in that case.
> 
> Furthermore, there is one additional thing that we do not yet support but
> will later: CDL descriptor 0 can be used to set a target time limit for
> high priority NCQ commands. Without this new feature introduced with CDL,
> the drive is free to schedule high priority NCQ commands as it wants, and
> that is hard coded in FW. So you can endup with very aggressive scheduling
> leading to significant overall IOPS drop and long tail latency for low
> priority commands. See page 11 and 20 of this presentation for an example:
> 
> https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf
> 
> For a drive that supports both CDL and NCQ priority, with CDL feature
> turned off, CDL descriptor 0 defines the time limit hint for high priority
> NCQ commands. Again, CDL and NCQ high priority are mutually exclusive.
> 
> So for clarity, I really would prefer separating CDL and RT classes as we
> did. We could integrate CDL support reusing the RT class + level for CDL
> index, but I think this may be very confusing for users, especially
> considering that the CLDs on a drive can be defined in any order the user
> wants, resulting in indexes/levels that does do not have any particular
> order, making it impossible for the host to correctly schedule commands.

Hi Damien,

Thanks again for the detailed reply. Your replies are very informative 
and help me understand the context better.

However, I'm still less than enthusiast about the introduction of the 
I/O priority class IOPRIO_CLASS_DL. To me command duration limits (CDL) 
is a mechanism that is supported by one storage standard (SCSI) and of 
which it is not sure that it will be integrated in other storage 
standards (NVMe, ...). Isn't the purpose of the block layer to provide 
an interface that is independent of the specifics of a single storage 
standard? This is why I'm in favor of letting the ATA core translate one 
of the existing I/O priority classes into a CDL instead of introducing a 
new I/O priority class (IOPRIO_CLASS_DL) in the block layer.

Others may have a different opinion.

Thanks,

Bart.
Damien Le Moal Jan. 25, 2023, 1:19 a.m. UTC | #8
On 1/25/23 09:05, Bart Van Assche wrote:
> On 1/24/23 14:59, Damien Le Moal wrote:
>> There is only one priority class that ATA understands: RT (the level is
>> irrelevant and ignored). All RT class IOs are mapped to high priority NCQ
>> commands. All other classes map to normal priority (no priority bit set)
>> commands.
>>
>> And sure, we could map the level of RT class IOs to a CDL index, as we do
>> for the CDL class, but what would be the point ? The user should use the
>> CDL class in that case.
>>
>> Furthermore, there is one additional thing that we do not yet support but
>> will later: CDL descriptor 0 can be used to set a target time limit for
>> high priority NCQ commands. Without this new feature introduced with CDL,
>> the drive is free to schedule high priority NCQ commands as it wants, and
>> that is hard coded in FW. So you can endup with very aggressive scheduling
>> leading to significant overall IOPS drop and long tail latency for low
>> priority commands. See page 11 and 20 of this presentation for an example:
>>
>> https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf
>>
>> For a drive that supports both CDL and NCQ priority, with CDL feature
>> turned off, CDL descriptor 0 defines the time limit hint for high priority
>> NCQ commands. Again, CDL and NCQ high priority are mutually exclusive.
>>
>> So for clarity, I really would prefer separating CDL and RT classes as we
>> did. We could integrate CDL support reusing the RT class + level for CDL
>> index, but I think this may be very confusing for users, especially
>> considering that the CLDs on a drive can be defined in any order the user
>> wants, resulting in indexes/levels that does do not have any particular
>> order, making it impossible for the host to correctly schedule commands.
> 
> Hi Damien,
> 
> Thanks again for the detailed reply. Your replies are very informative 
> and help me understand the context better.
> 
> However, I'm still less than enthusiast about the introduction of the 
> I/O priority class IOPRIO_CLASS_DL. To me command duration limits (CDL) 
> is a mechanism that is supported by one storage standard (SCSI) and of 

And ATA (ACS) too. Not just SCSI. This is actually an improvement over IO
priority (command priority) that is supported only by ATA NCQ and does not
exist with SCSI/SBC.

> which it is not sure that it will be integrated in other storage 
> standards (NVMe, ...). Isn't the purpose of the block layer to provide 
> an interface that is independent of the specifics of a single storage 
> standard? This is why I'm in favor of letting the ATA core translate one 
> of the existing I/O priority classes into a CDL instead of introducing a 
> new I/O priority class (IOPRIO_CLASS_DL) in the block layer.

We discussed CDL with Hannes in the context of NVMe over fabrics. Their
may be interesting extensions to consider for NVMe in that context (the
value for local PCI attached NVMe drive is more limited at best).

I would argue that IO priority is the same: that is not supported by all
device classes either, and for those that support it, the semantic is not
identical (ATA vs NVMe). Yet, we have the RT class that maps to high
priority for ATA, and nothing else as far as I know.

CDL at least covers SCSI *and* ATA, and as mentioned above, could be used
by NVMe-of host drivers to do fancy link selection for a multipath setup
based on the link speed for instance.

We could overload the RT class with a mapping to CDL feature on scsi and
ata, but I think this is more confusing/messy than a separate class as we
implemented.

> 
> Others may have a different opinion.
> 
> Thanks,
> 
> Bart.
Christoph Hellwig Jan. 25, 2023, 6:33 a.m. UTC | #9
On Tue, Jan 24, 2023 at 02:43:24PM -0800, Bart Van Assche wrote:
> What if a device that supports I/O priorities (e.g. an NVMe device that 
> supports configuring the SQ priority)

NVMe does not have any I/O priorities, it only has a very limited
priority scheme for queue arbitration.
Bart Van Assche Jan. 25, 2023, 6:37 p.m. UTC | #10
On 1/24/23 17:19, Damien Le Moal wrote:
> On 1/25/23 09:05, Bart Van Assche wrote:
>> Thanks again for the detailed reply. Your replies are very informative
>> and help me understand the context better.
>>
>> However, I'm still less than enthusiast about the introduction of the
>> I/O priority class IOPRIO_CLASS_DL. To me command duration limits (CDL)
>> is a mechanism that is supported by one storage standard (SCSI) and of
> 
> And ATA (ACS) too. Not just SCSI. This is actually an improvement over IO
> priority (command priority) that is supported only by ATA NCQ and does not
> exist with SCSI/SBC.
> 
>> which it is not sure that it will be integrated in other storage
>> standards (NVMe, ...). Isn't the purpose of the block layer to provide
>> an interface that is independent of the specifics of a single storage
>> standard? This is why I'm in favor of letting the ATA core translate one
>> of the existing I/O priority classes into a CDL instead of introducing a
>> new I/O priority class (IOPRIO_CLASS_DL) in the block layer.
> 
> We discussed CDL with Hannes in the context of NVMe over fabrics. Their
> may be interesting extensions to consider for NVMe in that context (the
> value for local PCI attached NVMe drive is more limited at best).
> 
> I would argue that IO priority is the same: that is not supported by all
> device classes either, and for those that support it, the semantic is not
> identical (ATA vs NVMe). Yet, we have the RT class that maps to high
> priority for ATA, and nothing else as far as I know.
> 
> CDL at least covers SCSI *and* ATA, and as mentioned above, could be used
> by NVMe-of host drivers to do fancy link selection for a multipath setup
> based on the link speed for instance.
> 
> We could overload the RT class with a mapping to CDL feature on scsi and
> ata, but I think this is more confusing/messy than a separate class as we
> implemented.

Hi Damien,

The more I think about this, the more I'm convinced that it would be wrong
to introduce IOPRIO_CLASS_DL. Datacenters will have a mix of drives that
support CDL and drives that do not support CDL. It seems wrong to me to
make user space software responsible for figuring out whether or not the
drive supports CDL before it can be decided which I/O priority class should
be used. This is something the kernel should do instead of user space
software.

If we would ask Android storage vendors to implement CDL then IOPRIO_CLASS_DL
would cause trouble too. Android has support since considerable time to give
the foreground application a higher I/O priority than background applications.
The cgroup settings for foreground and background applications come from the
task_profiles.json file (see also
https://android.googlesource.com/platform/system/core/+/master/libprocessgroup/profiles/task_profiles.json).
As one can see all the settings in that file are independent of the features
of the storage device. Introducing IOPRIO_CLASS_DL in the kernel and using it
in task_profiles.json would imply that the storage device type has to be
determined before it can be decided whether or not IOPRIO_CLASS_DL can be used.
This seems wrong to me.

I downloaded the patch series in its entirety and applied it on a local kernel
branch. I verified which changes would be needed to replace IOPRIO_CLASS_DL
with IOPRIO_CLASS_RT. Can you help me with verifying the patch below?

Regarding the BFQ changes in the patch below, is an I/O scheduler useful at all
if CDL is used since a comment in the BFQ driver says that the disk should do
the scheduling instead of BFQ if CDL is used?

Thanks,

Bart.


diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 7add9346c585..815b884d6c5a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5545,14 +5545,6 @@ bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
  		bfqq->new_ioprio_class = IOPRIO_CLASS_IDLE;
  		bfqq->new_ioprio = 7;
  		break;
-	case IOPRIO_CLASS_DL:
-		/*
-		 * For the duration-limits class, we want the disk to do the
-		 * scheduling. So map all levels to the highest RT level.
-		 */
-		bfqq->new_ioprio = 0;
-		bfqq->new_ioprio_class = IOPRIO_CLASS_RT;
-		break;
  	}

  	if (bfqq->new_ioprio >= IOPRIO_NR_LEVELS) {
@@ -5681,8 +5673,6 @@ static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
  		return &bfqg->async_bfqq[1][ioprio][act_idx];
  	case IOPRIO_CLASS_IDLE:
  		return &bfqg->async_idle_bfqq[act_idx];
-	case IOPRIO_CLASS_DL:
-		return &bfqg->async_bfqq[0][0][act_idx];
  	default:
  		return NULL;
  	}
diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c
index dfb5c3f447f4..8bb6b8eba4ce 100644
--- a/block/blk-ioprio.c
+++ b/block/blk-ioprio.c
@@ -27,7 +27,6 @@
   * @POLICY_RESTRICT_TO_BE: modify IOPRIO_CLASS_NONE and IOPRIO_CLASS_RT into
   *		IOPRIO_CLASS_BE.
   * @POLICY_ALL_TO_IDLE: change the I/O priority class into IOPRIO_CLASS_IDLE.
- * @POLICY_ALL_TO_DL: change the I/O priority class into IOPRIO_CLASS_DL.
   *
   * See also <linux/ioprio.h>.
   */
@@ -36,7 +35,6 @@ enum prio_policy {
  	POLICY_NONE_TO_RT	= 1,
  	POLICY_RESTRICT_TO_BE	= 2,
  	POLICY_ALL_TO_IDLE	= 3,
-	POLICY_ALL_TO_DL	= 4,
  };

  static const char *policy_name[] = {
@@ -44,7 +42,6 @@ static const char *policy_name[] = {
  	[POLICY_NONE_TO_RT]	= "none-to-rt",
  	[POLICY_RESTRICT_TO_BE]	= "restrict-to-be",
  	[POLICY_ALL_TO_IDLE]	= "idle",
-	[POLICY_ALL_TO_DL]	= "duration-limits",
  };

  static struct blkcg_policy ioprio_policy;
diff --git a/block/ioprio.c b/block/ioprio.c
index 1b3a9da82597..32a456b45804 100644
--- a/block/ioprio.c
+++ b/block/ioprio.c
@@ -37,7 +37,6 @@ int ioprio_check_cap(int ioprio)

  	switch (class) {
  		case IOPRIO_CLASS_RT:
-		case IOPRIO_CLASS_DL:
  			/*
  			 * Originally this only checked for CAP_SYS_ADMIN,
  			 * which was implicitly allowed for pid 0 by security
@@ -48,7 +47,7 @@ int ioprio_check_cap(int ioprio)
  			if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_NICE))
  				return -EPERM;
  			fallthrough;
-			/* RT and DL have prio field too */
+			/* rt has prio field too */
  		case IOPRIO_CLASS_BE:
  			if (data >= IOPRIO_NR_LEVELS || data < 0)
  				return -EINVAL;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 526d0ea4dbf9..f10c2a0d18d4 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -113,7 +113,6 @@ static const enum dd_prio ioprio_class_to_prio[] = {
  	[IOPRIO_CLASS_RT]	= DD_RT_PRIO,
  	[IOPRIO_CLASS_BE]	= DD_BE_PRIO,
  	[IOPRIO_CLASS_IDLE]	= DD_IDLE_PRIO,
-	[IOPRIO_CLASS_DL]	= DD_RT_PRIO,
  };

  static inline struct rb_root *
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index b4761c3c4b91..3065b632e6ae 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -673,7 +673,7 @@ static inline void ata_set_tf_cdl(struct ata_queued_cmd *qc, int ioprio)
  	struct ata_taskfile *tf = &qc->tf;
  	int cdl;

-	if (IOPRIO_PRIO_CLASS(ioprio) != IOPRIO_CLASS_DL)
+	if (IOPRIO_PRIO_CLASS(ioprio) != IOPRIO_CLASS_RT)
  		return;

  	cdl = IOPRIO_PRIO_DATA(ioprio) & 0x07;
diff --git a/drivers/scsi/sd_cdl.c b/drivers/scsi/sd_cdl.c
index 59d02dbb5ea1..c5286f5ddae4 100644
--- a/drivers/scsi/sd_cdl.c
+++ b/drivers/scsi/sd_cdl.c
@@ -880,10 +880,10 @@ int sd_cdl_dld(struct scsi_disk *sdkp, struct scsi_cmnd *scmd)
  	unsigned int dld;

  	/*
-	 * Use "no limit" if the request ioprio class is not IOPRIO_CLASS_DL
+	 * Use "no limit" if the request ioprio class is not IOPRIO_CLASS_RT
  	 * or if the user specified an invalid CDL descriptor index.
  	 */
-	if (IOPRIO_PRIO_CLASS(ioprio) != IOPRIO_CLASS_DL)
+	if (IOPRIO_PRIO_CLASS(ioprio) != IOPRIO_CLASS_RT)
  		return 0;

  	dld = IOPRIO_PRIO_DATA(ioprio);
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index 2f3fc2fbd668..7578d4f6a969 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -20,7 +20,7 @@ static inline bool ioprio_valid(unsigned short ioprio)
  {
  	unsigned short class = IOPRIO_PRIO_CLASS(ioprio);

-	return class > IOPRIO_CLASS_NONE && class <= IOPRIO_CLASS_DL;
+	return class > IOPRIO_CLASS_NONE && class <= IOPRIO_CLASS_IDLE;
  }

  /*
diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
index 15908b9e9d8c..f70f2596a6bf 100644
--- a/include/uapi/linux/ioprio.h
+++ b/include/uapi/linux/ioprio.h
@@ -29,7 +29,6 @@ enum {
  	IOPRIO_CLASS_RT,
  	IOPRIO_CLASS_BE,
  	IOPRIO_CLASS_IDLE,
-	IOPRIO_CLASS_DL,
  };

  /*
@@ -38,12 +37,6 @@ enum {
  #define IOPRIO_NR_LEVELS	8
  #define IOPRIO_BE_NR		IOPRIO_NR_LEVELS

-/*
- * The Duration limits class allows 8 levels: level 0 for "no limit" and levels
- * 1 to 7, each corresponding to a read or write limit descriptor.
- */
-#define IOPRIO_DL_NR_LEVELS	8
-
  enum {
  	IOPRIO_WHO_PROCESS = 1,
  	IOPRIO_WHO_PGRP,
Niklas Cassel Jan. 25, 2023, 9:23 p.m. UTC | #11
On Wed, Jan 25, 2023 at 10:37:52AM -0800, Bart Van Assche wrote:

(snip)

> Hi Damien,
> 
> The more I think about this, the more I'm convinced that it would be wrong
> to introduce IOPRIO_CLASS_DL. Datacenters will have a mix of drives that
> support CDL and drives that do not support CDL. It seems wrong to me to
> make user space software responsible for figuring out whether or not the
> drive supports CDL before it can be decided which I/O priority class should
> be used. This is something the kernel should do instead of user space
> software.

Well, if we take e.g. NCQ priority as an example, as that is probably
the only device side I/O priority feature currently supported by the
kernel.

If you want to use of NCQ priority, you need to first enable
/sys/block/sdX/device/ncq_prio_enable
and then submit I/O using IOPRIO_CLASS_RT, so I would argue the user
already needs to know that a device supports device side I/O priority,
if he wants to make use of it.


For CDL there are 7 different limits for reads and 7 different
limits for writes, these limits can be configured by the user.
So the users that want to get most performance out of their drive
will most likely analyze their workloads, and set the limits depending
on how their workload actually looks like.

Bottom line is that heavy users of CDL will absolutely know how the CDL
limits are configured in user space, as they will pick the correct CDL
index (prio level) for the descriptor that they want to use for the
specific I/O that they are doing. An ioscheduler will most likely be
disabled.

(For CDL, the limit is from the time the command is submitted to the device,
so from the device's PoV, it does not really matter if a command is queued
for a long time in a scheduler or not, but from an application PoV, it does
not make sense to hold back a command for long if it e.g. has a short limit.)


If we were to reuse IOPRIO_CLASS_RT, then I guess the best option would be
to have something like:

$ cat /sys/block/sdX/device/rt_prio_backend
[none] ncq-prio cdl

Devices that does not support ncq-prio or cdl,
e.g. currently NVMe, would just have none
(i.e. RT simply means higher host side priority (if a scheduler is used)).

SCSI would then have none and cdl
(for SCSI devices supporting CDL.)

ATA would have none, ncq-prio and cdl.
(for ATA devices supporting CDL.)

That would theoretically avoid another ioprio class, but like I've just
explained, a user space application making use of CDL would for sure know
how the descriptors look like anyway, so I'm not sure if there is an actual
benefit of doing it this way over simply having a IOPRIO_CLASS_DL.

I guess the only benefit would be that we would avoid introducing another
I/O priority class (at the expense of additional complexity elsewhere).


Kind regards,
Niklas
Keith Busch Jan. 25, 2023, 11:11 p.m. UTC | #12
On Wed, Jan 25, 2023 at 10:19:45AM +0900, Damien Le Moal wrote:
> On 1/25/23 09:05, Bart Van Assche wrote:
> 
> > which it is not sure that it will be integrated in other storage 
> > standards (NVMe, ...). Isn't the purpose of the block layer to provide 
> > an interface that is independent of the specifics of a single storage 
> > standard? This is why I'm in favor of letting the ATA core translate one 
> > of the existing I/O priority classes into a CDL instead of introducing a 
> > new I/O priority class (IOPRIO_CLASS_DL) in the block layer.
> 
> We discussed CDL with Hannes in the context of NVMe over fabrics. Their
> may be interesting extensions to consider for NVMe in that context (the
> value for local PCI attached NVMe drive is more limited at best).

I wouldn't necessarily rule out CDL for PCI attached in some future TP. NVMe
does allow rotating media, and they'll want feature parity if CDL is considered
useful in other protocols.
Damien Le Moal Jan. 26, 2023, 12:08 a.m. UTC | #13
On 2023/01/26 8:11, Keith Busch wrote:
> On Wed, Jan 25, 2023 at 10:19:45AM +0900, Damien Le Moal wrote:
>> On 1/25/23 09:05, Bart Van Assche wrote:
>>
>>> which it is not sure that it will be integrated in other storage 
>>> standards (NVMe, ...). Isn't the purpose of the block layer to provide 
>>> an interface that is independent of the specifics of a single storage 
>>> standard? This is why I'm in favor of letting the ATA core translate one 
>>> of the existing I/O priority classes into a CDL instead of introducing a 
>>> new I/O priority class (IOPRIO_CLASS_DL) in the block layer.
>>
>> We discussed CDL with Hannes in the context of NVMe over fabrics. Their
>> may be interesting extensions to consider for NVMe in that context (the
>> value for local PCI attached NVMe drive is more limited at best).
> 
> I wouldn't necessarily rule out CDL for PCI attached in some future TP. NVMe
> does allow rotating media, and they'll want feature parity if CDL is considered
> useful in other protocols.

True. If NVMe HDDs come to market, we'll definitely want a CDL feature too.
Damien Le Moal Jan. 26, 2023, 12:24 a.m. UTC | #14
On 2023/01/26 6:23, Niklas Cassel wrote:
> On Wed, Jan 25, 2023 at 10:37:52AM -0800, Bart Van Assche wrote:
> 
> (snip)
> 
>> Hi Damien,
>>
>> The more I think about this, the more I'm convinced that it would be wrong
>> to introduce IOPRIO_CLASS_DL. Datacenters will have a mix of drives that
>> support CDL and drives that do not support CDL. It seems wrong to me to
>> make user space software responsible for figuring out whether or not the
>> drive supports CDL before it can be decided which I/O priority class should
>> be used. This is something the kernel should do instead of user space
>> software.
> 
> Well, if we take e.g. NCQ priority as an example, as that is probably
> the only device side I/O priority feature currently supported by the
> kernel.
> 
> If you want to use of NCQ priority, you need to first enable
> /sys/block/sdX/device/ncq_prio_enable
> and then submit I/O using IOPRIO_CLASS_RT, so I would argue the user
> already needs to know that a device supports device side I/O priority,
> if he wants to make use of it.

Yes, absolutely. In addition to this, NCQ high priority feature is optional. The
host-level RT class scheduling works the same way regardless of a SATA drive
supporting NCQ high priority or not. If ncq_prio_enable is not enabled (or not
supported), the scheduler still works as before. If ncq_prio_enable is set for a
drive that supports NCQ high prio, then the user gets the additional benefit of
*also* having the drive prioritize the commands from high-priority user IOs.

> For CDL there are 7 different limits for reads and 7 different
> limits for writes, these limits can be configured by the user.
> So the users that want to get most performance out of their drive
> will most likely analyze their workloads, and set the limits depending
> on how their workload actually looks like.
> 
> Bottom line is that heavy users of CDL will absolutely know how the CDL
> limits are configured in user space, as they will pick the correct CDL
> index (prio level) for the descriptor that they want to use for the
> specific I/O that they are doing. An ioscheduler will most likely be
> disabled.

Yes. And for cases where we still need an IO scheduler (e.g. SMR with
mq-deadline), we really cannot use the priority level (CDL index) as a
meaningful information to make request scheduling decisions because I think it
is simply impossible to reliably define a "priority" order for the 7 read and
write descriptors. We cannot map a set of 14 descriptors with a very large
possible number of variations to sorted array of priority-like levels.

> (For CDL, the limit is from the time the command is submitted to the device,
> so from the device's PoV, it does not really matter if a command is queued
> for a long time in a scheduler or not, but from an application PoV, it does
> not make sense to hold back a command for long if it e.g. has a short limit.)
> 
> 
> If we were to reuse IOPRIO_CLASS_RT, then I guess the best option would be
> to have something like:
> 
> $ cat /sys/block/sdX/device/rt_prio_backend
> [none] ncq-prio cdl

No need for this. We can keep the existing ncq_prio_enable and the proposed
duration_limits/enable sysfs attributes. The user cannot enable both at the same
time with our patches. So if the user enables ncq_prio_enable, then it will get
high priority NCQ commands mapping for any level of the RT class. If
duration_limits/enable is set, then the user will get CDL scheduling of commands
on the drive.

But again, the difficulty with this overloading is that we *cannot* implement a
solid level-based scheduling in IO schedulers because ordering the CDLs in a
meaningful way is impossible. So BFQ handling of the RT class would likely not
result in the most ideal scheduling (that would depend heavily on how the CDL
descriptors are defined on the drive). Hence my reluctance to overload the RT
class for CDL.

> Devices that does not support ncq-prio or cdl,
> e.g. currently NVMe, would just have none
> (i.e. RT simply means higher host side priority (if a scheduler is used)).

Yes. Exactly.

> SCSI would then have none and cdl
> (for SCSI devices supporting CDL.)
> 
> ATA would have none, ncq-prio and cdl.
> (for ATA devices supporting CDL.)
> 
> That would theoretically avoid another ioprio class, but like I've just
> explained, a user space application making use of CDL would for sure know
> how the descriptors look like anyway, so I'm not sure if there is an actual
> benefit of doing it this way over simply having a IOPRIO_CLASS_DL.

Agree. And as explained above, I think that reusing the RT class creates more
problems than the only apparent simplification it is.

> I guess the only benefit would be that we would avoid introducing another
> I/O priority class (at the expense of additional complexity elsewhere).

Yes. And I think that the added complexity to correctly handle the overloaded RT
class is too much. RT class has been around for a long time for host-level IO
priority scheduling. Let's not break it in weird ways.

We certainly can work on improving handling of IOPRIO_CLASS_DL in IO schedulers.
But in my opinion, that can be done later, after this initial series introducing
CDL support is applied.
Christoph Hellwig Jan. 26, 2023, 5:26 a.m. UTC | #15
On Wed, Jan 25, 2023 at 04:11:41PM -0700, Keith Busch wrote:
> I wouldn't necessarily rule out CDL for PCI attached in some future TP. NVMe
> does allow rotating media, and they'll want feature parity if CDL is considered
> useful in other protocols.

NVMe has a TP for CDL that is technically active, although it doesn't
seem to be actively worked on right now.
Niklas Cassel Jan. 26, 2023, 1:53 p.m. UTC | #16
On Thu, Jan 26, 2023 at 09:24:12AM +0900, Damien Le Moal wrote:
> On 2023/01/26 6:23, Niklas Cassel wrote:
> > On Wed, Jan 25, 2023 at 10:37:52AM -0800, Bart Van Assche wrote:

(snip)

> > If we were to reuse IOPRIO_CLASS_RT, then I guess the best option would be
> > to have something like:
> > 
> > $ cat /sys/block/sdX/device/rt_prio_backend
> > [none] ncq-prio cdl
> 
> No need for this. We can keep the existing ncq_prio_enable and the proposed
> duration_limits/enable sysfs attributes. The user cannot enable both at the same
> time with our patches. So if the user enables ncq_prio_enable, then it will get
> high priority NCQ commands mapping for any level of the RT class. If
> duration_limits/enable is set, then the user will get CDL scheduling of commands
> on the drive.
> 
> But again, the difficulty with this overloading is that we *cannot* implement a
> solid level-based scheduling in IO schedulers because ordering the CDLs in a
> meaningful way is impossible. So BFQ handling of the RT class would likely not
> result in the most ideal scheduling (that would depend heavily on how the CDL
> descriptors are defined on the drive). Hence my reluctance to overload the RT
> class for CDL.

Well, if CDL were to reuse IOPRIO_CLASS_RT, then the user would either have to
disable the IO scheduler, so that lower classdata levels wouldn't be prioritized
over higher classdata levels, or simply use an IO scheduler that does not care
about the classdata level, e.g. mq-deadline.
Bart Van Assche Jan. 26, 2023, 5:33 p.m. UTC | #17
On 1/26/23 05:53, Niklas Cassel wrote:
> On Thu, Jan 26, 2023 at 09:24:12AM +0900, Damien Le Moal wrote:
>> But again, the difficulty with this overloading is that we *cannot* implement a
>> solid level-based scheduling in IO schedulers because ordering the CDLs in a
>> meaningful way is impossible. So BFQ handling of the RT class would likely not
>> result in the most ideal scheduling (that would depend heavily on how the CDL
>> descriptors are defined on the drive). Hence my reluctance to overload the RT
>> class for CDL.
> 
> Well, if CDL were to reuse IOPRIO_CLASS_RT, then the user would either have to
> disable the IO scheduler, so that lower classdata levels wouldn't be prioritized
> over higher classdata levels, or simply use an IO scheduler that does not care
> about the classdata level, e.g. mq-deadline.

How about making the information about whether or not CDL has been 
enabled available to the scheduler such that the scheduler can include 
that information in its decisions?

> However, for CDL, things are not as simple as setting a single bit in the
> command, because of all the different descriptors, so we must let the classdata
> represent the device side priority level, and not the host side priority level
> (as we cannot have both, and I agree with you, it is very hard define an order
> between the descriptors.. e.g. should a 20 ms policy 0xf descriptor be ranked
> higher or lower than a 20 ms policy 0xd descriptor?).

How about only supporting a subset of the standard such that it becomes 
easy to map CDLs to host side priority levels?

If users really need the ability to use all standardized CDL features 
and if there is no easy way to map CDL levels to an I/O priority, is the 
I/O priority mechanism really the best basis for a user space interface 
for CDLs?

Thanks,

Bart.
Damien Le Moal Jan. 27, 2023, 12:18 a.m. UTC | #18
On 1/27/23 02:33, Bart Van Assche wrote:
> On 1/26/23 05:53, Niklas Cassel wrote:
>> On Thu, Jan 26, 2023 at 09:24:12AM +0900, Damien Le Moal wrote:
>>> But again, the difficulty with this overloading is that we *cannot* implement a
>>> solid level-based scheduling in IO schedulers because ordering the CDLs in a
>>> meaningful way is impossible. So BFQ handling of the RT class would likely not
>>> result in the most ideal scheduling (that would depend heavily on how the CDL
>>> descriptors are defined on the drive). Hence my reluctance to overload the RT
>>> class for CDL.
>>
>> Well, if CDL were to reuse IOPRIO_CLASS_RT, then the user would either have to
>> disable the IO scheduler, so that lower classdata levels wouldn't be prioritized
>> over higher classdata levels, or simply use an IO scheduler that does not care
>> about the classdata level, e.g. mq-deadline.
> 
> How about making the information about whether or not CDL has been 
> enabled available to the scheduler such that the scheduler can include 
> that information in its decisions?

Sure, that is easy to do. But as I mentioned before, I think that is
something we can do after this initial support series.

>> However, for CDL, things are not as simple as setting a single bit in the
>> command, because of all the different descriptors, so we must let the classdata
>> represent the device side priority level, and not the host side priority level
>> (as we cannot have both, and I agree with you, it is very hard define an order
>> between the descriptors.. e.g. should a 20 ms policy 0xf descriptor be ranked
>> higher or lower than a 20 ms policy 0xd descriptor?).
> 
> How about only supporting a subset of the standard such that it becomes 
> easy to map CDLs to host side priority levels?

I am opposed to this, for several reasons:

1) We are seeing different use cases from users that cover a wide range of
use of CDL descriptors with various definitions.

2) Passthrough commands can be used by a user to change a drive CDL
descriptors without the kernel knowing about it, unless we spend our time
revalidating the CDL descriptor log page(s)...

3) CDL standard as is is actually very sensible and not overloaded with
stuff that is only useful in niche use cases. For each CDL descriptor, you
have:
 * The active time limit, which is a clean way to specify how much time
you allow a drive to deal with bad sectors (mostly read case). A typical
HDD will try very hard to recover data from a sector, always. As a result,
the HDD may spend up to several seconds reading a sector again and again
applying different signal processing techniques until it gets the sector
ECC checked to return valid data. That of course can hugely increase an IO
latency seen by the host. In applications such as erasure coded
distributed object stores, maximum latency for an object access can thus
be kept low using this limit without compromising the data since the
object can always be rebuilt from the erasure codes if one HDD is slow to
respond. This limit is also interesting for video streaming/playback to
avoid video buffer underflow (at the expense of may be some block noise
depending on the codec).
 * The inactive time limit can be used to tell the drive how long it is
allowed to let a command stand in the drive internal queue before
processing. This is thus a parameter that allows a host to tune the drive
RPO optimization (rotational positioning optimization, e.g. HDD internal
command scheduling based on angular sector position on tracks withe the
head current position). This is a neat way to control max IOPS vs tail
latency since drives tend to privilege maximizing IOPS over lowering max
tail latency.
 * The duration guideline limit defines an overall time limit for a
command without distinguishing between active and inactive time. It is the
easiest to use (the easiest one to understand from a beginner user point
of view). This is a neat way to define an intelligent IO prioritization in
fact, way better than RT class scheduling on the host or the use of ATA
NCQ high priority, as it provides more information to the drive about the
urgency of a particular command. That allows the drive to still perform
RPO to maximize IOPS without long tail latencies. Chaining such limit with
an active+inactive time limit descriptor using the "next limit" policy
(0x1 policy) can also finely define what the drive should if the guideline
limit is exceeded (as the next descriptor can define what to do based on
the reason for the limit being exceeded: long internal queueing vs bad
sector long access time).

> If users really need the ability to use all standardized CDL features 
> and if there is no easy way to map CDL levels to an I/O priority, is the 
> I/O priority mechanism really the best basis for a user space interface 
> for CDLs?

As you can see above, yes, we need everything and should not attempt
restricting CDL use. The IO priority interface is a perfect fit for CDL in
the sense that all we need to pass along from user to device is one
number: the CDL index to use for a command. So creating a different
interface for this while the IO priority interface exactly does that
sounds silly to me.

One compromise we could do is: have the IO schedulers completely ignore
CDL prio class for now, that is, have them assume that no IO prio
class/level was specified. Given that they are not tuned to handle CDL
well anyway, this is probably the best thing to do for now.

We still need to have the block layer prevent merging of requests with
different CDL descriptors though, which is another reason to reuse the IO
prio interface as the block layer already does this. Less code, which is
always a good thing.

> 
> Thanks,
> 
> Bart.
Damien Le Moal Jan. 27, 2023, 1:40 a.m. UTC | #19
On 1/27/23 09:18, Damien Le Moal wrote:
> On 1/27/23 02:33, Bart Van Assche wrote:
>> On 1/26/23 05:53, Niklas Cassel wrote:
>>> On Thu, Jan 26, 2023 at 09:24:12AM +0900, Damien Le Moal wrote:
>>>> But again, the difficulty with this overloading is that we *cannot* implement a
>>>> solid level-based scheduling in IO schedulers because ordering the CDLs in a
>>>> meaningful way is impossible. So BFQ handling of the RT class would likely not
>>>> result in the most ideal scheduling (that would depend heavily on how the CDL
>>>> descriptors are defined on the drive). Hence my reluctance to overload the RT
>>>> class for CDL.
>>>
>>> Well, if CDL were to reuse IOPRIO_CLASS_RT, then the user would either have to
>>> disable the IO scheduler, so that lower classdata levels wouldn't be prioritized
>>> over higher classdata levels, or simply use an IO scheduler that does not care
>>> about the classdata level, e.g. mq-deadline.
>>
>> How about making the information about whether or not CDL has been 
>> enabled available to the scheduler such that the scheduler can include 
>> that information in its decisions?
> 
> Sure, that is easy to do. But as I mentioned before, I think that is
> something we can do after this initial support series.
> 
>>> However, for CDL, things are not as simple as setting a single bit in the
>>> command, because of all the different descriptors, so we must let the classdata
>>> represent the device side priority level, and not the host side priority level
>>> (as we cannot have both, and I agree with you, it is very hard define an order
>>> between the descriptors.. e.g. should a 20 ms policy 0xf descriptor be ranked
>>> higher or lower than a 20 ms policy 0xd descriptor?).
>>
>> How about only supporting a subset of the standard such that it becomes 
>> easy to map CDLs to host side priority levels?
> 
> I am opposed to this, for several reasons:
> 
> 1) We are seeing different use cases from users that cover a wide range of
> use of CDL descriptors with various definitions.
> 
> 2) Passthrough commands can be used by a user to change a drive CDL
> descriptors without the kernel knowing about it, unless we spend our time
> revalidating the CDL descriptor log page(s)...
> 
> 3) CDL standard as is is actually very sensible and not overloaded with
> stuff that is only useful in niche use cases. For each CDL descriptor, you
> have:
>  * The active time limit, which is a clean way to specify how much time
> you allow a drive to deal with bad sectors (mostly read case). A typical
> HDD will try very hard to recover data from a sector, always. As a result,
> the HDD may spend up to several seconds reading a sector again and again
> applying different signal processing techniques until it gets the sector
> ECC checked to return valid data. That of course can hugely increase an IO
> latency seen by the host. In applications such as erasure coded
> distributed object stores, maximum latency for an object access can thus
> be kept low using this limit without compromising the data since the
> object can always be rebuilt from the erasure codes if one HDD is slow to
> respond. This limit is also interesting for video streaming/playback to
> avoid video buffer underflow (at the expense of may be some block noise
> depending on the codec).
>  * The inactive time limit can be used to tell the drive how long it is
> allowed to let a command stand in the drive internal queue before
> processing. This is thus a parameter that allows a host to tune the drive
> RPO optimization (rotational positioning optimization, e.g. HDD internal
> command scheduling based on angular sector position on tracks withe the
> head current position). This is a neat way to control max IOPS vs tail
> latency since drives tend to privilege maximizing IOPS over lowering max
> tail latency.
>  * The duration guideline limit defines an overall time limit for a
> command without distinguishing between active and inactive time. It is the
> easiest to use (the easiest one to understand from a beginner user point
> of view). This is a neat way to define an intelligent IO prioritization in
> fact, way better than RT class scheduling on the host or the use of ATA
> NCQ high priority, as it provides more information to the drive about the
> urgency of a particular command. That allows the drive to still perform
> RPO to maximize IOPS without long tail latencies. Chaining such limit with
> an active+inactive time limit descriptor using the "next limit" policy
> (0x1 policy) can also finely define what the drive should if the guideline
> limit is exceeded (as the next descriptor can define what to do based on
> the reason for the limit being exceeded: long internal queueing vs bad
> sector long access time).

Note that all 3 limits can be used in a single CDL descriptor to precisely
define how a command should be processed by the device. That is why it is
nearly impossible to come up with a meaningful ordering of CDL descriptors
as an increasing set of priority levels.

> 
>> If users really need the ability to use all standardized CDL features 
>> and if there is no easy way to map CDL levels to an I/O priority, is the 
>> I/O priority mechanism really the best basis for a user space interface 
>> for CDLs?
> 
> As you can see above, yes, we need everything and should not attempt
> restricting CDL use. The IO priority interface is a perfect fit for CDL in
> the sense that all we need to pass along from user to device is one
> number: the CDL index to use for a command. So creating a different
> interface for this while the IO priority interface exactly does that
> sounds silly to me.
> 
> One compromise we could do is: have the IO schedulers completely ignore
> CDL prio class for now, that is, have them assume that no IO prio
> class/level was specified. Given that they are not tuned to handle CDL
> well anyway, this is probably the best thing to do for now.
> 
> We still need to have the block layer prevent merging of requests with
> different CDL descriptors though, which is another reason to reuse the IO
> prio interface as the block layer already does this. Less code, which is
> always a good thing.
> 
>>
>> Thanks,
>>
>> Bart.
>
Hannes Reinecke Jan. 27, 2023, 12:43 p.m. UTC | #20
On 1/24/23 20:02, Niklas Cassel wrote:
> From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> 
> Introduce the IOPRIO_CLASS_DL priority class to indicate that IOs should
> be executed using duration-limits targets. The duration target to apply
> to a command is indicated using the priority level. Up to 8 levels are
> supported, with level 0 indiating "no limit".
> 
> This priority class has effect only if the target device supports the
> command duration limits feature and this feature is enabled by the user.
> 
> While it is recommended to not use an ioscheduler when using the
> IOPRIO_CLASS_DL priority class, if using the BFQ or mq-deadline scheduler,
> IOPRIO_CLASS_DL is mapped to IOPRIO_CLASS_RT.
> 
> The reason for this is twofold:
> 1) Each priority level for the IOPRIO_CLASS_DL priority class represents a
> duration limit descriptor (DLD) inside the device. Users can configure
> these limits themselves using passthrough commands, so from a block layer
> perspective, Linux has no idea of how each DLD is actually configured.
> 
> By mapping a command to IOPRIO_CLASS_RT, the chance that a command exceeds
> its duration limit (because it was held too long in the scheduler) is
> decreased. It is still possible to use the IOPRIO_CLASS_DL priority class
> for "low priority" IOs by configuring a large limit in the respective DLD.
> 
> 2) On ATA drives, IOPRIO_CLASS_DL commands and NCQ priority commands
> (IOPRIO_CLASS_RT) cannot be used together. A mix of CDL and high priority
> commands cannot be sent to a device. By mapping IOPRIO_CLASS_DL to
> IOPRIO_CLASS_RT, we ensure that a device will never receive a mix of these
> two incompatible priority classes.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   block/bfq-iosched.c         | 10 ++++++++++
>   block/blk-ioprio.c          |  3 +++
>   block/ioprio.c              |  3 ++-
>   block/mq-deadline.c         |  1 +
>   include/linux/ioprio.h      |  2 +-
>   include/uapi/linux/ioprio.h |  7 +++++++
>   6 files changed, 24 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
Bart Van Assche Jan. 27, 2023, 5:23 p.m. UTC | #21
On 1/26/23 17:40, Damien Le Moal wrote:
> On 1/27/23 09:18, Damien Le Moal wrote:
>> On 1/27/23 02:33, Bart Van Assche wrote:
>>> How about only supporting a subset of the standard such that it becomes
>>> easy to map CDLs to host side priority levels?
>>
>> I am opposed to this, for several reasons:
>>
>> 1) We are seeing different use cases from users that cover a wide range of
>> use of CDL descriptors with various definitions.
>>
>> 2) Passthrough commands can be used by a user to change a drive CDL
>> descriptors without the kernel knowing about it, unless we spend our time
>> revalidating the CDL descriptor log page(s)...
>> 3) CDL standard as is is actually very sensible and not overloaded with
>> stuff that is only useful in niche use cases. For each CDL descriptor, you
>> have:
>>   * The active time limit, which is a clean way to specify how much time
>> you allow a drive to deal with bad sectors (mostly read case). A typical
>> HDD will try very hard to recover data from a sector, always. As a result,
>> the HDD may spend up to several seconds reading a sector again and again
>> applying different signal processing techniques until it gets the sector
>> ECC checked to return valid data. That of course can hugely increase an IO
>> latency seen by the host. In applications such as erasure coded
>> distributed object stores, maximum latency for an object access can thus
>> be kept low using this limit without compromising the data since the
>> object can always be rebuilt from the erasure codes if one HDD is slow to
>> respond. This limit is also interesting for video streaming/playback to
>> avoid video buffer underflow (at the expense of may be some block noise
>> depending on the codec).
>>   * The inactive time limit can be used to tell the drive how long it is
>> allowed to let a command stand in the drive internal queue before
>> processing. This is thus a parameter that allows a host to tune the drive
>> RPO optimization (rotational positioning optimization, e.g. HDD internal
>> command scheduling based on angular sector position on tracks withe the
>> head current position). This is a neat way to control max IOPS vs tail
>> latency since drives tend to privilege maximizing IOPS over lowering max
>> tail latency.
>>   * The duration guideline limit defines an overall time limit for a
>> command without distinguishing between active and inactive time. It is the
>> easiest to use (the easiest one to understand from a beginner user point
>> of view). This is a neat way to define an intelligent IO prioritization in
>> fact, way better than RT class scheduling on the host or the use of ATA
>> NCQ high priority, as it provides more information to the drive about the
>> urgency of a particular command. That allows the drive to still perform
>> RPO to maximize IOPS without long tail latencies. Chaining such limit with
>> an active+inactive time limit descriptor using the "next limit" policy
>> (0x1 policy) can also finely define what the drive should if the guideline
>> limit is exceeded (as the next descriptor can define what to do based on
>> the reason for the limit being exceeded: long internal queueing vs bad
>> sector long access time).
> 
> Note that all 3 limits can be used in a single CDL descriptor to precisely
> define how a command should be processed by the device. That is why it is
> nearly impossible to come up with a meaningful ordering of CDL descriptors
> as an increasing set of priority levels.

A summary of my concerns is as follows:
* The current I/O priority levels (RT, BE, IDLE) apply to all block 
devices. IOPRIO_CLASS_DL is only supported by certain block devices 
(some but not all SCSI harddisks). This forces applications to check the 
capabilities of the storage device before it can be decided whether or 
not IOPRIO_CLASS_DL can be used. This is not something applications 
should do but something the kernel should do. Additionally, if multiple 
dm devices are stacked on top of the block device driver, like in 
Android, it becomes even more cumbersome to check whether or not the 
block device supports CDL.
* For the RT, BE and IDLE classes, it is well defined which priority 
number represents a high priority and which priority number represents a 
low priority. For CDL, only the drive knows the priority details. I 
think that application software should be able to select a DL priority 
without having to read the CDL configuration first.

I hope that I have it made it clear that I think that the proposed user 
space API will be very painful to use for application developers.

Bart.
Damien Le Moal Jan. 28, 2023, 12:40 a.m. UTC | #22
On 1/28/23 02:23, Bart Van Assche wrote:
> A summary of my concerns is as follows:
> * The current I/O priority levels (RT, BE, IDLE) apply to all block 
> devices. IOPRIO_CLASS_DL is only supported by certain block devices 
> (some but not all SCSI harddisks). This forces applications to check the 
> capabilities of the storage device before it can be decided whether or 
> not IOPRIO_CLASS_DL can be used. This is not something applications 
> should do but something the kernel should do. Additionally, if multiple 
> dm devices are stacked on top of the block device driver, like in 
> Android, it becomes even more cumbersome to check whether or not the 
> block device supports CDL.

Yes, RT, BE and IDLE apply to all block devices. And so does CDL in the sense
that if a user specifies the CDL class for IOs to a device that does not support
CDL, then nothing special will happen. There will be no differentiation of the
IOs. That *exactly* what happens when using RT, BE or IDLE with the none
scheduler (e.g. default nvme setup). And the same remark applies to RT class
mapping to ATA NCQ priority feature: the user needs to check the device to know
if that will happen, *and* also needs to turn on that feature for that mapping
to be effective.

The levels of the CDL priority class are also very well defined: they map to the
CDL descriptors defined on the drive, which are consultable by the user through
sysfs (no special tools needed), so easily discoverable.

As for DM devices, these have no scheduler. So any processing of a priority
class by a DM target driver is that driver responsibility. Initially, all that
happens is the block layer passing on that information through the stack with
the BIOs. That's it. Real action may happen once the physical block device is
reached with the IO scheduler for that device, if one is set.

At that level, none scheduler is of no concern, nothing will happen. Kyber also
ignores priorities. We are left with only bfq and mq-deadline. The latter only
cares about the priority class, ignoring levels. bfq does act on both class and
level.

IOPRIO_CLASS_DL is equal to 4, so strictly speaking, is of lower priority than
the IDLE class if you want to consider it as part of that ordering. But we
defined it as a different class to allow *not* having to do that. IO schedulers
can be modified to ignore that priority class for now, mapping it to say the
default BE class for instance. Our current patch set maps the CDL class to the
RT class for the schedulers, as that made most sense given the time-sensitive
nature of CDL workloads. But we can change that to actually let the scheduler
decide if you want. There are no other changes in the block layer that have or
need special handling of the CDL class. All very clean in my opinion, no special
conditions for that feature. No additional "if" in the hot path, no overhead added.

> * For the RT, BE and IDLE classes, it is well defined which priority 
> number represents a high priority and which priority number represents a 
> low priority. For CDL, only the drive knows the priority details. I 
> think that application software should be able to select a DL priority 
> without having to read the CDL configuration first.

The levels of the CDL priority class are also very well defined: they map to the
CDL descriptors defined on the drive, which are consultable by the user through
sysfs (no special tools needed), so easily discoverable. And unless we restrict
how CDL descriptors can be defined, which I explained in my previous email is
not desirable at all, we cannot and should not try to order levels in some sort
of priority semantic. CDL semantic does not define directly a priority level,
only time limits, which may or may not be ordered, depending on the limits
definitions.

As Niklas pointed out, this is not a "generic" feature that any random
application can magically use without modifications. The application must be
aware of what CDL is and if how the descriptors are. And for 99.99% of the use
cases, the CDL descriptors will be defined in a way usefull for that
application. There is no magic generic set of descriptors defined by default.
Though a simple set of increasing time limits that can be cleanly mapped to
priority levels. A system administrator is free to do that for the system drives
if that is what the running applications expect. CDL is a very flexible feature
that can cover a lot of use cases. Trying to shoehorn in into the legacy/classic
priority semantic framework would only restrict its usefulness.

> I hope that I have it made it clear that I think that the proposed user 
> space API will be very painful to use for application developers.

I completely disagree. Reusing the prio class/level API made it easy to allow
applications to use the feature. fio support for CDL requires exactly *one line*
change, to allow for the CDL class number 4. That's it. From there, one can use
the --cmdprio_class=4 nd --cmdprio=idx options to exercise a drive. The value of
"idx" here of course depends on how the descriptors are set on the drive. But
back to the point above. This depends on the application goals and the
descriptors are set accordingly for that goal. There is no real discovery needed
by the application. The application expect a certain set of CDL limits for its
use case, and checking that this set is the one currently defined on the drive
is easy to do from an application with the sysfs interface we added.

Many users out there have deployed and using applications taking advantage of
ATA NCQ priority feature, using class RT for high priority IOs. The new CDL
class does not require many application changes to be enabled for next gen
drives that will have CDL.

> 
> Bart.
>
Bart Van Assche Jan. 28, 2023, 12:47 a.m. UTC | #23
On 1/27/23 16:40, Damien Le Moal wrote:
> On 1/28/23 02:23, Bart Van Assche wrote:
>> I hope that I have it made it clear that I think that the proposed user
>> space API will be very painful to use for application developers.
> 
> I completely disagree. Reusing the prio class/level API made it easy to allow
> applications to use the feature. fio support for CDL requires exactly *one line*
> change, to allow for the CDL class number 4. That's it. From there, one can use
> the --cmdprio_class=4 nd --cmdprio=idx options to exercise a drive. The value of
> "idx" here of course depends on how the descriptors are set on the drive. But
> back to the point above. This depends on the application goals and the
> descriptors are set accordingly for that goal. There is no real discovery needed
> by the application. The application expect a certain set of CDL limits for its
> use case, and checking that this set is the one currently defined on the drive
> is easy to do from an application with the sysfs interface we added.
> 
> Many users out there have deployed and using applications taking advantage of
> ATA NCQ priority feature, using class RT for high priority IOs. The new CDL
> class does not require many application changes to be enabled for next gen
> drives that will have CDL.
  As I mentioned before, the new I/O priority class IOPRIO_CLASS_DL 
makes it impossible to use a single I/O priority class across devices 
that support CDL and devices that do not support CDL. I'm surprised that 
you keep denying that IOPRIO_CLASS_DL is a royal pain for users who have 
to support devices that support CDL and devices that do not support CDL.

Bart.
Damien Le Moal Jan. 28, 2023, 12:59 a.m. UTC | #24
On 1/28/23 09:47, Bart Van Assche wrote:
> On 1/27/23 16:40, Damien Le Moal wrote:
>> On 1/28/23 02:23, Bart Van Assche wrote:
>>> I hope that I have it made it clear that I think that the proposed user
>>> space API will be very painful to use for application developers.
>>
>> I completely disagree. Reusing the prio class/level API made it easy to allow
>> applications to use the feature. fio support for CDL requires exactly *one line*
>> change, to allow for the CDL class number 4. That's it. From there, one can use
>> the --cmdprio_class=4 nd --cmdprio=idx options to exercise a drive. The value of
>> "idx" here of course depends on how the descriptors are set on the drive. But
>> back to the point above. This depends on the application goals and the
>> descriptors are set accordingly for that goal. There is no real discovery needed
>> by the application. The application expect a certain set of CDL limits for its
>> use case, and checking that this set is the one currently defined on the drive
>> is easy to do from an application with the sysfs interface we added.
>>
>> Many users out there have deployed and using applications taking advantage of
>> ATA NCQ priority feature, using class RT for high priority IOs. The new CDL
>> class does not require many application changes to be enabled for next gen
>> drives that will have CDL.
>   As I mentioned before, the new I/O priority class IOPRIO_CLASS_DL 
> makes it impossible to use a single I/O priority class across devices 
> that support CDL and devices that do not support CDL. I'm surprised that 
> you keep denying that IOPRIO_CLASS_DL is a royal pain for users who have 
> to support devices that support CDL and devices that do not support CDL.

I am not denying anything. I simply keep telling you that CDL is not a generic
feature for random applications to use, including those that already use
RT/BE/IDLE. It is for applications that know and expect it, and so have a setup
suited for CDL use down to the drive CDL descriptors. That includes DM setups.

Thinking about CDL in a generic setup for any random application to use is
nonsense. And even if that happens and a user not knowing about it still tries
it, than as mentioned, nothing bad will happen. Using CDL in a setup that does
not support it is a NOP. That would be the same as not using it.

> 
> Bart.
Martin K. Petersen Jan. 28, 2023, 8:25 p.m. UTC | #25
Damien,

Finally had a window where I could sit down a read this extremely long
thread.

> I am not denying anything. I simply keep telling you that CDL is not a
> generic feature for random applications to use, including those that
> already use RT/BE/IDLE. It is for applications that know and expect
> it, and so have a setup suited for CDL use down to the drive CDL
> descriptors. That includes DM setups.
>
> Thinking about CDL in a generic setup for any random application to
> use is nonsense. And even if that happens and a user not knowing about
> it still tries it, than as mentioned, nothing bad will happen. Using
> CDL in a setup that does not support it is a NOP. That would be the
> same as not using it.

My observations:

 - Wrt. ioprio as conduit, I personally really dislike the idea of
   conflating priority (relative performance wrt. other I/O) with CDL
   (which is a QoS concept). I would really prefer those things to be
   separate. However, I do think that the ioprio *interface* is a good
   fit. A tool like ionice seems like a reasonable approach to letting
   generic applications set their CDL.

   If bio space wasn't a premium, I'd say just keep things separate.
   But given the inherent incompatibility between kernel I/O scheduling
   and CDL, it's probably not worth the hassle to separate them. As much
   as it pains me to mix two concepts which should be completely
   orthogonal.

   I wish we could let applications specify both a priority and a CDL at
   the same time, though. Even if the kernel plumbing in the short term
   ends up using bi_ioprio as conduit. It's much harder to make changes
   in the application interface at a later date.

 - Wrt. "CDL is not a generic feature", I think you are underestimating
   how many applications actually want something like this. We have
   many.

   I don't think we should design for "special interest only, needs
   custom device tweaking to be usable". We have been down that path
   before (streams, etc.). And with poor results.

   I/O hints also tanked but at least we tried to pre-define performance
   classes that made sense in an abstract fashion. And programmed the
   mode page on discovered devices so that the classes were identical
   across all disks, regardless of whether they were SSDs or million
   dollar arrays. This allowed filesystems to communicate "this is
   metadata" regardless of the device the I/O was going to. Instead of
   "index 5 on this device" but "index 42 on the mirror".

   As such, I don't like the "just customize your settings with
   cdltools" approach. I'd much rather see us try to define a few QoS
   classes that make sense that would apply to every app and use those
   to define the application interface. And then have the kernel program
   those CDL classes into SCSI/ATA devices by default.

   Having the kernel provide an abstract interface for bio QoS and
   configuring a new disk with a sane handful of classes does not
   prevent $CLOUD_VENDOR from overriding what Linux configured. But at
   least we'd have a generic approach to block QoS in Linux. Similar to
   the existing I/O priority infrastructure which is also not tied to
   any particular hardware feature.

   A generic implementation also allows us to do fancy things in the
   hypervisor where we would like to be able to do QoS across multiple
   devices as well. Without having ATA or SCSI with CDL involved. Or
   whatever things might look like in NVMe.
Damien Le Moal Jan. 29, 2023, 3:52 a.m. UTC | #26
On 1/29/23 05:25, Martin K. Petersen wrote:
> 
> Damien,
> 
> Finally had a window where I could sit down a read this extremely long
> thread.
> 
>> I am not denying anything. I simply keep telling you that CDL is not a
>> generic feature for random applications to use, including those that
>> already use RT/BE/IDLE. It is for applications that know and expect
>> it, and so have a setup suited for CDL use down to the drive CDL
>> descriptors. That includes DM setups.
>>
>> Thinking about CDL in a generic setup for any random application to
>> use is nonsense. And even if that happens and a user not knowing about
>> it still tries it, than as mentioned, nothing bad will happen. Using
>> CDL in a setup that does not support it is a NOP. That would be the
>> same as not using it.
> 
> My observations:
> 
>  - Wrt. ioprio as conduit, I personally really dislike the idea of
>    conflating priority (relative performance wrt. other I/O) with CDL
>    (which is a QoS concept). I would really prefer those things to be
>    separate. However, I do think that the ioprio *interface* is a good
>    fit. A tool like ionice seems like a reasonable approach to letting
>    generic applications set their CDL.

The definition of IOPRIO_CLASS_CDL was more about reusing the ioprio *interface*
rather than having CDL support defined as a fully functional IO priority class.
As I argued in this thread, and I think you agreee, CDL semantic is more than
the simple priority class/level ordering.

>    If bio space wasn't a premium, I'd say just keep things separate.
>    But given the inherent incompatibility between kernel I/O scheduling
>    and CDL, it's probably not worth the hassle to separate them. As much
>    as it pains me to mix two concepts which should be completely
>    orthogonal.
> 
>    I wish we could let applications specify both a priority and a CDL at
>    the same time, though. Even if the kernel plumbing in the short term
>    ends up using bi_ioprio as conduit. It's much harder to make changes
>    in the application interface at a later date.

See below. There may be a solution about that.

>  - Wrt. "CDL is not a generic feature", I think you are underestimating
>    how many applications actually want something like this. We have
>    many.
> 
>    I don't think we should design for "special interest only, needs
>    custom device tweaking to be usable". We have been down that path
>    before (streams, etc.). And with poor results.

OK.

>    I/O hints also tanked but at least we tried to pre-define performance
>    classes that made sense in an abstract fashion. And programmed the
>    mode page on discovered devices so that the classes were identical
>    across all disks, regardless of whether they were SSDs or million
>    dollar arrays. This allowed filesystems to communicate "this is
>    metadata" regardless of the device the I/O was going to. Instead of
>    "index 5 on this device" but "index 42 on the mirror".
> 
>    As such, I don't like the "just customize your settings with
>    cdltools" approach. I'd much rather see us try to define a few QoS
>    classes that make sense that would apply to every app and use those
>    to define the application interface. And then have the kernel program
>    those CDL classes into SCSI/ATA devices by default.

Makes sense. Though I think it will be hard to define a set of QoS hints that
are useful for a wide range of applications, and even harder to convert the
defined hint classes to CDL descriptors. I fear that we may end up with the same
issues as IO hints/streams.

>    Having the kernel provide an abstract interface for bio QoS and
>    configuring a new disk with a sane handful of classes does not
>    prevent $CLOUD_VENDOR from overriding what Linux configured. But at
>    least we'd have a generic approach to block QoS in Linux. Similar to
>    the existing I/O priority infrastructure which is also not tied to
>    any particular hardware feature.

OK. See below about this.

>    A generic implementation also allows us to do fancy things in the
>    hypervisor where we would like to be able to do QoS across multiple
>    devices as well. Without having ATA or SCSI with CDL involved. Or
>    whatever things might look like in NVMe.

Fair point, especially given that virtio actually already forwards a guest
ioprio to the host through the virtio block command. Thinking of that particular
point together with what you said, I came up with the change show below as a
replacement for this patch 1/18.

This changes the 13-bits ioprio data into a 3-bits QOS hint + 3-bits of IO prio
level. This is consistent with the IO prio interface since IO priority levels
have to be between 0 and 7 (otherwise, errors are returned). So in fact, the
upper 10-bits of the ioprio data are ignored and we can safely use 3 of these
bits for an IO hint.

This hint applies to all priority classes and levels, that is, for the CDL case,
we can enrich any priority with a hint that specifies the CDL index to use for
an IO.

This falls short of actually defining generic IO hints, but this has the
advantage to not break anything for current applications using IO priorities,
not require any change to existing IO schedulers, while still allowing to pass
CDL indexes for IOs down to the scsi & ATA layers (which for now would be the
only layers in the kernel acting on the ioprio qos hints).

I think that this approach still allows us to enable CDL support, and on top of
it, go further and define generic QOS hints that IO scheduler can use and that
also potentially map to CDL for scsi & ata (similarly to the RT class IOs
mapping to the NCQ priority feature if the user enabled that feature).

As mentioned above, I think that defining generic IO hint classes will be
difficult. But the change below is I think a good a starting point that should
not prevent working on that.

Thoughts ?

Bart,

Given that you did not like the IOPRIO_CLASS_CDL, what do you think of this
approach ?


diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ccf2204477a5..9b3c8fb806f1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5378,11 +5378,11 @@ bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct
bfq_io_cq *bic)
 		bfqq->new_ioprio_class = task_nice_ioclass(tsk);
 		break;
 	case IOPRIO_CLASS_RT:
-		bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+		bfqq->new_ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio);
 		bfqq->new_ioprio_class = IOPRIO_CLASS_RT;
 		break;
 	case IOPRIO_CLASS_BE:
-		bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+		bfqq->new_ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio);
 		bfqq->new_ioprio_class = IOPRIO_CLASS_BE;
 		break;
 	case IOPRIO_CLASS_IDLE:
@@ -5671,7 +5671,7 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 				       struct bfq_io_cq *bic,
 				       bool respawn)
 {
-	const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+	const int ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio);
 	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
 	struct bfq_queue **async_bfqq = NULL;
 	struct bfq_queue *bfqq;
diff --git a/block/ioprio.c b/block/ioprio.c
index 32a456b45804..33f327a10811 100644
--- a/block/ioprio.c
+++ b/block/ioprio.c
@@ -33,7 +33,7 @@
 int ioprio_check_cap(int ioprio)
 {
 	int class = IOPRIO_PRIO_CLASS(ioprio);
-	int data = IOPRIO_PRIO_DATA(ioprio);
+	int level = IOPRIO_PRIO_LEVEL(ioprio);

 	switch (class) {
 		case IOPRIO_CLASS_RT:
@@ -49,13 +49,13 @@ int ioprio_check_cap(int ioprio)
 			fallthrough;
 			/* rt has prio field too */
 		case IOPRIO_CLASS_BE:
-			if (data >= IOPRIO_NR_LEVELS || data < 0)
+			if (level >= IOPRIO_NR_LEVELS || level < 0)
 				return -EINVAL;
 			break;
 		case IOPRIO_CLASS_IDLE:
 			break;
 		case IOPRIO_CLASS_NONE:
-			if (data)
+			if (level)
 				return -EINVAL;
 			break;
 		default:
diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c
index 83a366f3ee80..eba23e6a7bf6 100644
--- a/fs/f2fs/sysfs.c
+++ b/fs/f2fs/sysfs.c
@@ -314,7 +314,7 @@ static ssize_t f2fs_sbi_show(struct f2fs_attr *a,
 		struct ckpt_req_control *cprc = &sbi->cprc_info;
 		int len = 0;
 		int class = IOPRIO_PRIO_CLASS(cprc->ckpt_thread_ioprio);
-		int data = IOPRIO_PRIO_DATA(cprc->ckpt_thread_ioprio);
+		int level = IOPRIO_PRIO_LEVEL(cprc->ckpt_thread_ioprio);

 		if (class == IOPRIO_CLASS_RT)
 			len += scnprintf(buf + len, PAGE_SIZE - len, "rt,");
@@ -323,7 +323,7 @@ static ssize_t f2fs_sbi_show(struct f2fs_attr *a,
 		else
 			return -EINVAL;

-		len += scnprintf(buf + len, PAGE_SIZE - len, "%d\n", data);
+		len += scnprintf(buf + len, PAGE_SIZE - len, "%d\n", level);
 		return len;
 	}

diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
index f70f2596a6bf..1d90349a19c9 100644
--- a/include/uapi/linux/ioprio.h
+++ b/include/uapi/linux/ioprio.h
@@ -37,6 +37,18 @@ enum {
 #define IOPRIO_NR_LEVELS	8
 #define IOPRIO_BE_NR		IOPRIO_NR_LEVELS

+/*
+ * The 13-bits of ioprio data for each class provide up to 8 QOS hints and
+ * up to 8 priority levels.
+ */
+#define IOPRIO_PRIO_LEVEL_MASK	(IOPRIO_NR_LEVELS - 1)
+#define IOPRIO_QOS_HINT_SHIFT	10
+#define IOPRIO_NR_QOS_HINTS	8
+#define IOPRIO_QOS_HINT_MASK	(IOPRIO_NR_QOS_HINTS - 1)
+#define IOPRIO_PRIO_LEVEL(ioprio)	((ioprio) & IOPRIO_PRIO_LEVEL_MASK)
+#define IOPRIO_QOS_HINT(ioprio)	\
+	(((ioprio) >> IOPRIO_QOS_HINT_SHIFT) & IOPRIO_QOS_HINT_MASK)
+
 enum {
 	IOPRIO_WHO_PROCESS = 1,
 	IOPRIO_WHO_PGRP,
Damien Le Moal Jan. 30, 2023, 2:55 p.m. UTC | #27
On 1/30/23 22:44, Hannes Reinecke wrote:
> On 1/29/23 04:52, Damien Le Moal wrote:
>> On 1/29/23 05:25, Martin K. Petersen wrote:
> [ .. ]
>>>
>>>     As such, I don't like the "just customize your settings with
>>>     cdltools" approach. I'd much rather see us try to define a few QoS
>>>     classes that make sense that would apply to every app and use those
>>>     to define the application interface. And then have the kernel program
>>>     those CDL classes into SCSI/ATA devices by default.
>>
>> Makes sense. Though I think it will be hard to define a set of QoS hints that
>> are useful for a wide range of applications, and even harder to convert the
>> defined hint classes to CDL descriptors. I fear that we may end up with the same
>> issues as IO hints/streams.
>>
>>>     Having the kernel provide an abstract interface for bio QoS and
>>>     configuring a new disk with a sane handful of classes does not
>>>     prevent $CLOUD_VENDOR from overriding what Linux configured. But at
>>>     least we'd have a generic approach to block QoS in Linux. Similar to
>>>     the existing I/O priority infrastructure which is also not tied to
>>>     any particular hardware feature.
>>
>> OK. See below about this.
>>
>>>     A generic implementation also allows us to do fancy things in the
>>>     hypervisor where we would like to be able to do QoS across multiple
>>>     devices as well. Without having ATA or SCSI with CDL involved. Or
>>>     whatever things might look like in NVMe.
>>
>> Fair point, especially given that virtio actually already forwards a guest
>> ioprio to the host through the virtio block command. Thinking of that particular
>> point together with what you said, I came up with the change show below as a
>> replacement for this patch 1/18.
>>
>> This changes the 13-bits ioprio data into a 3-bits QOS hint + 3-bits of IO prio
>> level. This is consistent with the IO prio interface since IO priority levels
>> have to be between 0 and 7 (otherwise, errors are returned). So in fact, the
>> upper 10-bits of the ioprio data are ignored and we can safely use 3 of these
>> bits for an IO hint.
>>
>> This hint applies to all priority classes and levels, that is, for the CDL case,
>> we can enrich any priority with a hint that specifies the CDL index to use for
>> an IO.
>>
>> This falls short of actually defining generic IO hints, but this has the
>> advantage to not break anything for current applications using IO priorities,
>> not require any change to existing IO schedulers, while still allowing to pass
>> CDL indexes for IOs down to the scsi & ATA layers (which for now would be the
>> only layers in the kernel acting on the ioprio qos hints).
>>
>> I think that this approach still allows us to enable CDL support, and on top of
>> it, go further and define generic QOS hints that IO scheduler can use and that
>> also potentially map to CDL for scsi & ata (similarly to the RT class IOs
>> mapping to the NCQ priority feature if the user enabled that feature).
>>
>> As mentioned above, I think that defining generic IO hint classes will be
>> difficult. But the change below is I think a good a starting point that should
>> not prevent working on that.
>>
>> Thoughts ?
>>
> I like the idea.
> QoS is one of the recurring topic always coming up sooner or later when 
> talking of storage networks, so having _some_ concept of QoS in the 
> linux kernel (for storage) would be beneficial.
> 
> Maybe time for a topic at LSF?

Yes. I was hoping for a quicker resolution so that we can get the CDL
"mechanical" bits in, but without a nice API for it, we cannot :)
Trying to compile something with Niklas. So far, we are thinking of having
QOS flags + QOS data, the flags determining how (and if) the QOS data is used
and what it means.

Ex of things We could have:
* IOPRIO_QOS_FAILFAST: do not retry the IO if it fails the first time
* IOPRIO_QOS_DURATION_LIMIT: then the QOS data indicates the limit to use
(number). That can be implemented in schedulers and also map to CDL on drives
that support that feature.

That is the difficult part: what else ? For now, considering only our target of
adding scsi & ata CDL support, the above is enough. But is that enough in
general for most users/apps ?

> 
> Cheers,
> 
> Hannes
>
Bart Van Assche Jan. 30, 2023, 7:13 p.m. UTC | #28
On 1/28/23 12:25, Martin K. Petersen wrote:
>   - Wrt. ioprio as conduit, I personally really dislike the idea of
>     conflating priority (relative performance wrt. other I/O) with CDL
>     (which is a QoS concept). I would really prefer those things to be
>     separate. However, I do think that the ioprio *interface* is a good
>     fit. A tool like ionice seems like a reasonable approach to letting
>     generic applications set their CDL.

Hi Martin,

My understanding is that ionice uses the ioprio_set() system call and 
hence only affects foreground I/O but not page cache writeback. This is 
why I introduced the ioprio rq-qos policy (block/blk-ioprio.c). How 
about not adding CDL support in ioprio_set() and only supporting 
configuration of CDL via the v2 cgroup mechanism?

Thanks,

Bart.
Bart Van Assche Jan. 30, 2023, 7:24 p.m. UTC | #29
On 1/28/23 19:52, Damien Le Moal wrote:
> diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
> index f70f2596a6bf..1d90349a19c9 100644
> +/*
> + * The 13-bits of ioprio data for each class provide up to 8 QOS hints and
> + * up to 8 priority levels.
> + */
> +#define IOPRIO_PRIO_LEVEL_MASK	(IOPRIO_NR_LEVELS - 1)
> +#define IOPRIO_QOS_HINT_SHIFT	10
> +#define IOPRIO_NR_QOS_HINTS	8
> +#define IOPRIO_QOS_HINT_MASK	(IOPRIO_NR_QOS_HINTS - 1)
> +#define IOPRIO_PRIO_LEVEL(ioprio)	((ioprio) & IOPRIO_PRIO_LEVEL_MASK)
> +#define IOPRIO_QOS_HINT(ioprio)	\
> +	(((ioprio) >> IOPRIO_QOS_HINT_SHIFT) & IOPRIO_QOS_HINT_MASK)
> +

Hi Damien,

How about the following approach?
* Do not add QoS support to the ioprio_set() system cal since that
   system call only affects foreground I/O.
* Configure QoS via the v2 cgroup mechanism such that QoS policies are
   applied to both foreground and background I/O.

This approach allows to use another binary representation for I/O 
priorities and QoS in the bio.bi_ioprio field than what is supported by 
the ioprio_set() system call. This approach also allows to use names 
(strings) for QoS settings instead of numbers in the interface between 
user space and the kernel if that would be considered desirable.

Thanks,

Bart.
Bart Van Assche Jan. 30, 2023, 8:40 p.m. UTC | #30
On 1/28/23 19:52, Damien Le Moal wrote:
> +/*
> + * The 13-bits of ioprio data for each class provide up to 8 QOS hints and
> + * up to 8 priority levels.
> + */
> +#define IOPRIO_PRIO_LEVEL_MASK	(IOPRIO_NR_LEVELS - 1)
> +#define IOPRIO_QOS_HINT_SHIFT	10
> +#define IOPRIO_NR_QOS_HINTS	8
> +#define IOPRIO_QOS_HINT_MASK	(IOPRIO_NR_QOS_HINTS - 1)
> +#define IOPRIO_PRIO_LEVEL(ioprio)	((ioprio) & IOPRIO_PRIO_LEVEL_MASK)
> +#define IOPRIO_QOS_HINT(ioprio)	\
> +	(((ioprio) >> IOPRIO_QOS_HINT_SHIFT) & IOPRIO_QOS_HINT_MASK)

Does the QoS level really have to be encoded in bio.bi_ioprio? How about 
introducing a new field in the existing hole in struct bio? From the 
pahole output:

struct bio {
         struct bio *               bi_next;            /*     0     4 */
         struct block_device *      bi_bdev;            /*     4     4 */
         blk_opf_t                  bi_opf;             /*     8     4 */
         short unsigned int         bi_flags;           /*    12     2 */
         short unsigned int         bi_ioprio;          /*    14     2 */
         blk_status_t               bi_status;          /*    16     1 */

         /* XXX 3 bytes hole, try to pack */

         atomic_t                   __bi_remaining;     /*    20     4 */
         struct bvec_iter           bi_iter;            /*    24    20 */
         blk_qc_t                   bi_cookie;          /*    44     4 */
         bio_end_io_t *             bi_end_io;          /*    48     4 */
         void *                     bi_private;         /*    52     4 */
         struct bio_crypt_ctx *     bi_crypt_context;   /*    56     4 */
[ ... ]

Thanks,

Bart.
Martin K. Petersen Jan. 31, 2023, 2:49 a.m. UTC | #31
Damien,

> Makes sense. Though I think it will be hard to define a set of QoS
> hints that are useful for a wide range of applications, and even
> harder to convert the defined hint classes to CDL descriptors. I fear
> that we may end up with the same issues as IO hints/streams.

Hints mainly failed because non-Linux OSes had very different
expectations about how this was going to work. So that left device
vendors in a situation where they had to essentially support 3 different
approaches all implemented using the same protocol.

The challenge of being a general purpose OS is to come up with concepts
that are applicable in a variety of situations. Twiddling protocol
fields is the easy part.

I have a couple of experienced CDL users that I'd like to talk to and
try to get a better idea of what a suitable set of defaults might look
like.

> This hint applies to all priority classes and levels, that is, for the
> CDL case, we can enrich any priority with a hint that specifies the
> CDL index to use for an IO.

Yeah, I like that approach better.
Martin K. Petersen Jan. 31, 2023, 2:58 a.m. UTC | #32
Hi Bart!

> My understanding is that ionice uses the ioprio_set() system call and
> hence only affects foreground I/O but not page cache writeback. This
> is why I introduced the ioprio rq-qos policy (block/blk-ioprio.c). How
> about not adding CDL support in ioprio_set() and only supporting
> configuration of CDL via the v2 cgroup mechanism?

I suspect applications that care about CDL would probably not go through
the page cache. But I don't have a problem supporting cgroups at all.

Longer term, for the applications that I'm aware of that care about
this, we'd probably want to be able to specify things on a per-I/O basis
via io_uring, though.
Damien Le Moal Jan. 31, 2023, 3:03 a.m. UTC | #33
On 1/31/23 11:58, Martin K. Petersen wrote:
> 
> Hi Bart!
> 
>> My understanding is that ionice uses the ioprio_set() system call and
>> hence only affects foreground I/O but not page cache writeback. This
>> is why I introduced the ioprio rq-qos policy (block/blk-ioprio.c). How
>> about not adding CDL support in ioprio_set() and only supporting
>> configuration of CDL via the v2 cgroup mechanism?
> 
> I suspect applications that care about CDL would probably not go through
> the page cache. But I don't have a problem supporting cgroups at all.
> 
> Longer term, for the applications that I'm aware of that care about
> this, we'd probably want to be able to specify things on a per-I/O basis
> via io_uring, though.

Absolutely agree here. Similarly to the legacy ioprio, we need to be able to
specify per-io (iouring or libaio) and per context (ioprio_set() or cgroups). I
see the per-io and cgroups APIs as complementary.

Many of the use cases we are seeing for CDL are transitions from ATA NCQ prio,
which relies on the RT class for per IO aio_iorpio field with libaio/iouring. I
would like to have continuity with this to facilitate application development.
Having only cgroups as the API would disallow per-io async io application engines.
Damien Le Moal Jan. 31, 2023, 3:10 a.m. UTC | #34
On 1/31/23 11:49, Martin K. Petersen wrote:
> 
> Damien,
> 
>> Makes sense. Though I think it will be hard to define a set of QoS
>> hints that are useful for a wide range of applications, and even
>> harder to convert the defined hint classes to CDL descriptors. I fear
>> that we may end up with the same issues as IO hints/streams.
> 
> Hints mainly failed because non-Linux OSes had very different
> expectations about how this was going to work. So that left device
> vendors in a situation where they had to essentially support 3 different
> approaches all implemented using the same protocol.
> 
> The challenge of being a general purpose OS is to come up with concepts
> that are applicable in a variety of situations. Twiddling protocol
> fields is the easy part.
> 
> I have a couple of experienced CDL users that I'd like to talk to and
> try to get a better idea of what a suitable set of defaults might look
> like.
> 
>> This hint applies to all priority classes and levels, that is, for the
>> CDL case, we can enrich any priority with a hint that specifies the
>> CDL index to use for an IO.
> 
> Yeah, I like that approach better.

Of note is that even though the IOPRIO_XXX macros in include/uapi/linux/ioprio.h
assume a 16bits value for the priority class + data, of which only 6 bits are
usable (3 for the class, 3 for the level), all syscall and kernel internal
interface has ioprio defined as an int. So we have in fact 32 bits to play with.
We could keep the lower 16 bits for ioprio as it was, and have the upper 16bits
used for QOS hints. More room that the 10 bits between the prio class and level.

The only place that will need changing is struct bio since bi_ioprio is defined
as an unsigned short. To solve this, as Bart suggested, we could add another
unsigned short in the bio struct hole for the qos hints (bi_iohint or bi_ioqoshint).

But if we can define a sensible set of hints that covers at least CDL with the
10 free bits we have in the current ioprio, that would be even better I think
(less changes needed in the block layer).
diff mbox series

Patch

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 815b884d6c5a..7add9346c585 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5545,6 +5545,14 @@  bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
 		bfqq->new_ioprio_class = IOPRIO_CLASS_IDLE;
 		bfqq->new_ioprio = 7;
 		break;
+	case IOPRIO_CLASS_DL:
+		/*
+		 * For the duration-limits class, we want the disk to do the
+		 * scheduling. So map all levels to the highest RT level.
+		 */
+		bfqq->new_ioprio = 0;
+		bfqq->new_ioprio_class = IOPRIO_CLASS_RT;
+		break;
 	}
 
 	if (bfqq->new_ioprio >= IOPRIO_NR_LEVELS) {
@@ -5673,6 +5681,8 @@  static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
 		return &bfqg->async_bfqq[1][ioprio][act_idx];
 	case IOPRIO_CLASS_IDLE:
 		return &bfqg->async_idle_bfqq[act_idx];
+	case IOPRIO_CLASS_DL:
+		return &bfqg->async_bfqq[0][0][act_idx];
 	default:
 		return NULL;
 	}
diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c
index 8bb6b8eba4ce..dfb5c3f447f4 100644
--- a/block/blk-ioprio.c
+++ b/block/blk-ioprio.c
@@ -27,6 +27,7 @@ 
  * @POLICY_RESTRICT_TO_BE: modify IOPRIO_CLASS_NONE and IOPRIO_CLASS_RT into
  *		IOPRIO_CLASS_BE.
  * @POLICY_ALL_TO_IDLE: change the I/O priority class into IOPRIO_CLASS_IDLE.
+ * @POLICY_ALL_TO_DL: change the I/O priority class into IOPRIO_CLASS_DL.
  *
  * See also <linux/ioprio.h>.
  */
@@ -35,6 +36,7 @@  enum prio_policy {
 	POLICY_NONE_TO_RT	= 1,
 	POLICY_RESTRICT_TO_BE	= 2,
 	POLICY_ALL_TO_IDLE	= 3,
+	POLICY_ALL_TO_DL	= 4,
 };
 
 static const char *policy_name[] = {
@@ -42,6 +44,7 @@  static const char *policy_name[] = {
 	[POLICY_NONE_TO_RT]	= "none-to-rt",
 	[POLICY_RESTRICT_TO_BE]	= "restrict-to-be",
 	[POLICY_ALL_TO_IDLE]	= "idle",
+	[POLICY_ALL_TO_DL]	= "duration-limits",
 };
 
 static struct blkcg_policy ioprio_policy;
diff --git a/block/ioprio.c b/block/ioprio.c
index 32a456b45804..1b3a9da82597 100644
--- a/block/ioprio.c
+++ b/block/ioprio.c
@@ -37,6 +37,7 @@  int ioprio_check_cap(int ioprio)
 
 	switch (class) {
 		case IOPRIO_CLASS_RT:
+		case IOPRIO_CLASS_DL:
 			/*
 			 * Originally this only checked for CAP_SYS_ADMIN,
 			 * which was implicitly allowed for pid 0 by security
@@ -47,7 +48,7 @@  int ioprio_check_cap(int ioprio)
 			if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_NICE))
 				return -EPERM;
 			fallthrough;
-			/* rt has prio field too */
+			/* RT and DL have prio field too */
 		case IOPRIO_CLASS_BE:
 			if (data >= IOPRIO_NR_LEVELS || data < 0)
 				return -EINVAL;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f10c2a0d18d4..526d0ea4dbf9 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -113,6 +113,7 @@  static const enum dd_prio ioprio_class_to_prio[] = {
 	[IOPRIO_CLASS_RT]	= DD_RT_PRIO,
 	[IOPRIO_CLASS_BE]	= DD_BE_PRIO,
 	[IOPRIO_CLASS_IDLE]	= DD_IDLE_PRIO,
+	[IOPRIO_CLASS_DL]	= DD_RT_PRIO,
 };
 
 static inline struct rb_root *
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index 7578d4f6a969..2f3fc2fbd668 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -20,7 +20,7 @@  static inline bool ioprio_valid(unsigned short ioprio)
 {
 	unsigned short class = IOPRIO_PRIO_CLASS(ioprio);
 
-	return class > IOPRIO_CLASS_NONE && class <= IOPRIO_CLASS_IDLE;
+	return class > IOPRIO_CLASS_NONE && class <= IOPRIO_CLASS_DL;
 }
 
 /*
diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
index f70f2596a6bf..15908b9e9d8c 100644
--- a/include/uapi/linux/ioprio.h
+++ b/include/uapi/linux/ioprio.h
@@ -29,6 +29,7 @@  enum {
 	IOPRIO_CLASS_RT,
 	IOPRIO_CLASS_BE,
 	IOPRIO_CLASS_IDLE,
+	IOPRIO_CLASS_DL,
 };
 
 /*
@@ -37,6 +38,12 @@  enum {
 #define IOPRIO_NR_LEVELS	8
 #define IOPRIO_BE_NR		IOPRIO_NR_LEVELS
 
+/*
+ * The Duration limits class allows 8 levels: level 0 for "no limit" and levels
+ * 1 to 7, each corresponding to a read or write limit descriptor.
+ */
+#define IOPRIO_DL_NR_LEVELS	8
+
 enum {
 	IOPRIO_WHO_PROCESS = 1,
 	IOPRIO_WHO_PGRP,