diff mbox series

[v2,03/15] block: Support data lifetime in the I/O priority bitfield

Message ID 20231005194129.1882245-4-bvanassche@acm.org
State New
Headers show
Series Pass data temperature information to UFS devices | expand

Commit Message

Bart Van Assche Oct. 5, 2023, 7:40 p.m. UTC
The NVMe and SCSI standards define 64 different data lifetimes. Support
storing this information in the I/O priority bitfield.

The current allocation of the 16 bits in the I/O priority bitfield is as
follows:
* 15..13: I/O priority class
* 12..6: unused
* 5..3: I/O hint (CDL)
* 2..0: I/O priority level

This patch changes this into the following:
* 15..13: I/O priority class
* 12: unused
* 11..6: data lifetime
* 5..3: I/O hint (CDL)
* 2..0: I/O priority level

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 include/uapi/linux/ioprio.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Kanchan Joshi Oct. 6, 2023, 6:42 a.m. UTC | #1
On Thu, Oct 05, 2023 at 12:40:49PM -0700, Bart Van Assche wrote:
>The NVMe and SCSI standards define 64 different data lifetimes. Support
>storing this information in the I/O priority bitfield.
>
>The current allocation of the 16 bits in the I/O priority bitfield is as
>follows:
>* 15..13: I/O priority class
>* 12..6: unused
>* 5..3: I/O hint (CDL)
>* 2..0: I/O priority level
>
>This patch changes this into the following:
>* 15..13: I/O priority class
>* 12: unused
>* 11..6: data lifetime
>* 5..3: I/O hint (CDL)
>* 2..0: I/O priority level
>
>Cc: Damien Le Moal <dlemoal@kernel.org>
>Cc: Niklas Cassel <niklas.cassel@wdc.com>
>Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>---
> include/uapi/linux/ioprio.h | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
>diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
>index bee2bdb0eedb..efe9bc450872 100644
>--- a/include/uapi/linux/ioprio.h
>+++ b/include/uapi/linux/ioprio.h
>@@ -71,7 +71,7 @@ enum {
>  * class and level.
>  */
> #define IOPRIO_HINT_SHIFT		IOPRIO_LEVEL_NR_BITS
>-#define IOPRIO_HINT_NR_BITS		10
>+#define IOPRIO_HINT_NR_BITS		3

Should the comment[*] also be modified to reflect this change?

[*]
/*
 * The 10 bits between the priority class and the priority level are used to
 * optionally define I/O hints for any combination of I/O priority class and
Damien Le Moal Oct. 6, 2023, 8:19 a.m. UTC | #2
On 10/6/23 04:40, Bart Van Assche wrote:
> The NVMe and SCSI standards define 64 different data lifetimes. Support
> storing this information in the I/O priority bitfield.
> 
> The current allocation of the 16 bits in the I/O priority bitfield is as
> follows:
> * 15..13: I/O priority class
> * 12..6: unused
> * 5..3: I/O hint (CDL)
> * 2..0: I/O priority level
> 
> This patch changes this into the following:
> * 15..13: I/O priority class
> * 12: unused
> * 11..6: data lifetime
> * 5..3: I/O hint (CDL)
> * 2..0: I/O priority level
> 
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  include/uapi/linux/ioprio.h | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
> index bee2bdb0eedb..efe9bc450872 100644
> --- a/include/uapi/linux/ioprio.h
> +++ b/include/uapi/linux/ioprio.h
> @@ -71,7 +71,7 @@ enum {
>   * class and level.
>   */
>  #define IOPRIO_HINT_SHIFT		IOPRIO_LEVEL_NR_BITS
> -#define IOPRIO_HINT_NR_BITS		10
> +#define IOPRIO_HINT_NR_BITS		3
>  #define IOPRIO_NR_HINTS			(1 << IOPRIO_HINT_NR_BITS)
>  #define IOPRIO_HINT_MASK		(IOPRIO_NR_HINTS - 1)
>  #define IOPRIO_PRIO_HINT(ioprio)	\
> @@ -102,6 +102,12 @@ enum {
>  	IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7,
>  };
>  
> +#define IOPRIO_LIFETIME_SHIFT		(IOPRIO_HINT_SHIFT + IOPRIO_HINT_NR_BITS)
> +#define IOPRIO_LIFETIME_NR_BITS		6
> +#define IOPRIO_LIFETIME_MASK		((1u << IOPRIO_LIFETIME_NR_BITS) - 1)
> +#define IOPRIO_PRIO_LIFETIME(ioprio)					\
> +	((ioprio >> IOPRIO_LIFETIME_SHIFT) & IOPRIO_LIFETIME_MASK)
> +
>  #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max))

I am really not a fan of this. This essentially limits prio hints to CDL, while
the initial intent was to define the hints as something generic that depend on
the device features. With your change, we will not be able to support new
features in the future.

Your change seem to assume that it makes sense to be able to combine CDL with
lifetime hints. But does it really ? CDL is of dubious value for solid state
media and as far as I know, UFS world has not expressed interest. Conversely,
data lifetime hints do not make much sense for spin rust media where CDL is
important. So I would say that the combination of CDL and lifetime hints is of
dubious value.

Given this, why not simply define the 64 possible lifetime values as plain hint
values (8 to 71, following 1 to 7 for CDL) ?

The other question here if you really want to keep the bit separation approach
is: do we really need up to 64 different lifetime hints ? While the scsi
standard allows that much, does this many different lifetime make sense in
practice ? Can we ever think of a usecase that needs more than say 8 different
liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8,
then we can keep 4 bits unused in the hint field for future features.
Niklas Cassel Oct. 6, 2023, 9:53 a.m. UTC | #3
On Fri, Oct 06, 2023 at 05:19:52PM +0900, Damien Le Moal wrote:
> On 10/6/23 04:40, Bart Van Assche wrote:
> > The NVMe and SCSI standards define 64 different data lifetimes. Support
> > storing this information in the I/O priority bitfield.
> > 
> > The current allocation of the 16 bits in the I/O priority bitfield is as
> > follows:
> > * 15..13: I/O priority class
> > * 12..6: unused
> > * 5..3: I/O hint (CDL)
> > * 2..0: I/O priority level
> > 
> > This patch changes this into the following:
> > * 15..13: I/O priority class
> > * 12: unused
> > * 11..6: data lifetime
> > * 5..3: I/O hint (CDL)
> > * 2..0: I/O priority level
> > 
> > Cc: Damien Le Moal <dlemoal@kernel.org>
> > Cc: Niklas Cassel <niklas.cassel@wdc.com>
> > Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> > ---
> >  include/uapi/linux/ioprio.h | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
> > index bee2bdb0eedb..efe9bc450872 100644
> > --- a/include/uapi/linux/ioprio.h
> > +++ b/include/uapi/linux/ioprio.h
> > @@ -71,7 +71,7 @@ enum {
> >   * class and level.
> >   */
> >  #define IOPRIO_HINT_SHIFT		IOPRIO_LEVEL_NR_BITS
> > -#define IOPRIO_HINT_NR_BITS		10
> > +#define IOPRIO_HINT_NR_BITS		3

Can we really redefine this?
This is defined to 10 in released kernels, e.g. kernel v6.5.

Perhaps a better option would be to keep this defined to 10,
and then add new macros that define "specific" IO prio hint "classes".

something like
IOPRIO_HINT_NR_BITS                10

IOPRIO_HINT_CDL
IOPRIO_HINT_CDL_NR_BITS
IOPRIO_HINT_LIFETIME
IOPRIO_HINT_LIFETIME_NR_BITS

lifetime is really a hint, so I think it makes sense for it to be part of
the "IOPRIO_HINT bits".


> >  #define IOPRIO_NR_HINTS			(1 << IOPRIO_HINT_NR_BITS)
> >  #define IOPRIO_HINT_MASK		(IOPRIO_NR_HINTS - 1)
> >  #define IOPRIO_PRIO_HINT(ioprio)	\
> > @@ -102,6 +102,12 @@ enum {
> >  	IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7,
> >  };
> >  
> > +#define IOPRIO_LIFETIME_SHIFT		(IOPRIO_HINT_SHIFT + IOPRIO_HINT_NR_BITS)
> > +#define IOPRIO_LIFETIME_NR_BITS		6
> > +#define IOPRIO_LIFETIME_MASK		((1u << IOPRIO_LIFETIME_NR_BITS) - 1)
> > +#define IOPRIO_PRIO_LIFETIME(ioprio)					\
> > +	((ioprio >> IOPRIO_LIFETIME_SHIFT) & IOPRIO_LIFETIME_MASK)
> > +
> >  #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max))
> 
> I am really not a fan of this. This essentially limits prio hints to CDL, while
> the initial intent was to define the hints as something generic that depend on
> the device features. With your change, we will not be able to support new
> features in the future.
> 
> Your change seem to assume that it makes sense to be able to combine CDL with
> lifetime hints. But does it really ? CDL is of dubious value for solid state
> media and as far as I know, UFS world has not expressed interest. Conversely,
> data lifetime hints do not make much sense for spin rust media where CDL is
> important. So I would say that the combination of CDL and lifetime hints is of
> dubious value.
> 
> Given this, why not simply define the 64 possible lifetime values as plain hint
> values (8 to 71, following 1 to 7 for CDL) ?
> 
> The other question here if you really want to keep the bit separation approach
> is: do we really need up to 64 different lifetime hints ? While the scsi
> standard allows that much, does this many different lifetime make sense in
> practice ? Can we ever think of a usecase that needs more than say 8 different
> liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8,
> then we can keep 4 bits unused in the hint field for future features.

I think the question is: do we ever want to allow ioprio hints to be combined
or not?

If the answer is no, then go ahead and add the life time hints as values
8 to 71.

If the answer is yes, then life time hints need to use unique bits, even if it
might not make sense for CDL and life times to be combined.


Kind regards,
Niklas
Bart Van Assche Oct. 6, 2023, 6:07 p.m. UTC | #4
On 10/6/23 01:19, Damien Le Moal wrote:
> Your change seem to assume that it makes sense to be able to combine CDL with
> lifetime hints. But does it really ? CDL is of dubious value for solid state
> media and as far as I know, UFS world has not expressed interest. Conversely,
> data lifetime hints do not make much sense for spin rust media where CDL is
> important. So I would say that the combination of CDL and lifetime hints is of
> dubious value.
> 
> Given this, why not simply define the 64 possible lifetime values as plain hint
> values (8 to 71, following 1 to 7 for CDL) ?
> 
> The other question here if you really want to keep the bit separation approach
> is: do we really need up to 64 different lifetime hints ? While the scsi
> standard allows that much, does this many different lifetime make sense in
> practice ? Can we ever think of a usecase that needs more than say 8 different
> liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8,
> then we can keep 4 bits unused in the hint field for future features.

Hi Damien,

Not supporting CDL for solid state media and supporting eight different
lifetime values sounds good to me. Is this perhaps what you had in mind?

Thanks,

Bart.

--- a/include/uapi/linux/ioprio.h
+++ b/include/uapi/linux/ioprio.h
@@ -100,6 +100,14 @@ enum {
         IOPRIO_HINT_DEV_DURATION_LIMIT_5 = 5,
         IOPRIO_HINT_DEV_DURATION_LIMIT_6 = 6,
         IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7,
+       IOPRIO_HINT_DATA_LIFE_TIME_0 = 8,
+       IOPRIO_HINT_DATA_LIFE_TIME_1 = 9,
+       IOPRIO_HINT_DATA_LIFE_TIME_2 = 10,
+       IOPRIO_HINT_DATA_LIFE_TIME_3 = 11,
+       IOPRIO_HINT_DATA_LIFE_TIME_4 = 12,
+       IOPRIO_HINT_DATA_LIFE_TIME_5 = 13,
+       IOPRIO_HINT_DATA_LIFE_TIME_6 = 14,
+       IOPRIO_HINT_DATA_LIFE_TIME_7 = 15,
  };
Bart Van Assche Oct. 12, 2023, 6 p.m. UTC | #5
On 10/11/23 18:02, Damien Le Moal wrote:
> Some have stated interest in CDL in NVMe-oF context, which could
> imply that combining CDL and lifetime may be something useful to do
> in that space...

We are having this discussion because bi_ioprio is sixteen bits wide and
because we don't want to make struct bio larger. How about expanding the
bi_ioprio field from 16 to 32 bits and to use separate bits for CDL
information and data lifetimes?

This patch does not make struct bio bigger because it changes a three
byte hole with a one byte hole:

--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -268,8 +268,8 @@ struct bio {
                                                  * req_flags.
                                                  */
         unsigned short          bi_flags;       /* BIO_* below */
-       unsigned short          bi_ioprio;
         blk_status_t            bi_status;
+       u32                     bi_ioprio;
         atomic_t                __bi_remaining;

         struct bvec_iter        bi_iter;


 From the pahole output:

struct bio {
         struct bio *               bi_next;              /*     0     8 */
         struct block_device *      bi_bdev;              /*     8     8 */
         blk_opf_t                  bi_opf;               /*    16     4 */
         short unsigned int         bi_flags;             /*    20     2 */
         blk_status_t               bi_status;            /*    22     1 */

         /* XXX 1 byte hole, try to pack */

         u32                        bi_ioprio;            /*    24     4 */
         atomic_t                   __bi_remaining;       /*    28     4 */
         struct bvec_iter           bi_iter;              /*    32    20 */
         blk_qc_t                   bi_cookie;            /*    52     4 */
         bio_end_io_t *             bi_end_io;            /*    56     8 */
         /* --- cacheline 1 boundary (64 bytes) --- */
         void *                     bi_private;           /*    64     8 */
         struct bio_crypt_ctx *     bi_crypt_context;     /*    72     8 */
         union {
                 struct bio_integrity_payload * bi_integrity; /*    80 
   8 */
         };                                               /*    80     8 */
         short unsigned int         bi_vcnt;              /*    88     2 */
         short unsigned int         bi_max_vecs;          /*    90     2 */
         atomic_t                   __bi_cnt;             /*    92     4 */
         struct bio_vec *           bi_io_vec;            /*    96     8 */
         struct bio_set *           bi_pool;              /*   104     8 */
         struct bio_vec             bi_inline_vecs[];     /*   112     0 */

         /* size: 112, cachelines: 2, members: 19 */
         /* sum members: 111, holes: 1, sum holes: 1 */
         /* last cacheline: 48 bytes */
};

Thanks,

Bart.
Damien Le Moal Oct. 13, 2023, 1:08 a.m. UTC | #6
On 10/13/23 03:00, Bart Van Assche wrote:
> On 10/11/23 18:02, Damien Le Moal wrote:
>> Some have stated interest in CDL in NVMe-oF context, which could
>> imply that combining CDL and lifetime may be something useful to do
>> in that space...
> 
> We are having this discussion because bi_ioprio is sixteen bits wide and
> because we don't want to make struct bio larger. How about expanding the
> bi_ioprio field from 16 to 32 bits and to use separate bits for CDL
> information and data lifetimes?

I guess we could do that as well. User side aio_reqprio field of struct aiocb,
which is used by io_uring and libaio, is an int, so 32-bits also. Changing
bi_ioprio to match that should not cause regressions or break user space I
think. Kernel uapi ioprio.h will need some massaging though.

> This patch does not make struct bio bigger because it changes a three
> byte hole with a one byte hole:

Yeah, but if the kernel is compiled with struct randomization, that does not
really apply, doesn't it ?

Reading Niklas's reply to Kanchan, I was reminded that using ioprio hint for
the lifetime may have one drawback: that information will be propagated to the
device only for direct IOs, no ? For buffered IOs, the information will be
lost. The other potential disadvantage of the ioprio interface is that we
cannot define ioprio+hint per file (or per inode really), unlike the old
write_hint that you initially reintroduced. Are these points blockers for the
user API you were thinking of ? How do you envision the user specifying
lifetime ? Per file ? Or are you thinking of not relying on the user to specify
that but rather the FS (e.g. f2fs) deciding on its own ? If it is the latter, I
think ioprio+hint is fine (it is simple). But if it is the former, the ioprio
API may not be the best suited for the job at hand.
Niklas Cassel Oct. 13, 2023, 9:33 a.m. UTC | #7
On Fri, Oct 13, 2023 at 10:08:29AM +0900, Damien Le Moal wrote:
> On 10/13/23 03:00, Bart Van Assche wrote:
> > On 10/11/23 18:02, Damien Le Moal wrote:
> >> Some have stated interest in CDL in NVMe-oF context, which could
> >> imply that combining CDL and lifetime may be something useful to do
> >> in that space...
> > 
> > We are having this discussion because bi_ioprio is sixteen bits wide and
> > because we don't want to make struct bio larger. How about expanding the
> > bi_ioprio field from 16 to 32 bits and to use separate bits for CDL
> > information and data lifetimes?
> 
> I guess we could do that as well. User side aio_reqprio field of struct aiocb,
> which is used by io_uring and libaio, is an int, so 32-bits also. Changing
> bi_ioprio to match that should not cause regressions or break user space I
> think. Kernel uapi ioprio.h will need some massaging though.
> 
> > This patch does not make struct bio bigger because it changes a three
> > byte hole with a one byte hole:
> 
> Yeah, but if the kernel is compiled with struct randomization, that does not
> really apply, doesn't it ?
> 
> Reading Niklas's reply to Kanchan, I was reminded that using ioprio hint for
> the lifetime may have one drawback: that information will be propagated to the
> device only for direct IOs, no ? For buffered IOs, the information will be
> lost. The other potential disadvantage of the ioprio interface is that we
> cannot define ioprio+hint per file (or per inode really), unlike the old
> write_hint that you initially reintroduced. Are these points blockers for the
> user API you were thinking of ? How do you envision the user specifying
> lifetime ? Per file ? Or are you thinking of not relying on the user to specify
> that but rather the FS (e.g. f2fs) deciding on its own ? If it is the latter, I
> think ioprio+hint is fine (it is simple). But if it is the former, the ioprio
> API may not be the best suited for the job at hand.

Hello Damien,

If you look closer at this series, you will see that even V2 of this series
still uses fcntl F_SET_RW_HINT as the user facing API.

This series simply take the value from fcntl F_SET_RW_HINT
(inode->i_write_hint) and stores it in bio->ioprio.

So it is not really using the ioprio user API.

See the patch to e.g. buffered-io.c:

--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -6,6 +6,7 @@
 #include <linux/module.h>
 #include <linux/compiler.h>
 #include <linux/fs.h>
+#include <linux/fs-lifetime.h>
 #include <linux/iomap.h>
 #include <linux/pagemap.h>
 #include <linux/uio.h>
@@ -1660,6 +1661,7 @@ iomap_alloc_ioend(struct inode *inode, struct iomap_writepage_ctx *wpc,
 			       REQ_OP_WRITE | wbc_to_write_flags(wbc),
 			       GFP_NOFS, &iomap_ioend_bioset);
 	bio->bi_iter.bi_sector = sector;
+	bio_set_data_lifetime(bio, inode->i_write_hint);
 	wbc_init_bio(wbc, bio);
 
 	ioend = container_of(bio, struct iomap_ioend, io_inline_bio);




In commit c75e707fe1aa ("block: remove the per-bio/request write hint")
this line from fs/direct-io.c was removed:
-       bio->bi_write_hint = dio->iocb->ki_hint;

I'm not sure why this series does not readd a similar line to set the
lifetime (using bio_set_data_lifetime()) also for fs/direct-io.c.


I still don't understand what happens if one uses io_uring to write
to a file on a f2fs filesystem using buffered-io, with both
inode->i_write_hint set using fcntl F_SET_RW_HINT, and bits belonging
to life time hints set in the io_uring SQE (sqe->ioprio).

I'm guessing that the:
bio_set_data_lifetime(bio, inode->i_write_hint);
call above in buffered-io.c, will simply overwrite whatever value
that was already stored in bio->ioprio. (Because if I understand
correctly, bio->ioprio will initially be set to the value in the
io_uring SQE (sqe->ioprio).)


Kind regards,
Niklas
Bart Van Assche Oct. 13, 2023, 8:18 p.m. UTC | #8
On 10/12/23 18:08, Damien Le Moal wrote:
> On 10/13/23 03:00, Bart Van Assche wrote:
>> We are having this discussion because bi_ioprio is sixteen bits wide and
>> because we don't want to make struct bio larger. How about expanding the
>> bi_ioprio field from 16 to 32 bits and to use separate bits for CDL
>> information and data lifetimes?
> 
> I guess we could do that as well. User side aio_reqprio field of struct aiocb,
> which is used by io_uring and libaio, is an int, so 32-bits also. Changing
> bi_ioprio to match that should not cause regressions or break user space I
> think. Kernel uapi ioprio.h will need some massaging though.

Hmm ... are we perhaps looking at different kernel versions? This is
what I found:

$ git grep -nHE 'ioprio;|reqprio;' include/uapi/linux/{io_uring,aio_abi}.h
include/uapi/linux/aio_abi.h:89:	__s16	aio_reqprio;
include/uapi/linux/io_uring.h:33:	__u16	ioprio;		/* ioprio for the 
request */

The struct iocb used for asynchronous I/O has a size of 64 bytes and
does not have any holes. struct io_uring_sqe also has a size of 64 bytes
and does not have any holes either. The ioprio_set() and ioprio_get()
system calls use the data type int so these wouldn't need any changes to
increase the number of ioprio bits.

> Reading Niklas's reply to Kanchan, I was reminded that using ioprio hint for
> the lifetime may have one drawback: that information will be propagated to the
> device only for direct IOs, no ? For buffered IOs, the information will be
> lost. The other potential disadvantage of the ioprio interface is that we
> cannot define ioprio+hint per file (or per inode really), unlike the old
> write_hint that you initially reintroduced. Are these points blockers for the
> user API you were thinking of ? How do you envision the user specifying
> lifetime ? Per file ? Or are you thinking of not relying on the user to specify
> that but rather the FS (e.g. f2fs) deciding on its own ? If it is the latter, I
> think ioprio+hint is fine (it is simple). But if it is the former, the ioprio
> API may not be the best suited for the job at hand.

The way I see it is that the primary purpose of the bits in the
bi_ioprio member that are used for the data lifetime is to allow
filesystems to provide data lifetime information to block drivers.

Specifying data lifetime information for direct I/O is convenient when
writing test scripts that verify whether data lifetime supports works
correctly. There may be other use cases but this is not my primary
focus.

I think that applications that want to specify data lifetime information
should use fcntl(fd, F_SET_RW_HINT, ...). It is up to the filesystem to
make sure that this information ends up in the bi_ioprio field. The
block layer is responsible for passing the information in the bi_ioprio
member to block drivers. Filesystems can support multiple policies for
combining the i_write_hint and other information into a data lifetime.
See also the whint_mode restored by patch 05/15 in this series.

Thanks,

Bart.
Bart Van Assche Oct. 13, 2023, 9:20 p.m. UTC | #9
On 10/13/23 02:33, Niklas Cassel wrote:
> In commit c75e707fe1aa ("block: remove the per-bio/request write hint")
> this line from fs/direct-io.c was removed:
> -       bio->bi_write_hint = dio->iocb->ki_hint;
> 
> I'm not sure why this series does not readd a similar line to set the
> lifetime (using bio_set_data_lifetime()) also for fs/direct-io.c.

It depends on how we want the user to specify the data lifetime for
direct I/O. This assignment is not modified by this patch series and
copies the data lifetime information from the ioprio bitfield from user
space into the bio:

		bio->bi_ioprio = dio->iocb->ki_ioprio;

> I still don't understand what happens if one uses io_uring to write
> to a file on a f2fs filesystem using buffered-io, with both
> inode->i_write_hint set using fcntl F_SET_RW_HINT, and bits belonging
> to life time hints set in the io_uring SQE (sqe->ioprio).

Is the documentation of the whint_mode mount option in patch 5/15 of this
series sufficient to answer the above question?

Thanks,

Bart.
Damien Le Moal Oct. 15, 2023, 10:22 p.m. UTC | #10
On 10/14/23 05:18, Bart Van Assche wrote:
> On 10/12/23 18:08, Damien Le Moal wrote:
>> On 10/13/23 03:00, Bart Van Assche wrote:
>>> We are having this discussion because bi_ioprio is sixteen bits wide and
>>> because we don't want to make struct bio larger. How about expanding the
>>> bi_ioprio field from 16 to 32 bits and to use separate bits for CDL
>>> information and data lifetimes?
>>
>> I guess we could do that as well. User side aio_reqprio field of struct aiocb,
>> which is used by io_uring and libaio, is an int, so 32-bits also. Changing
>> bi_ioprio to match that should not cause regressions or break user space I
>> think. Kernel uapi ioprio.h will need some massaging though.
> 
> Hmm ... are we perhaps looking at different kernel versions? This is
> what I found:
> 
> $ git grep -nHE 'ioprio;|reqprio;' include/uapi/linux/{io_uring,aio_abi}.h
> include/uapi/linux/aio_abi.h:89:	__s16	aio_reqprio;
> include/uapi/linux/io_uring.h:33:	__u16	ioprio;		/* ioprio for the 
> request */

My bad. I looked at "man aio" but that is the posix AIO API, not Linux native.

> The struct iocb used for asynchronous I/O has a size of 64 bytes and
> does not have any holes. struct io_uring_sqe also has a size of 64 bytes
> and does not have any holes either. The ioprio_set() and ioprio_get()
> system calls use the data type int so these wouldn't need any changes to
> increase the number of ioprio bits.

Yes, but I think it would be better to keep the bio bi_ioprio field size synced
with the per AIO aio_reqprio/ioprio for libaio and io_uring, that is, 16-bits.

>> Reading Niklas's reply to Kanchan, I was reminded that using ioprio hint for
>> the lifetime may have one drawback: that information will be propagated to the
>> device only for direct IOs, no ? For buffered IOs, the information will be
>> lost. The other potential disadvantage of the ioprio interface is that we
>> cannot define ioprio+hint per file (or per inode really), unlike the old
>> write_hint that you initially reintroduced. Are these points blockers for the
>> user API you were thinking of ? How do you envision the user specifying
>> lifetime ? Per file ? Or are you thinking of not relying on the user to specify
>> that but rather the FS (e.g. f2fs) deciding on its own ? If it is the latter, I
>> think ioprio+hint is fine (it is simple). But if it is the former, the ioprio
>> API may not be the best suited for the job at hand.
> 
> The way I see it is that the primary purpose of the bits in the
> bi_ioprio member that are used for the data lifetime is to allow
> filesystems to provide data lifetime information to block drivers.
> 
> Specifying data lifetime information for direct I/O is convenient when
> writing test scripts that verify whether data lifetime supports works
> correctly. There may be other use cases but this is not my primary
> focus.
> 
> I think that applications that want to specify data lifetime information
> should use fcntl(fd, F_SET_RW_HINT, ...). It is up to the filesystem to
> make sure that this information ends up in the bi_ioprio field. The
> block layer is responsible for passing the information in the bi_ioprio
> member to block drivers. Filesystems can support multiple policies for
> combining the i_write_hint and other information into a data lifetime.
> See also the whint_mode restored by patch 05/15 in this series.

Explaining this in the cover letter of the series would be helpful for one to
understand your view of how the information is propagated from user to device.

I am not a fan of having a fcntl() call ending up modifying the ioprio of IOs
using hints, given that hints in themselves are already a user facing
information/API. This is confusing... What if we have a user issue direct IOs
with a lifetime value hint on a file that has a different lifetime set with
fcntl() ? And I am sure there are other corner cases like this.

Given that lifetime is per file (inode) and IO prio is per process or per I/O,
having different user APIs makes sense. The issue of not growing (if possible)
the bio and request structures remains. For bio, you identified a hole already,
so what about using another 16-bits field for lifetime ? Not sure for requests.
I thought also of a union with bi_ioprio, but that would prevent using lifetime
and IO priority together, which is not ideal.

> 
> Thanks,
> 
> Bart.
Christoph Hellwig Oct. 16, 2023, 6:17 a.m. UTC | #11
On Fri, Oct 06, 2023 at 05:19:52PM +0900, Damien Le Moal wrote:
> Your change seem to assume that it makes sense to be able to combine CDL with
> lifetime hints. But does it really ? 

Yes, it does.


> CDL is of dubious value for solid state
> media and as far as I know,

No, it's pretty useful and I'd bet my 2 cents that it will eventually
show up in relevant standards and devices.

Even if it wasn't making our user interfaces exclusive would be a
massive pain.

> The other question here if you really want to keep the bit separation approach
> is: do we really need up to 64 different lifetime hints ? While the scsi
> standard allows that much, does this many different lifetime make sense in
> practice ? Can we ever think of a usecase that needs more than say 8 different
> liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8,
> then we can keep 4 bits unused in the hint field for future features.

Yes, I think this is the smoking gun.  We should be fine with a much
more limited number of lifetime hints, i.e. the user interface only
exposes 5 hints, and supporting more in the in-kernel interfaces seems
of rather doubtfuĊ€ use.
Niklas Cassel Oct. 16, 2023, 9:20 a.m. UTC | #12
On Fri, Oct 13, 2023 at 02:20:23PM -0700, Bart Van Assche wrote:
> On 10/13/23 02:33, Niklas Cassel wrote:
> > In commit c75e707fe1aa ("block: remove the per-bio/request write hint")
> > this line from fs/direct-io.c was removed:
> > -       bio->bi_write_hint = dio->iocb->ki_hint;
> > 
> > I'm not sure why this series does not readd a similar line to set the
> > lifetime (using bio_set_data_lifetime()) also for fs/direct-io.c.
> 
> It depends on how we want the user to specify the data lifetime for
> direct I/O. This assignment is not modified by this patch series and
> copies the data lifetime information from the ioprio bitfield from user
> space into the bio:
> 
> 		bio->bi_ioprio = dio->iocb->ki_ioprio;

Before per-bio/request write hints were removed, things looked like this:

io_uring.c:
req->rw.kiocb.ki_hint = ki_hint_validate(file_write_hint(req->file));

fs/fcntl.c:
static inline enum rw_hint file_write_hint(struct file *file)
{
	if (file->f_write_hint != WRITE_LIFE_NOT_SET)
		return file->f_write_hint;

	return file_inode(file)->i_write_hint;
}

direct-io.c:
bio->bi_write_hint = dio->iocb->ki_hint;

buffered-io.c:
bio->bi_write_hint = inode->i_write_hint;



After this series, things instead look like this:

direct-io.c:
bio->bi_ioprio = dio->iocb->ki_ioprio;

buffered-io.c:
bio_set_data_lifetime(bio, inode->i_write_hint);


So when you say:
"It depends on how we want the user to specify the data lifetime for
direct I/O.", do you mean that buffered I/O should use fcntl() to specify
data lifetime, but direct I/O should used Linux IO priority API to specify
the same?

Because, to me that seems to be how the series is currently working.
(I am sorry if I am missing something.)


Kind regards,
Niklas
Bart Van Assche Oct. 16, 2023, 4:31 p.m. UTC | #13
On 10/15/23 15:22, Damien Le Moal wrote:
> Given that lifetime is per file (inode) and IO prio is per process or per I/O,
> having different user APIs makes sense. The issue of not growing (if possible)
> the bio and request structures remains. For bio, you identified a hole already,
> so what about using another 16-bits field for lifetime ? Not sure for requests.
> I thought also of a union with bi_ioprio, but that would prevent using lifetime
> and IO priority together, which is not ideal.

There is a challenge: F_SET_RW_HINT / F_GET_RW_HINT are per inode and
hence submitting direct I/O with different lifetimes to a single file
by opening that file multiple times and by using F_SET_FILE_RW_HINT
won't be possible. fio supports this use case. See also fio commit
bd553af6c849 ("Update API for file write hints"). If nobody objects I
won't restore F_SET_FILE_RW_HINT support.

Furthermore, I plan to drop the bi_ioprio changes and introduce a new u8
struct bio member instead: bi_lifetime. I think 8 bits is enough since
the NVMe and SCSI data lifetime fields are six bits wide.

There is a two byte hole in struct request past the ioprio field. I'd
like to use that space for a new data lifetime member.

Thanks,

Bart.
Bart Van Assche Oct. 16, 2023, 4:32 p.m. UTC | #14
On 10/15/23 23:17, Christoph Hellwig wrote:
> On Fri, Oct 06, 2023 at 05:19:52PM +0900, Damien Le Moal wrote:
>> Your change seem to assume that it makes sense to be able to combine CDL with
>> lifetime hints. But does it really ?
> 
> Yes, it does.
> 
> 
>> CDL is of dubious value for solid state
>> media and as far as I know,
> 
> No, it's pretty useful and I'd bet my 2 cents that it will eventually
> show up in relevant standards and devices.
> 
> Even if it wasn't making our user interfaces exclusive would be a
> massive pain.
> 
>> The other question here if you really want to keep the bit separation approach
>> is: do we really need up to 64 different lifetime hints ? While the scsi
>> standard allows that much, does this many different lifetime make sense in
>> practice ? Can we ever think of a usecase that needs more than say 8 different
>> liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8,
>> then we can keep 4 bits unused in the hint field for future features.
> 
> Yes, I think this is the smoking gun.  We should be fine with a much
> more limited number of lifetime hints, i.e. the user interface only
> exposes 5 hints, and supporting more in the in-kernel interfaces seems
> of rather doubtfuĊ€ use.

Thanks for the feedback Christoph. I will address this feedback in the
next version of this patch series.

Bart.
Bart Van Assche Oct. 16, 2023, 4:36 p.m. UTC | #15
On 10/16/23 02:20, Niklas Cassel wrote:
> So when you say:
> "It depends on how we want the user to specify the data lifetime for
> direct I/O.", do you mean that buffered I/O should use fcntl() to specify
> data lifetime, but direct I/O should used Linux IO priority API to specify
> the same?
> 
> Because, to me that seems to be how the series is currently working.
> (I am sorry if I am missing something.)

Let's drop support for passing the data lifetime in the I/O priority
field from user space as Damien proposed. The remaining question is
whether only to restore support for F_SET_RW_HINT (per inode) or also
for F_SET_FILE_RW_HINT (per file)?

Thanks,

Bart.
diff mbox series

Patch

diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
index bee2bdb0eedb..efe9bc450872 100644
--- a/include/uapi/linux/ioprio.h
+++ b/include/uapi/linux/ioprio.h
@@ -71,7 +71,7 @@  enum {
  * class and level.
  */
 #define IOPRIO_HINT_SHIFT		IOPRIO_LEVEL_NR_BITS
-#define IOPRIO_HINT_NR_BITS		10
+#define IOPRIO_HINT_NR_BITS		3
 #define IOPRIO_NR_HINTS			(1 << IOPRIO_HINT_NR_BITS)
 #define IOPRIO_HINT_MASK		(IOPRIO_NR_HINTS - 1)
 #define IOPRIO_PRIO_HINT(ioprio)	\
@@ -102,6 +102,12 @@  enum {
 	IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7,
 };
 
+#define IOPRIO_LIFETIME_SHIFT		(IOPRIO_HINT_SHIFT + IOPRIO_HINT_NR_BITS)
+#define IOPRIO_LIFETIME_NR_BITS		6
+#define IOPRIO_LIFETIME_MASK		((1u << IOPRIO_LIFETIME_NR_BITS) - 1)
+#define IOPRIO_PRIO_LIFETIME(ioprio)					\
+	((ioprio >> IOPRIO_LIFETIME_SHIFT) & IOPRIO_LIFETIME_MASK)
+
 #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max))
 
 /*