mbox series

[v1,0/6] no-copy bvec

Message ID cover.1607976425.git.asml.silence@gmail.com
Headers show
Series no-copy bvec | expand

Message

Pavel Begunkov Dec. 15, 2020, 12:20 a.m. UTC
Instead of creating a full copy of iter->bvec into bio in direct I/O,
the patchset makes use of the one provided. It changes semantics and
obliges users of asynchronous kiocb to track bvec lifetime, and [1/6]
converts the only place that doesn't.

bio_iov_iter_get_pages() is still does iov_iter_advance(), which is
not great, but neccessary for revert to work. It's desirable to have
a fast version of iov_iter_advance(i, i->count), so we may want to
hack something up for that. E.g. allow to not keep it consistent
in some cases when i->count==0. Also we can add a separate bio pool
without inlined bvec. Very easy to do and shrinks bios from 3 to 2
cachelines.

Also as suggested it removes BIO_WORKINGSET from direct paths: blkdev,
iomap, fs/direct-io. Even though the last one is not very important as
more filesystems are converted to iomap, but still looks hacky. Maybe,
as Johannes mentioned in another thread, moving it to the writeback
code (or other option) would be better in the end. Afterwards?

since RFC:
- add target_core_file patch by Christoph
- make no-copy default behaviour, remove iter flag
- iter_advance() instead of hacks to revert to work
- add bvec iter_advance() optimisation patch
- remove PSI annotations from direct IO (iomap, block and fs/direct)
- note in d/f/porting

Christoph Hellwig (1):
  target/file: allocate the bvec array as part of struct
    target_core_file_cmd

Pavel Begunkov (5):
  iov_iter: optimise bvec iov_iter_advance()
  bio: deduplicate adding a page into bio
  block/psi: remove PSI annotations from direct IO
  bio: add a helper calculating nr segments to alloc
  block/iomap: don't copy bvec for direct IO

 Documentation/filesystems/porting.rst |   9 +++
 block/bio.c                           | 103 ++++++++++++--------------
 drivers/target/target_core_file.c     |  20 ++---
 fs/block_dev.c                        |   7 +-
 fs/direct-io.c                        |   2 +
 fs/iomap/direct-io.c                  |   9 +--
 include/linux/bio.h                   |   9 +++
 lib/iov_iter.c                        |  19 +++++
 8 files changed, 102 insertions(+), 76 deletions(-)

Comments

Dave Chinner Dec. 15, 2020, 12:56 a.m. UTC | #1
On Tue, Dec 15, 2020 at 12:20:23AM +0000, Pavel Begunkov wrote:
> As reported, we must not do pressure stall information accounting for
> direct IO, because otherwise it tells that it's thrashing a page when
> actually doing IO on hot data.
> 
> Apparently, bio_iov_iter_get_pages() is used only by paths doing direct
> IO, so just make it avoid setting BIO_WORKINGSET, it also saves us CPU
> cycles on doing that. For fs/direct-io.c just clear the flag before
> submit_bio(), it's not of much concern performance-wise.
> 
> Reported-by: Christoph Hellwig <hch@infradead.org>
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  block/bio.c    | 25 ++++++++++++++++---------
>  fs/direct-io.c |  2 ++
>  2 files changed, 18 insertions(+), 9 deletions(-)
.....
> @@ -1099,6 +1103,9 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
>   * fit into the bio, or are requested in @iter, whatever is smaller. If
>   * MM encounters an error pinning the requested pages, it stops. Error
>   * is returned only if 0 pages could be pinned.
> + *
> + * It also doesn't set BIO_WORKINGSET, so is intended for direct IO. If used
> + * otherwise the caller is responsible to do that to keep PSI happy.
>   */
>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  {
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index d53fa92a1ab6..914a7f600ecd 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -426,6 +426,8 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
>  	unsigned long flags;
>  
>  	bio->bi_private = dio;
> +	/* PSI is only for paging IO */
> +	bio_clear_flag(bio, BIO_WORKINGSET);

Why only do this for the old direct IO path? Why isn't this
necessary for the iomap DIO path?

Cheers,

Dave.
Dave Chinner Dec. 15, 2020, 1:09 a.m. UTC | #2
On Tue, Dec 15, 2020 at 12:20:25AM +0000, Pavel Begunkov wrote:
> The block layer spends quite a while in blkdev_direct_IO() to copy and
> initialise bio's bvec. However, if we've already got a bvec in the input
> iterator it might be reused in some cases, i.e. when new
> ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable
> performance boost, and it also reduces memory footprint.
> 
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  Documentation/filesystems/porting.rst |  9 ++++
>  block/bio.c                           | 64 +++++++++++----------------
>  include/linux/bio.h                   |  3 ++
>  3 files changed, 38 insertions(+), 38 deletions(-)

This doesn't touch iomap code, so the title of the patch seems
wrong...

> +For bvec based itererators bio_iov_iter_get_pages() now doesn't copy bvecs but
> +uses the one provided. Anyone issuing kiocb-I/O should ensure that the bvec and
> +page references stay until I/O has completed, i.e. until ->ki_complete() has
> +been called or returned with non -EIOCBQUEUED code.

This is hard to follow. Perhaps:

bio_iov_iter_get_pages() uses the bvecs  provided for bvec based
iterators rather than copying them. Hence anyone issuing kiocb based
IO needs to ensure the bvecs and pages stay referenced until the
submitted I/O is completed by a call to ->ki_complete() or returns
with an error other than -EIOCBQUEUED.

> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 2a9f3f0bbe0a..337f4280b639 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -444,6 +444,9 @@ static inline void bio_wouldblock_error(struct bio *bio)
>  
>  static inline int bio_iov_vecs_to_alloc(struct iov_iter *iter, int max_segs)
>  {
> +	/* reuse iter->bvec */
> +	if (iov_iter_is_bvec(iter))
> +		return 0;
>  	return iov_iter_npages(iter, max_segs);

Ah, I'm a blind idiot... :/

Cheers,

Dave.
Pavel Begunkov Dec. 15, 2020, 11:14 a.m. UTC | #3
On 15/12/2020 01:41, Ming Lei wrote:
> On Tue, Dec 15, 2020 at 12:20:19AM +0000, Pavel Begunkov wrote:
>> Instead of creating a full copy of iter->bvec into bio in direct I/O,
>> the patchset makes use of the one provided. It changes semantics and
>> obliges users of asynchronous kiocb to track bvec lifetime, and [1/6]
>> converts the only place that doesn't.
> 
> Just think of one corner case: iov_iter(BVEC) may pass bvec table with zero
> length bvec, which may not be supported by block layer or driver, so
> this patchset has to address this case first.

The easiest for me would be to fallback to copy if there are zero bvecs,
e.g. finding such during iov_iter_alignment(), but do we know from where
zero bvecs can came? As it's internals we may want to forbid them if
there is not too much hassle.

> 
> Please see 7e24969022cb ("block: allow for_each_bvec to support zero len bvec").
Pavel Begunkov Dec. 15, 2020, 11:41 a.m. UTC | #4
On 15/12/2020 01:33, Dave Chinner wrote:
> On Tue, Dec 15, 2020 at 01:03:45AM +0000, Pavel Begunkov wrote:
>> On 15/12/2020 00:56, Dave Chinner wrote:
>>> On Tue, Dec 15, 2020 at 12:20:23AM +0000, Pavel Begunkov wrote:
>>>> As reported, we must not do pressure stall information accounting for
>>>> direct IO, because otherwise it tells that it's thrashing a page when
>>>> actually doing IO on hot data.
>>>>
>>>> Apparently, bio_iov_iter_get_pages() is used only by paths doing direct
>>>> IO, so just make it avoid setting BIO_WORKINGSET, it also saves us CPU
>>>> cycles on doing that. For fs/direct-io.c just clear the flag before
>>>> submit_bio(), it's not of much concern performance-wise.
>>>>
>>>> Reported-by: Christoph Hellwig <hch@infradead.org>
>>>> Suggested-by: Christoph Hellwig <hch@infradead.org>
>>>> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>>> ---
>>>>  block/bio.c    | 25 ++++++++++++++++---------
>>>>  fs/direct-io.c |  2 ++
>>>>  2 files changed, 18 insertions(+), 9 deletions(-)
>>> .....
>>>> @@ -1099,6 +1103,9 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
>>>>   * fit into the bio, or are requested in @iter, whatever is smaller. If
>>>>   * MM encounters an error pinning the requested pages, it stops. Error
>>>>   * is returned only if 0 pages could be pinned.
>>>> + *
>>>> + * It also doesn't set BIO_WORKINGSET, so is intended for direct IO. If used
>>>> + * otherwise the caller is responsible to do that to keep PSI happy.
>>>>   */
>>>>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>>>>  {
>>>> diff --git a/fs/direct-io.c b/fs/direct-io.c
>>>> index d53fa92a1ab6..914a7f600ecd 100644
>>>> --- a/fs/direct-io.c
>>>> +++ b/fs/direct-io.c
>>>> @@ -426,6 +426,8 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
>>>>  	unsigned long flags;
>>>>  
>>>>  	bio->bi_private = dio;
>>>> +	/* PSI is only for paging IO */
>>>> +	bio_clear_flag(bio, BIO_WORKINGSET);
>>>
>>> Why only do this for the old direct IO path? Why isn't this
>>> necessary for the iomap DIO path?
>>
>> It's in the description. In short, block and iomap dio use
>> bio_iov_iter_get_pages(), which with this patch doesn't use
>> [__]bio_add_page() and so doesn't set the flag. 
> 
> That is not obvious to someone not intimately familiar with the
> patchset you are working on. You described -what- the code is doing,
> not -why- the flag needs to be cleared here.

It's missing the link between BIO_WORKINGSET and PSI, but otherwise
it describe both, what it does and how. I'll reword it for you next
iteration.

> 
> "Direct IO does not operate on the current working set of pages
> managed by the kernel, so it should not be accounted as IO to the
> pressure stall tracking infrastructure. Only direct IO paths use
> bio_iov_iter_get_pages() to build bios, so to avoid PSI tracking of
> direct IO don't flag the bio with BIO_WORKINGSET in this function.
> 
> fs/direct-io.c uses <some other function> to build the bio we
> are going to submit and so still flags the bio with BIO_WORKINGSET.
> Rather than convert it to use bio_iov_iter_get_pages() to avoid
> flagging the bio, we simply clear the BIO_WORKINGSET flag before
> submitting the bio."
> 
> Cheers,
> 
> Dave.
>
Pavel Begunkov Dec. 15, 2020, 2:05 p.m. UTC | #5
On 15/12/2020 12:03, Ming Lei wrote:
> On Tue, Dec 15, 2020 at 11:14:20AM +0000, Pavel Begunkov wrote:
>> On 15/12/2020 01:41, Ming Lei wrote:
>>> On Tue, Dec 15, 2020 at 12:20:19AM +0000, Pavel Begunkov wrote:
>>>> Instead of creating a full copy of iter->bvec into bio in direct I/O,
>>>> the patchset makes use of the one provided. It changes semantics and
>>>> obliges users of asynchronous kiocb to track bvec lifetime, and [1/6]
>>>> converts the only place that doesn't.
>>>
>>> Just think of one corner case: iov_iter(BVEC) may pass bvec table with zero
>>> length bvec, which may not be supported by block layer or driver, so
>>> this patchset has to address this case first.
>>
>> The easiest for me would be to fallback to copy if there are zero bvecs,
>> e.g. finding such during iov_iter_alignment(), but do we know from where
>> zero bvecs can came? As it's internals we may want to forbid them if
>> there is not too much hassle.
> 
> You may find clue from the following link:
> 
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2262077.html

Thanks for the link!

Al, you mentioned "Zero-length segments are not disallowed", do you have
a strong opinion on that? Apart from already diverged behaviour from the
block layer and getting in the way of this series, without it we'd also be
able to remove some extra ifs, e.g. in iterate_bvec()
Christoph Hellwig Dec. 22, 2020, 2:11 p.m. UTC | #6
On Tue, Dec 15, 2020 at 02:05:35PM +0000, Pavel Begunkov wrote:
> > You may find clue from the following link:

> > 

> > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2262077.html

> 

> Thanks for the link!

> 

> Al, you mentioned "Zero-length segments are not disallowed", do you have

> a strong opinion on that? Apart from already diverged behaviour from the

> block layer and getting in the way of this series, without it we'd also be

> able to remove some extra ifs, e.g. in iterate_bvec()


I'd prefer not to support zero-length ITER_BVEC and catching them
early, as the block layer can't deal with them either.  From a quick
look at iter_file_splice_write it should be pretty trivial to fix there,
although we'll need to audit other callers as well (even if I don't
expect them to submit this degenerate case).
Pavel Begunkov Dec. 23, 2020, 12:52 p.m. UTC | #7
On 22/12/2020 14:11, Christoph Hellwig wrote:
> On Tue, Dec 15, 2020 at 02:05:35PM +0000, Pavel Begunkov wrote:

>>> You may find clue from the following link:

>>>

>>> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2262077.html

>>

>> Thanks for the link!

>>

>> Al, you mentioned "Zero-length segments are not disallowed", do you have

>> a strong opinion on that? Apart from already diverged behaviour from the

>> block layer and getting in the way of this series, without it we'd also be

>> able to remove some extra ifs, e.g. in iterate_bvec()

> 

> I'd prefer not to support zero-length ITER_BVEC and catching them

> early, as the block layer can't deal with them either.  From a quick

> look at iter_file_splice_write it should be pretty trivial to fix there,

> although we'll need to audit other callers as well (even if I don't

> expect them to submit this degenerate case).


Can scatterlist have 0-len entries? Those are directly translated into
bvecs, e.g. in nvme/target/io-cmd-file.c and target/target_core_file.c.
I've audited most of others by this moment, they're fine.

Thanks for other nits, they will go into next version.

-- 
Pavel Begunkov
Christoph Hellwig Dec. 23, 2020, 3:51 p.m. UTC | #8
On Wed, Dec 23, 2020 at 12:52:59PM +0000, Pavel Begunkov wrote:
> Can scatterlist have 0-len entries? Those are directly translated into

> bvecs, e.g. in nvme/target/io-cmd-file.c and target/target_core_file.c.

> I've audited most of others by this moment, they're fine.


For block layer SGLs we should never see them, and for nvme neither.
I think the same is true for the SCSI target code, but please double
check.
James Bottomley Dec. 23, 2020, 4:04 p.m. UTC | #9
On Wed, 2020-12-23 at 15:51 +0000, Christoph Hellwig wrote:
> On Wed, Dec 23, 2020 at 12:52:59PM +0000, Pavel Begunkov wrote:

> > Can scatterlist have 0-len entries? Those are directly translated

> > into bvecs, e.g. in nvme/target/io-cmd-file.c and

> > target/target_core_file.c. I've audited most of others by this

> > moment, they're fine.

> 

> For block layer SGLs we should never see them, and for nvme neither.

> I think the same is true for the SCSI target code, but please double

> check.


Right, no-one ever wants to see a 0-len scatter list entry.  The reason
is that every driver uses the sgl to program the device DMA engine in
the way NVME does.  a 0 length sgl would be a dangerous corner case:
some DMA engines would ignore it and others would go haywire, so if we
ever let a 0 length list down into the driver, they'd have to
understand the corner case behaviour of their DMA engine and filter it
accordingly, which is why we disallow them in the upper levels, since
they're effective nops anyway.

James
Douglas Gilbert Dec. 23, 2020, 8:23 p.m. UTC | #10
On 2020-12-23 11:04 a.m., James Bottomley wrote:
> On Wed, 2020-12-23 at 15:51 +0000, Christoph Hellwig wrote:

>> On Wed, Dec 23, 2020 at 12:52:59PM +0000, Pavel Begunkov wrote:

>>> Can scatterlist have 0-len entries? Those are directly translated

>>> into bvecs, e.g. in nvme/target/io-cmd-file.c and

>>> target/target_core_file.c. I've audited most of others by this

>>> moment, they're fine.

>>

>> For block layer SGLs we should never see them, and for nvme neither.

>> I think the same is true for the SCSI target code, but please double

>> check.

> 

> Right, no-one ever wants to see a 0-len scatter list entry.  The reason

> is that every driver uses the sgl to program the device DMA engine in

> the way NVME does.  a 0 length sgl would be a dangerous corner case:

> some DMA engines would ignore it and others would go haywire, so if we

> ever let a 0 length list down into the driver, they'd have to

> understand the corner case behaviour of their DMA engine and filter it

> accordingly, which is why we disallow them in the upper levels, since

> they're effective nops anyway.


When using scatter gather lists at the far end (i.e. on the storage device)
the T10 examples (WRITE SCATTERED and POPULATE TOKEN in SBC-4) explicitly
allow the "number of logical blocks" in their sgl_s to be zero and state
that it is _not_ to be considered an error.

Doug Gilbert
Pavel Begunkov Dec. 23, 2020, 8:32 p.m. UTC | #11
On 23/12/2020 20:23, Douglas Gilbert wrote:
> On 2020-12-23 11:04 a.m., James Bottomley wrote:

>> On Wed, 2020-12-23 at 15:51 +0000, Christoph Hellwig wrote:

>>> On Wed, Dec 23, 2020 at 12:52:59PM +0000, Pavel Begunkov wrote:

>>>> Can scatterlist have 0-len entries? Those are directly translated

>>>> into bvecs, e.g. in nvme/target/io-cmd-file.c and

>>>> target/target_core_file.c. I've audited most of others by this

>>>> moment, they're fine.

>>>

>>> For block layer SGLs we should never see them, and for nvme neither.

>>> I think the same is true for the SCSI target code, but please double

>>> check.

>>

>> Right, no-one ever wants to see a 0-len scatter list entry.  The reason

>> is that every driver uses the sgl to program the device DMA engine in

>> the way NVME does.  a 0 length sgl would be a dangerous corner case:

>> some DMA engines would ignore it and others would go haywire, so if we

>> ever let a 0 length list down into the driver, they'd have to

>> understand the corner case behaviour of their DMA engine and filter it

>> accordingly, which is why we disallow them in the upper levels, since

>> they're effective nops anyway.

> 

> When using scatter gather lists at the far end (i.e. on the storage device)

> the T10 examples (WRITE SCATTERED and POPULATE TOKEN in SBC-4) explicitly

> allow the "number of logical blocks" in their sgl_s to be zero and state

> that it is _not_ to be considered an error.


It's fine for my case unless it leaks them out of device driver to the
net/block layer/etc. Is it?

-- 
Pavel Begunkov
Christoph Hellwig Dec. 24, 2020, 6:41 a.m. UTC | #12
On Wed, Dec 23, 2020 at 08:32:45PM +0000, Pavel Begunkov wrote:
> On 23/12/2020 20:23, Douglas Gilbert wrote:

> > On 2020-12-23 11:04 a.m., James Bottomley wrote:

> >> On Wed, 2020-12-23 at 15:51 +0000, Christoph Hellwig wrote:

> >>> On Wed, Dec 23, 2020 at 12:52:59PM +0000, Pavel Begunkov wrote:

> >>>> Can scatterlist have 0-len entries? Those are directly translated

> >>>> into bvecs, e.g. in nvme/target/io-cmd-file.c and

> >>>> target/target_core_file.c. I've audited most of others by this

> >>>> moment, they're fine.

> >>>

> >>> For block layer SGLs we should never see them, and for nvme neither.

> >>> I think the same is true for the SCSI target code, but please double

> >>> check.

> >>

> >> Right, no-one ever wants to see a 0-len scatter list entry.?? The reason

> >> is that every driver uses the sgl to program the device DMA engine in

> >> the way NVME does.?? a 0 length sgl would be a dangerous corner case:

> >> some DMA engines would ignore it and others would go haywire, so if we

> >> ever let a 0 length list down into the driver, they'd have to

> >> understand the corner case behaviour of their DMA engine and filter it

> >> accordingly, which is why we disallow them in the upper levels, since

> >> they're effective nops anyway.

> > 

> > When using scatter gather lists at the far end (i.e. on the storage device)

> > the T10 examples (WRITE SCATTERED and POPULATE TOKEN in SBC-4) explicitly

> > allow the "number of logical blocks" in their sgl_s to be zero and state

> > that it is _not_ to be considered an error.

> 

> It's fine for my case unless it leaks them out of device driver to the

> net/block layer/etc. Is it?


None of the SCSI Command mentions above are supported by Linux,
nevermind mapped to struct scatterlist.
Douglas Gilbert Dec. 24, 2020, 4:45 p.m. UTC | #13
On 2020-12-24 1:41 a.m., Christoph Hellwig wrote:
> On Wed, Dec 23, 2020 at 08:32:45PM +0000, Pavel Begunkov wrote:

>> On 23/12/2020 20:23, Douglas Gilbert wrote:

>>> On 2020-12-23 11:04 a.m., James Bottomley wrote:

>>>> On Wed, 2020-12-23 at 15:51 +0000, Christoph Hellwig wrote:

>>>>> On Wed, Dec 23, 2020 at 12:52:59PM +0000, Pavel Begunkov wrote:

>>>>>> Can scatterlist have 0-len entries? Those are directly translated

>>>>>> into bvecs, e.g. in nvme/target/io-cmd-file.c and

>>>>>> target/target_core_file.c. I've audited most of others by this

>>>>>> moment, they're fine.

>>>>>

>>>>> For block layer SGLs we should never see them, and for nvme neither.

>>>>> I think the same is true for the SCSI target code, but please double

>>>>> check.

>>>>

>>>> Right, no-one ever wants to see a 0-len scatter list entry.?? The reason

>>>> is that every driver uses the sgl to program the device DMA engine in

>>>> the way NVME does.?? a 0 length sgl would be a dangerous corner case:

>>>> some DMA engines would ignore it and others would go haywire, so if we

>>>> ever let a 0 length list down into the driver, they'd have to

>>>> understand the corner case behaviour of their DMA engine and filter it

>>>> accordingly, which is why we disallow them in the upper levels, since

>>>> they're effective nops anyway.

>>>

>>> When using scatter gather lists at the far end (i.e. on the storage device)

>>> the T10 examples (WRITE SCATTERED and POPULATE TOKEN in SBC-4) explicitly

>>> allow the "number of logical blocks" in their sgl_s to be zero and state

>>> that it is _not_ to be considered an error.

>>

>> It's fine for my case unless it leaks them out of device driver to the

>> net/block layer/etc. Is it?

> 

> None of the SCSI Command mentions above are supported by Linux,

> nevermind mapped to struct scatterlist.

> 


The POPULATE TOKEN / WRITE USING TOKEN pair can be viewed as a subset
of EXTENDED COPY (SPC-4) which also supports "range descriptors". It is
not clear if target_core_xcopy.c supports these range descriptors but
if it did, it would be trying to map them to struct scatterlist objects.

That said, it would be easy to skip the "number of logical blocks" == 0
case when translating range descriptors to sgl_s.

In my ddpt utility (a dd clone) I have generalized skip= and seek= to
optionally take sgl_s. If the last element in one of those sgl_s is
LBAn,0 then it is interpreted as "until the end of that device" which
is further restricted if the other sgl has a "hard" length or count=
is given. The point being a length of 0 can have meaning, a benefit
lost with NVMe's 0-based counts.

Doug Gilbert
James Bottomley Dec. 24, 2020, 5:30 p.m. UTC | #14
On Wed, 2020-12-23 at 15:23 -0500, Douglas Gilbert wrote:
> On 2020-12-23 11:04 a.m., James Bottomley wrote:

> > On Wed, 2020-12-23 at 15:51 +0000, Christoph Hellwig wrote:

> > > On Wed, Dec 23, 2020 at 12:52:59PM +0000, Pavel Begunkov wrote:

> > > > Can scatterlist have 0-len entries? Those are directly

> > > > translated into bvecs, e.g. in nvme/target/io-cmd-file.c and

> > > > target/target_core_file.c. I've audited most of others by this

> > > > moment, they're fine.

> > > 

> > > For block layer SGLs we should never see them, and for nvme

> > > neither. I think the same is true for the SCSI target code, but

> > > please double check.

> > 

> > Right, no-one ever wants to see a 0-len scatter list entry.  The

> > reason is that every driver uses the sgl to program the device DMA

> > engine in the way NVME does.  a 0 length sgl would be a dangerous

> > corner case: some DMA engines would ignore it and others would go

> > haywire, so if we ever let a 0 length list down into the driver,

> > they'd have to understand the corner case behaviour of their DMA

> > engine and filter it accordingly, which is why we disallow them in

> > the upper levels, since they're effective nops anyway.

> 

> When using scatter gather lists at the far end (i.e. on the storage

> device) the T10 examples (WRITE SCATTERED and POPULATE TOKEN in SBC-

> 4) explicitly allow the "number of logical blocks" in their sgl_s to

> be zero and state that it is _not_ to be considered an error.


But that's pretty irrelevant.  The scatterlists that block has been
constructing to drive DMA engines pre-date SCSI's addition of SGLs by
decades (all SCSI commands before the object commands use a linear
buffer which is implemented in the HBA engine as a scatterlist but not
described by the SCSI standard as one).

So the answer to the question should the block layer emit zero length
sgl elements is "no" because they can confuse some DMA engines.

If there's a more theoretical question of whether the target driver in
adding commands it doesn't yet support should inject zero length SGL
elements into block because SCSI allows it, the answer is still "no"
because we don't want block to have SGLs that may confuse other DMA
engines.  There's lots of daft corner cases in the SCSI standard we
don't implement and a nop for SGL elements seems to be one of the more
hare brained because it adds no useful feature and merely causes
compatibility issues.

James