[v2] lib/scatterlist: add simple page iterator

Message ID	1360608604-3520-1-git-send-email-imre.deak@intel.com
State	New
Headers	show Return-Path: <patch+caf_=linaro-patchwork=canonical.com@linaro.org> Received-SPF: neutral (google.com: 91.189.95.16 is neither permitted nor denied by best guess record for domain of linaro-mm-sig-bounces@lists.linaro.org) client-ip=91.189.95.16; From: Imre Deak <imre.deak@intel.com> To: linux-kernel@vger.kernel.org Date: Mon, 11 Feb 2013 20:50:04 +0200 Message-Id: <1360608604-3520-1-git-send-email-imre.deak@intel.com> In-Reply-To: <1353590706-1366-5-git-send-email-imre.deak@intel.com> References: <1353590706-1366-5-git-send-email-imre.deak@intel.com> Cc: Tejun Heo <tj@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, linaro-mm-sig@lists.linaro.org, Maxim Levitsky <maximlevitsky@gmail.com> Subject: [Linaro-mm-sig] [PATCH v2] lib/scatterlist: add simple page iterator Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: linaro-mm-sig-bounces@lists.linaro.org Errors-To: linaro-mm-sig-bounces@lists.linaro.org

Imre Deak Feb. 11, 2013, 6:50 p.m. UTC

Add an iterator to walk through a scatter list a page at a time starting
at a specific page offset. As opposed to the mapping iterator this is
meant to be small, performing well even in simple loops like collecting
all pages on the scatterlist into an array or setting up an iommu table
based on the pages' DMA address.

v2:
- In each iteration sg_pgoffset pointed incorrectly at the next page not
  the current one.

Signed-off-by: Imre Deak <imre.deak@intel.com>
---
 include/linux/scatterlist.h |   50 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

Andrew Morton Feb. 11, 2013, 8:54 p.m. UTC | #1

On Mon, 11 Feb 2013 20:50:04 +0200
Imre Deak <imre.deak@intel.com> wrote:

> Add an iterator to walk through a scatter list a page at a time starting
> at a specific page offset. As opposed to the mapping iterator this is

What is "the mapping iterator"?

> meant to be small, performing well even in simple loops like collecting
> all pages on the scatterlist into an array or setting up an iommu table
> based on the pages' DMA address.

Where will this new macro be used?  What is driving this effort?

> v2:
> - In each iteration sg_pgoffset pointed incorrectly at the next page not
>   the current one.
> 
> Signed-off-by: Imre Deak <imre.deak@intel.com>
> ---
>  include/linux/scatterlist.h |   50 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
> index 4bd6c06..72578b5 100644
> --- a/include/linux/scatterlist.h
> +++ b/include/linux/scatterlist.h
> @@ -231,6 +231,56 @@ size_t sg_copy_to_buffer(struct scatterlist *sgl, unsigned int nents,
>   */
>  #define SG_MAX_SINGLE_ALLOC		(PAGE_SIZE / sizeof(struct scatterlist))
>  
> +struct sg_page_iter {
> +	struct scatterlist *sg;
> +	int sg_pgoffset;
> +	struct page *page;
> +};

Some documentation wouldn't hurt.   What it's used for, why it exists.

> +static inline int
> +sg_page_cnt(struct scatterlist *sg)

unneeded newline here.

A more typical name would be "sg_page_count".  Stripping words of their
vowels makes the symbols harder to remember.

> +{
> +	BUG_ON(sg->offset || sg->length & ~PAGE_MASK);
> +
> +	return sg->length >> PAGE_SHIFT;
> +}
> +
> +static inline struct page *
> +sg_page_iter_get_page(struct sg_page_iter *iter)
> +{
> +	while (iter->sg && iter->sg_pgoffset >= sg_page_cnt(iter->sg)) {
> +		iter->sg_pgoffset -= sg_page_cnt(iter->sg);
> +		iter->sg = sg_next(iter->sg);
> +	}
> +
> +	return iter->sg ? nth_page(sg_page(iter->sg), iter->sg_pgoffset) : NULL;
> +}
> +
> +static inline void
> +sg_page_iter_next(struct sg_page_iter *iter)
> +{
> +	iter->sg_pgoffset++;
> +	iter->page = sg_page_iter_get_page(iter);
> +}
> +
> +static inline void
> +sg_page_iter_start(struct sg_page_iter *iter, struct scatterlist *sglist,
> +		   unsigned long pgoffset)
> +{
> +	iter->sg = sglist;
> +	iter->sg_pgoffset = pgoffset;
> +	iter->page = sg_page_iter_get_page(iter);
> +}

All the above are undocumented also.  I guess that's acceptable if they
are only ever to be used by for_each_sg_page().  Although if that's the
case then perhaps the identifiers should be a bit more obscure-looking.
Usually we prefix them with "__" to say "this is in internal thing".

> +/*
> + * Simple sg page iterator, starting off at the given page offset. Each entry
> + * on the sglist must start at offset 0 and can contain only full pages.
> + * iter->page will point to the current page, iter->sg_pgoffset to the page
> + * offset within the sg holding that page.
> + */
> +#define for_each_sg_page(sglist, iter, pgoffset)				\
> +	for (sg_page_iter_start((iter), (sglist), (pgoffset));		\
> +	     (iter)->page; sg_page_iter_next(iter))

Because all the helper functions are inlined, this will expand to a
quite large amount of code.  And large code can be slow code due to
I-cache eviction.

I don't know *how* big this thing will be because the patch didn't
include a caller and I can't be bothered writing my own.  (And the lack
of any caller means that the code will not be tested).

So, exactly how big is this thing, and how do we know it's better this
way than if we were to uninline some/all of the helpers?

Imre Deak Feb. 12, 2013, 5:07 p.m. UTC | #2

On Mon, 2013-02-11 at 12:54 -0800, Andrew Morton wrote:
> On Mon, 11 Feb 2013 20:50:04 +0200
> Imre Deak <imre.deak@intel.com> wrote:
> 
> > Add an iterator to walk through a scatter list a page at a time starting
> > at a specific page offset. As opposed to the mapping iterator this is
> 
> What is "the mapping iterator"?

It's the one implemented by sg_miter_{start,stop} in scatterlist.c. It
also iterates through a scatterlist a page at a time, but it also kmaps
these pages. Since in our use case we don't need to map the pages we
needed a solution without this overhead.

> > meant to be small, performing well even in simple loops like collecting
> > all pages on the scatterlist into an array or setting up an iommu table
> > based on the pages' DMA address.
> 
> Where will this new macro be used?  What is driving this effort?

At the moment the only user of the macro would be the i915 driver, see
[1] for the patches that takes it into use. In the patchset the macro
was added as a DRM specific macro, but since it might be useful in the
future for other drivers too (anything using dma-buf) I'd like to add it
to a more generic place.

> > v2:
> > - In each iteration sg_pgoffset pointed incorrectly at the next page not
> >   the current one.
> > 
> > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > ---
> >  include/linux/scatterlist.h |   50 +++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 50 insertions(+)
> > 
> > diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
> > index 4bd6c06..72578b5 100644
> > --- a/include/linux/scatterlist.h
> > +++ b/include/linux/scatterlist.h
> > @@ -231,6 +231,56 @@ size_t sg_copy_to_buffer(struct scatterlist *sgl, unsigned int nents,
> >   */
> >  #define SG_MAX_SINGLE_ALLOC		(PAGE_SIZE / sizeof(struct scatterlist))
> >  
> > +struct sg_page_iter {
> > +	struct scatterlist *sg;
> > +	int sg_pgoffset;
> > +	struct page *page;
> > +};
> 
> Some documentation wouldn't hurt.   What it's used for, why it exists.

Ok, will add it.

> 
> > +static inline int
> > +sg_page_cnt(struct scatterlist *sg)
> 
> unneeded newline here.
> 
> A more typical name would be "sg_page_count".  Stripping words of their
> vowels makes the symbols harder to remember.

Ok, will fix this.

> > +{
> > +	BUG_ON(sg->offset || sg->length & ~PAGE_MASK);
> > +
> > +	return sg->length >> PAGE_SHIFT;
> > +}
> > +
> > +static inline struct page *
> > +sg_page_iter_get_page(struct sg_page_iter *iter)
> > +{
> > +	while (iter->sg && iter->sg_pgoffset >= sg_page_cnt(iter->sg)) {
> > +		iter->sg_pgoffset -= sg_page_cnt(iter->sg);
> > +		iter->sg = sg_next(iter->sg);
> > +	}
> > +
> > +	return iter->sg ? nth_page(sg_page(iter->sg), iter->sg_pgoffset) : NULL;
> > +}
> > +
> > +static inline void
> > +sg_page_iter_next(struct sg_page_iter *iter)
> > +{
> > +	iter->sg_pgoffset++;
> > +	iter->page = sg_page_iter_get_page(iter);
> > +}
> > +
> > +static inline void
> > +sg_page_iter_start(struct sg_page_iter *iter, struct scatterlist *sglist,
> > +		   unsigned long pgoffset)
> > +{
> > +	iter->sg = sglist;
> > +	iter->sg_pgoffset = pgoffset;
> > +	iter->page = sg_page_iter_get_page(iter);
> > +}
> 
> All the above are undocumented also.  I guess that's acceptable if they
> are only ever to be used by for_each_sg_page().  Although if that's the
> case then perhaps the identifiers should be a bit more obscure-looking.
> Usually we prefix them with "__" to say "this is in internal thing".

Yes, they are meant to be used only internally, so I'll add the __
prefix.

> > +/*
> > + * Simple sg page iterator, starting off at the given page offset. Each entry
> > + * on the sglist must start at offset 0 and can contain only full pages.
> > + * iter->page will point to the current page, iter->sg_pgoffset to the page
> > + * offset within the sg holding that page.
> > + */
> > +#define for_each_sg_page(sglist, iter, pgoffset)				\
> > +	for (sg_page_iter_start((iter), (sglist), (pgoffset));		\
> > +	     (iter)->page; sg_page_iter_next(iter))
> 
> Because all the helper functions are inlined, this will expand to a
> quite large amount of code.  And large code can be slow code due to
> I-cache eviction.
> 
> I don't know *how* big this thing will be because the patch didn't
> include a caller and I can't be bothered writing my own.  (And the lack
> of any caller means that the code will not be tested).
> 
> So, exactly how big is this thing, and how do we know it's better this
> way than if we were to uninline some/all of the helpers?

I admit I only hoped compiler optimization would keep the inlined parts
at a minimum, but now I actually checked (on Intel CPU). I applied the
patchset from [1] and uninlined sg_page_iter_start as it's not
significant for speed:

 size drivers/gpu/drm/i915/i915.ko
 514855	  15996	    272	 531123	  81ab3	drivers/gpu/drm/i915/i915.ko

Then uninlined all helpers:
 size drivers/gpu/drm/i915/i915.ko
 513447	  15996	    272	 529715	  81533	drivers/gpu/drm/i915/i915.ko

Since there are 8 invocations of the macro, the overhead for a single
invocation is about (531123 - 529715) / 8 = 191 bytes.

For speed, I benchmarked a simple loop which was basically:

page = vmalloc(sizeof(*page) * 1000, GFP_KERNEL);
for_each_sg_page(sglist, iter, 0)
	*page++ = iter.page;

where each entry on the sglist contained 16 consecutive pages. This
takes ~10% more time for the uninlined version to run. This is a rather
artificial test and I couldn't come up with something more real-life
using only the i915 driver's ioctl interface that would show a
significant change in speed.

So at least for now I'm ok with just uninlining all the helpers.

Thanks for the review,
Imre

[1]
http://lists.freedesktop.org/archives/intel-gfx/2013-February/024589.html

Tejun Heo Feb. 12, 2013, 5:13 p.m. UTC | #3

Hello,

On Tue, Feb 12, 2013 at 07:07:20PM +0200, Imre Deak wrote:
> It's the one implemented by sg_miter_{start,stop} in scatterlist.c. It
> also iterates through a scatterlist a page at a time, but it also kmaps
> these pages. Since in our use case we don't need to map the pages we
> needed a solution without this overhead.

I'm not against having non-mapping iterator but please consider that
kmaps are no-ops on many configurations.  It matters only for archs w/
high memory.

> where each entry on the sglist contained 16 consecutive pages. This
> takes ~10% more time for the uninlined version to run. This is a rather
> artificial test and I couldn't come up with something more real-life
> using only the i915 driver's ioctl interface that would show a
> significant change in speed.
> 
> So at least for now I'm ok with just uninlining all the helpers.

Can we reimplement mapping iters using the new ones?

Thanks.

Andrew Morton Feb. 12, 2013, 9:28 p.m. UTC | #4

On Tue, 12 Feb 2013 19:07:20 +0200
Imre Deak <imre.deak@intel.com> wrote:

> > So, exactly how big is this thing, and how do we know it's better this
> > way than if we were to uninline some/all of the helpers?
> 
> I admit I only hoped compiler optimization would keep the inlined parts
> at a minimum, but now I actually checked (on Intel CPU). I applied the
> patchset from [1] and uninlined sg_page_iter_start as it's not
> significant for speed:
> 
>  size drivers/gpu/drm/i915/i915.ko
>  514855	  15996	    272	 531123	  81ab3	drivers/gpu/drm/i915/i915.ko
> 
> Then uninlined all helpers:
>  size drivers/gpu/drm/i915/i915.ko
>  513447	  15996	    272	 529715	  81533	drivers/gpu/drm/i915/i915.ko
> 
> Since there are 8 invocations of the macro, the overhead for a single
> invocation is about (531123 - 529715) / 8 = 191 bytes.
> 
> For speed, I benchmarked a simple loop which was basically:
> 
> page = vmalloc(sizeof(*page) * 1000, GFP_KERNEL);
> for_each_sg_page(sglist, iter, 0)
> 	*page++ = iter.page;
> 
> where each entry on the sglist contained 16 consecutive pages. This
> takes ~10% more time for the uninlined version to run. This is a rather
> artificial test and I couldn't come up with something more real-life
> using only the i915 driver's ioctl interface that would show a
> significant change in speed.

10% for the function call overhead sounds reasonable.  Of course, that
test is biased in one direction.  A test which was biased in the other
direction would exercise all eight of the macro's callsites and would
investigate the performance impact of a 1kbyte increase in L1 cache
utilisation.

And I must say, it would need to be a pretty damn carefully crafted
test case to be able to trigger enough cache thrashing to cause a 10%
hit.

> So at least for now I'm ok with just uninlining all the helpers.

OK.

Imre Deak Feb. 13, 2013, 2:51 p.m. UTC | #5

On Tue, 2013-02-12 at 09:13 -0800, Tejun Heo wrote:
> Hello,
> 
> On Tue, Feb 12, 2013 at 07:07:20PM +0200, Imre Deak wrote:
> > It's the one implemented by sg_miter_{start,stop} in scatterlist.c. It
> > also iterates through a scatterlist a page at a time, but it also kmaps
> > these pages. Since in our use case we don't need to map the pages we
> > needed a solution without this overhead.
> 
> I'm not against having non-mapping iterator but please consider that
> kmaps are no-ops on many configurations.  It matters only for archs w/
> high memory.

Ok, I haven't thought about that. But in any case we care about those
archs too and would like to avoid the mapping there as well.

> > where each entry on the sglist contained 16 consecutive pages. This
> > takes ~10% more time for the uninlined version to run. This is a rather
> > artificial test and I couldn't come up with something more real-life
> > using only the i915 driver's ioctl interface that would show a
> > significant change in speed.
> > 
> > So at least for now I'm ok with just uninlining all the helpers.
> 
> Can we reimplement mapping iters using the new ones?

Yes I think it's a good idea, I will follow up with a new patchset
addressing this and Andrew's comments.

Thanks,
Imre

[v2] lib/scatterlist: add simple page iterator

Commit Message

Comments

Patch