diff mbox

sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

Message ID 520894D5.7060207@linaro.org
State Superseded
Headers show

Commit Message

Will Newton Aug. 12, 2013, 7:55 a.m. UTC
A small change to the entry to the aligned copy loop improves
performance slightly on A9 and A15 cores for certain copies.

ports/ChangeLog.arm:

2013-08-07  Will Newton  <will.newton@linaro.org>

	* sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
	on entry to aligned copy loop for improved performance.
---
 ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Will Newton Aug. 27, 2013, 7:46 a.m. UTC | #1
On 12 August 2013 08:55, Will Newton <will.newton@linaro.org> wrote:
>
> A small change to the entry to the aligned copy loop improves
> performance slightly on A9 and A15 cores for certain copies.
>
> ports/ChangeLog.arm:
>
> 2013-08-07  Will Newton  <will.newton@linaro.org>
>
>         * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
>         on entry to aligned copy loop for improved performance.
> ---
>  ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Ping?
Joseph Myers Aug. 29, 2013, 11:58 p.m. UTC | #2
On Mon, 12 Aug 2013, Will Newton wrote:

> A small change to the entry to the aligned copy loop improves
> performance slightly on A9 and A15 cores for certain copies.

Could you clarify what you mean by "certain copies"?

In particular, have you verified that for all three choices in this code 
(NEON, VFP or neither), the code for unaligned copies is at least as fast 
in this case (common 32-bit alignment, but not common 64-bit alignment) as 
the code that would previously have been used in those cases?

There are various comments regarding alignment, whether stating "LDRD/STRD 
support unaligned word accesses" or referring to the mutual alignment that 
applies for particular code.  Does this patch make any of them out of 
date?  (If code can now only be reached with common 64-bit alignment, but 
in fact requires only 32-bit alignment, the comment should probably state 
both those things explicitly.)
Will Newton Aug. 30, 2013, 2:56 p.m. UTC | #3
On 30 August 2013 00:58, Joseph S. Myers <joseph@codesourcery.com> wrote:

Hi Joseph,

>> A small change to the entry to the aligned copy loop improves
>> performance slightly on A9 and A15 cores for certain copies.
>
> Could you clarify what you mean by "certain copies"?

Large copies (> 16kB) where the buffers are 4-byte aligned but not
8-byte aligned. I'll respin the patch with an improved description.

> In particular, have you verified that for all three choices in this code
> (NEON, VFP or neither), the code for unaligned copies is at least as fast
> in this case (common 32-bit alignment, but not common 64-bit alignment) as
> the code that would previously have been used in those cases?

Yes, the performance is very similar but slightly better in the NEON
case and approximately unchanged in the others.

> There are various comments regarding alignment, whether stating "LDRD/STRD
> support unaligned word accesses" or referring to the mutual alignment that
> applies for particular code.  Does this patch make any of them out of
> date?  (If code can now only be reached with common 64-bit alignment, but
> in fact requires only 32-bit alignment, the comment should probably state
> both those things explicitly.)

I've reviewed the comments and they all look ok as far as I can tell.
Joseph Myers Aug. 30, 2013, 3:18 p.m. UTC | #4
On Fri, 30 Aug 2013, Will Newton wrote:

> > There are various comments regarding alignment, whether stating "LDRD/STRD
> > support unaligned word accesses" or referring to the mutual alignment that
> > applies for particular code.  Does this patch make any of them out of
> > date?  (If code can now only be reached with common 64-bit alignment, but
> > in fact requires only 32-bit alignment, the comment should probably state
> > both those things explicitly.)
> 
> I've reviewed the comments and they all look ok as far as I can tell.

Are you sure?  For example, where it says "SRC and DST have the same 
mutual 32-bit alignment", is it not in fact the case after the patch that 
they now have the same mutual 64-bit alignment, even if this code doesn't 
currently rely on that?  I think the comments in each place should be 
explicit about both things - what preconditions the code relies on, and 
what possibly stronger preconditions are in fact true given the other code 
in this file.  Saying the mutual alignment is 32-bit when the reader can 
see from the code not far above that it's 64-bit just seems liable to 
confuse the reader, even if the comment is still formally true.  
Similarly for the requirements on unaligned word accesses - after this 
patch, which uses of ldrd/strd do require that?
Carlos O'Donell Aug. 30, 2013, 5:14 p.m. UTC | #5
On 08/27/2013 03:46 AM, Will Newton wrote:
> On 12 August 2013 08:55, Will Newton <will.newton@linaro.org> wrote:
>>
>> A small change to the entry to the aligned copy loop improves
>> performance slightly on A9 and A15 cores for certain copies.
>>
>> ports/ChangeLog.arm:
>>
>> 2013-08-07  Will Newton  <will.newton@linaro.org>
>>
>>         * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
>>         on entry to aligned copy loop for improved performance.
>> ---
>>  ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Ping?

How did you test the performance?

glibc has a performance microbenchmark, did you use that?

Cheers,
Carlos.
Will Newton Aug. 30, 2013, 6:46 p.m. UTC | #6
On 30 August 2013 16:18, Joseph S. Myers <joseph@codesourcery.com> wrote:
> On Fri, 30 Aug 2013, Will Newton wrote:
>
>> > There are various comments regarding alignment, whether stating "LDRD/STRD
>> > support unaligned word accesses" or referring to the mutual alignment that
>> > applies for particular code.  Does this patch make any of them out of
>> > date?  (If code can now only be reached with common 64-bit alignment, but
>> > in fact requires only 32-bit alignment, the comment should probably state
>> > both those things explicitly.)
>>
>> I've reviewed the comments and they all look ok as far as I can tell.
>
> Are you sure?  For example, where it says "SRC and DST have the same
> mutual 32-bit alignment", is it not in fact the case after the patch that
> they now have the same mutual 64-bit alignment, even if this code doesn't
> currently rely on that?  I think the comments in each place should be
> explicit about both things - what preconditions the code relies on, and
> what possibly stronger preconditions are in fact true given the other code
> in this file.  Saying the mutual alignment is 32-bit when the reader can
> see from the code not far above that it's 64-bit just seems liable to
> confuse the reader, even if the comment is still formally true.
> Similarly for the requirements on unaligned word accesses - after this
> patch, which uses of ldrd/strd do require that?

Yes, you're right, that needs more thought. I'll have a look at it next week.
Will Newton Aug. 30, 2013, 6:48 p.m. UTC | #7
On 30 August 2013 18:14, Carlos O'Donell <carlos@redhat.com> wrote:

Hi Carlos,

>>> A small change to the entry to the aligned copy loop improves
>>> performance slightly on A9 and A15 cores for certain copies.
>>>
>>> ports/ChangeLog.arm:
>>>
>>> 2013-08-07  Will Newton  <will.newton@linaro.org>
>>>
>>>         * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
>>>         on entry to aligned copy loop for improved performance.
>>> ---
>>>  ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> Ping?
>
> How did you test the performance?
>
> glibc has a performance microbenchmark, did you use that?

No, I used the cortex-strings package developed by Linaro for
benchmarking various string functions against one another[1].

I haven't checked the glibc benchmarks but I'll look into that. It's
quite a specific case that shows the problem so it may not be obvious
which one is better however.

[1] https://launchpad.net/cortex-strings
Carlos O'Donell Aug. 30, 2013, 7:26 p.m. UTC | #8
On 08/30/2013 02:48 PM, Will Newton wrote:
> On 30 August 2013 18:14, Carlos O'Donell <carlos@redhat.com> wrote:
> 
> Hi Carlos,
> 
>>>> A small change to the entry to the aligned copy loop improves
>>>> performance slightly on A9 and A15 cores for certain copies.
>>>>
>>>> ports/ChangeLog.arm:
>>>>
>>>> 2013-08-07  Will Newton  <will.newton@linaro.org>
>>>>
>>>>         * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
>>>>         on entry to aligned copy loop for improved performance.
>>>> ---
>>>>  ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++--
>>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> Ping?
>>
>> How did you test the performance?
>>
>> glibc has a performance microbenchmark, did you use that?
> 
> No, I used the cortex-strings package developed by Linaro for
> benchmarking various string functions against one another[1].
> 
> I haven't checked the glibc benchmarks but I'll look into that. It's
> quite a specific case that shows the problem so it may not be obvious
> which one is better however.

If it's not obvious how is someone supposed to review this patch? :-)

> [1] https://launchpad.net/cortex-strings

There are 2 benchmarks. One appears to be dhrystone 2.1, which isn't a string 
test in and of itself which should not be used for benchmarking or changing
string functions. The other is called "multi" and appears to run some functions
in a loop and take the time. 

e.g.
http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/benchmarks/multi/harness.c

I would not call `multi' exhaustive, and while neither is the glibc performance
benchmark tests the glibc tests have received review from the glibc community
and are our preferred way of demonstrating performance gains when posting
performance patches.

I would really really like to see you post the results of running your new
implementation with this benchmark and show the numbers that claim this is
faster. Is that possible?

Cheers,
Carlos.
Will Newton Sept. 2, 2013, 1:58 p.m. UTC | #9
On 30 August 2013 20:26, Carlos O'Donell <carlos@redhat.com> wrote:
> On 08/30/2013 02:48 PM, Will Newton wrote:
>> On 30 August 2013 18:14, Carlos O'Donell <carlos@redhat.com> wrote:
>>
>> Hi Carlos,
>>
>>>>> A small change to the entry to the aligned copy loop improves
>>>>> performance slightly on A9 and A15 cores for certain copies.
>>>>>
>>>>> ports/ChangeLog.arm:
>>>>>
>>>>> 2013-08-07  Will Newton  <will.newton@linaro.org>
>>>>>
>>>>>         * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
>>>>>         on entry to aligned copy loop for improved performance.
>>>>> ---
>>>>>  ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++--
>>>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> Ping?
>>>
>>> How did you test the performance?
>>>
>>> glibc has a performance microbenchmark, did you use that?
>>
>> No, I used the cortex-strings package developed by Linaro for
>> benchmarking various string functions against one another[1].
>>
>> I haven't checked the glibc benchmarks but I'll look into that. It's
>> quite a specific case that shows the problem so it may not be obvious
>> which one is better however.
>
> If it's not obvious how is someone supposed to review this patch? :-)

With difficulty. ;-)

Joseph has raised some good points about the comments and I'll go back
through the code and make sure everything is correct in that regard.
The change was actually made to the copy of the code in cortex-strings
some time ago but I delayed pushing the patch due to the 2.18 release
so I have to refresh my memory somewhat.

Ideally we would have an agreed upon benchmark with which everyone
could analyse the performance of the code on their systems, however
that does not seem to exist as far as I can tell.

>> [1] https://launchpad.net/cortex-strings
>
> There are 2 benchmarks. One appears to be dhrystone 2.1, which isn't a string
> test in and of itself which should not be used for benchmarking or changing
> string functions. The other is called "multi" and appears to run some functions
> in a loop and take the time.

The dhrystone stuff is historical and should probably be removed.

> e.g.
> http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/benchmarks/multi/harness.c
>
> I would not call `multi' exhaustive, and while neither is the glibc performance
> benchmark tests the glibc tests have received review from the glibc community
> and are our preferred way of demonstrating performance gains when posting
> performance patches.

The key advantage of the cortex-strings framework is that it allows
graphing the results of benchmarks. Often changes to string function
performance can only really be analysed graphically as otherwise you
end up with a huge soup of numbers, some going up, some going down and
it is very hard to separate the signal from the noise.

The glibc benchmarks also have some other weaknesses that should
really be addressed, hopefully I'll have some time to write patches
for some of this work.

> I would really really like to see you post the results of running your new
> implementation with this benchmark and show the numbers that claim this is
> faster. Is that possible?

The attached image shows a graph of memcpy performance with one 8 byte
aligned buffer and one 4 byte aligned buffer as buffer size increases.
Ignore the bionic lines, "this" is the modified code and "old" is the
old code. As you can see it is quite a marginal improvement but for a
certain class of copy it does improve performance.

---
Will Newton
Toolchain Working Group, Linaro
Will Newton Sept. 2, 2013, 2:18 p.m. UTC | #10
On 30 August 2013 20:26, Carlos O'Donell <carlos@redhat.com> wrote:
> On 08/30/2013 02:48 PM, Will Newton wrote:
>> On 30 August 2013 18:14, Carlos O'Donell <carlos@redhat.com> wrote:
>>
>> Hi Carlos,
>>
>>>>> A small change to the entry to the aligned copy loop improves
>>>>> performance slightly on A9 and A15 cores for certain copies.
>>>>>
>>>>> ports/ChangeLog.arm:
>>>>>
>>>>> 2013-08-07  Will Newton  <will.newton@linaro.org>
>>>>>
>>>>>         * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
>>>>>         on entry to aligned copy loop for improved performance.
>>>>> ---
>>>>>  ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++--
>>>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> Ping?
>>>
>>> How did you test the performance?
>>>
>>> glibc has a performance microbenchmark, did you use that?
>>
>> No, I used the cortex-strings package developed by Linaro for
>> benchmarking various string functions against one another[1].
>>
>> I haven't checked the glibc benchmarks but I'll look into that. It's
>> quite a specific case that shows the problem so it may not be obvious
>> which one is better however.
>
> If it's not obvious how is someone supposed to review this patch? :-)
>
>> [1] https://launchpad.net/cortex-strings
>
> There are 2 benchmarks. One appears to be dhrystone 2.1, which isn't a string
> test in and of itself which should not be used for benchmarking or changing
> string functions. The other is called "multi" and appears to run some functions
> in a loop and take the time.
>
> e.g.
> http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/benchmarks/multi/harness.c
>
> I would not call `multi' exhaustive, and while neither is the glibc performance
> benchmark tests the glibc tests have received review from the glibc community
> and are our preferred way of demonstrating performance gains when posting
> performance patches.
>
> I would really really like to see you post the results of running your new
> implementation with this benchmark and show the numbers that claim this is
> faster. Is that possible?

The mailing list server does not seem to accept image attachments so I
have uploaded the performance graph here:

http://people.linaro.org/~will.newton/glibc_memcpy/sizes-memcpy-08-04-2.5.png
Siddhesh Poyarekar Sept. 2, 2013, 2:20 p.m. UTC | #11
Adding libc-alpha since my comments on this are more suitable there -
maybe we should just get rid of the ports mailing list altogether.

On Mon, Sep 02, 2013 at 02:58:23PM +0100, Will Newton wrote:
> The key advantage of the cortex-strings framework is that it allows
> graphing the results of benchmarks. Often changes to string function
> performance can only really be analysed graphically as otherwise you
> end up with a huge soup of numbers, some going up, some going down and
> it is very hard to separate the signal from the noise.

We already depend on gd-devel to draw graphs for memusagestat.  Maybe
we could leverage that in the microbenchmark as well and come up with
html reports rather than just a bench.out?  Graphs in general are a
lot more readable and I agree that they would be the next logical
improvement to the benchmark framework.

> The glibc benchmarks also have some other weaknesses that should
> really be addressed, hopefully I'll have some time to write patches
> for some of this work.

I know Ondrej had proposed a few improvements as well.  I'd like to
see those reposted so that we can look at it and if possible, have
them merged in.

Siddhesh
Ondřej Bílka Sept. 2, 2013, 7:57 p.m. UTC | #12
On Mon, Sep 02, 2013 at 02:58:23PM +0100, Will Newton wrote:
> On 30 August 2013 20:26, Carlos O'Donell <carlos@redhat.com> wrote:
> > On 08/30/2013 02:48 PM, Will Newton wrote:
> >> On 30 August 2013 18:14, Carlos O'Donell <carlos@redhat.com> wrote:
> >>>> Ping?
> >>>
> >>> How did you test the performance?
> >>>
> >>> glibc has a performance microbenchmark, did you use that?
> >>
> >> No, I used the cortex-strings package developed by Linaro for
> >> benchmarking various string functions against one another[1].
> >>
> >> I haven't checked the glibc benchmarks but I'll look into that. It's
> >> quite a specific case that shows the problem so it may not be obvious
> >> which one is better however.
> >
> > If it's not obvious how is someone supposed to review this patch? :-)
> 
> With difficulty. ;-)
> 
> Joseph has raised some good points about the comments and I'll go back
> through the code and make sure everything is correct in that regard.
> The change was actually made to the copy of the code in cortex-strings
> some time ago but I delayed pushing the patch due to the 2.18 release
> so I have to refresh my memory somewhat.
> 
> Ideally we would have an agreed upon benchmark with which everyone
> could analyse the performance of the code on their systems, however
> that does not seem to exist as far as I can tell.
>
Well, for measuring performance about only way that everybody will agree
with is compile implementations as old.so and new.so and then use

LD_PRELOAD=old.so time cmd
LD_PRELOAD=new.so time cmd

in loop until you calculate that there is statistically significant
difference (provided that commands you use are representative enough).

For any other somebody will argue that its opposite because you
forgotten to take some factor into account.

Even when you change LD_PRELOAD=old.so implementation that to
accurately measure time spend in function it need not be enough.

You could have implementation that will be 5 cycles faster on that
benchmark but slower in reality because

1) Its code is 1000 bytes bigger than alternative. Gains in function
itself will be eaten by instruction cache misses outside function. 
Or
2) Function agressively prefetches data (say loop that prefetches 
lines 512 bytes after current buffer position). This makes benchmark
performance better but cache will be littered by data after buffer end
buffers, real performance suffers.
Or
3) For malloc saving metadata at same cache line as start of allocated
memory could make benchmark look bad due to cache misses. But it will
improve performance as user will write there and metadata write serves
as prefetch.
or
...


> > e.g.
> > http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/benchmarks/multi/harness.c
> >
> > I would not call `multi' exhaustive, and while neither is the glibc performance
> > benchmark tests the glibc tests have received review from the glibc community
> > and are our preferred way of demonstrating performance gains when posting
> > performance patches.
> 
> The key advantage of the cortex-strings framework is that it allows
> graphing the results of benchmarks. Often changes to string function
> performance can only really be analysed graphically as otherwise you
> end up with a huge soup of numbers, some going up, some going down and
> it is very hard to separate the signal from the noise.
> 
Like following? http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_loop/results_rand/result.html

On real workloads of memcpy it is still bit hard to see what is going on:
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_loop/results_gcc/result.html


> The glibc benchmarks also have some other weaknesses that should
> really be addressed, hopefully I'll have some time to write patches
> for some of this work.
> 
How will you fix measuring in tight loop with same arguments and only 32 times?
Carlos O'Donell Sept. 3, 2013, 4:13 p.m. UTC | #13
On 09/02/2013 10:18 AM, Will Newton wrote:
> The mailing list server does not seem to accept image attachments so I
> have uploaded the performance graph here:
> 
> http://people.linaro.org/~will.newton/glibc_memcpy/sizes-memcpy-08-04-2.5.png

Excellent graph.

Cheers,
Carlos.
Carlos O'Donell Sept. 3, 2013, 4:18 p.m. UTC | #14
On 09/02/2013 09:58 AM, Will Newton wrote:
>> If it's not obvious how is someone supposed to review this patch? :-)
> 
> With difficulty. ;-)

Thank you for acknowledging that.

> Joseph has raised some good points about the comments and I'll go back
> through the code and make sure everything is correct in that regard.
> The change was actually made to the copy of the code in cortex-strings
> some time ago but I delayed pushing the patch due to the 2.18 release
> so I have to refresh my memory somewhat.
> 
> Ideally we would have an agreed upon benchmark with which everyone
> could analyse the performance of the code on their systems, however
> that does not seem to exist as far as I can tell.

We have one, it's the glibc microbenchmark, and we want to expand it,
otherwise when ACME comes with their patch for ARM and breaks performance
for targets that Linaro cares about I have no way to reject the patch
objectively :-)

> The key advantage of the cortex-strings framework is that it allows
> graphing the results of benchmarks. Often changes to string function
> performance can only really be analysed graphically as otherwise you
> end up with a huge soup of numbers, some going up, some going down and
> it is very hard to separate the signal from the noise.

I disagree strongly. You *must* come up with a measurable answer and
looking at a graph is never a solution I'm going to accept.

You need to statistically analyze the numbers, assign weights to ranges,
and come up with some kind of number that evaluates the results based
on *some* formula. That is the only way we are going to keep moving
performance forward (against some kind of criteria).

> The glibc benchmarks also have some other weaknesses that should
> really be addressed, hopefully I'll have some time to write patches
> for some of this work.

Thank you very much.

Cheers,
Carlos.
Ondřej Bílka Sept. 3, 2013, 5:37 p.m. UTC | #15
On Tue, Sep 03, 2013 at 12:18:24PM -0400, Carlos O'Donell wrote:
> On 09/02/2013 09:58 AM, Will Newton wrote:
> >> If it's not obvious how is someone supposed to review this patch? :-)
> > 
> > With difficulty. ;-)
> 
> Thank you for acknowledging that.
> 
> > Joseph has raised some good points about the comments and I'll go back
> > through the code and make sure everything is correct in that regard.
> > The change was actually made to the copy of the code in cortex-strings
> > some time ago but I delayed pushing the patch due to the 2.18 release
> > so I have to refresh my memory somewhat.
> > 
> > Ideally we would have an agreed upon benchmark with which everyone
> > could analyse the performance of the code on their systems, however
> > that does not seem to exist as far as I can tell.
> 
> We have one, it's the glibc microbenchmark, and we want to expand it,
> otherwise when ACME comes with their patch for ARM and breaks performance
> for targets that Linaro cares about I have no way to reject the patch
> objectively :-)
>
Carlos, you are asking for impossible. When you publish benchmark people
will try to maximize benchmark number. After certain point this becomes
possible only by employing shady accounting: Move part of time to place
wehre it will not be measured by benchmark (for example by having
function that is 4kb large, on benchmarks it will fit into instruction
cache but that does not happen in reality). 

Taking care of common factors that can cause that is about ten times
more complex than whole system benchmarking, analysis will be quite
difficult as you will get twenty numbers and you will need to decide
which ones could made real impact and which wont.
 
> > The key advantage of the cortex-strings framework is that it allows
> > graphing the results of benchmarks. Often changes to string function
> > performance can only really be analysed graphically as otherwise you
> > end up with a huge soup of numbers, some going up, some going down and
> > it is very hard to separate the signal from the noise.
> 
> I disagree strongly. You *must* come up with a measurable answer and
> looking at a graph is never a solution I'm going to accept.
> 
You can have that opinion.
Looking at performance graphs is most powerful technique how to
understand performance. I got most of my improvements from analyzing
these.

> You need to statistically analyze the numbers, assign weights to ranges,
> and come up with some kind of number that evaluates the results based
> on *some* formula. That is the only way we are going to keep moving
> performance forward (against some kind of criteria).
> 
These accurate assigning weigths is best done by taking program running
it and measuring time. Without taking this into account weigths will not
tell much, as you will likely just optimize cold code at expense of hot
code.

> > The glibc benchmarks also have some other weaknesses that should
> > really be addressed, hopefully I'll have some time to write patches
> > for some of this work.
> 
> Thank you very much.
> 
> Cheers,
> Carlos.
Carlos O'Donell Sept. 3, 2013, 5:52 p.m. UTC | #16
On 09/03/2013 01:37 PM, Ondřej Bílka wrote:
>> We have one, it's the glibc microbenchmark, and we want to expand it,
>> otherwise when ACME comes with their patch for ARM and breaks performance
>> for targets that Linaro cares about I have no way to reject the patch
>> objectively :-)
>>
> Carlos, you are asking for impossible. When you publish benchmark people
> will try to maximize benchmark number. After certain point this becomes
> possible only by employing shady accounting: Move part of time to place
> wehre it will not be measured by benchmark (for example by having
> function that is 4kb large, on benchmarks it will fit into instruction
> cache but that does not happen in reality). 

What is it that I'm asking that is impossible?

> Taking care of common factors that can cause that is about ten times
> more complex than whole system benchmarking, analysis will be quite
> difficult as you will get twenty numbers and you will need to decide
> which ones could made real impact and which wont.

Sorry, could you clarify this a bit more, exactly what is ten times
more complex?

If we have N tests and they produce N numbers, for a given target,
for a given device, for a given workload, there is a set of importance
weights on N that should give you some kind of relevance.

We should be able to come up with some kind of framework from which
we can clearly say "this patch is better than this other patch", even
if not automated, it should be possible to reason from the results,
and that reasoning recorded as a discussion on this list.

>>> The key advantage of the cortex-strings framework is that it allows
>>> graphing the results of benchmarks. Often changes to string function
>>> performance can only really be analysed graphically as otherwise you
>>> end up with a huge soup of numbers, some going up, some going down and
>>> it is very hard to separate the signal from the noise.
>>
>> I disagree strongly. You *must* come up with a measurable answer and
>> looking at a graph is never a solution I'm going to accept.
>>
> You can have that opinion.
> Looking at performance graphs is most powerful technique how to
> understand performance. I got most of my improvements from analyzing
> these.

That is a different use for the graphs. I do not disagree that graphing
is a *powerful* way to display information and using that information to
produce a new routine is useful. What I disagree with is using such graphs
to argue qualitatively that your patch is better than the existing
implementation.

There is always a quantitative way to say X is better than Y, but it
requires breaking down your expectations and documenting them e.g.
should be faster with X alignment on sizes from N bytes to M bytes, and
then ranking based on those criteria.

>> You need to statistically analyze the numbers, assign weights to ranges,
>> and come up with some kind of number that evaluates the results based
>> on *some* formula. That is the only way we are going to keep moving
>> performance forward (against some kind of criteria).
>>
> These accurate assigning weigths is best done by taking program running
> it and measuring time. Without taking this into account weigths will not
> tell much, as you will likely just optimize cold code at expense of hot
> code.

I don't disagree with you here.

Cheers,
Carlos.
Ondřej Bílka Sept. 3, 2013, 6:57 p.m. UTC | #17
On Tue, Sep 03, 2013 at 01:52:34PM -0400, Carlos O'Donell wrote:
> On 09/03/2013 01:37 PM, Ondřej Bílka wrote:
> >> We have one, it's the glibc microbenchmark, and we want to expand it,
> >> otherwise when ACME comes with their patch for ARM and breaks performance
> >> for targets that Linaro cares about I have no way to reject the patch
> >> objectively :-)
> >>
> > Carlos, you are asking for impossible. When you publish benchmark people
> > will try to maximize benchmark number. After certain point this becomes
> > possible only by employing shady accounting: Move part of time to place
> > wehre it will not be measured by benchmark (for example by having
> > function that is 4kb large, on benchmarks it will fit into instruction
> > cache but that does not happen in reality). 
> 
> What is it that I'm asking that is impossible?
>
Having static set of benchmarks that can say if implementation is
improvement. 

We are shooting to moving target, architectures change and as what we write
will code that will come to users with considerable delay and factors we 
used for decision will change in meantime.

Once implementation reach certain quality question what is better
becomes dependent on program used. Until we could decide from profile
feedback we will lose some percents by having to use single
implementation.
 
> > Taking care of common factors that can cause that is about ten times
> > more complex than whole system benchmarking, analysis will be quite
> > difficult as you will get twenty numbers and you will need to decide
> > which ones could made real impact and which wont.
> 
> Sorry, could you clarify this a bit more, exactly what is ten times
> more complex?
>
Having benchmark suite that will catch all relevant factors that can
affect performance. Some are hard to qualify for them we need to know
how average program stresses resources.

Take instruction cache usage, a function will occupy cache lines and we
can accurately measure probability and cost of cache misses inside
function. What is hard to estimate is how this will affect rest of
program. For this we would need to know average probability that cache
line will be referenced in future.
 
> If we have N tests and they produce N numbers, for a given target,
> for a given device, for a given workload, there is a set of importance
> weights on N that should give you some kind of relevance.
> 
You are jumping to case when we will have these weights. Problematic
part is getting those.

> We should be able to come up with some kind of framework from which
> we can clearly say "this patch is better than this other patch", even
> if not automated, it should be possible to reason from the results,
> and that reasoning recorded as a discussion on this list.
> 
What is possible is to say that some patch is significantly worse based
on some criteria. There is lot of gray area where decision is unclear.

> >>> The key advantage of the cortex-strings framework is that it allows
> >>> graphing the results of benchmarks. Often changes to string function
> >>> performance can only really be analysed graphically as otherwise you
> >>> end up with a huge soup of numbers, some going up, some going down and
> >>> it is very hard to separate the signal from the noise.
> >>
> >> I disagree strongly. You *must* come up with a measurable answer and
> >> looking at a graph is never a solution I'm going to accept.
> >>
> > You can have that opinion.
> > Looking at performance graphs is most powerful technique how to
> > understand performance. I got most of my improvements from analyzing
> > these.
> 
> That is a different use for the graphs. I do not disagree that graphing
> is a *powerful* way to display information and using that information to
> produce a new routine is useful. What I disagree with is using such graphs
> to argue qualitatively that your patch is better than the existing
> implementation.
> 
> There is always a quantitative way to say X is better than Y, but it
> requires breaking down your expectations and documenting them e.g.
> should be faster with X alignment on sizes from N bytes to M bytes, and
> then ranking based on those criteria.
> 
> >> You need to statistically analyze the numbers, assign weights to ranges,
> >> and come up with some kind of number that evaluates the results based
> >> on *some* formula. That is the only way we are going to keep moving
> >> performance forward (against some kind of criteria).
> >>
> > These accurate assigning weigths is best done by taking program running
> > it and measuring time. Without taking this into account weigths will not
> > tell much, as you will likely just optimize cold code at expense of hot
> > code.
> 
> I don't disagree with you here.
> 
> Cheers,
> Carlos.
Carlos O'Donell Sept. 3, 2013, 7:15 p.m. UTC | #18
On 09/03/2013 02:57 PM, Ondřej Bílka wrote:
> On Tue, Sep 03, 2013 at 01:52:34PM -0400, Carlos O'Donell wrote:
>> On 09/03/2013 01:37 PM, Ondřej Bílka wrote:
>>>> We have one, it's the glibc microbenchmark, and we want to expand it,
>>>> otherwise when ACME comes with their patch for ARM and breaks performance
>>>> for targets that Linaro cares about I have no way to reject the patch
>>>> objectively :-)
>>>>
>>> Carlos, you are asking for impossible. When you publish benchmark people
>>> will try to maximize benchmark number. After certain point this becomes
>>> possible only by employing shady accounting: Move part of time to place
>>> wehre it will not be measured by benchmark (for example by having
>>> function that is 4kb large, on benchmarks it will fit into instruction
>>> cache but that does not happen in reality). 
>>
>> What is it that I'm asking that is impossible?
>>
> Having static set of benchmarks that can say if implementation is
> improvement. 

I agree that a static set of benchmarks will not provide us with an
accurate answer for "Is this patch objectively better for performance?"

I don't see that that makes it impossible. The static set of benchmarks
at a given point in time (since we should be adding new benchmarks) may
have some error with respect to "fastest" for a given ISA/device/workload
(which I will shorten as `workload' from now on). Therefore I would not 
say it's impossible.

It's probably impossible for the error between the estimator and reality
being close to zero though. That's just expected.

> We are shooting to moving target, architectures change and as what we write
> will code that will come to users with considerable delay and factors we 
> used for decision will change in meantime.

What's wrong with a moving target?

I have spoken with CPU designers and I've been told that they do whole 
system profiling in order assist in making instruction to microcode
decoding decisions. Therefore what we select as optimal sequences are 
also fed forward into *new* architecture designs.

> Once implementation reach certain quality question what is better
> becomes dependent on program used. Until we could decide from profile
> feedback we will lose some percents by having to use single
> implementation.

I agree. The eventual goal of the project is to have some kind of
whole system benchmarking that allows users to feed in their profiles
and allow us as developers to see what users are doing with our library.

Just like CPU designers feed in a whole distribution of applications
and look at the probability of instruction selection and tweak instruction
to microcode mappings.

I am willing to accept a certain error in the process as long as I know
we are headed in the right direction. If we all disagree about the
direction we are going in then we should talk about it.

I see:

microbenchmarks -> whole system benchmarks -> profile driven optimizations

With the latter driving the set of tunnables we expose in the library,
and possibly even the functions selected by the IFUNC resolvers at
program startup.
  
>>> Taking care of common factors that can cause that is about ten times
>>> more complex than whole system benchmarking, analysis will be quite
>>> difficult as you will get twenty numbers and you will need to decide
>>> which ones could made real impact and which wont.
>>
>> Sorry, could you clarify this a bit more, exactly what is ten times
>> more complex?
>>
> Having benchmark suite that will catch all relevant factors that can
> affect performance. Some are hard to qualify for them we need to know
> how average program stresses resources.

I agree.

I would be happy to accept a patch that does:
* Shows the benchmark numbers.
* Explains relevant factors not caught by the benchmark that affect
  performance, what they are, and why the patch should go in.

My goal is to increase the quality of the written rationales for
performance related submissions.

> Take instruction cache usage, a function will occupy cache lines and we
> can accurately measure probability and cost of cache misses inside
> function. What is hard to estimate is how this will affect rest of
> program. For this we would need to know average probability that cache
> line will be referenced in future.

Good example.

>> If we have N tests and they produce N numbers, for a given target,
>> for a given device, for a given workload, there is a set of importance
>> weights on N that should give you some kind of relevance.
>>
> You are jumping to case when we will have these weights. Problematic
> part is getting those.

I agree.

It's hard to know the weights without having an intuitive understanding
of the applications you're running on your system and what's relevant
for their performance.

Perhaps my example of a weighted average is too abstract to use today.

>> We should be able to come up with some kind of framework from which
>> we can clearly say "this patch is better than this other patch", even
>> if not automated, it should be possible to reason from the results,
>> and that reasoning recorded as a discussion on this list.
>>
> What is possible is to say that some patch is significantly worse based
> on some criteria. There is lot of gray area where decision is unclear.

:-)

Cheers,
Carlos.
Ryan S. Arnold Sept. 3, 2013, 7:31 p.m. UTC | #19
On Tue, Sep 3, 2013 at 11:18 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> We have one, it's the glibc microbenchmark, and we want to expand it,
> otherwise when ACME comes with their patch for ARM and breaks performance
> for targets that Linaro cares about I have no way to reject the patch
> objectively :-)

Can you be objective in analyzing performance when two different
people have differing opinions on what performance preconditions
should be coded against?

There are some cases that are obvious.. we know that from pipeline
analysis that certain instruction sequences can hinder performance.
That is objective and can be measured by a benchmark, but saying that
a particular change penalizes X sized copies but helps Y sized copies
when there are no published performance preconditions isn't.  It's a
difference in opinion of what's important.

PowerPC has had the luxury of not having their performance
pre-conditions contested.  PowerPC string performance is optimized
based upon customer data-set analysis.  So PowerPC's preconditions are
pretty concrete...  Optimize for aligned data in excess of 128-bytes
(I believe).

> You need to statistically analyze the numbers, assign weights to ranges,
> and come up with some kind of number that evaluates the results based
> on *some* formula. That is the only way we are going to keep moving
> performance forward (against some kind of criteria).

This sounds like establishing preconditions (what types of data will
be optimized for).

Unless technology evolves that you can statistically analyze data in
real time and adjust the implementation based on what you find (an
implementation with a different set of preconditions) to account for
this you're going to end up with a lot of in-fighting over
performance.

I've run into situations where I recommended that a customer code
their own string function implementation because they continually
encountered unaligned-data when copying-by-value in C++ functions and
PowerPC's string function implementations penalized unaligned copies
in preference for aligned copies.

Ryan
Ryan S. Arnold Sept. 3, 2013, 7:34 p.m. UTC | #20
On Tue, Sep 3, 2013 at 12:37 PM, Ondřej Bílka <neleai@seznam.cz> wrote:
>> I disagree strongly. You *must* come up with a measurable answer and
>> looking at a graph is never a solution I'm going to accept.
>>
> You can have that opinion.
> Looking at performance graphs is most powerful technique how to
> understand performance. I got most of my improvements from analyzing
> these.

Are there any open source pipeline analysis tools?  I've found the one
I've used (proprietary) to be a pretty good indicator of general
instruction selection optimization/correctness.

Graphs are useful if everyone's agreed which data-sizes and alignment
restrictions should be optimized for a particular platform.

Ryan
Carlos O'Donell Sept. 3, 2013, 7:54 p.m. UTC | #21
On 09/03/2013 03:31 PM, Ryan S. Arnold wrote:
> On Tue, Sep 3, 2013 at 11:18 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>> We have one, it's the glibc microbenchmark, and we want to expand it,
>> otherwise when ACME comes with their patch for ARM and breaks performance
>> for targets that Linaro cares about I have no way to reject the patch
>> objectively :-)
> 
> Can you be objective in analyzing performance when two different
> people have differing opinions on what performance preconditions
> should be coded against?

No.

> There are some cases that are obvious.. we know that from pipeline
> analysis that certain instruction sequences can hinder performance.
> That is objective and can be measured by a benchmark, but saying that
> a particular change penalizes X sized copies but helps Y sized copies
> when there are no published performance preconditions isn't.  It's a
> difference in opinion of what's important.

I agree. The project needs to adopt some set of performance preconditions
and document them and defend them.

If we can't defend these positions then we will never be able to evaluate
any performance patches. We will see-saw between several implementations
over the years.

The current set of performance preconditions are baked into the experience
of the core developers reviewing patches. I want the experts out of the
loop.

> PowerPC has had the luxury of not having their performance
> pre-conditions contested.  PowerPC string performance is optimized
> based upon customer data-set analysis.  So PowerPC's preconditions are
> pretty concrete...  Optimize for aligned data in excess of 128-bytes
> (I believe).

We should be documenting this somewhere, preferably in a Power-specific
test that looks at just this kind of issue.

Documenting this statically is the first, in my opinion, stepping stone
to having something like dynamic feedback.

>> You need to statistically analyze the numbers, assign weights to ranges,
>> and come up with some kind of number that evaluates the results based
>> on *some* formula. That is the only way we are going to keep moving
>> performance forward (against some kind of criteria).
> 
> This sounds like establishing preconditions (what types of data will
> be optimized for).

I agree. We need it.

> Unless technology evolves that you can statistically analyze data in
> real time and adjust the implementation based on what you find (an
> implementation with a different set of preconditions) to account for
> this you're going to end up with a lot of in-fighting over
> performance.

Why do you assume we'll have a lot of in-fighting over performance?

At present we've split the performance intensive (or so we believe)
routines on a per-machine basis. The arguments are then going to be
had only on a per-machine basis, and even then for each hardware
variant can have an IFUNC resolver select the right routine at
runtime.

Then we come upon the tunables that should allow some dynamic adjustment
of an algorithm based on realtime data.

> I've run into situations where I recommended that a customer code
> their own string function implementation because they continually
> encountered unaligned-data when copying-by-value in C++ functions and
> PowerPC's string function implementations penalized unaligned copies
> in preference for aligned copies.

Provide both in glibc and expose a tunable?

Cheers,
Carlos.
Ryan S. Arnold Sept. 3, 2013, 8:56 p.m. UTC | #22
On Tue, Sep 3, 2013 at 2:54 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> The current set of performance preconditions are baked into the experience
> of the core developers reviewing patches. I want the experts out of the
> loop.
>

This is the clutch.

Developers working for the CPU manufacturers are privy to a lot of
unpublished timing, penalty/hazard information, as well as proprietary
pipeline analysis tools.

Will "J. Random Hacker" working for MegaCorp tell you that the reason
he's chosen a particular instruction sequence is because the system
he's working on has a tiny branch cache (the size of which might be
unpublished)?

>> PowerPC has had the luxury of not having their performance
>> pre-conditions contested.  PowerPC string performance is optimized
>> based upon customer data-set analysis.  So PowerPC's preconditions are
>> pretty concrete...  Optimize for aligned data in excess of 128-bytes
>> (I believe).
>
> We should be documenting this somewhere, preferably in a Power-specific
> test that looks at just this kind of issue.

I might be mistaken, but I think you'll find these preconditions
explicitly defined in the string function implementation source files
for PowerPC.

> Documenting this statically is the first, in my opinion, stepping stone
> to having something like dynamic feedback.

Absolutely!

>> Unless technology evolves that you can statistically analyze data in
>> real time and adjust the implementation based on what you find (an
>> implementation with a different set of preconditions) to account for
>> this you're going to end up with a lot of in-fighting over
>> performance.
>
> Why do you assume we'll have a lot of in-fighting over performance?

I'm projecting here.  If someone proposed to adjust the PowerPC
optimized string functions to their own preconditions and it
drastically changed the performance of existing customers, or future
customers you'd see me panic.

> At present we've split the performance intensive (or so we believe)
> routines on a per-machine basis. The arguments are then going to be
> had only on a per-machine basis, and even then for each hardware
> variant can have an IFUNC resolver select the right routine at
> runtime.

Right, selecting the right variant with IFUNC has certainly helped
platforms that didn't use optimized libraries.  This is the low
hanging fruit.  So now our concern is the proliferation of micro-tuned
variants and a lack of qualified eyes to objectively review the
patches.

> Then we come upon the tunables that should allow some dynamic adjustment
> of an algorithm based on realtime data.

Yes, you can do this with tunables if the developer knows something
about the data (more about that later).

>> I've run into situations where I recommended that a customer code
>> their own string function implementation because they continually
>> encountered unaligned-data when copying-by-value in C++ functions and
>> PowerPC's string function implementations penalized unaligned copies
>> in preference for aligned copies.
>
> Provide both in glibc and expose a tunable?

So do we (the glibc community) no longer consider the proliferation of
tunables to be a mortal sin?  Or was that only with regard to
configuration options?  Regardless, it still burdens the Linux
distributions and developers who have to provide QA.

If tunables are available, then trial-and-error would help where a
user doesn't know the particulars of his application's data usage.

Using tunables is potentially problematic as well.  Often testing a
condition in highly optimized code is enough to obviate the
performance benefit you're attempting to provide. Checking for feature
availability might consume enough cycles to make it senseless to use
the facility itself.  I believe this is what happened in the early
days trying to use VMX in string routines.

Additionally, while dynamically linked applications won't suffer from
using IFUNC resolved functions (because of mandatory PLT usage), glibc
internal usage of IFUNC resolved functions very likely will if/when
forced to go through the PLT, especially on systems like PowerPC where
indirect branching is more expensive than direct branching.  When
Adhemerval's PowerPC IFUNC patches go in I'll probably argue for
keeping a 'generic' optimized version for internal libc usage.  We'll
see how it all works together.

So using tunables alone isn't necessarily a win unless it's coupled
with IFUNC.  But using IFUNC also isn't a guaranteed win in all cases.

For external usage, Using IFUNC in combination with a tunable should
be beneficial.  For instance, on systems that don't have a concrete
cacheline size (e.g., the A2 processor), at process initialization we
query the system cacheline size, populate a static with the size, and
then the string routines will query that size at runtime.  It'd be
nice to do that query at initialization and then pre-select an
implementation based on cacheline size so we don't have to test for
the cacheline size each time through the string function.

This of course increases the cost of maintaining the string routines
by having myriad of combinations.

These are all the trade-offs we weigh.

Ryan
Ondřej Bílka Sept. 3, 2013, 10:27 p.m. UTC | #23
On Tue, Sep 03, 2013 at 02:31:14PM -0500, Ryan S. Arnold wrote:
> On Tue, Sep 3, 2013 at 11:18 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> > We have one, it's the glibc microbenchmark, and we want to expand it,
> > otherwise when ACME comes with their patch for ARM and breaks performance
> > for targets that Linaro cares about I have no way to reject the patch
> > objectively :-)
> 
> Can you be objective in analyzing performance when two different
> people have differing opinions on what performance preconditions
> should be coded against?
>
I fear more situations like when google and facebook will send their
implementation which saves each milion dollars in energy bill over others
implementation.

> > You need to statistically analyze the numbers, assign weights to ranges,
> > and come up with some kind of number that evaluates the results based
> > on *some* formula. That is the only way we are going to keep moving
> > performance forward (against some kind of criteria).
> 
> This sounds like establishing preconditions (what types of data will
> be optimized for).
> 
> Unless technology evolves that you can statistically analyze data in
> real time and adjust the implementation based on what you find (an
> implementation with a different set of preconditions) to account for
> this you're going to end up with a lot of in-fighting over
> performance.
>
Technology is there at least for x64, its political problem.

You do not need real time analysis most of time. When user accepts
possibility performance could be worse by constant factor then profiling
becomes easier. We could rely on data from previous runs of program
instead.

First step would be to replace ifunc selection by first running
benchmarks on processor and then creating hash table with numbers of
fastest implementation and selection will be done by this lookup.

My profiler could(modulo externalities) for each program measure
 which implementation is fastest on given processor.

Then AFAIR appending data to elf is legal so we could append hash table
with program specific numbers to binary.

A ifunc resolver will first look to end of binary if there is ifunc table.
This could be implemented that worst that can happen for false positive
is selecting slow implementation. If there is entry for current processor it 
will use it, otherwise it will look to table at end of libc.so and
repeat this step and if it was not found it will use default implementation.

This is doable, problems is motivate somebody to do profiling. My
approach tries to it possible for programmers (write testing, run it at
different machines and assemble), distributions (find enough users that
will agree to have sometimes turned profiling on and sending results.)
or end users who could turn profiling on to learn their patterns.
Ondřej Bílka Sept. 3, 2013, 11:29 p.m. UTC | #24
On Tue, Sep 03, 2013 at 03:56:29PM -0500, Ryan S. Arnold wrote:
> On Tue, Sep 3, 2013 at 2:54 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> > The current set of performance preconditions are baked into the experience
> > of the core developers reviewing patches. I want the experts out of the
> > loop.
> >
> 
> This is the clutch.
> 
> Developers working for the CPU manufacturers are privy to a lot of
> unpublished timing, penalty/hazard information, as well as proprietary
> pipeline analysis tools.
> 
One problem of those on x64 is how much are these informations complete.
At least based on reading AMD optimization manuals around half of advices
is inaccurate so I need to check if they hold anyway.

> > At present we've split the performance intensive (or so we believe)
> > routines on a per-machine basis. The arguments are then going to be
> > had only on a per-machine basis, and even then for each hardware
> > variant can have an IFUNC resolver select the right routine at
> > runtime.
> 
> Right, selecting the right variant with IFUNC has certainly helped
> platforms that didn't use optimized libraries.  This is the low
> hanging fruit.  So now our concern is the proliferation of micro-tuned
> variants and a lack of qualified eyes to objectively review the
> patches.
>
Not there yet, at least for AMD, where its often using ifunc to select
slowest implementation. When you look at x64 benchmarks relation between
asymptoticaly best and selected implementation is quite random.

> > Then we come upon the tunables that should allow some dynamic adjustment
> > of an algorithm based on realtime data.
> 
> Yes, you can do this with tunables if the developer knows something
> about the data (more about that later).
> 
> >> I've run into situations where I recommended that a customer code
> >> their own string function implementation because they continually
> >> encountered unaligned-data when copying-by-value in C++ functions and
> >> PowerPC's string function implementations penalized unaligned copies
> >> in preference for aligned copies.
> >
> > Provide both in glibc and expose a tunable?
> 
> So do we (the glibc community) no longer consider the proliferation of
> tunables to be a mortal sin?  Or was that only with regard to
> configuration options?  Regardless, it still burdens the Linux
> distributions and developers who have to provide QA.
> 
> If tunables are available, then trial-and-error would help where a
> user doesn't know the particulars of his application's data usage.
>
Here auto-tuning as described in other mail would give better result.

Unless there data usage would not fit any of our variants in which case
a custom routine would be needed.
 
> Using tunables is potentially problematic as well.  Often testing a
> condition in highly optimized code is enough to obviate the
> performance benefit you're attempting to provide. Checking for feature
> availability might consume enough cycles to make it senseless to use
> the facility itself.  I believe this is what happened in the early
> days trying to use VMX in string routines.
> 
> Additionally, while dynamically linked applications won't suffer from
> using IFUNC resolved functions (because of mandatory PLT usage), glibc
> internal usage of IFUNC resolved functions very likely will if/when
> forced to go through the PLT, especially on systems like PowerPC where
> indirect branching is more expensive than direct branching.  When
> Adhemerval's PowerPC IFUNC patches go in I'll probably argue for
> keeping a 'generic' optimized version for internal libc usage.  We'll
> see how it all works together.
>
Those that are concerned about getting each bit of performance would 
recompile everything with --march=target anyway. It would be nice to
select internal version based on --march.
 
> So using tunables alone isn't necessarily a win unless it's coupled
> with IFUNC.  But using IFUNC also isn't a guaranteed win in all cases.
> 
> For external usage, Using IFUNC in combination with a tunable should
> be beneficial.  For instance, on systems that don't have a concrete
> cacheline size (e.g., the A2 processor), at process initialization we
> query the system cacheline size, populate a static with the size, and
> then the string routines will query that size at runtime.  It'd be
> nice to do that query at initialization and then pre-select an
> implementation based on cacheline size so we don't have to test for
> the cacheline size each time through the string function.
> 
> This of course increases the cost of maintaining the string routines
> by having myriad of combinations.
> 
> These are all the trade-offs we weigh.
> 
> Ryan
Carlos O'Donell Sept. 3, 2013, 11:31 p.m. UTC | #25
On 09/03/2013 04:56 PM, Ryan S. Arnold wrote:
> On Tue, Sep 3, 2013 at 2:54 PM, Carlos O'Donell <carlos@redhat.com> wrote:
>> The current set of performance preconditions are baked into the experience
>> of the core developers reviewing patches. I want the experts out of the
>> loop.
>>
> 
> This is the clutch.
> 
> Developers working for the CPU manufacturers are privy to a lot of
> unpublished timing, penalty/hazard information, as well as proprietary
> pipeline analysis tools.
> 
> Will "J. Random Hacker" working for MegaCorp tell you that the reason
> he's chosen a particular instruction sequence is because the system
> he's working on has a tiny branch cache (the size of which might be
> unpublished)?

That's an interesting point. I've seen similar things happen in *.md
file generation in gcc and had forgotten about it entirely. However,
in all such instances the developer can say "I am bound by confidentiality
not to reveal the reasons why I made my choices." We can document that.

The opposite point is also true in that we might actually tune an
implementation based on real-world data and have no idea why it behaves
optimally from an architectural perspective. In which case we have to
document "This implementation was tuned using data set X."

How are these two situations any different really?

I would accept patches in both cases, but I would like to see that
MegaCorp's patches don't change anything for the consumer level CPUs
we routinely employ.

>>> PowerPC has had the luxury of not having their performance
>>> pre-conditions contested.  PowerPC string performance is optimized
>>> based upon customer data-set analysis.  So PowerPC's preconditions are
>>> pretty concrete...  Optimize for aligned data in excess of 128-bytes
>>> (I believe).
>>
>> We should be documenting this somewhere, preferably in a Power-specific
>> test that looks at just this kind of issue.
> 
> I might be mistaken, but I think you'll find these preconditions
> explicitly defined in the string function implementation source files
> for PowerPC.

Excellent. We should probably have some more central location for this
information in a developer's guide or internals guide.

>> Documenting this statically is the first, in my opinion, stepping stone
>> to having something like dynamic feedback.
> 
> Absolutely!

Glad we agree.

>>> Unless technology evolves that you can statistically analyze data in
>>> real time and adjust the implementation based on what you find (an
>>> implementation with a different set of preconditions) to account for
>>> this you're going to end up with a lot of in-fighting over
>>> performance.
>>
>> Why do you assume we'll have a lot of in-fighting over performance?
> 
> I'm projecting here.  If someone proposed to adjust the PowerPC
> optimized string functions to their own preconditions and it
> drastically changed the performance of existing customers, or future
> customers you'd see me panic.

At least with a microbenchmark you'd know the situation was about to
go sideways immediately after running the benchmark. Otherwise it might
be quite a while before you do profiling before a big tools release.

This is exactly the situation we are in right now for x86 and x86-64,
and will likely see with ARM/AArch64 and the plethora of vendor
implementations.

>> At present we've split the performance intensive (or so we believe)
>> routines on a per-machine basis. The arguments are then going to be
>> had only on a per-machine basis, and even then for each hardware
>> variant can have an IFUNC resolver select the right routine at
>> runtime.
> 
> Right, selecting the right variant with IFUNC has certainly helped
> platforms that didn't use optimized libraries.  This is the low
> hanging fruit.  So now our concern is the proliferation of micro-tuned
> variants and a lack of qualified eyes to objectively review the
> patches.

Yes, that's a concern.

>> Then we come upon the tunables that should allow some dynamic adjustment
>> of an algorithm based on realtime data.
> 
> Yes, you can do this with tunables if the developer knows something
> about the data (more about that later).

We can't always assume ignorance, but I agree that there are problems
here.

>>> I've run into situations where I recommended that a customer code
>>> their own string function implementation because they continually
>>> encountered unaligned-data when copying-by-value in C++ functions and
>>> PowerPC's string function implementations penalized unaligned copies
>>> in preference for aligned copies.
>>
>> Provide both in glibc and expose a tunable?
> 
> So do we (the glibc community) no longer consider the proliferation of
> tunables to be a mortal sin?  Or was that only with regard to
> configuration options?  Regardless, it still burdens the Linux
> distributions and developers who have to provide QA.

We need to have a broader conversation about this very issue.

Pragmatically configuration and tunables are similar problems.

I think tunables are a mortal sin and would rather see algorithms
that can select the best implementation for the user automatically.
However, we need tunables to bootstrap that. I would hope that as
tunables appear they disappear into smart algorithms that do a better
job of tunning internals.

> If tunables are available, then trial-and-error would help where a
> user doesn't know the particulars of his application's data usage.

It would.

> Using tunables is potentially problematic as well.  Often testing a
> condition in highly optimized code is enough to obviate the
> performance benefit you're attempting to provide. Checking for feature
> availability might consume enough cycles to make it senseless to use
> the facility itself.  I believe this is what happened in the early
> days trying to use VMX in string routines.

Agreed. You have to make it configurable then, and that's costly from
a complexity perspective, and makes testing all build configurations
harder and harder.

> Additionally, while dynamically linked applications won't suffer from
> using IFUNC resolved functions (because of mandatory PLT usage), glibc
> internal usage of IFUNC resolved functions very likely will if/when
> forced to go through the PLT, especially on systems like PowerPC where
> indirect branching is more expensive than direct branching.  When
> Adhemerval's PowerPC IFUNC patches go in I'll probably argue for
> keeping a 'generic' optimized version for internal libc usage.  We'll
> see how it all works together.

Good point.

> So using tunables alone isn't necessarily a win unless it's coupled
> with IFUNC.  But using IFUNC also isn't a guaranteed win in all cases.

Unless we try to come up wtih something to make it faster on Power?

> For external usage, Using IFUNC in combination with a tunable should
> be beneficial.  For instance, on systems that don't have a concrete
> cacheline size (e.g., the A2 processor), at process initialization we
> query the system cacheline size, populate a static with the size, and
> then the string routines will query that size at runtime.  It'd be
> nice to do that query at initialization and then pre-select an
> implementation based on cacheline size so we don't have to test for
> the cacheline size each time through the string function.

Yes.

> This of course increases the cost of maintaining the string routines
> by having myriad of combinations.

Which we already have and need to test via the IFUNC testing functionality
that HJ added.

> These are all the trade-offs we weigh.

... and more.

I see your concerns, and raise you a couple more.

If we leave the situation as is we will have a continued and difficult
time accepting performance patches for functional units of the library.

We need to drive objectivity into the evaluation of the patches. It won't
be completely objective at first, but it better move in that direction. 

Cheers,
Carlos.
Siddhesh Poyarekar Sept. 4, 2013, 7:30 a.m. UTC | #26
On Tue, Sep 03, 2013 at 03:15:25PM -0400, Carlos O'Donell wrote:
> I agree. The eventual goal of the project is to have some kind of
> whole system benchmarking that allows users to feed in their profiles
> and allow us as developers to see what users are doing with our library.
> 
> Just like CPU designers feed in a whole distribution of applications
> and look at the probability of instruction selection and tweak instruction
> to microcode mappings.
> 
> I am willing to accept a certain error in the process as long as I know
> we are headed in the right direction. If we all disagree about the
> direction we are going in then we should talk about it.
> 
> I see:
> 
> microbenchmarks -> whole system benchmarks -> profile driven optimizations

I've mentioned this before - microbenchmarks are not a way to whole
system benchmarks in that they don't replace system benchmarks.  We
need to work on both in parallel because both have different goals.

A microbenchmark would have parameters such as alignment, size and
cache pressure to determine how an implementation scales.  These are
generic numbers (i.e. they're not tied to specific high level
workloads) that a developer can use to design their programs.

Whole system benchmarks however work at a different level.  They would
give an average case number that describes how a specific recipe
impacts performance of a set of programs.  An administrator would use
these to tweak the system for the workload.

> I would be happy to accept a patch that does:
> * Shows the benchmark numbers.
> * Explains relevant factors not caught by the benchmark that affect
>   performance, what they are, and why the patch should go in.
> 
> My goal is to increase the quality of the written rationales for
> performance related submissions.

Agreed.  In fact, this should go in as a large comment in the
implementation itself.  Someone had mentioned in the past (was it
Torvald?) that every assembly implementation we write should be as
verbose in comments as it can possibly be so that there is no
ambiguity about the rationale for selection of specific instruction
sequences over others.

> >> If we have N tests and they produce N numbers, for a given target,
> >> for a given device, for a given workload, there is a set of importance
> >> weights on N that should give you some kind of relevance.
> >>
> > You are jumping to case when we will have these weights. Problematic
> > part is getting those.
> 
> I agree.
> 
> It's hard to know the weights without having an intuitive understanding
> of the applications you're running on your system and what's relevant
> for their performance.

1. Assume aligned input.  Nothing should take (any noticeable)
   performance away from align copies/moves
2. Scale with size
3. Provide acceptable performance for unaligned sizes without
   penalizing the aligned case
4. Measure the effect of dcache pressure on function performance
5. Measure effect of icache pressure on function performance.

Depending on the actual cost of cache misses on different processors,
the icache/dcache miss cost would either have higher or lower weight
but for 1-3, I'd go in that order of priorities with little concern
for unaligned cases.

Siddhesh
Ondřej Bílka Sept. 4, 2013, 11:03 a.m. UTC | #27
On Wed, Sep 04, 2013 at 01:00:09PM +0530, Siddhesh Poyarekar wrote:
> On Tue, Sep 03, 2013 at 03:15:25PM -0400, Carlos O'Donell wrote:
> > I agree. The eventual goal of the project is to have some kind of
> > whole system benchmarking that allows users to feed in their profiles
> > and allow us as developers to see what users are doing with our library.
> > 
> > Just like CPU designers feed in a whole distribution of applications
> > and look at the probability of instruction selection and tweak instruction
> > to microcode mappings.
> > 
> > I am willing to accept a certain error in the process as long as I know
> > we are headed in the right direction. If we all disagree about the
> > direction we are going in then we should talk about it.
> > 
> > I see:
> > 
> > microbenchmarks -> whole system benchmarks -> profile driven optimizations
> 
> I've mentioned this before - microbenchmarks are not a way to whole
> system benchmarks in that they don't replace system benchmarks.  We
> need to work on both in parallel because both have different goals.
> 
> A microbenchmark would have parameters such as alignment, size and
> cache pressure to determine how an implementation scales.  These are
> generic numbers (i.e. they're not tied to specific high level
> workloads) that a developer can use to design their programs.
> 
> Whole system benchmarks however work at a different level.  They would
> give an average case number that describes how a specific recipe
> impacts performance of a set of programs.  An administrator would use
> these to tweak the system for the workload.
> 
> > I would be happy to accept a patch that does:
> > * Shows the benchmark numbers.
> > * Explains relevant factors not caught by the benchmark that affect
> >   performance, what they are, and why the patch should go in.
> > 
> > My goal is to increase the quality of the written rationales for
> > performance related submissions.
> 
> Agreed.  In fact, this should go in as a large comment in the
> implementation itself.  Someone had mentioned in the past (was it
> Torvald?) that every assembly implementation we write should be as
> verbose in comments as it can possibly be so that there is no
> ambiguity about the rationale for selection of specific instruction
> sequences over others.
> 
> > >> If we have N tests and they produce N numbers, for a given target,
> > >> for a given device, for a given workload, there is a set of importance
> > >> weights on N that should give you some kind of relevance.
> > >>
> > > You are jumping to case when we will have these weights. Problematic
> > > part is getting those.
> > 
> > I agree.
> > 
> > It's hard to know the weights without having an intuitive understanding
> > of the applications you're running on your system and what's relevant
> > for their performance.
> 
> 1. Assume aligned input.  Nothing should take (any noticeable)
>    performance away from align copies/moves
Not very useful as this is extremely dependant on function measured. For
functions like strcmp and strlen alignments are mostly random so aligned
case does not say much. On opposite end of spectrum is memset which is
almost always 8 byte aligned and unaligned performance does not make lot
of sense.

> 2. Scale with size
Not very important for several reasons. One is that big sizes are cold
(just look in oprofile output that loops are less frequent than header.)

Second reason is that if we look at caller large sizes are unlikely
bottleneck.

One type of usage is find delimiter like:

i=strlen(n);
for (i=0;i<n;i++)
  something();

Here for n large a strlen contribution is likely small.

Or second one is skipping parts of string. Consider following

while(p=strchr(p+1,'a')){
  something();
}

If p was 1000 byte buffer then best case is when a is not there and we
do one 1000 byte strchr. Worst case is when string consist entirely of
a's and we need to call 1 byte strchr 1000 times.

> 3. Provide acceptable performance for unaligned sizes without
>    penalizing the aligned case

This is quite important case. It should be measured correctly, what is
important is that alignment varies. This can be slower than when you
pick fixed alignment and alignment varies in reality.

> 4. Measure the effect of dcache pressure on function performance
> 5. Measure effect of icache pressure on function performance.
> 
Here you really need to base weigths on function usage patterns. 
A bigger code size is acceptable for functions that are called more
often. You need to see distribution of how are calls clustered to get
full picture. A strcmp is least sensitive to icache concerns, as when it
is called its mostly 100 times over in tight loop so size is not big issue.
If same number of call is uniformnly spread through program we need
stricter criteria.

> Depending on the actual cost of cache misses on different processors,
> the icache/dcache miss cost would either have higher or lower weight
> but for 1-3, I'd go in that order of priorities with little concern
> for unaligned cases.
> 
> Siddhesh
Siddhesh Poyarekar Sept. 4, 2013, 11:45 a.m. UTC | #28
On Wed, Sep 04, 2013 at 01:03:33PM +0200, Ondřej Bílka wrote:
> > 1. Assume aligned input.  Nothing should take (any noticeable)
> >    performance away from align copies/moves
> Not very useful as this is extremely dependant on function measured. For
> functions like strcmp and strlen alignments are mostly random so aligned
> case does not say much. On opposite end of spectrum is memset which is
> almost always 8 byte aligned and unaligned performance does not make lot
> of sense.

Agreed.  So for functions like memset/memcpy/memmove we heavily favour
aligned inputs.  For strlen/strchr/memchr we strive for acceptable
average case performance, i.e. less variance in performance.

> > 2. Scale with size
> Not very important for several reasons. One is that big sizes are cold
> (just look in oprofile output that loops are less frequent than header.)
> 
> Second reason is that if we look at caller large sizes are unlikely
> bottleneck.

I did not imply that we optimize for larger sizes - I meant that as a
general principle, the algorithm should scale reasonably for larger
sizes.  A quadratic algorithm is bad even if it gives acceptable
performance for smaller sizes.  I would consider that a pretty
important trait to monitor in the benchmark even if we won't really
get such implementations in practice.

> > 3. Provide acceptable performance for unaligned sizes without
> >    penalizing the aligned case
> 
> This is quite important case. It should be measured correctly, what is
> important is that alignment varies. This can be slower than when you
> pick fixed alignment and alignment varies in reality.

I agree that we need to measure unaligned cases correctly.

> > 4. Measure the effect of dcache pressure on function performance
> > 5. Measure effect of icache pressure on function performance.
> > 
> Here you really need to base weigths on function usage patterns. 
> A bigger code size is acceptable for functions that are called more
> often. You need to see distribution of how are calls clustered to get
> full picture. A strcmp is least sensitive to icache concerns, as when it
> is called its mostly 100 times over in tight loop so size is not big issue.
> If same number of call is uniformnly spread through program we need
> stricter criteria.

That's not necessarily true.  It may be true for specific applications
but I don't think an strcmp is always called in a tight loop.  Do you
have a qualitative argument to prove that statement or is it just
based on dry runs?

Siddhesh
Carlos O'Donell Sept. 4, 2013, 3:30 p.m. UTC | #29
On 09/04/2013 03:30 AM, Siddhesh Poyarekar wrote:
> On Tue, Sep 03, 2013 at 03:15:25PM -0400, Carlos O'Donell wrote:
>> I agree. The eventual goal of the project is to have some kind of
>> whole system benchmarking that allows users to feed in their profiles
>> and allow us as developers to see what users are doing with our library.
>>
>> Just like CPU designers feed in a whole distribution of applications
>> and look at the probability of instruction selection and tweak instruction
>> to microcode mappings.
>>
>> I am willing to accept a certain error in the process as long as I know
>> we are headed in the right direction. If we all disagree about the
>> direction we are going in then we should talk about it.
>>
>> I see:
>>
>> microbenchmarks -> whole system benchmarks -> profile driven optimizations
> 
> I've mentioned this before - microbenchmarks are not a way to whole
> system benchmarks in that they don't replace system benchmarks.  We
> need to work on both in parallel because both have different goals.

Sorry, my ASCII graph was misleading, and I agree with you.

What I meant to say is that we need to walk before we run.
Creating a microbenchmark framework will give us the tools
we need to move on to whole system benchmarks and eventually
to profile driven optimizations. We will continue to do all
3 of these activities once we reach the end of that logical
progression.

Cheers,
Carlos.
Ryan S. Arnold Sept. 4, 2013, 5:35 p.m. UTC | #30
On Wed, Sep 4, 2013 at 2:30 AM, Siddhesh Poyarekar <siddhesh@redhat.com> wrote:
> 1. Assume aligned input.  Nothing should take (any noticeable)
>    performance away from align copies/moves
> 2. Scale with size

In my experience scaling with data-size isn't really possible beyond a
certain point.  We pick a target range of sizes to optimize for based
upon customer feedback and we try to use pre-fetching in that range as
efficiently as possible.  But I get your point.  We don't want any
particular size to be severely penalized.

Each architecture and specific platform needs to know/decide what the
optimal range is and document it.  Even for Power we have different
expectations on server hardware like POWER7, vs. embedded hardware
like ppc 476.

> 3. Provide acceptable performance for unaligned sizes without
>    penalizing the aligned case

There are cases where the user can't control the alignment of the data
being fed into string functions, and we shouldn't penalize them for
these situations if possible, but in reality if a string routine shows
up hot in a profile this is a likely culprit and there's not much that
can be done once the unaligned case is made as stream-lined as
possible.

Simply testing for alignment (not presuming aligned data) itself slows
down the processing of aligned-data, but that's an unavoidable
reality.  I've chatted with some compiler folks about the possibility
of branching directly to aligned case labels in string routines if the
compiler is able to detect aligned data.. but was informed that this
suggestion might get me burned at the stake.

As previously discussed, we might be able to use tunables in the
future to mitigate this.  But of course, this would be 'use at your
own risk'.

> 4. Measure the effect of dcache pressure on function performance
> 5. Measure effect of icache pressure on function performance.
>
> Depending on the actual cost of cache misses on different processors,
> the icache/dcache miss cost would either have higher or lower weight
> but for 1-3, I'd go in that order of priorities with little concern
> for unaligned cases.

I know that icache and dcache miss penalty/costs are known for most
architectures but not whether they're "published".  I suppose we can,
at least, encourage developers for the CPU manufacturers to indicate
in the documentation of preconditions which is more expensive,
relative to the other if they're unable to indicate the exact costs of
these misses.

Some further thoughts (just to get this stuff documented):

Some performance regressions I'm familiar with (on Power), which CAN
be measured with a baseline micro-benchmark regardless of use-case:

1. Hazard/Penalties - I'm thinking things like load-hit-store in the
tail of a loop, e.g., label: load value from a register, do work,
store to same register, branch to loop.  Take a stall when the value
at the top of the loop isn't ready to load.

2. Dispatch grouping - Some instructions need to be first-in-group,
etc.  Grouping is also based on instruction alignment.  At least on
Power I believe some instructions benefit from specific alignment.

3. Instruction Grouping - Depending on topology of the pipeline,
specific groupings of instructions of might incur pipeline stalls due
to unavailability of the load/store unit (for instance).

4. Facility usage costs - Sometimes using certain facilities for
certain sizes of data are more costly than not using the facility.
For instance, I believe that using the DFPU on Power requires that the
floating-point pipeline be flushed, so BFP and DFP really shouldn't be
used together.  I believe there is a powerpc32 string function which
uses FPRs because they're 64-bits wide even on ppc32.  But we measured
the cost/benefit ratio of using this vs. not.

On Power, micro benchmarks are run in-house with these (and many
other) factors in mind.

Ryan
Ryan S. Arnold Sept. 4, 2013, 5:37 p.m. UTC | #31
On Wed, Sep 4, 2013 at 6:03 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Wed, Sep 04, 2013 at 01:00:09PM +0530, Siddhesh Poyarekar wrote:
>> 2. Scale with size
> Not very important for several reasons. One is that big sizes are cold
> (just look in oprofile output that loops are less frequent than header.)
>
> Second reason is that if we look at caller large sizes are unlikely
> bottleneck.

From my experience, extremely large data sizes are not very common.
Optimizing for those gets diminishing returns.  I believe that at very
large sizes the pressure is all on the hardware anyway.  Prefetching
large amounts of data in a loop takes a fixed amount of time and given
a large enough amount of data, the overhead introduced by most other
factors is negligible.

>> 4. Measure the effect of dcache pressure on function performance
>> 5. Measure effect of icache pressure on function performance.
>>
> Here you really need to base weigths on function usage patterns.
> A bigger code size is acceptable for functions that are called more
> often. You need to see distribution of how are calls clustered to get
> full picture. A strcmp is least sensitive to icache concerns, as when it
> is called its mostly 100 times over in tight loop so size is not big issue.
> If same number of call is uniformnly spread through program we need
> stricter criteria.

Icache pressure is probably one of the more difficult things to
measure with a benchmark.  I suppose it'd be easier with a pipeline
analyzer.

Can you explain how usage pattern analysis might reveal icache pressure?

I'm not sure how useful 'usage pattern' are when considering dcache
pressure.  On Power we have data-cache prefetch instructions and since
we know that dcache pressure is a reality, we will prefetch if our
data sizes are large enough to out-weigh the overhead of prefetching,
e.g., when the data size exceeds the cacheline size.

Ryan
Ondřej Bílka Sept. 5, 2013, 8:04 a.m. UTC | #32
On Wed, Sep 04, 2013 at 12:37:33PM -0500, Ryan S. Arnold wrote:
> On Wed, Sep 4, 2013 at 6:03 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> > On Wed, Sep 04, 2013 at 01:00:09PM +0530, Siddhesh Poyarekar wrote:
> >> 4. Measure the effect of dcache pressure on function performance
> >> 5. Measure effect of icache pressure on function performance.
> >>
> > Here you really need to base weigths on function usage patterns.
> > A bigger code size is acceptable for functions that are called more
> > often. You need to see distribution of how are calls clustered to get
> > full picture. A strcmp is least sensitive to icache concerns, as when it
> > is called its mostly 100 times over in tight loop so size is not big issue.
> > If same number of call is uniformnly spread through program we need
> > stricter criteria.
> 
> Icache pressure is probably one of the more difficult things to
> measure with a benchmark.  I suppose it'd be easier with a pipeline
> analyzer.
> 
> Can you explain how usage pattern analysis might reveal icache pressure?
>
With profiler its simple, I profiled firefox a while, results are here:
http://kam.mff.cuni.cz/~ondra/benchmark_string/strcmp_profile_firefox/result.html

Now when you look to 'Delays between calls' graph you will see peak
which is likely caused by strcmp being called in loop.

From graph about 2/3 of calls happen in less than 128 cycles since last
one. As there is limited number of cache lines that you can access in
128 cycles per call impact is smaller.

> I'm not sure how useful 'usage pattern' are when considering dcache
> pressure.  On Power we have data-cache prefetch instructions and since
> we know that dcache pressure is a reality, we will prefetch if our
> data sizes are large enough to out-weigh the overhead of prefetching,
> e.g., when the data size exceeds the cacheline size.
> 
Very useful as overhead of prefetching is determined that this quantity. 
You can have two applications that often call memset with size 16000.

First one uses memset to refresh one static array which is entirely in
L1 cache and prefetching is harmful.

Second one does random access of 1GB of memory and prefetching would
help.

Swithching to prefetching when you exceed cache size has advantage of
certainty that is will help.
Real treshold is lower as it is unlikely that large array got as
argument is only thing that occupies cache.
Ondřej Bílka Sept. 5, 2013, 11:06 a.m. UTC | #33
On Wed, Sep 04, 2013 at 12:35:46PM -0500, Ryan S. Arnold wrote:
> On Wed, Sep 4, 2013 at 2:30 AM, Siddhesh Poyarekar <siddhesh@redhat.com> wrote:
> > 3. Provide acceptable performance for unaligned sizes without
> >    penalizing the aligned case
> 
> There are cases where the user can't control the alignment of the data
> being fed into string functions, and we shouldn't penalize them for
> these situations if possible, but in reality if a string routine shows
> up hot in a profile this is a likely culprit and there's not much that
> can be done once the unaligned case is made as stream-lined as
> possible.
> 
> Simply testing for alignment (not presuming aligned data) itself slows
> down the processing of aligned-data, but that's an unavoidable
> reality.

How expensive are unaligned loads on powerpc?  On x64 a penalty for
using them is smaller than alternatives(increased branch
misprediction...)

>  I've chatted with some compiler folks about the possibility
> of branching directly to aligned case labels in string routines if the
> compiler is able to detect aligned data.. but was informed that this
> suggestion might get me burned at the stake.
> 
You would need to improve gcc detection of alignments first. Now gcc
misses most of opportunities, even in following code gcc issues
retundant alignment checks:

#include <stdint.h>
char *foo(long *x){
 if (((uintptr_t)x)%16)
  return x+4;
 else {
  __builtin_memset(x,0,512);
  return x;
 }
}

If gcc guys fix that then we do not have to ask them anything. We could
just change headers to recognize aligned case like

#define strchr(x,c) ({ char *__x=x;\
  if (__builtin_constant_p(((uintptr_t)__x)%16) && !((uintptr_t)__x)%16)\
    strchr_aligned(__x,c);\
  else\
    strchr(__x,c);\
})

> > 4. Measure the effect of dcache pressure on function performance
> > 5. Measure effect of icache pressure on function performance.
> >
> > Depending on the actual cost of cache misses on different processors,
> > the icache/dcache miss cost would either have higher or lower weight
> > but for 1-3, I'd go in that order of priorities with little concern
> > for unaligned cases.
> 
> I know that icache and dcache miss penalty/costs are known for most
> architectures but not whether they're "published".  I suppose we can,
> at least, encourage developers for the CPU manufacturers to indicate
> in the documentation of preconditions which is more expensive,
> relative to the other if they're unable to indicate the exact costs of
> these misses.
>
These cost are relatively difficult to describe, take strlen on main
memory as example.
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strlen_profile/results_rand_nocache/result.html

Here we see hardware prefetcher in action. A time goes linearly with
size until 512 bytes and remains constant until 4096 bytes(switch to
block view) where it starts increasing at slower rate.
 
For core2 shape is similar except that plateau starts at 256 bytes and
ends at 1024 bytes.
http://kam.mff.cuni.cz/~ondra/benchmark_string/core2/strlen_profile/results_rand_nocache/result.html

AMD processors are different, phenomII performance is line, and for fx10
there is even area where time decreases with size. 
http://kam.mff.cuni.cz/~ondra/benchmark_string/phenomII/strlen_profile/results_rand_nocache/result.html 
http://kam.mff.cuni.cz/~ondra/benchmark_string/fx10/strlen_profile/results_rand_nocache/result.html
Joseph Myers Sept. 5, 2013, 11:54 a.m. UTC | #34
On Wed, 4 Sep 2013, Ryan S. Arnold wrote:

> Simply testing for alignment (not presuming aligned data) itself slows
> down the processing of aligned-data, but that's an unavoidable
> reality.  I've chatted with some compiler folks about the possibility
> of branching directly to aligned case labels in string routines if the
> compiler is able to detect aligned data.. but was informed that this
> suggestion might get me burned at the stake.

See the ARM EABI __aeabi_mem* functions.  At present the glibc versions 
just wrap or alias the generic ones, so don't take advantage of the extra 
alignment information (and the constraints on register clobbers also mean 
plain ARM memcpy gets used for them rather than the NEON version).  And 
GCC doesn't detect alignment and generate calls to those functions (which 
would be a pessimization anyway without glibc implementing these 
functions more optimally).  But the principle makes sense, subject to not 
pessimizing real use cases by expanding libc code size unduly (different 
entry points to functions may be better than duplicating large amounts of 
code for each alignment case).
Ondřej Bílka Sept. 7, 2013, 11:54 a.m. UTC | #35
On Tue, Sep 03, 2013 at 02:34:54PM -0500, Ryan S. Arnold wrote:
> On Tue, Sep 3, 2013 at 12:37 PM, Ondřej Bílka <neleai@seznam.cz> wrote:
> >> I disagree strongly. You *must* come up with a measurable answer and
> >> looking at a graph is never a solution I'm going to accept.
> >>
> > You can have that opinion.
> > Looking at performance graphs is most powerful technique how to
> > understand performance. I got most of my improvements from analyzing
> > these.
> 
> Are there any open source pipeline analysis tools?  I've found the one
> I've used (proprietary) to be a pretty good indicator of general
> instruction selection optimization/correctness.
> 
After bit of googling I found http://marss86.org/~marss86/index.php/Home
Is anyone familiar with it?

This reminded me to work on project in my backlog. Writing tool that can
legaly shuffle assembler instructions. 
Then I run simulated annaling to get optimal sheduling for given processor 
and benchmark. 
With dryrun data this will be close to real performance on small size so
I could get some percents there.
diff mbox

Patch

diff --git a/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S b/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
index 3decad6..6e84173 100644
--- a/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
+++ b/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
@@ -369,8 +369,8 @@  ENTRY(memcpy)
 	cfi_adjust_cfa_offset (FRAME_SIZE)
 	cfi_rel_offset (tmp2, 0)
 	cfi_remember_state
-	and	tmp2, src, #3
-	and	tmp1, dst, #3
+	and	tmp2, src, #7
+	and	tmp1, dst, #7
 	cmp	tmp1, tmp2
 	bne	.Lcpy_notaligned