Message ID | 520894D5.7060207@linaro.org |
---|---|
State | Superseded |
Headers | show |
On 12 August 2013 08:55, Will Newton <will.newton@linaro.org> wrote: > > A small change to the entry to the aligned copy loop improves > performance slightly on A9 and A15 cores for certain copies. > > ports/ChangeLog.arm: > > 2013-08-07 Will Newton <will.newton@linaro.org> > > * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check > on entry to aligned copy loop for improved performance. > --- > ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Ping?
On Mon, 12 Aug 2013, Will Newton wrote: > A small change to the entry to the aligned copy loop improves > performance slightly on A9 and A15 cores for certain copies. Could you clarify what you mean by "certain copies"? In particular, have you verified that for all three choices in this code (NEON, VFP or neither), the code for unaligned copies is at least as fast in this case (common 32-bit alignment, but not common 64-bit alignment) as the code that would previously have been used in those cases? There are various comments regarding alignment, whether stating "LDRD/STRD support unaligned word accesses" or referring to the mutual alignment that applies for particular code. Does this patch make any of them out of date? (If code can now only be reached with common 64-bit alignment, but in fact requires only 32-bit alignment, the comment should probably state both those things explicitly.)
On 30 August 2013 00:58, Joseph S. Myers <joseph@codesourcery.com> wrote: Hi Joseph, >> A small change to the entry to the aligned copy loop improves >> performance slightly on A9 and A15 cores for certain copies. > > Could you clarify what you mean by "certain copies"? Large copies (> 16kB) where the buffers are 4-byte aligned but not 8-byte aligned. I'll respin the patch with an improved description. > In particular, have you verified that for all three choices in this code > (NEON, VFP or neither), the code for unaligned copies is at least as fast > in this case (common 32-bit alignment, but not common 64-bit alignment) as > the code that would previously have been used in those cases? Yes, the performance is very similar but slightly better in the NEON case and approximately unchanged in the others. > There are various comments regarding alignment, whether stating "LDRD/STRD > support unaligned word accesses" or referring to the mutual alignment that > applies for particular code. Does this patch make any of them out of > date? (If code can now only be reached with common 64-bit alignment, but > in fact requires only 32-bit alignment, the comment should probably state > both those things explicitly.) I've reviewed the comments and they all look ok as far as I can tell.
On Fri, 30 Aug 2013, Will Newton wrote: > > There are various comments regarding alignment, whether stating "LDRD/STRD > > support unaligned word accesses" or referring to the mutual alignment that > > applies for particular code. Does this patch make any of them out of > > date? (If code can now only be reached with common 64-bit alignment, but > > in fact requires only 32-bit alignment, the comment should probably state > > both those things explicitly.) > > I've reviewed the comments and they all look ok as far as I can tell. Are you sure? For example, where it says "SRC and DST have the same mutual 32-bit alignment", is it not in fact the case after the patch that they now have the same mutual 64-bit alignment, even if this code doesn't currently rely on that? I think the comments in each place should be explicit about both things - what preconditions the code relies on, and what possibly stronger preconditions are in fact true given the other code in this file. Saying the mutual alignment is 32-bit when the reader can see from the code not far above that it's 64-bit just seems liable to confuse the reader, even if the comment is still formally true. Similarly for the requirements on unaligned word accesses - after this patch, which uses of ldrd/strd do require that?
On 08/27/2013 03:46 AM, Will Newton wrote: > On 12 August 2013 08:55, Will Newton <will.newton@linaro.org> wrote: >> >> A small change to the entry to the aligned copy loop improves >> performance slightly on A9 and A15 cores for certain copies. >> >> ports/ChangeLog.arm: >> >> 2013-08-07 Will Newton <will.newton@linaro.org> >> >> * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check >> on entry to aligned copy loop for improved performance. >> --- >> ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) > > Ping? How did you test the performance? glibc has a performance microbenchmark, did you use that? Cheers, Carlos.
On 30 August 2013 16:18, Joseph S. Myers <joseph@codesourcery.com> wrote: > On Fri, 30 Aug 2013, Will Newton wrote: > >> > There are various comments regarding alignment, whether stating "LDRD/STRD >> > support unaligned word accesses" or referring to the mutual alignment that >> > applies for particular code. Does this patch make any of them out of >> > date? (If code can now only be reached with common 64-bit alignment, but >> > in fact requires only 32-bit alignment, the comment should probably state >> > both those things explicitly.) >> >> I've reviewed the comments and they all look ok as far as I can tell. > > Are you sure? For example, where it says "SRC and DST have the same > mutual 32-bit alignment", is it not in fact the case after the patch that > they now have the same mutual 64-bit alignment, even if this code doesn't > currently rely on that? I think the comments in each place should be > explicit about both things - what preconditions the code relies on, and > what possibly stronger preconditions are in fact true given the other code > in this file. Saying the mutual alignment is 32-bit when the reader can > see from the code not far above that it's 64-bit just seems liable to > confuse the reader, even if the comment is still formally true. > Similarly for the requirements on unaligned word accesses - after this > patch, which uses of ldrd/strd do require that? Yes, you're right, that needs more thought. I'll have a look at it next week.
On 30 August 2013 18:14, Carlos O'Donell <carlos@redhat.com> wrote: Hi Carlos, >>> A small change to the entry to the aligned copy loop improves >>> performance slightly on A9 and A15 cores for certain copies. >>> >>> ports/ChangeLog.arm: >>> >>> 2013-08-07 Will Newton <will.newton@linaro.org> >>> >>> * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check >>> on entry to aligned copy loop for improved performance. >>> --- >>> ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> Ping? > > How did you test the performance? > > glibc has a performance microbenchmark, did you use that? No, I used the cortex-strings package developed by Linaro for benchmarking various string functions against one another[1]. I haven't checked the glibc benchmarks but I'll look into that. It's quite a specific case that shows the problem so it may not be obvious which one is better however. [1] https://launchpad.net/cortex-strings
On 08/30/2013 02:48 PM, Will Newton wrote: > On 30 August 2013 18:14, Carlos O'Donell <carlos@redhat.com> wrote: > > Hi Carlos, > >>>> A small change to the entry to the aligned copy loop improves >>>> performance slightly on A9 and A15 cores for certain copies. >>>> >>>> ports/ChangeLog.arm: >>>> >>>> 2013-08-07 Will Newton <will.newton@linaro.org> >>>> >>>> * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check >>>> on entry to aligned copy loop for improved performance. >>>> --- >>>> ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++-- >>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> Ping? >> >> How did you test the performance? >> >> glibc has a performance microbenchmark, did you use that? > > No, I used the cortex-strings package developed by Linaro for > benchmarking various string functions against one another[1]. > > I haven't checked the glibc benchmarks but I'll look into that. It's > quite a specific case that shows the problem so it may not be obvious > which one is better however. If it's not obvious how is someone supposed to review this patch? :-) > [1] https://launchpad.net/cortex-strings There are 2 benchmarks. One appears to be dhrystone 2.1, which isn't a string test in and of itself which should not be used for benchmarking or changing string functions. The other is called "multi" and appears to run some functions in a loop and take the time. e.g. http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/benchmarks/multi/harness.c I would not call `multi' exhaustive, and while neither is the glibc performance benchmark tests the glibc tests have received review from the glibc community and are our preferred way of demonstrating performance gains when posting performance patches. I would really really like to see you post the results of running your new implementation with this benchmark and show the numbers that claim this is faster. Is that possible? Cheers, Carlos.
On 30 August 2013 20:26, Carlos O'Donell <carlos@redhat.com> wrote: > On 08/30/2013 02:48 PM, Will Newton wrote: >> On 30 August 2013 18:14, Carlos O'Donell <carlos@redhat.com> wrote: >> >> Hi Carlos, >> >>>>> A small change to the entry to the aligned copy loop improves >>>>> performance slightly on A9 and A15 cores for certain copies. >>>>> >>>>> ports/ChangeLog.arm: >>>>> >>>>> 2013-08-07 Will Newton <will.newton@linaro.org> >>>>> >>>>> * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check >>>>> on entry to aligned copy loop for improved performance. >>>>> --- >>>>> ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++-- >>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>> >>>> Ping? >>> >>> How did you test the performance? >>> >>> glibc has a performance microbenchmark, did you use that? >> >> No, I used the cortex-strings package developed by Linaro for >> benchmarking various string functions against one another[1]. >> >> I haven't checked the glibc benchmarks but I'll look into that. It's >> quite a specific case that shows the problem so it may not be obvious >> which one is better however. > > If it's not obvious how is someone supposed to review this patch? :-) With difficulty. ;-) Joseph has raised some good points about the comments and I'll go back through the code and make sure everything is correct in that regard. The change was actually made to the copy of the code in cortex-strings some time ago but I delayed pushing the patch due to the 2.18 release so I have to refresh my memory somewhat. Ideally we would have an agreed upon benchmark with which everyone could analyse the performance of the code on their systems, however that does not seem to exist as far as I can tell. >> [1] https://launchpad.net/cortex-strings > > There are 2 benchmarks. One appears to be dhrystone 2.1, which isn't a string > test in and of itself which should not be used for benchmarking or changing > string functions. The other is called "multi" and appears to run some functions > in a loop and take the time. The dhrystone stuff is historical and should probably be removed. > e.g. > http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/benchmarks/multi/harness.c > > I would not call `multi' exhaustive, and while neither is the glibc performance > benchmark tests the glibc tests have received review from the glibc community > and are our preferred way of demonstrating performance gains when posting > performance patches. The key advantage of the cortex-strings framework is that it allows graphing the results of benchmarks. Often changes to string function performance can only really be analysed graphically as otherwise you end up with a huge soup of numbers, some going up, some going down and it is very hard to separate the signal from the noise. The glibc benchmarks also have some other weaknesses that should really be addressed, hopefully I'll have some time to write patches for some of this work. > I would really really like to see you post the results of running your new > implementation with this benchmark and show the numbers that claim this is > faster. Is that possible? The attached image shows a graph of memcpy performance with one 8 byte aligned buffer and one 4 byte aligned buffer as buffer size increases. Ignore the bionic lines, "this" is the modified code and "old" is the old code. As you can see it is quite a marginal improvement but for a certain class of copy it does improve performance. --- Will Newton Toolchain Working Group, Linaro
On 30 August 2013 20:26, Carlos O'Donell <carlos@redhat.com> wrote: > On 08/30/2013 02:48 PM, Will Newton wrote: >> On 30 August 2013 18:14, Carlos O'Donell <carlos@redhat.com> wrote: >> >> Hi Carlos, >> >>>>> A small change to the entry to the aligned copy loop improves >>>>> performance slightly on A9 and A15 cores for certain copies. >>>>> >>>>> ports/ChangeLog.arm: >>>>> >>>>> 2013-08-07 Will Newton <will.newton@linaro.org> >>>>> >>>>> * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check >>>>> on entry to aligned copy loop for improved performance. >>>>> --- >>>>> ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++-- >>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>> >>>> Ping? >>> >>> How did you test the performance? >>> >>> glibc has a performance microbenchmark, did you use that? >> >> No, I used the cortex-strings package developed by Linaro for >> benchmarking various string functions against one another[1]. >> >> I haven't checked the glibc benchmarks but I'll look into that. It's >> quite a specific case that shows the problem so it may not be obvious >> which one is better however. > > If it's not obvious how is someone supposed to review this patch? :-) > >> [1] https://launchpad.net/cortex-strings > > There are 2 benchmarks. One appears to be dhrystone 2.1, which isn't a string > test in and of itself which should not be used for benchmarking or changing > string functions. The other is called "multi" and appears to run some functions > in a loop and take the time. > > e.g. > http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/benchmarks/multi/harness.c > > I would not call `multi' exhaustive, and while neither is the glibc performance > benchmark tests the glibc tests have received review from the glibc community > and are our preferred way of demonstrating performance gains when posting > performance patches. > > I would really really like to see you post the results of running your new > implementation with this benchmark and show the numbers that claim this is > faster. Is that possible? The mailing list server does not seem to accept image attachments so I have uploaded the performance graph here: http://people.linaro.org/~will.newton/glibc_memcpy/sizes-memcpy-08-04-2.5.png
Adding libc-alpha since my comments on this are more suitable there - maybe we should just get rid of the ports mailing list altogether. On Mon, Sep 02, 2013 at 02:58:23PM +0100, Will Newton wrote: > The key advantage of the cortex-strings framework is that it allows > graphing the results of benchmarks. Often changes to string function > performance can only really be analysed graphically as otherwise you > end up with a huge soup of numbers, some going up, some going down and > it is very hard to separate the signal from the noise. We already depend on gd-devel to draw graphs for memusagestat. Maybe we could leverage that in the microbenchmark as well and come up with html reports rather than just a bench.out? Graphs in general are a lot more readable and I agree that they would be the next logical improvement to the benchmark framework. > The glibc benchmarks also have some other weaknesses that should > really be addressed, hopefully I'll have some time to write patches > for some of this work. I know Ondrej had proposed a few improvements as well. I'd like to see those reposted so that we can look at it and if possible, have them merged in. Siddhesh
On Mon, Sep 02, 2013 at 02:58:23PM +0100, Will Newton wrote: > On 30 August 2013 20:26, Carlos O'Donell <carlos@redhat.com> wrote: > > On 08/30/2013 02:48 PM, Will Newton wrote: > >> On 30 August 2013 18:14, Carlos O'Donell <carlos@redhat.com> wrote: > >>>> Ping? > >>> > >>> How did you test the performance? > >>> > >>> glibc has a performance microbenchmark, did you use that? > >> > >> No, I used the cortex-strings package developed by Linaro for > >> benchmarking various string functions against one another[1]. > >> > >> I haven't checked the glibc benchmarks but I'll look into that. It's > >> quite a specific case that shows the problem so it may not be obvious > >> which one is better however. > > > > If it's not obvious how is someone supposed to review this patch? :-) > > With difficulty. ;-) > > Joseph has raised some good points about the comments and I'll go back > through the code and make sure everything is correct in that regard. > The change was actually made to the copy of the code in cortex-strings > some time ago but I delayed pushing the patch due to the 2.18 release > so I have to refresh my memory somewhat. > > Ideally we would have an agreed upon benchmark with which everyone > could analyse the performance of the code on their systems, however > that does not seem to exist as far as I can tell. > Well, for measuring performance about only way that everybody will agree with is compile implementations as old.so and new.so and then use LD_PRELOAD=old.so time cmd LD_PRELOAD=new.so time cmd in loop until you calculate that there is statistically significant difference (provided that commands you use are representative enough). For any other somebody will argue that its opposite because you forgotten to take some factor into account. Even when you change LD_PRELOAD=old.so implementation that to accurately measure time spend in function it need not be enough. You could have implementation that will be 5 cycles faster on that benchmark but slower in reality because 1) Its code is 1000 bytes bigger than alternative. Gains in function itself will be eaten by instruction cache misses outside function. Or 2) Function agressively prefetches data (say loop that prefetches lines 512 bytes after current buffer position). This makes benchmark performance better but cache will be littered by data after buffer end buffers, real performance suffers. Or 3) For malloc saving metadata at same cache line as start of allocated memory could make benchmark look bad due to cache misses. But it will improve performance as user will write there and metadata write serves as prefetch. or ... > > e.g. > > http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/benchmarks/multi/harness.c > > > > I would not call `multi' exhaustive, and while neither is the glibc performance > > benchmark tests the glibc tests have received review from the glibc community > > and are our preferred way of demonstrating performance gains when posting > > performance patches. > > The key advantage of the cortex-strings framework is that it allows > graphing the results of benchmarks. Often changes to string function > performance can only really be analysed graphically as otherwise you > end up with a huge soup of numbers, some going up, some going down and > it is very hard to separate the signal from the noise. > Like following? http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_loop/results_rand/result.html On real workloads of memcpy it is still bit hard to see what is going on: http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_loop/results_gcc/result.html > The glibc benchmarks also have some other weaknesses that should > really be addressed, hopefully I'll have some time to write patches > for some of this work. > How will you fix measuring in tight loop with same arguments and only 32 times?
On 09/02/2013 10:18 AM, Will Newton wrote: > The mailing list server does not seem to accept image attachments so I > have uploaded the performance graph here: > > http://people.linaro.org/~will.newton/glibc_memcpy/sizes-memcpy-08-04-2.5.png Excellent graph. Cheers, Carlos.
On 09/02/2013 09:58 AM, Will Newton wrote: >> If it's not obvious how is someone supposed to review this patch? :-) > > With difficulty. ;-) Thank you for acknowledging that. > Joseph has raised some good points about the comments and I'll go back > through the code and make sure everything is correct in that regard. > The change was actually made to the copy of the code in cortex-strings > some time ago but I delayed pushing the patch due to the 2.18 release > so I have to refresh my memory somewhat. > > Ideally we would have an agreed upon benchmark with which everyone > could analyse the performance of the code on their systems, however > that does not seem to exist as far as I can tell. We have one, it's the glibc microbenchmark, and we want to expand it, otherwise when ACME comes with their patch for ARM and breaks performance for targets that Linaro cares about I have no way to reject the patch objectively :-) > The key advantage of the cortex-strings framework is that it allows > graphing the results of benchmarks. Often changes to string function > performance can only really be analysed graphically as otherwise you > end up with a huge soup of numbers, some going up, some going down and > it is very hard to separate the signal from the noise. I disagree strongly. You *must* come up with a measurable answer and looking at a graph is never a solution I'm going to accept. You need to statistically analyze the numbers, assign weights to ranges, and come up with some kind of number that evaluates the results based on *some* formula. That is the only way we are going to keep moving performance forward (against some kind of criteria). > The glibc benchmarks also have some other weaknesses that should > really be addressed, hopefully I'll have some time to write patches > for some of this work. Thank you very much. Cheers, Carlos.
On Tue, Sep 03, 2013 at 12:18:24PM -0400, Carlos O'Donell wrote: > On 09/02/2013 09:58 AM, Will Newton wrote: > >> If it's not obvious how is someone supposed to review this patch? :-) > > > > With difficulty. ;-) > > Thank you for acknowledging that. > > > Joseph has raised some good points about the comments and I'll go back > > through the code and make sure everything is correct in that regard. > > The change was actually made to the copy of the code in cortex-strings > > some time ago but I delayed pushing the patch due to the 2.18 release > > so I have to refresh my memory somewhat. > > > > Ideally we would have an agreed upon benchmark with which everyone > > could analyse the performance of the code on their systems, however > > that does not seem to exist as far as I can tell. > > We have one, it's the glibc microbenchmark, and we want to expand it, > otherwise when ACME comes with their patch for ARM and breaks performance > for targets that Linaro cares about I have no way to reject the patch > objectively :-) > Carlos, you are asking for impossible. When you publish benchmark people will try to maximize benchmark number. After certain point this becomes possible only by employing shady accounting: Move part of time to place wehre it will not be measured by benchmark (for example by having function that is 4kb large, on benchmarks it will fit into instruction cache but that does not happen in reality). Taking care of common factors that can cause that is about ten times more complex than whole system benchmarking, analysis will be quite difficult as you will get twenty numbers and you will need to decide which ones could made real impact and which wont. > > The key advantage of the cortex-strings framework is that it allows > > graphing the results of benchmarks. Often changes to string function > > performance can only really be analysed graphically as otherwise you > > end up with a huge soup of numbers, some going up, some going down and > > it is very hard to separate the signal from the noise. > > I disagree strongly. You *must* come up with a measurable answer and > looking at a graph is never a solution I'm going to accept. > You can have that opinion. Looking at performance graphs is most powerful technique how to understand performance. I got most of my improvements from analyzing these. > You need to statistically analyze the numbers, assign weights to ranges, > and come up with some kind of number that evaluates the results based > on *some* formula. That is the only way we are going to keep moving > performance forward (against some kind of criteria). > These accurate assigning weigths is best done by taking program running it and measuring time. Without taking this into account weigths will not tell much, as you will likely just optimize cold code at expense of hot code. > > The glibc benchmarks also have some other weaknesses that should > > really be addressed, hopefully I'll have some time to write patches > > for some of this work. > > Thank you very much. > > Cheers, > Carlos.
On 09/03/2013 01:37 PM, Ondřej Bílka wrote: >> We have one, it's the glibc microbenchmark, and we want to expand it, >> otherwise when ACME comes with their patch for ARM and breaks performance >> for targets that Linaro cares about I have no way to reject the patch >> objectively :-) >> > Carlos, you are asking for impossible. When you publish benchmark people > will try to maximize benchmark number. After certain point this becomes > possible only by employing shady accounting: Move part of time to place > wehre it will not be measured by benchmark (for example by having > function that is 4kb large, on benchmarks it will fit into instruction > cache but that does not happen in reality). What is it that I'm asking that is impossible? > Taking care of common factors that can cause that is about ten times > more complex than whole system benchmarking, analysis will be quite > difficult as you will get twenty numbers and you will need to decide > which ones could made real impact and which wont. Sorry, could you clarify this a bit more, exactly what is ten times more complex? If we have N tests and they produce N numbers, for a given target, for a given device, for a given workload, there is a set of importance weights on N that should give you some kind of relevance. We should be able to come up with some kind of framework from which we can clearly say "this patch is better than this other patch", even if not automated, it should be possible to reason from the results, and that reasoning recorded as a discussion on this list. >>> The key advantage of the cortex-strings framework is that it allows >>> graphing the results of benchmarks. Often changes to string function >>> performance can only really be analysed graphically as otherwise you >>> end up with a huge soup of numbers, some going up, some going down and >>> it is very hard to separate the signal from the noise. >> >> I disagree strongly. You *must* come up with a measurable answer and >> looking at a graph is never a solution I'm going to accept. >> > You can have that opinion. > Looking at performance graphs is most powerful technique how to > understand performance. I got most of my improvements from analyzing > these. That is a different use for the graphs. I do not disagree that graphing is a *powerful* way to display information and using that information to produce a new routine is useful. What I disagree with is using such graphs to argue qualitatively that your patch is better than the existing implementation. There is always a quantitative way to say X is better than Y, but it requires breaking down your expectations and documenting them e.g. should be faster with X alignment on sizes from N bytes to M bytes, and then ranking based on those criteria. >> You need to statistically analyze the numbers, assign weights to ranges, >> and come up with some kind of number that evaluates the results based >> on *some* formula. That is the only way we are going to keep moving >> performance forward (against some kind of criteria). >> > These accurate assigning weigths is best done by taking program running > it and measuring time. Without taking this into account weigths will not > tell much, as you will likely just optimize cold code at expense of hot > code. I don't disagree with you here. Cheers, Carlos.
On Tue, Sep 03, 2013 at 01:52:34PM -0400, Carlos O'Donell wrote: > On 09/03/2013 01:37 PM, Ondřej Bílka wrote: > >> We have one, it's the glibc microbenchmark, and we want to expand it, > >> otherwise when ACME comes with their patch for ARM and breaks performance > >> for targets that Linaro cares about I have no way to reject the patch > >> objectively :-) > >> > > Carlos, you are asking for impossible. When you publish benchmark people > > will try to maximize benchmark number. After certain point this becomes > > possible only by employing shady accounting: Move part of time to place > > wehre it will not be measured by benchmark (for example by having > > function that is 4kb large, on benchmarks it will fit into instruction > > cache but that does not happen in reality). > > What is it that I'm asking that is impossible? > Having static set of benchmarks that can say if implementation is improvement. We are shooting to moving target, architectures change and as what we write will code that will come to users with considerable delay and factors we used for decision will change in meantime. Once implementation reach certain quality question what is better becomes dependent on program used. Until we could decide from profile feedback we will lose some percents by having to use single implementation. > > Taking care of common factors that can cause that is about ten times > > more complex than whole system benchmarking, analysis will be quite > > difficult as you will get twenty numbers and you will need to decide > > which ones could made real impact and which wont. > > Sorry, could you clarify this a bit more, exactly what is ten times > more complex? > Having benchmark suite that will catch all relevant factors that can affect performance. Some are hard to qualify for them we need to know how average program stresses resources. Take instruction cache usage, a function will occupy cache lines and we can accurately measure probability and cost of cache misses inside function. What is hard to estimate is how this will affect rest of program. For this we would need to know average probability that cache line will be referenced in future. > If we have N tests and they produce N numbers, for a given target, > for a given device, for a given workload, there is a set of importance > weights on N that should give you some kind of relevance. > You are jumping to case when we will have these weights. Problematic part is getting those. > We should be able to come up with some kind of framework from which > we can clearly say "this patch is better than this other patch", even > if not automated, it should be possible to reason from the results, > and that reasoning recorded as a discussion on this list. > What is possible is to say that some patch is significantly worse based on some criteria. There is lot of gray area where decision is unclear. > >>> The key advantage of the cortex-strings framework is that it allows > >>> graphing the results of benchmarks. Often changes to string function > >>> performance can only really be analysed graphically as otherwise you > >>> end up with a huge soup of numbers, some going up, some going down and > >>> it is very hard to separate the signal from the noise. > >> > >> I disagree strongly. You *must* come up with a measurable answer and > >> looking at a graph is never a solution I'm going to accept. > >> > > You can have that opinion. > > Looking at performance graphs is most powerful technique how to > > understand performance. I got most of my improvements from analyzing > > these. > > That is a different use for the graphs. I do not disagree that graphing > is a *powerful* way to display information and using that information to > produce a new routine is useful. What I disagree with is using such graphs > to argue qualitatively that your patch is better than the existing > implementation. > > There is always a quantitative way to say X is better than Y, but it > requires breaking down your expectations and documenting them e.g. > should be faster with X alignment on sizes from N bytes to M bytes, and > then ranking based on those criteria. > > >> You need to statistically analyze the numbers, assign weights to ranges, > >> and come up with some kind of number that evaluates the results based > >> on *some* formula. That is the only way we are going to keep moving > >> performance forward (against some kind of criteria). > >> > > These accurate assigning weigths is best done by taking program running > > it and measuring time. Without taking this into account weigths will not > > tell much, as you will likely just optimize cold code at expense of hot > > code. > > I don't disagree with you here. > > Cheers, > Carlos.
On 09/03/2013 02:57 PM, Ondřej Bílka wrote: > On Tue, Sep 03, 2013 at 01:52:34PM -0400, Carlos O'Donell wrote: >> On 09/03/2013 01:37 PM, Ondřej Bílka wrote: >>>> We have one, it's the glibc microbenchmark, and we want to expand it, >>>> otherwise when ACME comes with their patch for ARM and breaks performance >>>> for targets that Linaro cares about I have no way to reject the patch >>>> objectively :-) >>>> >>> Carlos, you are asking for impossible. When you publish benchmark people >>> will try to maximize benchmark number. After certain point this becomes >>> possible only by employing shady accounting: Move part of time to place >>> wehre it will not be measured by benchmark (for example by having >>> function that is 4kb large, on benchmarks it will fit into instruction >>> cache but that does not happen in reality). >> >> What is it that I'm asking that is impossible? >> > Having static set of benchmarks that can say if implementation is > improvement. I agree that a static set of benchmarks will not provide us with an accurate answer for "Is this patch objectively better for performance?" I don't see that that makes it impossible. The static set of benchmarks at a given point in time (since we should be adding new benchmarks) may have some error with respect to "fastest" for a given ISA/device/workload (which I will shorten as `workload' from now on). Therefore I would not say it's impossible. It's probably impossible for the error between the estimator and reality being close to zero though. That's just expected. > We are shooting to moving target, architectures change and as what we write > will code that will come to users with considerable delay and factors we > used for decision will change in meantime. What's wrong with a moving target? I have spoken with CPU designers and I've been told that they do whole system profiling in order assist in making instruction to microcode decoding decisions. Therefore what we select as optimal sequences are also fed forward into *new* architecture designs. > Once implementation reach certain quality question what is better > becomes dependent on program used. Until we could decide from profile > feedback we will lose some percents by having to use single > implementation. I agree. The eventual goal of the project is to have some kind of whole system benchmarking that allows users to feed in their profiles and allow us as developers to see what users are doing with our library. Just like CPU designers feed in a whole distribution of applications and look at the probability of instruction selection and tweak instruction to microcode mappings. I am willing to accept a certain error in the process as long as I know we are headed in the right direction. If we all disagree about the direction we are going in then we should talk about it. I see: microbenchmarks -> whole system benchmarks -> profile driven optimizations With the latter driving the set of tunnables we expose in the library, and possibly even the functions selected by the IFUNC resolvers at program startup. >>> Taking care of common factors that can cause that is about ten times >>> more complex than whole system benchmarking, analysis will be quite >>> difficult as you will get twenty numbers and you will need to decide >>> which ones could made real impact and which wont. >> >> Sorry, could you clarify this a bit more, exactly what is ten times >> more complex? >> > Having benchmark suite that will catch all relevant factors that can > affect performance. Some are hard to qualify for them we need to know > how average program stresses resources. I agree. I would be happy to accept a patch that does: * Shows the benchmark numbers. * Explains relevant factors not caught by the benchmark that affect performance, what they are, and why the patch should go in. My goal is to increase the quality of the written rationales for performance related submissions. > Take instruction cache usage, a function will occupy cache lines and we > can accurately measure probability and cost of cache misses inside > function. What is hard to estimate is how this will affect rest of > program. For this we would need to know average probability that cache > line will be referenced in future. Good example. >> If we have N tests and they produce N numbers, for a given target, >> for a given device, for a given workload, there is a set of importance >> weights on N that should give you some kind of relevance. >> > You are jumping to case when we will have these weights. Problematic > part is getting those. I agree. It's hard to know the weights without having an intuitive understanding of the applications you're running on your system and what's relevant for their performance. Perhaps my example of a weighted average is too abstract to use today. >> We should be able to come up with some kind of framework from which >> we can clearly say "this patch is better than this other patch", even >> if not automated, it should be possible to reason from the results, >> and that reasoning recorded as a discussion on this list. >> > What is possible is to say that some patch is significantly worse based > on some criteria. There is lot of gray area where decision is unclear. :-) Cheers, Carlos.
On Tue, Sep 3, 2013 at 11:18 AM, Carlos O'Donell <carlos@redhat.com> wrote: > We have one, it's the glibc microbenchmark, and we want to expand it, > otherwise when ACME comes with their patch for ARM and breaks performance > for targets that Linaro cares about I have no way to reject the patch > objectively :-) Can you be objective in analyzing performance when two different people have differing opinions on what performance preconditions should be coded against? There are some cases that are obvious.. we know that from pipeline analysis that certain instruction sequences can hinder performance. That is objective and can be measured by a benchmark, but saying that a particular change penalizes X sized copies but helps Y sized copies when there are no published performance preconditions isn't. It's a difference in opinion of what's important. PowerPC has had the luxury of not having their performance pre-conditions contested. PowerPC string performance is optimized based upon customer data-set analysis. So PowerPC's preconditions are pretty concrete... Optimize for aligned data in excess of 128-bytes (I believe). > You need to statistically analyze the numbers, assign weights to ranges, > and come up with some kind of number that evaluates the results based > on *some* formula. That is the only way we are going to keep moving > performance forward (against some kind of criteria). This sounds like establishing preconditions (what types of data will be optimized for). Unless technology evolves that you can statistically analyze data in real time and adjust the implementation based on what you find (an implementation with a different set of preconditions) to account for this you're going to end up with a lot of in-fighting over performance. I've run into situations where I recommended that a customer code their own string function implementation because they continually encountered unaligned-data when copying-by-value in C++ functions and PowerPC's string function implementations penalized unaligned copies in preference for aligned copies. Ryan
On Tue, Sep 3, 2013 at 12:37 PM, Ondřej Bílka <neleai@seznam.cz> wrote: >> I disagree strongly. You *must* come up with a measurable answer and >> looking at a graph is never a solution I'm going to accept. >> > You can have that opinion. > Looking at performance graphs is most powerful technique how to > understand performance. I got most of my improvements from analyzing > these. Are there any open source pipeline analysis tools? I've found the one I've used (proprietary) to be a pretty good indicator of general instruction selection optimization/correctness. Graphs are useful if everyone's agreed which data-sizes and alignment restrictions should be optimized for a particular platform. Ryan
On 09/03/2013 03:31 PM, Ryan S. Arnold wrote: > On Tue, Sep 3, 2013 at 11:18 AM, Carlos O'Donell <carlos@redhat.com> wrote: >> We have one, it's the glibc microbenchmark, and we want to expand it, >> otherwise when ACME comes with their patch for ARM and breaks performance >> for targets that Linaro cares about I have no way to reject the patch >> objectively :-) > > Can you be objective in analyzing performance when two different > people have differing opinions on what performance preconditions > should be coded against? No. > There are some cases that are obvious.. we know that from pipeline > analysis that certain instruction sequences can hinder performance. > That is objective and can be measured by a benchmark, but saying that > a particular change penalizes X sized copies but helps Y sized copies > when there are no published performance preconditions isn't. It's a > difference in opinion of what's important. I agree. The project needs to adopt some set of performance preconditions and document them and defend them. If we can't defend these positions then we will never be able to evaluate any performance patches. We will see-saw between several implementations over the years. The current set of performance preconditions are baked into the experience of the core developers reviewing patches. I want the experts out of the loop. > PowerPC has had the luxury of not having their performance > pre-conditions contested. PowerPC string performance is optimized > based upon customer data-set analysis. So PowerPC's preconditions are > pretty concrete... Optimize for aligned data in excess of 128-bytes > (I believe). We should be documenting this somewhere, preferably in a Power-specific test that looks at just this kind of issue. Documenting this statically is the first, in my opinion, stepping stone to having something like dynamic feedback. >> You need to statistically analyze the numbers, assign weights to ranges, >> and come up with some kind of number that evaluates the results based >> on *some* formula. That is the only way we are going to keep moving >> performance forward (against some kind of criteria). > > This sounds like establishing preconditions (what types of data will > be optimized for). I agree. We need it. > Unless technology evolves that you can statistically analyze data in > real time and adjust the implementation based on what you find (an > implementation with a different set of preconditions) to account for > this you're going to end up with a lot of in-fighting over > performance. Why do you assume we'll have a lot of in-fighting over performance? At present we've split the performance intensive (or so we believe) routines on a per-machine basis. The arguments are then going to be had only on a per-machine basis, and even then for each hardware variant can have an IFUNC resolver select the right routine at runtime. Then we come upon the tunables that should allow some dynamic adjustment of an algorithm based on realtime data. > I've run into situations where I recommended that a customer code > their own string function implementation because they continually > encountered unaligned-data when copying-by-value in C++ functions and > PowerPC's string function implementations penalized unaligned copies > in preference for aligned copies. Provide both in glibc and expose a tunable? Cheers, Carlos.
On Tue, Sep 3, 2013 at 2:54 PM, Carlos O'Donell <carlos@redhat.com> wrote: > The current set of performance preconditions are baked into the experience > of the core developers reviewing patches. I want the experts out of the > loop. > This is the clutch. Developers working for the CPU manufacturers are privy to a lot of unpublished timing, penalty/hazard information, as well as proprietary pipeline analysis tools. Will "J. Random Hacker" working for MegaCorp tell you that the reason he's chosen a particular instruction sequence is because the system he's working on has a tiny branch cache (the size of which might be unpublished)? >> PowerPC has had the luxury of not having their performance >> pre-conditions contested. PowerPC string performance is optimized >> based upon customer data-set analysis. So PowerPC's preconditions are >> pretty concrete... Optimize for aligned data in excess of 128-bytes >> (I believe). > > We should be documenting this somewhere, preferably in a Power-specific > test that looks at just this kind of issue. I might be mistaken, but I think you'll find these preconditions explicitly defined in the string function implementation source files for PowerPC. > Documenting this statically is the first, in my opinion, stepping stone > to having something like dynamic feedback. Absolutely! >> Unless technology evolves that you can statistically analyze data in >> real time and adjust the implementation based on what you find (an >> implementation with a different set of preconditions) to account for >> this you're going to end up with a lot of in-fighting over >> performance. > > Why do you assume we'll have a lot of in-fighting over performance? I'm projecting here. If someone proposed to adjust the PowerPC optimized string functions to their own preconditions and it drastically changed the performance of existing customers, or future customers you'd see me panic. > At present we've split the performance intensive (or so we believe) > routines on a per-machine basis. The arguments are then going to be > had only on a per-machine basis, and even then for each hardware > variant can have an IFUNC resolver select the right routine at > runtime. Right, selecting the right variant with IFUNC has certainly helped platforms that didn't use optimized libraries. This is the low hanging fruit. So now our concern is the proliferation of micro-tuned variants and a lack of qualified eyes to objectively review the patches. > Then we come upon the tunables that should allow some dynamic adjustment > of an algorithm based on realtime data. Yes, you can do this with tunables if the developer knows something about the data (more about that later). >> I've run into situations where I recommended that a customer code >> their own string function implementation because they continually >> encountered unaligned-data when copying-by-value in C++ functions and >> PowerPC's string function implementations penalized unaligned copies >> in preference for aligned copies. > > Provide both in glibc and expose a tunable? So do we (the glibc community) no longer consider the proliferation of tunables to be a mortal sin? Or was that only with regard to configuration options? Regardless, it still burdens the Linux distributions and developers who have to provide QA. If tunables are available, then trial-and-error would help where a user doesn't know the particulars of his application's data usage. Using tunables is potentially problematic as well. Often testing a condition in highly optimized code is enough to obviate the performance benefit you're attempting to provide. Checking for feature availability might consume enough cycles to make it senseless to use the facility itself. I believe this is what happened in the early days trying to use VMX in string routines. Additionally, while dynamically linked applications won't suffer from using IFUNC resolved functions (because of mandatory PLT usage), glibc internal usage of IFUNC resolved functions very likely will if/when forced to go through the PLT, especially on systems like PowerPC where indirect branching is more expensive than direct branching. When Adhemerval's PowerPC IFUNC patches go in I'll probably argue for keeping a 'generic' optimized version for internal libc usage. We'll see how it all works together. So using tunables alone isn't necessarily a win unless it's coupled with IFUNC. But using IFUNC also isn't a guaranteed win in all cases. For external usage, Using IFUNC in combination with a tunable should be beneficial. For instance, on systems that don't have a concrete cacheline size (e.g., the A2 processor), at process initialization we query the system cacheline size, populate a static with the size, and then the string routines will query that size at runtime. It'd be nice to do that query at initialization and then pre-select an implementation based on cacheline size so we don't have to test for the cacheline size each time through the string function. This of course increases the cost of maintaining the string routines by having myriad of combinations. These are all the trade-offs we weigh. Ryan
On Tue, Sep 03, 2013 at 02:31:14PM -0500, Ryan S. Arnold wrote: > On Tue, Sep 3, 2013 at 11:18 AM, Carlos O'Donell <carlos@redhat.com> wrote: > > We have one, it's the glibc microbenchmark, and we want to expand it, > > otherwise when ACME comes with their patch for ARM and breaks performance > > for targets that Linaro cares about I have no way to reject the patch > > objectively :-) > > Can you be objective in analyzing performance when two different > people have differing opinions on what performance preconditions > should be coded against? > I fear more situations like when google and facebook will send their implementation which saves each milion dollars in energy bill over others implementation. > > You need to statistically analyze the numbers, assign weights to ranges, > > and come up with some kind of number that evaluates the results based > > on *some* formula. That is the only way we are going to keep moving > > performance forward (against some kind of criteria). > > This sounds like establishing preconditions (what types of data will > be optimized for). > > Unless technology evolves that you can statistically analyze data in > real time and adjust the implementation based on what you find (an > implementation with a different set of preconditions) to account for > this you're going to end up with a lot of in-fighting over > performance. > Technology is there at least for x64, its political problem. You do not need real time analysis most of time. When user accepts possibility performance could be worse by constant factor then profiling becomes easier. We could rely on data from previous runs of program instead. First step would be to replace ifunc selection by first running benchmarks on processor and then creating hash table with numbers of fastest implementation and selection will be done by this lookup. My profiler could(modulo externalities) for each program measure which implementation is fastest on given processor. Then AFAIR appending data to elf is legal so we could append hash table with program specific numbers to binary. A ifunc resolver will first look to end of binary if there is ifunc table. This could be implemented that worst that can happen for false positive is selecting slow implementation. If there is entry for current processor it will use it, otherwise it will look to table at end of libc.so and repeat this step and if it was not found it will use default implementation. This is doable, problems is motivate somebody to do profiling. My approach tries to it possible for programmers (write testing, run it at different machines and assemble), distributions (find enough users that will agree to have sometimes turned profiling on and sending results.) or end users who could turn profiling on to learn their patterns.
On Tue, Sep 03, 2013 at 03:56:29PM -0500, Ryan S. Arnold wrote: > On Tue, Sep 3, 2013 at 2:54 PM, Carlos O'Donell <carlos@redhat.com> wrote: > > The current set of performance preconditions are baked into the experience > > of the core developers reviewing patches. I want the experts out of the > > loop. > > > > This is the clutch. > > Developers working for the CPU manufacturers are privy to a lot of > unpublished timing, penalty/hazard information, as well as proprietary > pipeline analysis tools. > One problem of those on x64 is how much are these informations complete. At least based on reading AMD optimization manuals around half of advices is inaccurate so I need to check if they hold anyway. > > At present we've split the performance intensive (or so we believe) > > routines on a per-machine basis. The arguments are then going to be > > had only on a per-machine basis, and even then for each hardware > > variant can have an IFUNC resolver select the right routine at > > runtime. > > Right, selecting the right variant with IFUNC has certainly helped > platforms that didn't use optimized libraries. This is the low > hanging fruit. So now our concern is the proliferation of micro-tuned > variants and a lack of qualified eyes to objectively review the > patches. > Not there yet, at least for AMD, where its often using ifunc to select slowest implementation. When you look at x64 benchmarks relation between asymptoticaly best and selected implementation is quite random. > > Then we come upon the tunables that should allow some dynamic adjustment > > of an algorithm based on realtime data. > > Yes, you can do this with tunables if the developer knows something > about the data (more about that later). > > >> I've run into situations where I recommended that a customer code > >> their own string function implementation because they continually > >> encountered unaligned-data when copying-by-value in C++ functions and > >> PowerPC's string function implementations penalized unaligned copies > >> in preference for aligned copies. > > > > Provide both in glibc and expose a tunable? > > So do we (the glibc community) no longer consider the proliferation of > tunables to be a mortal sin? Or was that only with regard to > configuration options? Regardless, it still burdens the Linux > distributions and developers who have to provide QA. > > If tunables are available, then trial-and-error would help where a > user doesn't know the particulars of his application's data usage. > Here auto-tuning as described in other mail would give better result. Unless there data usage would not fit any of our variants in which case a custom routine would be needed. > Using tunables is potentially problematic as well. Often testing a > condition in highly optimized code is enough to obviate the > performance benefit you're attempting to provide. Checking for feature > availability might consume enough cycles to make it senseless to use > the facility itself. I believe this is what happened in the early > days trying to use VMX in string routines. > > Additionally, while dynamically linked applications won't suffer from > using IFUNC resolved functions (because of mandatory PLT usage), glibc > internal usage of IFUNC resolved functions very likely will if/when > forced to go through the PLT, especially on systems like PowerPC where > indirect branching is more expensive than direct branching. When > Adhemerval's PowerPC IFUNC patches go in I'll probably argue for > keeping a 'generic' optimized version for internal libc usage. We'll > see how it all works together. > Those that are concerned about getting each bit of performance would recompile everything with --march=target anyway. It would be nice to select internal version based on --march. > So using tunables alone isn't necessarily a win unless it's coupled > with IFUNC. But using IFUNC also isn't a guaranteed win in all cases. > > For external usage, Using IFUNC in combination with a tunable should > be beneficial. For instance, on systems that don't have a concrete > cacheline size (e.g., the A2 processor), at process initialization we > query the system cacheline size, populate a static with the size, and > then the string routines will query that size at runtime. It'd be > nice to do that query at initialization and then pre-select an > implementation based on cacheline size so we don't have to test for > the cacheline size each time through the string function. > > This of course increases the cost of maintaining the string routines > by having myriad of combinations. > > These are all the trade-offs we weigh. > > Ryan
On 09/03/2013 04:56 PM, Ryan S. Arnold wrote: > On Tue, Sep 3, 2013 at 2:54 PM, Carlos O'Donell <carlos@redhat.com> wrote: >> The current set of performance preconditions are baked into the experience >> of the core developers reviewing patches. I want the experts out of the >> loop. >> > > This is the clutch. > > Developers working for the CPU manufacturers are privy to a lot of > unpublished timing, penalty/hazard information, as well as proprietary > pipeline analysis tools. > > Will "J. Random Hacker" working for MegaCorp tell you that the reason > he's chosen a particular instruction sequence is because the system > he's working on has a tiny branch cache (the size of which might be > unpublished)? That's an interesting point. I've seen similar things happen in *.md file generation in gcc and had forgotten about it entirely. However, in all such instances the developer can say "I am bound by confidentiality not to reveal the reasons why I made my choices." We can document that. The opposite point is also true in that we might actually tune an implementation based on real-world data and have no idea why it behaves optimally from an architectural perspective. In which case we have to document "This implementation was tuned using data set X." How are these two situations any different really? I would accept patches in both cases, but I would like to see that MegaCorp's patches don't change anything for the consumer level CPUs we routinely employ. >>> PowerPC has had the luxury of not having their performance >>> pre-conditions contested. PowerPC string performance is optimized >>> based upon customer data-set analysis. So PowerPC's preconditions are >>> pretty concrete... Optimize for aligned data in excess of 128-bytes >>> (I believe). >> >> We should be documenting this somewhere, preferably in a Power-specific >> test that looks at just this kind of issue. > > I might be mistaken, but I think you'll find these preconditions > explicitly defined in the string function implementation source files > for PowerPC. Excellent. We should probably have some more central location for this information in a developer's guide or internals guide. >> Documenting this statically is the first, in my opinion, stepping stone >> to having something like dynamic feedback. > > Absolutely! Glad we agree. >>> Unless technology evolves that you can statistically analyze data in >>> real time and adjust the implementation based on what you find (an >>> implementation with a different set of preconditions) to account for >>> this you're going to end up with a lot of in-fighting over >>> performance. >> >> Why do you assume we'll have a lot of in-fighting over performance? > > I'm projecting here. If someone proposed to adjust the PowerPC > optimized string functions to their own preconditions and it > drastically changed the performance of existing customers, or future > customers you'd see me panic. At least with a microbenchmark you'd know the situation was about to go sideways immediately after running the benchmark. Otherwise it might be quite a while before you do profiling before a big tools release. This is exactly the situation we are in right now for x86 and x86-64, and will likely see with ARM/AArch64 and the plethora of vendor implementations. >> At present we've split the performance intensive (or so we believe) >> routines on a per-machine basis. The arguments are then going to be >> had only on a per-machine basis, and even then for each hardware >> variant can have an IFUNC resolver select the right routine at >> runtime. > > Right, selecting the right variant with IFUNC has certainly helped > platforms that didn't use optimized libraries. This is the low > hanging fruit. So now our concern is the proliferation of micro-tuned > variants and a lack of qualified eyes to objectively review the > patches. Yes, that's a concern. >> Then we come upon the tunables that should allow some dynamic adjustment >> of an algorithm based on realtime data. > > Yes, you can do this with tunables if the developer knows something > about the data (more about that later). We can't always assume ignorance, but I agree that there are problems here. >>> I've run into situations where I recommended that a customer code >>> their own string function implementation because they continually >>> encountered unaligned-data when copying-by-value in C++ functions and >>> PowerPC's string function implementations penalized unaligned copies >>> in preference for aligned copies. >> >> Provide both in glibc and expose a tunable? > > So do we (the glibc community) no longer consider the proliferation of > tunables to be a mortal sin? Or was that only with regard to > configuration options? Regardless, it still burdens the Linux > distributions and developers who have to provide QA. We need to have a broader conversation about this very issue. Pragmatically configuration and tunables are similar problems. I think tunables are a mortal sin and would rather see algorithms that can select the best implementation for the user automatically. However, we need tunables to bootstrap that. I would hope that as tunables appear they disappear into smart algorithms that do a better job of tunning internals. > If tunables are available, then trial-and-error would help where a > user doesn't know the particulars of his application's data usage. It would. > Using tunables is potentially problematic as well. Often testing a > condition in highly optimized code is enough to obviate the > performance benefit you're attempting to provide. Checking for feature > availability might consume enough cycles to make it senseless to use > the facility itself. I believe this is what happened in the early > days trying to use VMX in string routines. Agreed. You have to make it configurable then, and that's costly from a complexity perspective, and makes testing all build configurations harder and harder. > Additionally, while dynamically linked applications won't suffer from > using IFUNC resolved functions (because of mandatory PLT usage), glibc > internal usage of IFUNC resolved functions very likely will if/when > forced to go through the PLT, especially on systems like PowerPC where > indirect branching is more expensive than direct branching. When > Adhemerval's PowerPC IFUNC patches go in I'll probably argue for > keeping a 'generic' optimized version for internal libc usage. We'll > see how it all works together. Good point. > So using tunables alone isn't necessarily a win unless it's coupled > with IFUNC. But using IFUNC also isn't a guaranteed win in all cases. Unless we try to come up wtih something to make it faster on Power? > For external usage, Using IFUNC in combination with a tunable should > be beneficial. For instance, on systems that don't have a concrete > cacheline size (e.g., the A2 processor), at process initialization we > query the system cacheline size, populate a static with the size, and > then the string routines will query that size at runtime. It'd be > nice to do that query at initialization and then pre-select an > implementation based on cacheline size so we don't have to test for > the cacheline size each time through the string function. Yes. > This of course increases the cost of maintaining the string routines > by having myriad of combinations. Which we already have and need to test via the IFUNC testing functionality that HJ added. > These are all the trade-offs we weigh. ... and more. I see your concerns, and raise you a couple more. If we leave the situation as is we will have a continued and difficult time accepting performance patches for functional units of the library. We need to drive objectivity into the evaluation of the patches. It won't be completely objective at first, but it better move in that direction. Cheers, Carlos.
On Tue, Sep 03, 2013 at 03:15:25PM -0400, Carlos O'Donell wrote: > I agree. The eventual goal of the project is to have some kind of > whole system benchmarking that allows users to feed in their profiles > and allow us as developers to see what users are doing with our library. > > Just like CPU designers feed in a whole distribution of applications > and look at the probability of instruction selection and tweak instruction > to microcode mappings. > > I am willing to accept a certain error in the process as long as I know > we are headed in the right direction. If we all disagree about the > direction we are going in then we should talk about it. > > I see: > > microbenchmarks -> whole system benchmarks -> profile driven optimizations I've mentioned this before - microbenchmarks are not a way to whole system benchmarks in that they don't replace system benchmarks. We need to work on both in parallel because both have different goals. A microbenchmark would have parameters such as alignment, size and cache pressure to determine how an implementation scales. These are generic numbers (i.e. they're not tied to specific high level workloads) that a developer can use to design their programs. Whole system benchmarks however work at a different level. They would give an average case number that describes how a specific recipe impacts performance of a set of programs. An administrator would use these to tweak the system for the workload. > I would be happy to accept a patch that does: > * Shows the benchmark numbers. > * Explains relevant factors not caught by the benchmark that affect > performance, what they are, and why the patch should go in. > > My goal is to increase the quality of the written rationales for > performance related submissions. Agreed. In fact, this should go in as a large comment in the implementation itself. Someone had mentioned in the past (was it Torvald?) that every assembly implementation we write should be as verbose in comments as it can possibly be so that there is no ambiguity about the rationale for selection of specific instruction sequences over others. > >> If we have N tests and they produce N numbers, for a given target, > >> for a given device, for a given workload, there is a set of importance > >> weights on N that should give you some kind of relevance. > >> > > You are jumping to case when we will have these weights. Problematic > > part is getting those. > > I agree. > > It's hard to know the weights without having an intuitive understanding > of the applications you're running on your system and what's relevant > for their performance. 1. Assume aligned input. Nothing should take (any noticeable) performance away from align copies/moves 2. Scale with size 3. Provide acceptable performance for unaligned sizes without penalizing the aligned case 4. Measure the effect of dcache pressure on function performance 5. Measure effect of icache pressure on function performance. Depending on the actual cost of cache misses on different processors, the icache/dcache miss cost would either have higher or lower weight but for 1-3, I'd go in that order of priorities with little concern for unaligned cases. Siddhesh
On Wed, Sep 04, 2013 at 01:00:09PM +0530, Siddhesh Poyarekar wrote: > On Tue, Sep 03, 2013 at 03:15:25PM -0400, Carlos O'Donell wrote: > > I agree. The eventual goal of the project is to have some kind of > > whole system benchmarking that allows users to feed in their profiles > > and allow us as developers to see what users are doing with our library. > > > > Just like CPU designers feed in a whole distribution of applications > > and look at the probability of instruction selection and tweak instruction > > to microcode mappings. > > > > I am willing to accept a certain error in the process as long as I know > > we are headed in the right direction. If we all disagree about the > > direction we are going in then we should talk about it. > > > > I see: > > > > microbenchmarks -> whole system benchmarks -> profile driven optimizations > > I've mentioned this before - microbenchmarks are not a way to whole > system benchmarks in that they don't replace system benchmarks. We > need to work on both in parallel because both have different goals. > > A microbenchmark would have parameters such as alignment, size and > cache pressure to determine how an implementation scales. These are > generic numbers (i.e. they're not tied to specific high level > workloads) that a developer can use to design their programs. > > Whole system benchmarks however work at a different level. They would > give an average case number that describes how a specific recipe > impacts performance of a set of programs. An administrator would use > these to tweak the system for the workload. > > > I would be happy to accept a patch that does: > > * Shows the benchmark numbers. > > * Explains relevant factors not caught by the benchmark that affect > > performance, what they are, and why the patch should go in. > > > > My goal is to increase the quality of the written rationales for > > performance related submissions. > > Agreed. In fact, this should go in as a large comment in the > implementation itself. Someone had mentioned in the past (was it > Torvald?) that every assembly implementation we write should be as > verbose in comments as it can possibly be so that there is no > ambiguity about the rationale for selection of specific instruction > sequences over others. > > > >> If we have N tests and they produce N numbers, for a given target, > > >> for a given device, for a given workload, there is a set of importance > > >> weights on N that should give you some kind of relevance. > > >> > > > You are jumping to case when we will have these weights. Problematic > > > part is getting those. > > > > I agree. > > > > It's hard to know the weights without having an intuitive understanding > > of the applications you're running on your system and what's relevant > > for their performance. > > 1. Assume aligned input. Nothing should take (any noticeable) > performance away from align copies/moves Not very useful as this is extremely dependant on function measured. For functions like strcmp and strlen alignments are mostly random so aligned case does not say much. On opposite end of spectrum is memset which is almost always 8 byte aligned and unaligned performance does not make lot of sense. > 2. Scale with size Not very important for several reasons. One is that big sizes are cold (just look in oprofile output that loops are less frequent than header.) Second reason is that if we look at caller large sizes are unlikely bottleneck. One type of usage is find delimiter like: i=strlen(n); for (i=0;i<n;i++) something(); Here for n large a strlen contribution is likely small. Or second one is skipping parts of string. Consider following while(p=strchr(p+1,'a')){ something(); } If p was 1000 byte buffer then best case is when a is not there and we do one 1000 byte strchr. Worst case is when string consist entirely of a's and we need to call 1 byte strchr 1000 times. > 3. Provide acceptable performance for unaligned sizes without > penalizing the aligned case This is quite important case. It should be measured correctly, what is important is that alignment varies. This can be slower than when you pick fixed alignment and alignment varies in reality. > 4. Measure the effect of dcache pressure on function performance > 5. Measure effect of icache pressure on function performance. > Here you really need to base weigths on function usage patterns. A bigger code size is acceptable for functions that are called more often. You need to see distribution of how are calls clustered to get full picture. A strcmp is least sensitive to icache concerns, as when it is called its mostly 100 times over in tight loop so size is not big issue. If same number of call is uniformnly spread through program we need stricter criteria. > Depending on the actual cost of cache misses on different processors, > the icache/dcache miss cost would either have higher or lower weight > but for 1-3, I'd go in that order of priorities with little concern > for unaligned cases. > > Siddhesh
On Wed, Sep 04, 2013 at 01:03:33PM +0200, Ondřej Bílka wrote: > > 1. Assume aligned input. Nothing should take (any noticeable) > > performance away from align copies/moves > Not very useful as this is extremely dependant on function measured. For > functions like strcmp and strlen alignments are mostly random so aligned > case does not say much. On opposite end of spectrum is memset which is > almost always 8 byte aligned and unaligned performance does not make lot > of sense. Agreed. So for functions like memset/memcpy/memmove we heavily favour aligned inputs. For strlen/strchr/memchr we strive for acceptable average case performance, i.e. less variance in performance. > > 2. Scale with size > Not very important for several reasons. One is that big sizes are cold > (just look in oprofile output that loops are less frequent than header.) > > Second reason is that if we look at caller large sizes are unlikely > bottleneck. I did not imply that we optimize for larger sizes - I meant that as a general principle, the algorithm should scale reasonably for larger sizes. A quadratic algorithm is bad even if it gives acceptable performance for smaller sizes. I would consider that a pretty important trait to monitor in the benchmark even if we won't really get such implementations in practice. > > 3. Provide acceptable performance for unaligned sizes without > > penalizing the aligned case > > This is quite important case. It should be measured correctly, what is > important is that alignment varies. This can be slower than when you > pick fixed alignment and alignment varies in reality. I agree that we need to measure unaligned cases correctly. > > 4. Measure the effect of dcache pressure on function performance > > 5. Measure effect of icache pressure on function performance. > > > Here you really need to base weigths on function usage patterns. > A bigger code size is acceptable for functions that are called more > often. You need to see distribution of how are calls clustered to get > full picture. A strcmp is least sensitive to icache concerns, as when it > is called its mostly 100 times over in tight loop so size is not big issue. > If same number of call is uniformnly spread through program we need > stricter criteria. That's not necessarily true. It may be true for specific applications but I don't think an strcmp is always called in a tight loop. Do you have a qualitative argument to prove that statement or is it just based on dry runs? Siddhesh
On 09/04/2013 03:30 AM, Siddhesh Poyarekar wrote: > On Tue, Sep 03, 2013 at 03:15:25PM -0400, Carlos O'Donell wrote: >> I agree. The eventual goal of the project is to have some kind of >> whole system benchmarking that allows users to feed in their profiles >> and allow us as developers to see what users are doing with our library. >> >> Just like CPU designers feed in a whole distribution of applications >> and look at the probability of instruction selection and tweak instruction >> to microcode mappings. >> >> I am willing to accept a certain error in the process as long as I know >> we are headed in the right direction. If we all disagree about the >> direction we are going in then we should talk about it. >> >> I see: >> >> microbenchmarks -> whole system benchmarks -> profile driven optimizations > > I've mentioned this before - microbenchmarks are not a way to whole > system benchmarks in that they don't replace system benchmarks. We > need to work on both in parallel because both have different goals. Sorry, my ASCII graph was misleading, and I agree with you. What I meant to say is that we need to walk before we run. Creating a microbenchmark framework will give us the tools we need to move on to whole system benchmarks and eventually to profile driven optimizations. We will continue to do all 3 of these activities once we reach the end of that logical progression. Cheers, Carlos.
On Wed, Sep 4, 2013 at 2:30 AM, Siddhesh Poyarekar <siddhesh@redhat.com> wrote: > 1. Assume aligned input. Nothing should take (any noticeable) > performance away from align copies/moves > 2. Scale with size In my experience scaling with data-size isn't really possible beyond a certain point. We pick a target range of sizes to optimize for based upon customer feedback and we try to use pre-fetching in that range as efficiently as possible. But I get your point. We don't want any particular size to be severely penalized. Each architecture and specific platform needs to know/decide what the optimal range is and document it. Even for Power we have different expectations on server hardware like POWER7, vs. embedded hardware like ppc 476. > 3. Provide acceptable performance for unaligned sizes without > penalizing the aligned case There are cases where the user can't control the alignment of the data being fed into string functions, and we shouldn't penalize them for these situations if possible, but in reality if a string routine shows up hot in a profile this is a likely culprit and there's not much that can be done once the unaligned case is made as stream-lined as possible. Simply testing for alignment (not presuming aligned data) itself slows down the processing of aligned-data, but that's an unavoidable reality. I've chatted with some compiler folks about the possibility of branching directly to aligned case labels in string routines if the compiler is able to detect aligned data.. but was informed that this suggestion might get me burned at the stake. As previously discussed, we might be able to use tunables in the future to mitigate this. But of course, this would be 'use at your own risk'. > 4. Measure the effect of dcache pressure on function performance > 5. Measure effect of icache pressure on function performance. > > Depending on the actual cost of cache misses on different processors, > the icache/dcache miss cost would either have higher or lower weight > but for 1-3, I'd go in that order of priorities with little concern > for unaligned cases. I know that icache and dcache miss penalty/costs are known for most architectures but not whether they're "published". I suppose we can, at least, encourage developers for the CPU manufacturers to indicate in the documentation of preconditions which is more expensive, relative to the other if they're unable to indicate the exact costs of these misses. Some further thoughts (just to get this stuff documented): Some performance regressions I'm familiar with (on Power), which CAN be measured with a baseline micro-benchmark regardless of use-case: 1. Hazard/Penalties - I'm thinking things like load-hit-store in the tail of a loop, e.g., label: load value from a register, do work, store to same register, branch to loop. Take a stall when the value at the top of the loop isn't ready to load. 2. Dispatch grouping - Some instructions need to be first-in-group, etc. Grouping is also based on instruction alignment. At least on Power I believe some instructions benefit from specific alignment. 3. Instruction Grouping - Depending on topology of the pipeline, specific groupings of instructions of might incur pipeline stalls due to unavailability of the load/store unit (for instance). 4. Facility usage costs - Sometimes using certain facilities for certain sizes of data are more costly than not using the facility. For instance, I believe that using the DFPU on Power requires that the floating-point pipeline be flushed, so BFP and DFP really shouldn't be used together. I believe there is a powerpc32 string function which uses FPRs because they're 64-bits wide even on ppc32. But we measured the cost/benefit ratio of using this vs. not. On Power, micro benchmarks are run in-house with these (and many other) factors in mind. Ryan
On Wed, Sep 4, 2013 at 6:03 AM, Ondřej Bílka <neleai@seznam.cz> wrote: > On Wed, Sep 04, 2013 at 01:00:09PM +0530, Siddhesh Poyarekar wrote: >> 2. Scale with size > Not very important for several reasons. One is that big sizes are cold > (just look in oprofile output that loops are less frequent than header.) > > Second reason is that if we look at caller large sizes are unlikely > bottleneck. From my experience, extremely large data sizes are not very common. Optimizing for those gets diminishing returns. I believe that at very large sizes the pressure is all on the hardware anyway. Prefetching large amounts of data in a loop takes a fixed amount of time and given a large enough amount of data, the overhead introduced by most other factors is negligible. >> 4. Measure the effect of dcache pressure on function performance >> 5. Measure effect of icache pressure on function performance. >> > Here you really need to base weigths on function usage patterns. > A bigger code size is acceptable for functions that are called more > often. You need to see distribution of how are calls clustered to get > full picture. A strcmp is least sensitive to icache concerns, as when it > is called its mostly 100 times over in tight loop so size is not big issue. > If same number of call is uniformnly spread through program we need > stricter criteria. Icache pressure is probably one of the more difficult things to measure with a benchmark. I suppose it'd be easier with a pipeline analyzer. Can you explain how usage pattern analysis might reveal icache pressure? I'm not sure how useful 'usage pattern' are when considering dcache pressure. On Power we have data-cache prefetch instructions and since we know that dcache pressure is a reality, we will prefetch if our data sizes are large enough to out-weigh the overhead of prefetching, e.g., when the data size exceeds the cacheline size. Ryan
On Wed, Sep 04, 2013 at 12:37:33PM -0500, Ryan S. Arnold wrote: > On Wed, Sep 4, 2013 at 6:03 AM, Ondřej Bílka <neleai@seznam.cz> wrote: > > On Wed, Sep 04, 2013 at 01:00:09PM +0530, Siddhesh Poyarekar wrote: > >> 4. Measure the effect of dcache pressure on function performance > >> 5. Measure effect of icache pressure on function performance. > >> > > Here you really need to base weigths on function usage patterns. > > A bigger code size is acceptable for functions that are called more > > often. You need to see distribution of how are calls clustered to get > > full picture. A strcmp is least sensitive to icache concerns, as when it > > is called its mostly 100 times over in tight loop so size is not big issue. > > If same number of call is uniformnly spread through program we need > > stricter criteria. > > Icache pressure is probably one of the more difficult things to > measure with a benchmark. I suppose it'd be easier with a pipeline > analyzer. > > Can you explain how usage pattern analysis might reveal icache pressure? > With profiler its simple, I profiled firefox a while, results are here: http://kam.mff.cuni.cz/~ondra/benchmark_string/strcmp_profile_firefox/result.html Now when you look to 'Delays between calls' graph you will see peak which is likely caused by strcmp being called in loop. From graph about 2/3 of calls happen in less than 128 cycles since last one. As there is limited number of cache lines that you can access in 128 cycles per call impact is smaller. > I'm not sure how useful 'usage pattern' are when considering dcache > pressure. On Power we have data-cache prefetch instructions and since > we know that dcache pressure is a reality, we will prefetch if our > data sizes are large enough to out-weigh the overhead of prefetching, > e.g., when the data size exceeds the cacheline size. > Very useful as overhead of prefetching is determined that this quantity. You can have two applications that often call memset with size 16000. First one uses memset to refresh one static array which is entirely in L1 cache and prefetching is harmful. Second one does random access of 1GB of memory and prefetching would help. Swithching to prefetching when you exceed cache size has advantage of certainty that is will help. Real treshold is lower as it is unlikely that large array got as argument is only thing that occupies cache.
On Wed, Sep 04, 2013 at 12:35:46PM -0500, Ryan S. Arnold wrote: > On Wed, Sep 4, 2013 at 2:30 AM, Siddhesh Poyarekar <siddhesh@redhat.com> wrote: > > 3. Provide acceptable performance for unaligned sizes without > > penalizing the aligned case > > There are cases where the user can't control the alignment of the data > being fed into string functions, and we shouldn't penalize them for > these situations if possible, but in reality if a string routine shows > up hot in a profile this is a likely culprit and there's not much that > can be done once the unaligned case is made as stream-lined as > possible. > > Simply testing for alignment (not presuming aligned data) itself slows > down the processing of aligned-data, but that's an unavoidable > reality. How expensive are unaligned loads on powerpc? On x64 a penalty for using them is smaller than alternatives(increased branch misprediction...) > I've chatted with some compiler folks about the possibility > of branching directly to aligned case labels in string routines if the > compiler is able to detect aligned data.. but was informed that this > suggestion might get me burned at the stake. > You would need to improve gcc detection of alignments first. Now gcc misses most of opportunities, even in following code gcc issues retundant alignment checks: #include <stdint.h> char *foo(long *x){ if (((uintptr_t)x)%16) return x+4; else { __builtin_memset(x,0,512); return x; } } If gcc guys fix that then we do not have to ask them anything. We could just change headers to recognize aligned case like #define strchr(x,c) ({ char *__x=x;\ if (__builtin_constant_p(((uintptr_t)__x)%16) && !((uintptr_t)__x)%16)\ strchr_aligned(__x,c);\ else\ strchr(__x,c);\ }) > > 4. Measure the effect of dcache pressure on function performance > > 5. Measure effect of icache pressure on function performance. > > > > Depending on the actual cost of cache misses on different processors, > > the icache/dcache miss cost would either have higher or lower weight > > but for 1-3, I'd go in that order of priorities with little concern > > for unaligned cases. > > I know that icache and dcache miss penalty/costs are known for most > architectures but not whether they're "published". I suppose we can, > at least, encourage developers for the CPU manufacturers to indicate > in the documentation of preconditions which is more expensive, > relative to the other if they're unable to indicate the exact costs of > these misses. > These cost are relatively difficult to describe, take strlen on main memory as example. http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strlen_profile/results_rand_nocache/result.html Here we see hardware prefetcher in action. A time goes linearly with size until 512 bytes and remains constant until 4096 bytes(switch to block view) where it starts increasing at slower rate. For core2 shape is similar except that plateau starts at 256 bytes and ends at 1024 bytes. http://kam.mff.cuni.cz/~ondra/benchmark_string/core2/strlen_profile/results_rand_nocache/result.html AMD processors are different, phenomII performance is line, and for fx10 there is even area where time decreases with size. http://kam.mff.cuni.cz/~ondra/benchmark_string/phenomII/strlen_profile/results_rand_nocache/result.html http://kam.mff.cuni.cz/~ondra/benchmark_string/fx10/strlen_profile/results_rand_nocache/result.html
On Wed, 4 Sep 2013, Ryan S. Arnold wrote: > Simply testing for alignment (not presuming aligned data) itself slows > down the processing of aligned-data, but that's an unavoidable > reality. I've chatted with some compiler folks about the possibility > of branching directly to aligned case labels in string routines if the > compiler is able to detect aligned data.. but was informed that this > suggestion might get me burned at the stake. See the ARM EABI __aeabi_mem* functions. At present the glibc versions just wrap or alias the generic ones, so don't take advantage of the extra alignment information (and the constraints on register clobbers also mean plain ARM memcpy gets used for them rather than the NEON version). And GCC doesn't detect alignment and generate calls to those functions (which would be a pessimization anyway without glibc implementing these functions more optimally). But the principle makes sense, subject to not pessimizing real use cases by expanding libc code size unduly (different entry points to functions may be better than duplicating large amounts of code for each alignment case).
On Tue, Sep 03, 2013 at 02:34:54PM -0500, Ryan S. Arnold wrote: > On Tue, Sep 3, 2013 at 12:37 PM, Ondřej Bílka <neleai@seznam.cz> wrote: > >> I disagree strongly. You *must* come up with a measurable answer and > >> looking at a graph is never a solution I'm going to accept. > >> > > You can have that opinion. > > Looking at performance graphs is most powerful technique how to > > understand performance. I got most of my improvements from analyzing > > these. > > Are there any open source pipeline analysis tools? I've found the one > I've used (proprietary) to be a pretty good indicator of general > instruction selection optimization/correctness. > After bit of googling I found http://marss86.org/~marss86/index.php/Home Is anyone familiar with it? This reminded me to work on project in my backlog. Writing tool that can legaly shuffle assembler instructions. Then I run simulated annaling to get optimal sheduling for given processor and benchmark. With dryrun data this will be close to real performance on small size so I could get some percents there.
diff --git a/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S b/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S index 3decad6..6e84173 100644 --- a/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S +++ b/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S @@ -369,8 +369,8 @@ ENTRY(memcpy) cfi_adjust_cfa_offset (FRAME_SIZE) cfi_rel_offset (tmp2, 0) cfi_remember_state - and tmp2, src, #3 - and tmp1, dst, #3 + and tmp2, src, #7 + and tmp1, dst, #7 cmp tmp1, tmp2 bne .Lcpy_notaligned