benchtests/Makefile: Run the string benchmarks four times by default.

Message ID 52274838.7010902@linaro.org
State Rejected
Headers show

Commit Message

Will Newton Sept. 4, 2013, 2:48 p.m.
The string benchmarks can be affected by physical page placement, so
running them multiple times is required to account for this. Also
backup the results of the previous run like is done for the other
benchmarks.

ChangeLog:

2013-09-03   Will Newton  <will.newton@linaro.org>

	* benchtests/Makefile (BENCH_RUNS): New variable.
	(bench-set): Run tests BENCH_RUNS times and backup old run
	results.
---
 benchtests/Makefile | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

Comments

Ondřej Bílka Sept. 4, 2013, 4:17 p.m. | #1
On Wed, Sep 04, 2013 at 03:48:24PM +0100, Will Newton wrote:
> 
> The string benchmarks can be affected by physical page placement, so
> running them multiple times is required to account for this. Also
> backup the results of the previous run like is done for the other
> benchmarks.
>
You do not need to do this. We should instead randomize addresses used
which handles this problem.
 
> ChangeLog:
> 
> 2013-09-03   Will Newton  <will.newton@linaro.org>
> 
> 	* benchtests/Makefile (BENCH_RUNS): New variable.
> 	(bench-set): Run tests BENCH_RUNS times and backup old run
> 	results.
> ---
>  benchtests/Makefile | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/benchtests/Makefile b/benchtests/Makefile
> index 6037e5c..d72c98f 100644
> --- a/benchtests/Makefile
> +++ b/benchtests/Makefile
> @@ -118,6 +118,11 @@ ifndef BENCH_DURATION
>  BENCH_DURATION := 10
>  endif
> 
> +# The default number of runs of a benchmark: 4.
> +ifndef BENCH_RUNS
> +BENCH_RUNS := 4
> +endif
> +
>  CPPFLAGS-nonlib += -DDURATION=$(BENCH_DURATION)
> 
>  # Use clock_gettime to measure performance of functions.  The default is to use
> @@ -146,8 +151,15 @@ bench: bench-set bench-func
> 
>  bench-set: $(binaries-benchset)
>  	for run in $^; do \
> -	  echo "Running $${run}"; \
> -	  $(run-bench) > $${run}.out; \
> +	  for old in $${run}.*.out; do \
> +	    if [ -f $$old ]; then \
> +	      mv $$old $${old}.old; \
> +	    fi; \
> +	  done; \
> +	  for count in $$(seq 1 $(BENCH_RUNS)); do \
> +	    echo "Running $${run} ($${count})"; \
> +	    $(run-bench) > $${run}.$${count}.out; \
> +	  done; \
>  	done
> 
>  bench-func: $(binaries-bench)
> -- 
> 1.8.1.4
Will Newton Sept. 4, 2013, 4:20 p.m. | #2
On 4 September 2013 17:17, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Wed, Sep 04, 2013 at 03:48:24PM +0100, Will Newton wrote:
>>
>> The string benchmarks can be affected by physical page placement, so
>> running them multiple times is required to account for this. Also
>> backup the results of the previous run like is done for the other
>> benchmarks.
>>
> You do not need to do this. We should instead randomize addresses used
> which handles this problem.

That seems like it would be considerably more complicated. Do you have
a reason why your approach is better?
Ondřej Bílka Sept. 4, 2013, 4:52 p.m. | #3
On Wed, Sep 04, 2013 at 05:20:23PM +0100, Will Newton wrote:
> On 4 September 2013 17:17, Ondřej Bílka <neleai@seznam.cz> wrote:
> > On Wed, Sep 04, 2013 at 03:48:24PM +0100, Will Newton wrote:
> >>
> >> The string benchmarks can be affected by physical page placement, so
> >> running them multiple times is required to account for this. Also
> >> backup the results of the previous run like is done for the other
> >> benchmarks.
> >>
> > You do not need to do this. We should instead randomize addresses used
> > which handles this problem.
> 
> That seems like it would be considerably more complicated. Do you have
> a reason why your approach is better?
> 
It does not matter if it is complicated, it is required to get
reasonable results.
Ryan S. Arnold Sept. 4, 2013, 6:22 p.m. | #4
On Wed, Sep 4, 2013 at 11:52 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Wed, Sep 04, 2013 at 05:20:23PM +0100, Will Newton wrote:
>> On 4 September 2013 17:17, Ondřej Bílka <neleai@seznam.cz> wrote:
>> > On Wed, Sep 04, 2013 at 03:48:24PM +0100, Will Newton wrote:
>> >>
>> >> The string benchmarks can be affected by physical page placement, so
>> >> running them multiple times is required to account for this. Also
>> >> backup the results of the previous run like is done for the other
>> >> benchmarks.
>> >>
>> > You do not need to do this. We should instead randomize addresses used
>> > which handles this problem.

What is the objective of randomizing the addresses? What are we trying
to accomplish by different page placement?

>> That seems like it would be considerably more complicated. Do you have
>> a reason why your approach is better?
>>
> It does not matter if it is complicated, it is required to get
> reasonable results.

Your proposed solution (in a later email) is in-fact not
overly-complicated, but for the record, there is a very good reason to
avoid complexity in a project like glibc, which is already extremely
complex.  Some poor soul will need to maintain all of the complicated
solutions long into the future.

Ryan
Will Newton Sept. 5, 2013, 7:51 a.m. | #5
On 4 September 2013 17:52, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Wed, Sep 04, 2013 at 05:20:23PM +0100, Will Newton wrote:
>> On 4 September 2013 17:17, Ondřej Bílka <neleai@seznam.cz> wrote:
>> > On Wed, Sep 04, 2013 at 03:48:24PM +0100, Will Newton wrote:
>> >>
>> >> The string benchmarks can be affected by physical page placement, so
>> >> running them multiple times is required to account for this. Also
>> >> backup the results of the previous run like is done for the other
>> >> benchmarks.
>> >>
>> > You do not need to do this. We should instead randomize addresses used
>> > which handles this problem.
>>
>> That seems like it would be considerably more complicated. Do you have
>> a reason why your approach is better?
>>
> It does not matter if it is complicated, it is required to get
> reasonable results.

How do you define "reasonable results"?

The intention of my patch - which I may have not made completely clear
in the commit message - is to improve test stability. What I mean by
this is that with a physically indexed cache the physical pages
allocated to the test can have a significant effect on the performance
at large (e.g. cache size / ways and above) buffer sizes and this will
cause variation when running the same test multiple times. My aim is
to average out these differences as it is hard to control for them
without understanding the details of the cache subsystem of the system
you are running on.

Your test appears to be addressing concerns of test validity by
running a wider range of buffer alignments, which is an important but
separate concern IMO.
Siddhesh Poyarekar Sept. 5, 2013, 10:31 a.m. | #6
On Wed, Sep 04, 2013 at 03:48:24PM +0100, Will Newton wrote:
> 
> The string benchmarks can be affected by physical page placement, so
> running them multiple times is required to account for this. Also
> backup the results of the previous run like is done for the other
> benchmarks.

I wonder if this would make a lot of difference.  On an otherwise idle
system (which is essentially a pre-condition for benchmark runs), I
would expect more or less similar page placement (i.e. reachable via
comparable indices) for consecutive executions of a program.  You
would probably need to have something spawned in parallel to actually
make a difference.

Siddhesh
Will Newton Sept. 5, 2013, 10:36 a.m. | #7
On 5 September 2013 11:31, Siddhesh Poyarekar <siddhesh@redhat.com> wrote:
> On Wed, Sep 04, 2013 at 03:48:24PM +0100, Will Newton wrote:
>>
>> The string benchmarks can be affected by physical page placement, so
>> running them multiple times is required to account for this. Also
>> backup the results of the previous run like is done for the other
>> benchmarks.
>
> I wonder if this would make a lot of difference.  On an otherwise idle
> system (which is essentially a pre-condition for benchmark runs), I
> would expect more or less similar page placement (i.e. reachable via
> comparable indices) for consecutive executions of a program.  You
> would probably need to have something spawned in parallel to actually
> make a difference.

There is no guarantee the OS will give you a different set of physical
pages but it does seem to have the desired effect in practice.
Siddhesh Poyarekar Sept. 5, 2013, 11:05 a.m. | #8
On Thu, Sep 05, 2013 at 11:36:00AM +0100, Will Newton wrote:
> > I wonder if this would make a lot of difference.  On an otherwise idle
> > system (which is essentially a pre-condition for benchmark runs), I
> > would expect more or less similar page placement (i.e. reachable via
> > comparable indices) for consecutive executions of a program.  You
> > would probably need to have something spawned in parallel to actually
> > make a difference.
> 
> There is no guarantee the OS will give you a different set of physical
> pages but it does seem to have the desired effect in practice.

OK, multiple runs are harmless, so I don't mind this change if it
smooths out any inconsistencies in practice.  The patch also looks OK
to me but I'd like a consolidated ouput instead of the four output
files, maybe as a separate change.  I'd like to see if anyone else has
objections to this, so please wait for another ack before you commit
it.

Siddhesh
Ondřej Bílka Sept. 5, 2013, 3:03 p.m. | #9
On Thu, Sep 05, 2013 at 08:51:53AM +0100, Will Newton wrote:
> The intention of my patch - which I may have not made completely clear
> in the commit message - is to improve test stability. What I mean by
> this is that with a physically indexed cache the physical pages
> allocated to the test can have a significant effect on the performance
> at large (e.g. cache size / ways and above) buffer sizes and this will
> cause variation when running the same test multiple times. My aim is
> to average out these differences as it is hard to control for them
> without understanding the details of the cache subsystem of the system
> you are running on.
>
This can be just explained just by having more data. Simply multiplying 
iteration count by four would then do same job. 

Please run your patch ten times and calculate variance. Compare that to
variance when iteration count is increased 4 times and show if there is
improvement. 


 
> Your test appears to be addressing concerns of test validity by
> running a wider range of buffer alignments, which is an important but
> separate concern IMO.
> 
No, your patch will pick src pointer at 4 different physical pages (one
allocated in each run) and calculate average performance.

Mine will pick src pointers in 2000000/4096 = 488 different pages and
calculate average. 

As this makes your patch redundant you will  add unnecessary complexity 
to glibc, there is a very good reason to avoid complexity in a project
like glibc, which is already extremely complex.  Some poor soul will 
need to maintain all of the complicated solutions long into the future.
Will Newton Sept. 5, 2013, 3:18 p.m. | #10
On 5 September 2013 16:03, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Thu, Sep 05, 2013 at 08:51:53AM +0100, Will Newton wrote:
>> The intention of my patch - which I may have not made completely clear
>> in the commit message - is to improve test stability. What I mean by
>> this is that with a physically indexed cache the physical pages
>> allocated to the test can have a significant effect on the performance
>> at large (e.g. cache size / ways and above) buffer sizes and this will
>> cause variation when running the same test multiple times. My aim is
>> to average out these differences as it is hard to control for them
>> without understanding the details of the cache subsystem of the system
>> you are running on.
>>
> This can be just explained just by having more data. Simply multiplying
> iteration count by four would then do same job.

No, it wouldn't. That would just mean four times as much data
resulting in a reduced variance but the same systematic error.

>> Your test appears to be addressing concerns of test validity by
>> running a wider range of buffer alignments, which is an important but
>> separate concern IMO.
>>
> No, your patch will pick src pointer at 4 different physical pages (one
> allocated in each run) and calculate average performance.
>
> Mine will pick src pointers in 2000000/4096 = 488 different pages and
> calculate average.

Yes, this would work too. But it has a number of flaws:

1. It does not allow one to analyze the performance of the code across
alignments, everything gets averaged together.
2. It has no mechanism for showing variance, whereas multiple runs of
the same test the variance of the means can at least be seen.
3. It only works for one test (memcpy).
Ondřej Bílka Sept. 5, 2013, 4:04 p.m. | #11
On Thu, Sep 05, 2013 at 04:18:18PM +0100, Will Newton wrote:
> On 5 September 2013 16:03, Ondřej Bílka <neleai@seznam.cz> wrote:
> > On Thu, Sep 05, 2013 at 08:51:53AM +0100, Will Newton wrote:
> >> The intention of my patch - which I may have not made completely clear
> >> in the commit message - is to improve test stability. What I mean by
> >> this is that with a physically indexed cache the physical pages
> >> allocated to the test can have a significant effect on the performance
> >> at large (e.g. cache size / ways and above) buffer sizes and this will
> >> cause variation when running the same test multiple times. My aim is
> >> to average out these differences as it is hard to control for them
> >> without understanding the details of the cache subsystem of the system
> >> you are running on.
> >>
> > This can be just explained just by having more data. Simply multiplying
> > iteration count by four would then do same job.
> 
> No, it wouldn't. That would just mean four times as much data
> resulting in a reduced variance but the same systematic error.
> 
That is you claim. I am now asking you second time to prove it.

As I write in previous mail in same place:

Please run your patch ten times and calculate variance. Compare that to
variance when iteration count is increased 4 times and show if there is
improvement.



> >> Your test appears to be addressing concerns of test validity by
> >> running a wider range of buffer alignments, which is an important but
> >> separate concern IMO.
> >>
> > No, your patch will pick src pointer at 4 different physical pages (one
> > allocated in each run) and calculate average performance.
> >
> > Mine will pick src pointers in 2000000/4096 = 488 different pages and
> > calculate average.
> 
> Yes, this would work too. But it has a number of flaws:
> 
> 1. It does not allow one to analyze the performance of the code across
> alignments, everything gets averaged together.

You cannot analyse performance across alignments now as benchmarks do
not print necessary data. 

For such analysis make sense you would need much more data. This would
make benchmarks as they are slower so it would need be used as option. 
If you would like analysis done in separate program you would need to 
print much more data. As this would make .out files harder to read for 
humans a better way would print results in separate part where you would 
use separate format. 
If you would do analysis inside benchmark then you would implement your
own logic making this issue also moot.

> 2. It has no mechanism for showing variance, whereas multiple runs of
> the same test the variance of the means can at least be seen.

There is a pretty good merchanism of showing variance and it is called 
calculating variance. However adding variance calculation is separate
issue.


> 3. It only works for one test (memcpy).
>
It is first step. A randomization is needed for all string functions and
it is better to start on concrete example. 

 
> -- 
> Will Newton
> Toolchain Working Group, Linaro
Will Newton Sept. 5, 2013, 5:06 p.m. | #12
On 5 September 2013 17:04, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Thu, Sep 05, 2013 at 04:18:18PM +0100, Will Newton wrote:
>> On 5 September 2013 16:03, Ondřej Bílka <neleai@seznam.cz> wrote:
>> > On Thu, Sep 05, 2013 at 08:51:53AM +0100, Will Newton wrote:
>> >> The intention of my patch - which I may have not made completely clear
>> >> in the commit message - is to improve test stability. What I mean by
>> >> this is that with a physically indexed cache the physical pages
>> >> allocated to the test can have a significant effect on the performance
>> >> at large (e.g. cache size / ways and above) buffer sizes and this will
>> >> cause variation when running the same test multiple times. My aim is
>> >> to average out these differences as it is hard to control for them
>> >> without understanding the details of the cache subsystem of the system
>> >> you are running on.
>> >>
>> > This can be just explained just by having more data. Simply multiplying
>> > iteration count by four would then do same job.
>>
>> No, it wouldn't. That would just mean four times as much data
>> resulting in a reduced variance but the same systematic error.
>>
> That is you claim. I am now asking you second time to prove it.
>
> As I write in previous mail in same place:
>
> Please run your patch ten times and calculate variance. Compare that to
> variance when iteration count is increased 4 times and show if there is
> improvement.

The benchmarks do not currently have any measure of variance so it's
not possible to do this with the benchmarks as they stand. I have seen
this effect with other benchmarks however.

>> >> Your test appears to be addressing concerns of test validity by
>> >> running a wider range of buffer alignments, which is an important but
>> >> separate concern IMO.
>> >>
>> > No, your patch will pick src pointer at 4 different physical pages (one
>> > allocated in each run) and calculate average performance.
>> >
>> > Mine will pick src pointers in 2000000/4096 = 488 different pages and
>> > calculate average.
>>
>> Yes, this would work too. But it has a number of flaws:
>>
>> 1. It does not allow one to analyze the performance of the code across
>> alignments, everything gets averaged together.
>
> You cannot analyse performance across alignments now as benchmarks do
> not print necessary data.

It currently prints the alignments of the buffers, that is all that is
required. The alignments chosen are a rather poor selection though I
would agree.

>
>> 2. It has no mechanism for showing variance, whereas multiple runs of
>> the same test the variance of the means can at least be seen.
>
> There is a pretty good merchanism of showing variance and it is called
> calculating variance. However adding variance calculation is separate
> issue.

I think you misunderstand me. The benchmarks as they stand do not
output any measure of variance. Multiple runs is a quick and easy way
to get a measure of variance without modifying the benchmarks or their
output.

>> 3. It only works for one test (memcpy).
>>
> It is first step. A randomization is needed for all string functions and
> it is better to start on concrete example.

I agree completely, lets start by finding the best way to fix the
benchmarks, but once we have consensus i think it would be best to fix
all the benchmarks rather than leave some unfixed.
Carlos O'Donell Sept. 5, 2013, 11:06 p.m. | #13
On 09/05/2013 01:06 PM, Will Newton wrote:
> I agree completely, lets start by finding the best way to fix the
> benchmarks, but once we have consensus i think it would be best to fix
> all the benchmarks rather than leave some unfixed.

No, please fix things as soon as you have them fixed. There is no need
to artificially fix all of them for the sake of perfection :-)

Cheers,
Carlos.
Carlos O'Donell Sept. 5, 2013, 11:11 p.m. | #14
On 09/04/2013 10:48 AM, Will Newton wrote:
> 
> The string benchmarks can be affected by physical page placement, so
> running them multiple times is required to account for this. Also
> backup the results of the previous run like is done for the other
> benchmarks.

Siddhesh already commented on what I was going to say which was
that you have no guarantee that what you want is going to actually
happen. The OS might give you back exactly the same physical pages
on the subsequent run (though unlikely).

I agree on many of Ondrej's points. Driving engineering rigour into
these benchmarks is important.
 
> ChangeLog:
> 
> 2013-09-03   Will Newton  <will.newton@linaro.org>
> 
> 	* benchtests/Makefile (BENCH_RUNS): New variable.
> 	(bench-set): Run tests BENCH_RUNS times and backup old run
> 	results.
> ---
>  benchtests/Makefile | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/benchtests/Makefile b/benchtests/Makefile
> index 6037e5c..d72c98f 100644
> --- a/benchtests/Makefile
> +++ b/benchtests/Makefile
> @@ -118,6 +118,11 @@ ifndef BENCH_DURATION
>  BENCH_DURATION := 10
>  endif
> 
> +# The default number of runs of a benchmark: 4.

This needs a lengthy comment on the purpose of BENCH_RUNS.
Why do we have it? What do we use it for? Why is 4 a good default?

> +ifndef BENCH_RUNS
> +BENCH_RUNS := 4
> +endif
> +
>  CPPFLAGS-nonlib += -DDURATION=$(BENCH_DURATION)
> 
>  # Use clock_gettime to measure performance of functions.  The default is to use
> @@ -146,8 +151,15 @@ bench: bench-set bench-func
> 
>  bench-set: $(binaries-benchset)
>  	for run in $^; do \
> -	  echo "Running $${run}"; \
> -	  $(run-bench) > $${run}.out; \
> +	  for old in $${run}.*.out; do \
> +	    if [ -f $$old ]; then \
> +	      mv $$old $${old}.old; \
> +	    fi; \
> +	  done; \
> +	  for count in $$(seq 1 $(BENCH_RUNS)); do \
> +	    echo "Running $${run} ($${count})"; \
> +	    $(run-bench) > $${run}.$${count}.out; \
> +	  done; \
>  	done
> 
>  bench-func: $(binaries-bench)
> 

Given that extra runs have no adverse effects I'm OK with this patch in principle.

Cheers,
Carlos.
Siddhesh Poyarekar Sept. 6, 2013, 5:15 a.m. | #15
On Thu, Sep 05, 2013 at 07:06:53PM -0400, Carlos O'Donell wrote:
> On 09/05/2013 01:06 PM, Will Newton wrote:
> > I agree completely, lets start by finding the best way to fix the
> > benchmarks, but once we have consensus i think it would be best to fix
> > all the benchmarks rather than leave some unfixed.
> 
> No, please fix things as soon as you have them fixed. There is no need
> to artificially fix all of them for the sake of perfection :-)

The context here is that Ondrej posted a patch to allocate a larger
buffer just for memcpy when it should have been done in the common
header.

Siddhesh

Patch

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 6037e5c..d72c98f 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -118,6 +118,11 @@  ifndef BENCH_DURATION
 BENCH_DURATION := 10
 endif

+# The default number of runs of a benchmark: 4.
+ifndef BENCH_RUNS
+BENCH_RUNS := 4
+endif
+
 CPPFLAGS-nonlib += -DDURATION=$(BENCH_DURATION)

 # Use clock_gettime to measure performance of functions.  The default is to use
@@ -146,8 +151,15 @@  bench: bench-set bench-func

 bench-set: $(binaries-benchset)
 	for run in $^; do \
-	  echo "Running $${run}"; \
-	  $(run-bench) > $${run}.out; \
+	  for old in $${run}.*.out; do \
+	    if [ -f $$old ]; then \
+	      mv $$old $${old}.old; \
+	    fi; \
+	  done; \
+	  for count in $$(seq 1 $(BENCH_RUNS)); do \
+	    echo "Running $${run} ($${count})"; \
+	    $(run-bench) > $${run}.$${count}.out; \
+	  done; \
 	done

 bench-func: $(binaries-bench)