[RFC] Speed-up -fprofile-update=atomic

On 10/17/2016 02:03 PM, Richard Biener wrote:
> On Mon, Oct 17, 2016 at 1:46 PM, Martin Liška <mliska@suse.cz> wrote:

>> On 10/13/2016 11:43 AM, Richard Biener wrote:

>>> On Wed, Oct 12, 2016 at 3:52 PM, Martin Liška <mliska@suse.cz> wrote:

>>>> On 10/04/2016 11:45 AM, Richard Biener wrote:

>>>>> On Thu, Sep 15, 2016 at 12:00 PM, Martin Liška <mliska@suse.cz> wrote:

>>>>>> On 09/07/2016 02:09 PM, Richard Biener wrote:

>>>>>>> On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška <mliska@suse.cz> wrote:

>>>>>>>> On 08/18/2016 06:06 PM, Richard Biener wrote:

>>>>>>>>> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek <jakub@redhat.com> wrote:

>>>>>>>>>> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote:

>>>>>>>>>>>> I'd prefer to make updates atomic in multi-threaded applications.

>>>>>>>>>>>> The best proxy we have for that is -pthread.

>>>>>>>>>>>>

>>>>>>>>>>>> Is it slower, most definitely, but odds are we're giving folks

>>>>>>>>>>>> garbage data otherwise, which in many ways is even worse.

>>>>>>>>>>>

>>>>>>>>>>> It will likely be catastrophically slower in some cases.

>>>>>>>>>>>

>>>>>>>>>>> Catastrophically as in too slow to be usable.

>>>>>>>>>>>

>>>>>>>>>>> An atomic instruction is a lot more expensive than a single

>>>>>>>>>> increment. Also

>>>>>>>>>>> they sometimes are really slow depending on the state of the machine.

>>>>>>>>>>

>>>>>>>>>> Can't we just have thread-local copies of all the counters (perhaps

>>>>>>>>>> using

>>>>>>>>>> __thread pointer as base) and just atomically merge at thread

>>>>>>>>>> termination?

>>>>>>>>>

>>>>>>>>> I suggested that as well but of course it'll have its own class of issues (short lived threads, so we need to somehow re-use counters from terminated threads, large number of threads and thus using too much memory for the counters)

>>>>>>>>>

>>>>>>>>> Richard.

>>>>>>>>

>>>>>>>> Hello.

>>>>>>>>

>>>>>>>> I've got written the approach on my TODO list, let's see whether it would be doable in a reasonable amount of time.

>>>>>>>>

>>>>>>>> I've just finished some measurements to illustrate slow-down of -fprofile-update=atomic approach.

>>>>>>>> All numbers are: no profile, -fprofile-generate, -fprofile-generate -fprofile-update=atomic

>>>>>>>> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s

>>>>>>>> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s

>>>>>>>> tramp3d (1 thread, -O3): 18.0, 46.6, 168s

>>>>>>>>

>>>>>>>> So the slow-down is roughly 300% compared to -fprofile-generate. I'm not having much experience with default option

>>>>>>>> selection, but these numbers can probably help.

>>>>>>>>

>>>>>>>> Thoughts?

>>>>>>>

>>>>>>> Look at the generated code for an instrumented simple loop and see that for

>>>>>>> the non-atomic updates we happily apply store-motion to the counter update

>>>>>>> and thus we only get one counter update per loop exit rather than one per

>>>>>>> loop iteration.  Now see what happens for the atomic case (I suspect you

>>>>>>> get one per iteration).

>>>>>>>

>>>>>>> I'll bet this accounts for most of the slowdown.

>>>>>>>

>>>>>>> Back in time ICC which had atomic counter updates (but using function

>>>>>>> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also

>>>>>>> didn't have early inlining -- removing abstraction helps reducing the number

>>>>>>> of counters significantly).

>>>>>>>

>>>>>>> Richard.

>>>>>>

>>>>>> Hi.

>>>>>>

>>>>>> During Cauldron I discussed with Richi approaches how to speed-up ARCS

>>>>>> profile counter updates. My first attempt is to utilize TLS storage, where

>>>>>> every function is accumulating arcs counters. These are eventually added

>>>>>> (using atomic operations) to the global one at the very end of a function.

>>>>>> Currently I rely on target support of TLS, which is questionable whether

>>>>>> to have such a requirement for -fprofile-update=atomic, or to add a new option value

>>>>>> like -fprofile-update=atomic-tls?

>>>>>>

>>>>>> Running the patch on tramp3d, compared to previous numbers, it takes 88s to finish.

>>>>>> Time shrinks to 50%, compared to the current implementation.

>>>>>>

>>>>>> Thoughts?

>>>>>

>>>>> Hmm, I thought I suggested that you can simply use automatic storage

>>>>> (which effectively

>>>>> is TLS...) for regions that are not forked or abnormally left (which

>>>>> means SESE regions

>>>>> that have no calls that eventually terminate or throw externally).

>>>>>

>>>>> So why did you end up with TLS?

>>>>

>>>> Hi.

>>>>

>>>> Usage for TLS does not makes sense, stupid mistake ;)

>>>>

>>>> By using SESE regions, do you mean the infrastructure that is utilized

>>>> by Graphite machinery?

>>>

>>> No, just as "single-entry single-exit region" which means placing of

>>> initializations of the internal counters to zero and the updates of the

>>> actual counters is "obvious".

>>>

>>> Note that this "optimization" isn't one if the SESE region does not contain

>>> cycle(s).  Unless there is a way to do an atomic update of a bunch of

>>> counters faster than doing them separately.  This optimization will also

>>> increase register pressure (or force the internal counters to the stack).

>>> Thus selecting which counters to "optimize" and which ones to leave in place

>>> might be necessary.

>>

>> Ok, I must admit the selection which counters to optimize is crucial. Current implementation

>> (atomic increments at places where a BB is reached) has advantage that it does not increase

>> register pressure and it does not update a global arcs counter when the BB is not visited.

>> On the other hand, having a local counters which are updated at function exit (my current implementation)

>> possibly updates counters for BB that not seen and it creates a huge memory locking spot if multiple threads

>> call a function very often. This is perf report of cray benchmark:

>>

>>        │             if((d = SQ(b) - 4.0 * a * c) < 0.0) return 0;

>>   0.01 │3c0:   xor    %ecx,%ecx

>>   0.00 │3c2:   mov    0x110(%rsp),%rdx

>>   0.00 │       lock   add    %rdx,__gcov0.ray_sphere

>>   7.96 │       mov    0x118(%rsp),%rdx

>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x8

>>  11.39 │       mov    0x120(%rsp),%rdx

>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x10

>>  11.09 │       mov    0x128(%rsp),%rdx

>>   0.00 │       lock   add    %rdx,__gcov0.ray_sphere+0x18

>>  11.27 │       mov    0x130(%rsp),%rdx

>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x20

>>  11.02 │       mov    0x138(%rsp),%rdx

>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x28

>>  11.46 │       mov    0x140(%rsp),%rdx

>>   0.00 │       lock   add    %rdx,__gcov0.ray_sphere+0x30

>>  11.84 │       mov    0x148(%rsp),%rdx

>>   0.00 │       lock   add    %rdx,__gcov0.ray_sphere+0x38

>>  11.57 │       mov    0x150(%rsp),%rdx

>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x40

>>   6.86 │       mov    0x158(%rsp),%rdx

>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x48

>>

>> My current approach does atomic increment when maybe_hot_bb_p return false

>> and local counters are used otherwise. Question is how to find a better place

>> where to initialize and store local counter values?

> 

> Given the main reason for using local counters is loops you can use

> loop information

> and put counter initializations in the loop preheader and stores on

> the loop exit edges

> (basically what loop store motion does w/o atomic updates).  Store motion knows

> to ignore any aliasing with counters and thus is _very_ aggressive

> with doing this

> (including all loop nests but of course not across possible (abnormal)

> function exits).

> 

> While profile instrumentation is quite late it is of course still

> before IPA inlining.

> Thus loop store motion w/o atomic updates possibly still sees more opportunities

> than this.  I wonder if we might want to leave the optimization to the regular

> optimizers by only lowering the actual counter kind late and start with some

> IFN_UPDATE_COVERAGE_COUNTER which passes could handle (and merge).

> 

> Anyway, I'd go for the loop trick and simply handle counter motion

> from innermost

> loops only.  The final update place would be the common post-dominator of all

> exits (if you want to handle multiple exits).  As said you need to watch out for

> function termination that is not reflected by the CFG (external throws, exits in

> callees, etc.).

> 

> Richard.

Hello Richard.

I've just finished my patch, where I come up with a new internal fn (UPDATE_COVERAGE_COUNTER).
The function is generated by profile pass, and is handled by lim pass. I originally tried to
support internal fn in the lim machinery, but doing specific loop motion looks easier to work with.

With the patch applied, tramp3d runs 1.8x slower with -fprofile-update=atomic compared to -fprofile-update=single.

Patch can bootstrap on ppc64le-redhat-linux and survives regression tests.
Thoughts?

Thanks,
Martin
> 

>> Ideas welcomed.

>> Thanks,

>> Martin

>>

>>>

>>> Richard.

>>>

>>>> Thanks,

>>>> Martin

>>>>

>>>>>

>>>>> Richard.

>>>>>

>>>>>> Martin

>>>>>>

>>>>>>>

>>>>>>>> Martin

>>>>>>>>

>>>>>>>>>

>>>>>>>>>>      Jakub

>>>>>>>>>

>>>>>>>>>

>>>>>>>>

>>>>>>

>>>>

>>

[RFC] Speed-up -fprofile-update=atomic

Commit Message

Comments

Patch