diff mbox

[RFC] Speed-up -fprofile-update=atomic

Message ID 0dc8a029-696a-39b0-96cc-ab44488607eb@suse.cz
State New
Headers show

Commit Message

Martin Liška Oct. 24, 2016, 12:09 p.m. UTC
On 10/17/2016 02:03 PM, Richard Biener wrote:
> On Mon, Oct 17, 2016 at 1:46 PM, Martin Liška <mliska@suse.cz> wrote:

>> On 10/13/2016 11:43 AM, Richard Biener wrote:

>>> On Wed, Oct 12, 2016 at 3:52 PM, Martin Liška <mliska@suse.cz> wrote:

>>>> On 10/04/2016 11:45 AM, Richard Biener wrote:

>>>>> On Thu, Sep 15, 2016 at 12:00 PM, Martin Liška <mliska@suse.cz> wrote:

>>>>>> On 09/07/2016 02:09 PM, Richard Biener wrote:

>>>>>>> On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška <mliska@suse.cz> wrote:

>>>>>>>> On 08/18/2016 06:06 PM, Richard Biener wrote:

>>>>>>>>> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek <jakub@redhat.com> wrote:

>>>>>>>>>> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote:

>>>>>>>>>>>> I'd prefer to make updates atomic in multi-threaded applications.

>>>>>>>>>>>> The best proxy we have for that is -pthread.

>>>>>>>>>>>>

>>>>>>>>>>>> Is it slower, most definitely, but odds are we're giving folks

>>>>>>>>>>>> garbage data otherwise, which in many ways is even worse.

>>>>>>>>>>>

>>>>>>>>>>> It will likely be catastrophically slower in some cases.

>>>>>>>>>>>

>>>>>>>>>>> Catastrophically as in too slow to be usable.

>>>>>>>>>>>

>>>>>>>>>>> An atomic instruction is a lot more expensive than a single

>>>>>>>>>> increment. Also

>>>>>>>>>>> they sometimes are really slow depending on the state of the machine.

>>>>>>>>>>

>>>>>>>>>> Can't we just have thread-local copies of all the counters (perhaps

>>>>>>>>>> using

>>>>>>>>>> __thread pointer as base) and just atomically merge at thread

>>>>>>>>>> termination?

>>>>>>>>>

>>>>>>>>> I suggested that as well but of course it'll have its own class of issues (short lived threads, so we need to somehow re-use counters from terminated threads, large number of threads and thus using too much memory for the counters)

>>>>>>>>>

>>>>>>>>> Richard.

>>>>>>>>

>>>>>>>> Hello.

>>>>>>>>

>>>>>>>> I've got written the approach on my TODO list, let's see whether it would be doable in a reasonable amount of time.

>>>>>>>>

>>>>>>>> I've just finished some measurements to illustrate slow-down of -fprofile-update=atomic approach.

>>>>>>>> All numbers are: no profile, -fprofile-generate, -fprofile-generate -fprofile-update=atomic

>>>>>>>> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s

>>>>>>>> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s

>>>>>>>> tramp3d (1 thread, -O3): 18.0, 46.6, 168s

>>>>>>>>

>>>>>>>> So the slow-down is roughly 300% compared to -fprofile-generate. I'm not having much experience with default option

>>>>>>>> selection, but these numbers can probably help.

>>>>>>>>

>>>>>>>> Thoughts?

>>>>>>>

>>>>>>> Look at the generated code for an instrumented simple loop and see that for

>>>>>>> the non-atomic updates we happily apply store-motion to the counter update

>>>>>>> and thus we only get one counter update per loop exit rather than one per

>>>>>>> loop iteration.  Now see what happens for the atomic case (I suspect you

>>>>>>> get one per iteration).

>>>>>>>

>>>>>>> I'll bet this accounts for most of the slowdown.

>>>>>>>

>>>>>>> Back in time ICC which had atomic counter updates (but using function

>>>>>>> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also

>>>>>>> didn't have early inlining -- removing abstraction helps reducing the number

>>>>>>> of counters significantly).

>>>>>>>

>>>>>>> Richard.

>>>>>>

>>>>>> Hi.

>>>>>>

>>>>>> During Cauldron I discussed with Richi approaches how to speed-up ARCS

>>>>>> profile counter updates. My first attempt is to utilize TLS storage, where

>>>>>> every function is accumulating arcs counters. These are eventually added

>>>>>> (using atomic operations) to the global one at the very end of a function.

>>>>>> Currently I rely on target support of TLS, which is questionable whether

>>>>>> to have such a requirement for -fprofile-update=atomic, or to add a new option value

>>>>>> like -fprofile-update=atomic-tls?

>>>>>>

>>>>>> Running the patch on tramp3d, compared to previous numbers, it takes 88s to finish.

>>>>>> Time shrinks to 50%, compared to the current implementation.

>>>>>>

>>>>>> Thoughts?

>>>>>

>>>>> Hmm, I thought I suggested that you can simply use automatic storage

>>>>> (which effectively

>>>>> is TLS...) for regions that are not forked or abnormally left (which

>>>>> means SESE regions

>>>>> that have no calls that eventually terminate or throw externally).

>>>>>

>>>>> So why did you end up with TLS?

>>>>

>>>> Hi.

>>>>

>>>> Usage for TLS does not makes sense, stupid mistake ;)

>>>>

>>>> By using SESE regions, do you mean the infrastructure that is utilized

>>>> by Graphite machinery?

>>>

>>> No, just as "single-entry single-exit region" which means placing of

>>> initializations of the internal counters to zero and the updates of the

>>> actual counters is "obvious".

>>>

>>> Note that this "optimization" isn't one if the SESE region does not contain

>>> cycle(s).  Unless there is a way to do an atomic update of a bunch of

>>> counters faster than doing them separately.  This optimization will also

>>> increase register pressure (or force the internal counters to the stack).

>>> Thus selecting which counters to "optimize" and which ones to leave in place

>>> might be necessary.

>>

>> Ok, I must admit the selection which counters to optimize is crucial. Current implementation

>> (atomic increments at places where a BB is reached) has advantage that it does not increase

>> register pressure and it does not update a global arcs counter when the BB is not visited.

>> On the other hand, having a local counters which are updated at function exit (my current implementation)

>> possibly updates counters for BB that not seen and it creates a huge memory locking spot if multiple threads

>> call a function very often. This is perf report of cray benchmark:

>>

>>        │             if((d = SQ(b) - 4.0 * a * c) < 0.0) return 0;

>>   0.01 │3c0:   xor    %ecx,%ecx

>>   0.00 │3c2:   mov    0x110(%rsp),%rdx

>>   0.00 │       lock   add    %rdx,__gcov0.ray_sphere

>>   7.96 │       mov    0x118(%rsp),%rdx

>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x8

>>  11.39 │       mov    0x120(%rsp),%rdx

>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x10

>>  11.09 │       mov    0x128(%rsp),%rdx

>>   0.00 │       lock   add    %rdx,__gcov0.ray_sphere+0x18

>>  11.27 │       mov    0x130(%rsp),%rdx

>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x20

>>  11.02 │       mov    0x138(%rsp),%rdx

>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x28

>>  11.46 │       mov    0x140(%rsp),%rdx

>>   0.00 │       lock   add    %rdx,__gcov0.ray_sphere+0x30

>>  11.84 │       mov    0x148(%rsp),%rdx

>>   0.00 │       lock   add    %rdx,__gcov0.ray_sphere+0x38

>>  11.57 │       mov    0x150(%rsp),%rdx

>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x40

>>   6.86 │       mov    0x158(%rsp),%rdx

>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x48

>>

>> My current approach does atomic increment when maybe_hot_bb_p return false

>> and local counters are used otherwise. Question is how to find a better place

>> where to initialize and store local counter values?

> 

> Given the main reason for using local counters is loops you can use

> loop information

> and put counter initializations in the loop preheader and stores on

> the loop exit edges

> (basically what loop store motion does w/o atomic updates).  Store motion knows

> to ignore any aliasing with counters and thus is _very_ aggressive

> with doing this

> (including all loop nests but of course not across possible (abnormal)

> function exits).

> 

> While profile instrumentation is quite late it is of course still

> before IPA inlining.

> Thus loop store motion w/o atomic updates possibly still sees more opportunities

> than this.  I wonder if we might want to leave the optimization to the regular

> optimizers by only lowering the actual counter kind late and start with some

> IFN_UPDATE_COVERAGE_COUNTER which passes could handle (and merge).

> 

> Anyway, I'd go for the loop trick and simply handle counter motion

> from innermost

> loops only.  The final update place would be the common post-dominator of all

> exits (if you want to handle multiple exits).  As said you need to watch out for

> function termination that is not reflected by the CFG (external throws, exits in

> callees, etc.).

> 

> Richard.


Hello Richard.

I've just finished my patch, where I come up with a new internal fn (UPDATE_COVERAGE_COUNTER).
The function is generated by profile pass, and is handled by lim pass. I originally tried to
support internal fn in the lim machinery, but doing specific loop motion looks easier to work with.

With the patch applied, tramp3d runs 1.8x slower with -fprofile-update=atomic compared to -fprofile-update=single.

Patch can bootstrap on ppc64le-redhat-linux and survives regression tests.
Thoughts?

Thanks,
Martin
> 

>> Ideas welcomed.

>> Thanks,

>> Martin

>>

>>>

>>> Richard.

>>>

>>>> Thanks,

>>>> Martin

>>>>

>>>>>

>>>>> Richard.

>>>>>

>>>>>> Martin

>>>>>>

>>>>>>>

>>>>>>>> Martin

>>>>>>>>

>>>>>>>>>

>>>>>>>>>>      Jakub

>>>>>>>>>

>>>>>>>>>

>>>>>>>>

>>>>>>

>>>>

>>

Comments

Martin Liška Oct. 25, 2016, 2:31 p.m. UTC | #1
On 10/24/2016 03:51 PM, Richard Biener wrote:

> It's quite ad-hoc :/  The IFN will also be a memory optimization

> barrier unless you add special support

> for it in the alias oracle - so the performance measurement needs to

> be taken with a grain of salt

> (same is true for all atomics of course... - I have some local patches

> to improve things here).


Good, thus please ping me with the patches you have and I'll integrate it.

> 

> The way you implement process_sm_for_coverage_counter is more like a

> final value replacement.

> You could maybe use create_iv for the loop counter or even wind up

> computing the final value

> (number of iterations) only after the loop, avoiding the IV completely

> (eventually SCEV cprop

> saves you here afterwards).


Or maybe we can basically assign loop->niter as the argument of UPDATE_COVERAGE_COUNTER
function?

Martin

> 

> Richard.
Richard Biener Oct. 26, 2016, 9:28 a.m. UTC | #2
On Tue, Oct 25, 2016 at 4:31 PM, Martin Liška <mliska@suse.cz> wrote:
> On 10/24/2016 03:51 PM, Richard Biener wrote:

>

>> It's quite ad-hoc :/  The IFN will also be a memory optimization

>> barrier unless you add special support

>> for it in the alias oracle - so the performance measurement needs to

>> be taken with a grain of salt

>> (same is true for all atomics of course... - I have some local patches

>> to improve things here).

>

> Good, thus please ping me with the patches you have and I'll integrate it.

>

>>

>> The way you implement process_sm_for_coverage_counter is more like a

>> final value replacement.

>> You could maybe use create_iv for the loop counter or even wind up

>> computing the final value

>> (number of iterations) only after the loop, avoiding the IV completely

>> (eventually SCEV cprop

>> saves you here afterwards).

>

> Or maybe we can basically assign loop->niter as the argument of UPDATE_COVERAGE_COUNTER

> function?


Yes, that's what I said.

Richard.

> Martin

>

>>

>> Richard.

>
Richard Biener Oct. 26, 2016, 9:31 a.m. UTC | #3
On Wed, Oct 26, 2016 at 11:28 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Tue, Oct 25, 2016 at 4:31 PM, Martin Liška <mliska@suse.cz> wrote:

>> On 10/24/2016 03:51 PM, Richard Biener wrote:

>>

>>> It's quite ad-hoc :/  The IFN will also be a memory optimization

>>> barrier unless you add special support

>>> for it in the alias oracle - so the performance measurement needs to

>>> be taken with a grain of salt

>>> (same is true for all atomics of course... - I have some local patches

>>> to improve things here).

>>

>> Good, thus please ping me with the patches you have and I'll integrate it.


This is what I have in my tree (appearantly only points-to changes, I
suppose general
alias changes will be controversical as the builtins would lose their
"compiler memory
barrier" behavior).

Richard.

>>>

>>> The way you implement process_sm_for_coverage_counter is more like a

>>> final value replacement.

>>> You could maybe use create_iv for the loop counter or even wind up

>>> computing the final value

>>> (number of iterations) only after the loop, avoiding the IV completely

>>> (eventually SCEV cprop

>>> saves you here afterwards).

>>

>> Or maybe we can basically assign loop->niter as the argument of UPDATE_COVERAGE_COUNTER

>> function?

>

> Yes, that's what I said.

>

> Richard.

>

>> Martin

>>

>>>

>>> Richard.

>>
diff mbox

Patch

From a9e76ff1bc8b69e23a50d8cdc9556270cf7b269d Mon Sep 17 00:00:00 2001
From: marxin <mliska@suse.cz>
Date: Fri, 21 Oct 2016 13:26:19 +0200
Subject: [PATCH] Introduce loop store motion for UPDATE_COVERAGE_COUNTER

gcc/testsuite/ChangeLog:

2016-10-24  Martin Liska  <mliska@suse.cz>

	* gcc.dg/tree-ssa/ssa-lim-11.c: Update scanned pattern.

gcc/ChangeLog:

2016-10-24  Martin Liska  <mliska@suse.cz>

	* Makefile.in: Add new file profile-expand.c.
	* internal-fn.c (expand_UPDATE_COVERAGE_COUNTER): New IFN.
	* internal-fn.def: Likewise.
	* passes.def: Add new pass profile_expand.
	* profile-expand.c: New file.
	* tree-pass.h (make_pass_profile_expand): Declare a new
	function.
	* tree-profile.c (gimple_gen_edge_profiler): Generate the new
	internal fn.
	* tree-ssa-loop-im.c (loop_suitable_for_sm): Move to header
	file.
	(move_coverage_counter_update): New function.
	(process_sm_for_coverage_counter): Likewise.
	(pass_lim::execute): Call invariant motion for
	UPDATE_COVERAGE_COUNTER internal functions.
	* tree-ssa-loop.h: Move the function here from
	tree-ssa-loop-im.c.
	* value-prof.h: Declare expand_coverage_counter_ifns.
---
 gcc/Makefile.in                            |   1 +
 gcc/internal-fn.c                          |   6 ++
 gcc/internal-fn.def                        |   2 +
 gcc/passes.def                             |   1 +
 gcc/profile-expand.c                       | 143 ++++++++++++++++++++++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c |   3 +-
 gcc/tree-pass.h                            |   1 +
 gcc/tree-profile.c                         |  37 +------
 gcc/tree-ssa-loop-im.c                     | 157 +++++++++++++++++++++++++----
 gcc/tree-ssa-loop.h                        |  18 ++++
 gcc/value-prof.h                           |   1 +
 11 files changed, 315 insertions(+), 55 deletions(-)
 create mode 100644 gcc/profile-expand.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index c512cd7..9bdb406 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1404,6 +1404,7 @@  OBJS = \
 	print-rtl-function.o \
 	print-tree.o \
 	profile.o \
+	profile-expand.o \
 	real.o \
 	realmpfr.o \
 	recog.o \
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 0b32d5f..557d373 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -254,6 +254,12 @@  expand_FALLTHROUGH (internal_fn, gcall *call)
 	    "invalid use of attribute %<fallthrough%>");
 }
 
+static void
+expand_UPDATE_COVERAGE_COUNTER (internal_fn, gcall *)
+{
+  gcc_unreachable ();
+}
+
 /* Helper function for expand_addsub_overflow.  Return 1
    if ARG interpreted as signed in its precision is known to be always
    positive or 2 if ARG is known to be always negative, or 3 if ARG may
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index d4fbdb2..348fc2f 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -198,6 +198,8 @@  DEF_INTERNAL_FN (ATOMIC_COMPARE_EXCHANGE, ECF_LEAF | ECF_NOTHROW, NULL)
 /* To implement [[fallthrough]].  */
 DEF_INTERNAL_FN (FALLTHROUGH, ECF_LEAF | ECF_NOTHROW, NULL)
 
+DEF_INTERNAL_FN (UPDATE_COVERAGE_COUNTER, ECF_LEAF | ECF_NOTHROW, NULL)
+
 #undef DEF_INTERNAL_INT_FN
 #undef DEF_INTERNAL_FLT_FN
 #undef DEF_INTERNAL_OPTAB_FN
diff --git a/gcc/passes.def b/gcc/passes.def
index 1375254..4a22860 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -386,6 +386,7 @@  along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_asan_O0);
   NEXT_PASS (pass_tsan_O0);
   NEXT_PASS (pass_sanopt);
+  NEXT_PASS (pass_profile_expand);
   NEXT_PASS (pass_cleanup_eh);
   NEXT_PASS (pass_lower_resx);
   NEXT_PASS (pass_nrv);
diff --git a/gcc/profile-expand.c b/gcc/profile-expand.c
new file mode 100644
index 0000000..317fe1f
--- /dev/null
+++ b/gcc/profile-expand.c
@@ -0,0 +1,143 @@ 
+/* Profile expand pass.
+   Copyright (C) 2003-2016 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "memmodel.h"
+#include "backend.h"
+#include "target.h"
+#include "tree.h"
+#include "gimple.h"
+#include "cfghooks.h"
+#include "tree-pass.h"
+#include "ssa.h"
+#include "coverage.h"
+#include "varasm.h"
+#include "tree-nested.h"
+#include "gimplify.h"
+#include "gimple-iterator.h"
+#include "gimplify-me.h"
+#include "tree-cfg.h"
+#include "tree-into-ssa.h"
+#include "value-prof.h"
+#include "profile.h"
+#include "tree-cfgcleanup.h"
+#include "params.h"
+
+void
+expand_coverage_counter_ifns (void)
+{
+  basic_block bb;
+  tree f = builtin_decl_explicit (LONG_LONG_TYPE_SIZE > 32
+				  ? BUILT_IN_ATOMIC_FETCH_ADD_8:
+				  BUILT_IN_ATOMIC_FETCH_ADD_4);
+
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (gimple_call_internal_p (stmt, IFN_UPDATE_COVERAGE_COUNTER))
+	    {
+	      tree addr = gimple_call_arg (stmt, 0);
+	      tree value = gimple_call_arg (stmt, 1);
+	      if (flag_profile_update == PROFILE_UPDATE_ATOMIC)
+		{
+		  gcall *stmt
+		    = gimple_build_call (f, 3, addr, value,
+					 build_int_cst (integer_type_node,
+							MEMMODEL_RELAXED));
+		  gsi_replace (&gsi, stmt, true);
+		}
+	      else
+		{
+		  gcc_assert (TREE_CODE (addr) == ADDR_EXPR);
+		  tree ref = TREE_OPERAND (addr, 0);
+		  tree gcov_type_tmp_var
+		    = make_temp_ssa_name (get_gcov_type (), NULL,
+					  "PROF_edge_counter");
+		  gassign *stmt1 = gimple_build_assign (gcov_type_tmp_var, ref);
+		  gcov_type_tmp_var
+		    = make_temp_ssa_name (get_gcov_type (), NULL,
+					  "PROF_edge_counter");
+		  gassign *stmt2
+		    = gimple_build_assign (gcov_type_tmp_var, PLUS_EXPR,
+					   gimple_assign_lhs (stmt1), value);
+		  gassign *stmt3
+		    = gimple_build_assign (unshare_expr (ref),
+					   gimple_assign_lhs (stmt2));
+		  gsi_insert_seq_before (&gsi, stmt1, GSI_SAME_STMT);
+		  gsi_insert_seq_before (&gsi, stmt2, GSI_SAME_STMT);
+		  gsi_replace (&gsi, stmt3, GSI_SAME_STMT);
+		}
+	    }
+	}
+    }
+}
+
+/* Profile expand pass.  */
+
+namespace {
+
+const pass_data pass_data_profile_expand =
+{
+  GIMPLE_PASS, /* type */
+  "profile_expand", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_LIM, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa, /* todo_flags_finish */
+};
+
+class pass_profile_expand : public gimple_opt_pass
+{
+public:
+  pass_profile_expand (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_profile_expand, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  opt_pass * clone () { return new pass_profile_expand (m_ctxt); }
+  virtual bool gate (function *) { return true; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_profile_expand
+
+unsigned int
+pass_profile_expand::execute (function *)
+{
+  expand_coverage_counter_ifns ();
+
+  return 0;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_profile_expand (gcc::context *ctxt)
+{
+  return new pass_profile_expand (ctxt);
+}
+
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c
index e4c11aa..4c14e24 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c
@@ -22,4 +22,5 @@  void access_buf(struct thread_param* p)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Executing store motion of __gcov0.access_buf\\\[\[01\]\\\] from loop 1" 2 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of &__gcov0.access_buf\\\[\[01\]\\\] from loop 1" 1 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of &__gcov0.access_buf\\\[\[01\]\\\] from loop 2" 1 "lim2" } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 5903fde..ac919f8 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -365,6 +365,7 @@  extern gimple_opt_pass *make_pass_fix_loops (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_no_loop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_init (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_profile_expand (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_lim (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_unswitch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_predcom (gcc::context *ctxt);
diff --git a/gcc/tree-profile.c b/gcc/tree-profile.c
index abeee92..288d38c 100644
--- a/gcc/tree-profile.c
+++ b/gcc/tree-profile.c
@@ -252,40 +252,13 @@  gimple_init_edge_profiler (void)
 void
 gimple_gen_edge_profiler (int edgeno, edge e)
 {
-  tree one;
+  tree one = build_int_cst (gcov_type_node, 1);
 
-  one = build_int_cst (gcov_type_node, 1);
-
-  if (flag_profile_update == PROFILE_UPDATE_ATOMIC)
-    {
-      /* __atomic_fetch_add (&counter, 1, MEMMODEL_RELAXED); */
-      tree addr = tree_coverage_counter_addr (GCOV_COUNTER_ARCS, edgeno);
-      tree f = builtin_decl_explicit (LONG_LONG_TYPE_SIZE > 32
-				      ? BUILT_IN_ATOMIC_FETCH_ADD_8:
-				      BUILT_IN_ATOMIC_FETCH_ADD_4);
-      gcall *stmt = gimple_build_call (f, 3, addr, one,
-				       build_int_cst (integer_type_node,
-						      MEMMODEL_RELAXED));
-      gsi_insert_on_edge (e, stmt);
-    }
-  else
-    {
-      tree ref = tree_coverage_counter_ref (GCOV_COUNTER_ARCS, edgeno);
-      tree gcov_type_tmp_var = make_temp_ssa_name (gcov_type_node,
-						   NULL, "PROF_edge_counter");
-      gassign *stmt1 = gimple_build_assign (gcov_type_tmp_var, ref);
-      gcov_type_tmp_var = make_temp_ssa_name (gcov_type_node,
-					      NULL, "PROF_edge_counter");
-      gassign *stmt2 = gimple_build_assign (gcov_type_tmp_var, PLUS_EXPR,
-					    gimple_assign_lhs (stmt1), one);
-      gassign *stmt3 = gimple_build_assign (unshare_expr (ref),
-					    gimple_assign_lhs (stmt2));
-      gsi_insert_on_edge (e, stmt1);
-      gsi_insert_on_edge (e, stmt2);
-      gsi_insert_on_edge (e, stmt3);
-    }
+  tree addr = tree_coverage_counter_addr (GCOV_COUNTER_ARCS, edgeno);
+  gcall *stmt = gimple_build_call_internal (IFN_UPDATE_COVERAGE_COUNTER,
+					    2, addr, one);
+  gsi_insert_on_edge (e, stmt);
 }
-
 /* Emits code to get VALUE to instrument at GSI, and returns the
    variable containing the value.  */
 
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 463db04..0e04250 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -44,6 +44,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "trans-mem.h"
 #include "gimple-fold.h"
 #include "tree-scalar-evolution.h"
+#include "coverage.h"
 
 /* TODO:  Support for predicated code motion.  I.e.
 
@@ -2281,24 +2282,6 @@  find_refs_for_sm (struct loop *loop, bitmap sm_executed, bitmap refs_to_sm)
     }
 }
 
-/* Checks whether LOOP (with exits stored in EXITS array) is suitable
-   for a store motion optimization (i.e. whether we can insert statement
-   on its exits).  */
-
-static bool
-loop_suitable_for_sm (struct loop *loop ATTRIBUTE_UNUSED,
-		      vec<edge> exits)
-{
-  unsigned i;
-  edge ex;
-
-  FOR_EACH_VEC_ELT (exits, i, ex)
-    if (ex->flags & (EDGE_ABNORMAL | EDGE_EH))
-      return false;
-
-  return true;
-}
-
 /* Try to perform store motion for all memory references modified inside
    LOOP.  SM_EXECUTED is the bitmap of the memory references for that
    store motion was executed in one of the outer loops.  */
@@ -2556,6 +2539,132 @@  tree_ssa_lim (void)
   return todo;
 }
 
+/* Move coverage counter update internal function, pointed by GSI iterator,
+   out of a loop LOOP.  */
+
+static void
+move_coverage_counter_update (gimple_stmt_iterator *gsi, struct loop *loop)
+{
+  gimple *call = gsi_stmt (*gsi);
+  tree type = get_gcov_type ();
+
+  vec<edge> exits = get_loop_exit_edges (loop);
+  if (!loop_suitable_for_sm (loop, exits))
+    {
+      exits.release ();
+      return;
+    }
+
+  /* Verify that BB of the CALL statement post-dominates all exits.  */
+  for (unsigned i = 0; i < exits.length (); i++)
+    {
+      edge exit = exits[i];
+      if (!dominated_by_p (CDI_POST_DOMINATORS, call->bb, exit->src))
+	{
+	  exits.release ();
+	  return;
+	}
+    }
+
+  if (exits.is_empty ())
+    return;
+
+  edge preheader = loop_preheader_edge (loop);
+  if (!single_succ_p (preheader->src)
+      || preheader->dest != call->bb)
+    return;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Executing store motion of ");
+      print_generic_expr (dump_file, gimple_call_arg (call, 0), 0);
+      fprintf (dump_file, " from loop %d\n", loop->num);
+    }
+
+  tree preheader_var = make_temp_ssa_name (type, NULL, "PROF_edge_counter");
+  gimple *stmt = gimple_build_assign (preheader_var, build_int_cst (type, 0));
+  gimple_stmt_iterator it = gsi_last_bb (preheader->src);
+  gsi_insert_after (&it, stmt, GSI_NEW_STMT);
+
+  tree loop_var1 = make_temp_ssa_name (type, NULL, "PROF_edge_counter");
+  tree loop_var2 = make_temp_ssa_name (type, NULL, "PROF_edge_counter");
+
+  gphi *phi = create_phi_node (loop_var1, call->bb);
+  add_phi_arg (phi, preheader_var, preheader, UNKNOWN_LOCATION);
+
+  stmt = gimple_build_assign (loop_var2, PLUS_EXPR, loop_var1,
+			      build_int_cst (type, 1));
+  gsi_insert_before (gsi, stmt, GSI_SAME_STMT);
+
+  edge e;
+  edge_iterator ei;
+  FOR_EACH_EDGE (e, ei, call->bb->preds)
+    if (e != preheader)
+      add_phi_arg (phi, loop_var2, e, UNKNOWN_LOCATION);
+
+  tree updated_value = make_temp_ssa_name (type, NULL, "PROF_edge_counter");
+
+  for (unsigned i = 0; i < exits.length (); i++)
+    {
+      edge exit = exits[i];
+      if (!dominated_by_p (CDI_DOMINATORS, exit->dest, exit->src))
+	{
+	  basic_block new_bb = split_edge (exit);
+	  set_immediate_dominator (CDI_DOMINATORS, new_bb, exit->src);
+	  e = single_pred_edge (new_bb);
+	}
+      else
+	e = exit;
+
+      basic_block bb = e->dest;
+      phi = create_phi_node (updated_value, e->dest);
+      add_phi_arg (phi, loop_var2, e, UNKNOWN_LOCATION);
+
+      tree addr = unshare_expr (gimple_call_arg (call, 0));
+      call = gimple_build_call_internal (IFN_UPDATE_COVERAGE_COUNTER,
+					 2, addr, gimple_phi_result (phi));
+
+      it = gsi_start_bb (bb);
+      gsi_insert_before (&it, call, GSI_NEW_STMT);
+    }
+
+  exits.release ();
+  gsi_remove (gsi, true);
+}
+
+/* Process store motion for coverage counter update internal function.  */
+
+static void
+process_sm_for_coverage_counter (void)
+{
+  bool has_dom_info = dom_info_available_p (CDI_POST_DOMINATORS);
+  if (!has_dom_info)
+    calculate_dominance_info (CDI_POST_DOMINATORS);
+
+  struct loop *loop;
+  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
+    {
+      basic_block *body = get_loop_body (loop);
+      for (unsigned i = 0; i < loop->num_nodes; i++)
+	{
+	  gimple_stmt_iterator gsi;
+	  for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi);)
+	    {
+	      gimple *stmt = gsi_stmt (gsi);
+	      if (gimple_call_internal_p (stmt, IFN_UPDATE_COVERAGE_COUNTER)
+		  && integer_onep (gimple_call_arg (stmt, 1)))
+		move_coverage_counter_update (&gsi, loop);
+
+	      if (!gsi_end_p (gsi))
+		gsi_next (&gsi);
+	    }
+	}
+    }
+
+  if (!has_dom_info)
+    free_dominance_info (CDI_POST_DOMINATORS);
+}
+
 /* Loop invariant motion pass.  */
 
 namespace {
@@ -2592,11 +2701,15 @@  pass_lim::execute (function *fun)
 {
   bool in_loop_pipeline = scev_initialized_p ();
   if (!in_loop_pipeline)
-    loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
+    loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS
+			 | LOOPS_HAVE_PREHEADERS);
 
-  if (number_of_loops (fun) <= 1)
-    return 0;
-  unsigned int todo = tree_ssa_lim ();
+  unsigned int todo = 0;
+  if (number_of_loops (fun) > 1)
+    todo = tree_ssa_lim ();
+
+  process_sm_for_coverage_counter ();
+  todo |= TODO_update_ssa;
 
   if (!in_loop_pipeline)
     loop_optimizer_finalize ();
diff --git a/gcc/tree-ssa-loop.h b/gcc/tree-ssa-loop.h
index b2f37ab..88bcad7 100644
--- a/gcc/tree-ssa-loop.h
+++ b/gcc/tree-ssa-loop.h
@@ -79,4 +79,22 @@  loop_containing_stmt (gimple *stmt)
   return bb->loop_father;
 }
 
+/* Checks whether LOOP (with exits stored in EXITS array) is suitable
+   for a store motion optimization (i.e. whether we can insert statement
+   on its exits).  */
+
+static inline bool
+loop_suitable_for_sm (struct loop *loop ATTRIBUTE_UNUSED,
+		      vec<edge> exits)
+{
+  unsigned i;
+  edge ex;
+
+  FOR_EACH_VEC_ELT (exits, i, ex)
+    if (ex->flags & (EDGE_ABNORMAL | EDGE_EH))
+      return false;
+
+  return true;
+}
+
 #endif /* GCC_TREE_SSA_LOOP_H */
diff --git a/gcc/value-prof.h b/gcc/value-prof.h
index 07e2b3b..cb3ae1d 100644
--- a/gcc/value-prof.h
+++ b/gcc/value-prof.h
@@ -98,6 +98,7 @@  bool check_ic_target (gcall *, struct cgraph_node *);
 /* In tree-profile.c.  */
 extern void gimple_init_edge_profiler (void);
 extern void gimple_gen_edge_profiler (int, edge);
+extern void expand_coverage_counter_ifns (void);
 extern void gimple_gen_interval_profiler (histogram_value, unsigned, unsigned);
 extern void gimple_gen_pow2_profiler (histogram_value, unsigned, unsigned);
 extern void gimple_gen_one_value_profiler (histogram_value, unsigned, unsigned);
-- 
2.10.1