[RFC,AARCH64] Machine reorg pass for aarch64/Falkor to handle prefetcher tag collision

Message ID CAELXzTP8rYTGBEk=nA6gD7_FFB-K2Rkd2gSKr5L8-k_DA+PYWw@mail.gmail.com
State New
Headers show
Series
  • [RFC,AARCH64] Machine reorg pass for aarch64/Falkor to handle prefetcher tag collision
Related show

Commit Message

Kugan Vivekanandarajah Feb. 12, 2018, 11:58 p.m.
Implements a machine reorg pass for aarch64/Falkor to handle
prefetcher tag collision. This is strictly not part of the loop
unroller but for Falkor, unrolling can make h/w prefetcher performing
badly if there are too much tag collisions based on the discussions in
https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html.

gcc/ChangeLog:

2018-02-12  Kugan Vivekanandarajah  <kuganv@linaro.org>

    * config/aarch64/aarch64.c (iv_p): New.
    (strided_load_p): Likwise.
    (make_tag): Likesie.
    (get_load_info): Likewise.
    (aarch64_reorg): Likewise.
    (TARGET_MACHINE_DEPENDENT_REORG): Implement new target hook.

Comments

Kyrill Tkachov Feb. 13, 2018, 9:47 a.m. | #1
Hi Kugan,

On 12/02/18 23:58, Kugan Vivekanandarajah wrote:
> Implements a machine reorg pass for aarch64/Falkor to handle

> prefetcher tag collision. This is strictly not part of the loop

> unroller but for Falkor, unrolling can make h/w prefetcher performing

> badly if there are too much tag collisions based on the discussions in

> https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html.

>


Could you expand a bit more on what transformation exactly this pass does?
 From my understanding the loads that use the same base
register and offset and have the same destination register
are considered part of the same stream by the hardware prefetcher, so for example:
ldr x0, [x1, 16] (load1)
... (set x1 to something else)
ldr x0, [x1, 16] (load2)

will cause the prefetcher to think that both loads are part of the same stream,
so this pass tries to rewrite the sequence into:
ldr x0, [x1, 16]
... (set x1 to something else)
mov tmp, x1
ldr x0, [tmp, 16]

Where the tag/signature is the combination of destination x0, base x1 and offset 16.
Is this a fair description?

I've got some comments on the patch itself

> gcc/ChangeLog:

>

> 2018-02-12  Kugan Vivekanandarajah <kuganv@linaro.org>

>

>     * config/aarch64/aarch64.c (iv_p): New.

>     (strided_load_p): Likwise.

>     (make_tag): Likesie.

>     (get_load_info): Likewise.

>     (aarch64_reorg): Likewise.

>     (TARGET_MACHINE_DEPENDENT_REORG): Implement new target hook.


New functions need function comments describing the arguments at least.
Functions like make_tag, get_load_info etc can get tricky to maintain without
some documentation on what they are supposed to accept and return.

I think the pass should be enabled at certain optimisation levels, say -O2?
I don't think it would be desirable at -Os since it creates extra moves that
increase code size.

That being said, I would recommend you implement this as an aarch64-specific pass,
in a similar way to cortex-a57-fma-steering.c. That way you can register it in
aarch64-passes.def and have flexibility as to when exactly the pass gets to run
(i.e. you wouldn't be limited by when machine_reorg gets run).

Also, I suggest you don't use the "if (aarch64_tune != falkor) return;" way of
gating this pass. Do it in a similar way to the FMA steering pass that is,
define a new flag in aarch64-tuning-flags.def and use it in the tune_flags field
of the falkor tuning struct.

Hope this helps,
Kyrill
Kugan Vivekanandarajah Feb. 13, 2018, 10:47 p.m. | #2
Hi Kyrill,

On 13 February 2018 at 20:47, Kyrill  Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
> Hi Kugan,

>

> On 12/02/18 23:58, Kugan Vivekanandarajah wrote:

>>

>> Implements a machine reorg pass for aarch64/Falkor to handle

>> prefetcher tag collision. This is strictly not part of the loop

>> unroller but for Falkor, unrolling can make h/w prefetcher performing

>> badly if there are too much tag collisions based on the discussions in

>> https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html.

>>

>

> Could you expand a bit more on what transformation exactly this pass does?


This is similar to what LLVM does in https://reviews.llvm.org/D35366.

Falkor hardware prefetcher works well when signature of the prefetches
(or tags as computed in the patch - similar to LLVM) are different for
different memory streams. If different memory streams  have the same
signature, it can result in bad performance. This machine reorg pass
tries to change the signature of memory loads by changing the base
register with a free register.

> From my understanding the loads that use the same base

> register and offset and have the same destination register

> are considered part of the same stream by the hardware prefetcher, so for

> example:

> ldr x0, [x1, 16] (load1)

> ... (set x1 to something else)

> ldr x0, [x1, 16] (load2)

>

> will cause the prefetcher to think that both loads are part of the same

> stream,

> so this pass tries to rewrite the sequence into:

> ldr x0, [x1, 16]

> ... (set x1 to something else)

> mov tmp, x1

> ldr x0, [tmp, 16]

>

> Where the tag/signature is the combination of destination x0, base x1 and

> offset 16.

> Is this a fair description?


This is precisely what is happening.

>

> I've got some comments on the patch itself

>

>> gcc/ChangeLog:

>>

>> 2018-02-12  Kugan Vivekanandarajah <kuganv@linaro.org>

>>

>>     * config/aarch64/aarch64.c (iv_p): New.

>>     (strided_load_p): Likwise.

>>     (make_tag): Likesie.

>>     (get_load_info): Likewise.

>>     (aarch64_reorg): Likewise.

>>     (TARGET_MACHINE_DEPENDENT_REORG): Implement new target hook.

>

>

> New functions need function comments describing the arguments at least.

> Functions like make_tag, get_load_info etc can get tricky to maintain

> without

> some documentation on what they are supposed to accept and return.


I wil add the comments.

>

> I think the pass should be enabled at certain optimisation levels, say -O2?

> I don't think it would be desirable at -Os since it creates extra moves that

> increase code size.


Ok, I will change this.

>

> That being said, I would recommend you implement this as an aarch64-specific

> pass,

> in a similar way to cortex-a57-fma-steering.c. That way you can register it

> in

> aarch64-passes.def and have flexibility as to when exactly the pass gets to

> run

> (i.e. you wouldn't be limited by when machine_reorg gets run).

>

> Also, I suggest you don't use the "if (aarch64_tune != falkor) return;" way

> of

> gating this pass. Do it in a similar way to the FMA steering pass that is,

> define a new flag in aarch64-tuning-flags.def and use it in the tune_flags

> field

> of the falkor tuning struct.


Ok, I will revise the patch.


Thanks,
Kugan

>

> Hope this helps,

> Kyrill
Kugan Vivekanandarajah Feb. 15, 2018, 8:49 p.m. | #3
Hi,

On 14 February 2018 at 09:47, Kugan Vivekanandarajah
<kugan.vivekanandarajah@linaro.org> wrote:
> Hi Kyrill,

>

> On 13 February 2018 at 20:47, Kyrill  Tkachov

> <kyrylo.tkachov@foss.arm.com> wrote:

>> Hi Kugan,

>>

>> On 12/02/18 23:58, Kugan Vivekanandarajah wrote:

>>>

>>> Implements a machine reorg pass for aarch64/Falkor to handle

>>> prefetcher tag collision. This is strictly not part of the loop

>>> unroller but for Falkor, unrolling can make h/w prefetcher performing

>>> badly if there are too much tag collisions based on the discussions in

>>> https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html.

>>>

>>

>> Could you expand a bit more on what transformation exactly this pass does?

>

> This is similar to what LLVM does in https://reviews.llvm.org/D35366.

>

> Falkor hardware prefetcher works well when signature of the prefetches

> (or tags as computed in the patch - similar to LLVM) are different for

> different memory streams. If different memory streams  have the same

> signature, it can result in bad performance. This machine reorg pass

> tries to change the signature of memory loads by changing the base

> register with a free register.

>

>> From my understanding the loads that use the same base

>> register and offset and have the same destination register

>> are considered part of the same stream by the hardware prefetcher, so for

>> example:

>> ldr x0, [x1, 16] (load1)

>> ... (set x1 to something else)

>> ldr x0, [x1, 16] (load2)

>>

>> will cause the prefetcher to think that both loads are part of the same

>> stream,

>> so this pass tries to rewrite the sequence into:

>> ldr x0, [x1, 16]

>> ... (set x1 to something else)

>> mov tmp, x1

>> ldr x0, [tmp, 16]

>>

>> Where the tag/signature is the combination of destination x0, base x1 and

>> offset 16.

>> Is this a fair description?

>

> This is precisely what is happening.

>

>>

>> I've got some comments on the patch itself

>>

>>> gcc/ChangeLog:

>>>

>>> 2018-02-12  Kugan Vivekanandarajah <kuganv@linaro.org>

>>>

>>>     * config/aarch64/aarch64.c (iv_p): New.

>>>     (strided_load_p): Likwise.

>>>     (make_tag): Likesie.

>>>     (get_load_info): Likewise.

>>>     (aarch64_reorg): Likewise.

>>>     (TARGET_MACHINE_DEPENDENT_REORG): Implement new target hook.

>>

>>

>> New functions need function comments describing the arguments at least.

>> Functions like make_tag, get_load_info etc can get tricky to maintain

>> without

>> some documentation on what they are supposed to accept and return.

>

> I wil add the comments.

>

>>

>> I think the pass should be enabled at certain optimisation levels, say -O2?

>> I don't think it would be desirable at -Os since it creates extra moves that

>> increase code size.

>

> Ok, I will change this.

>

>>

>> That being said, I would recommend you implement this as an aarch64-specific

>> pass,

>> in a similar way to cortex-a57-fma-steering.c. That way you can register it

>> in

>> aarch64-passes.def and have flexibility as to when exactly the pass gets to

>> run

>> (i.e. you wouldn't be limited by when machine_reorg gets run).

>>

>> Also, I suggest you don't use the "if (aarch64_tune != falkor) return;" way

>> of

>> gating this pass. Do it in a similar way to the FMA steering pass that is,

>> define a new flag in aarch64-tuning-flags.def and use it in the tune_flags

>> field

>> of the falkor tuning struct.

>

> Ok, I will revise the patch.


Here is the revised patch.

Thanks,
Kugan

gcc/ChangeLog:

2018-02-15  Kugan Vivekanandarajah  <kuganv@linaro.org>

    * config.gcc: Add falkor-tag-collision-avoidance.o to extra_objs for
        aarch64-*-*.
    * config/aarch64/aarch64-protos.h
(make_pass_tag_collision_avoidance): Declare.
    * config/aarch64/aarch64-passes.def: Insert tag collision avoidance pass.
    * config/aarch64/aarch64-tuning-flags.def
    (AARCH64_EXTRA_TUNE_AVOID_PREFETCH_TAG_COLLISION): Define.
    * config/aarch64/aarch64.c (qdf24xx_tunings): Add
    AARCH64_EXTRA_TUNE_AVOID_PREFETCH_TAG_COLLISION.
    * config/aarch64/falkor-tag-collision-avoidance.c: New file.
    * config/aarch64/t-aarch64: Add falkor-tag-collision-avoidance.o.


>

>

> Thanks,

> Kugan

>

>>

>> Hope this helps,

>> Kyrill
diff --git a/gcc/config.gcc b/gcc/config.gcc
index eca156a..c3f3e1a 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -303,7 +303,7 @@ aarch64*-*-*)
 	extra_headers="arm_fp16.h arm_neon.h arm_acle.h"
 	c_target_objs="aarch64-c.o"
 	cxx_target_objs="aarch64-c.o"
-	extra_objs="aarch64-builtins.o aarch-common.o cortex-a57-fma-steering.o"
+	extra_objs="aarch64-builtins.o aarch-common.o cortex-a57-fma-steering.o falkor-tag-collision-avoidance.o"
 	target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.c"
 	target_has_targetm_common=yes
 	;;
diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def
index 87747b4..d4b6a43 100644
--- a/gcc/config/aarch64/aarch64-passes.def
+++ b/gcc/config/aarch64/aarch64-passes.def
@@ -19,3 +19,4 @@
    <http://www.gnu.org/licenses/>.  */
 
 INSERT_PASS_AFTER (pass_regrename, 1, pass_fma_steering);
+INSERT_PASS_AFTER (pass_fast_rtl_dce, 1, pass_tag_collision_avoidance);
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 2d705d2..d8f6964 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -544,6 +544,7 @@ std::string aarch64_get_extension_string_for_isa_flags (unsigned long,
 							unsigned long);
 
 rtl_opt_pass *make_pass_fma_steering (gcc::context *ctxt);
+rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *ctxt);
 
 poly_uint64 aarch64_regmode_natural_size (machine_mode);
 
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index ea9ead2..c0dd178 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -41,4 +41,6 @@ AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
    are not considered cheap.  */
 AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
 
+AARCH64_EXTRA_TUNING_OPTION ("avoid_prefetch_tag_collision", AVOID_PREFETCH_TAG_COLLISION)
+
 #undef AARCH64_EXTRA_TUNING_OPTION
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 2e70f3a..b075325 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -875,7 +875,7 @@ static const struct tune_params qdf24xx_tunings =
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE),		/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_AVOID_PREFETCH_TAG_COLLISION), /* tune_flags.  */
   &qdf24xx_prefetch_tune
 };
 
diff --git a/gcc/config/aarch64/falkor-tag-collision-avoidance.c b/gcc/config/aarch64/falkor-tag-collision-avoidance.c
index e69de29..1fe320f 100644
--- a/gcc/config/aarch64/falkor-tag-collision-avoidance.c
+++ b/gcc/config/aarch64/falkor-tag-collision-avoidance.c
@@ -0,0 +1,468 @@
+/* Tag Collision Avoidance pass for Falkor.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#define IN_TARGET_CODE 1
+
+#include "config.h"
+#define INCLUDE_LIST
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "rtl.h"
+#include "tree.h"
+#include "tree-pass.h"
+#include "aarch64-protos.h"
+#include "hash-map.h"
+#include "cfgloop.h"
+#include "cfgrtl.h"
+#include "rtl-iter.h"
+#include "df.h"
+#include "memmodel.h"
+#include "optabs.h"
+#include "regs.h"
+#include "recog.h"
+
+/*
+   Falkor hardware prefetcher works well when signature of the prefetches
+   (or tags as computed in the patch) are different for different memory
+   streams.  If different memory streams  have the same signature, it can
+   result in bad performance.  This pass tries to change the signature of
+   memory loads by changing the base register with a free register.
+
+   Signature (TAG) is based on SRC, DST and Offset.  If the signature is
+   is same, it will be considered part of the same stream by the hardware
+   prefetcher, for example:
+   ldr x0, [x1, 16] (load stream 1)
+   x1 is resused for a different stream
+   ldr x0, [x1, 16] (load stream 2)
+
+   will cause the prefetcher to think that both loads are part of the same
+   stream, so this pass tries to rewrite the sequence into:
+   ldr x0, [x1, 16]
+   mov tmp, x1
+   ldr x0, [tmp, 16]
+   Such that the signatures are different.  */
+
+
+/* Return true if the REG is an IV in the LOOP, false otherwise.
+   This is an approximate check and does not rely on the functionality
+   provided likes of biv () as the loop form might not be suitable for
+   such analysis.  */
+
+static bool
+iv_p (rtx reg, struct loop *loop)
+{
+  df_ref adef;
+  unsigned regno = REGNO (reg);
+  bool def_in_loop = false;
+  bool def_out_loop = false;
+
+  if (GET_MODE_CLASS (GET_MODE (reg)) != MODE_INT)
+    return false;
+
+  for (adef = DF_REG_DEF_CHAIN (regno); adef; adef = DF_REF_NEXT_REG (adef))
+    {
+      if (!DF_REF_INSN_INFO (adef)
+	  || !NONDEBUG_INSN_P (DF_REF_INSN (adef)))
+	continue;
+
+      basic_block bb = DF_REF_BB (adef);
+      if (dominated_by_p (CDI_DOMINATORS, bb, loop->header)
+	  && bb->loop_father == loop)
+	{
+	  rtx_insn *insn = DF_REF_INSN (adef);
+	  recog_memoized (insn);
+	  rtx pat = PATTERN (insn);
+	  if (GET_CODE (pat) != SET)
+	    continue;
+	  rtx x = SET_SRC (pat);
+	  if (GET_CODE (x) == ZERO_EXTRACT
+	      || GET_CODE (x) == ZERO_EXTEND
+	      || GET_CODE (x) == SIGN_EXTEND)
+	    x = XEXP (x, 0);
+	  if (MEM_P (x))
+	    continue;
+	  if (GET_CODE (x) == POST_INC
+	      || GET_CODE (x) == POST_DEC
+	      || GET_CODE (x) == PRE_INC
+	      || GET_CODE (x) == PRE_DEC)
+	    def_in_loop = true;
+	  else if (BINARY_P (x))
+	    def_in_loop = true;
+	}
+      if (dominated_by_p (CDI_DOMINATORS, loop->header, bb))
+	def_out_loop = true;
+      if (def_in_loop && def_out_loop)
+	return true;
+    }
+  return false;
+}
+
+/* Return true if X is a strided load in the LOOP, false otherwise.
+   If it is a strided load, set the BASE and OFFSET.  Also, if this is
+   a pre/post increment load, set PRE_POST to true.  */
+
+static bool
+strided_load_p (rtx x,
+		struct loop *loop,
+		bool *pre_post,
+		rtx *base,
+		rtx *offset)
+{
+  /* Loadded value is extended, get src.  */
+  if (GET_CODE (x) == ZERO_EXTRACT
+      || GET_CODE (x) == ZERO_EXTEND
+      || GET_CODE (x) == SIGN_EXTEND)
+    x = XEXP (x, 0);
+
+  /* If it is not MEM_P, it is not lodade from mem.  */
+  if (!MEM_P (x))
+    return false;
+
+  /* Get the src of MEM_P.  */
+  x = XEXP (x, 0);
+
+  /* If it is a post/pre increment, get the src.  */
+  if (GET_CODE (x) == POST_INC
+      || GET_CODE (x) == POST_DEC
+      || GET_CODE (x) == PRE_INC
+      || GET_CODE (x) == PRE_DEC)
+    {
+      x = XEXP (x, 0);
+      *pre_post = true;
+    }
+
+  /* get base and offset depending on the type.  */
+  if (REG_P (x)
+      || UNARY_P (x))
+    {
+      if (!REG_P (x))
+	x = XEXP (x, 0);
+      if (REG_P (x)
+	  && iv_p (x, loop))
+	{
+	  *base = x;
+	  return true;
+	}
+    }
+  else if (BINARY_P (x))
+    {
+      rtx reg1, reg2;
+      reg1 = XEXP (x, 0);
+
+      if (REG_P (reg1)
+	  && REGNO (reg1) == SP_REGNUM)
+	return false;
+      reg2 = XEXP (x, 1);
+
+      if (REG_P (reg1)
+	  && iv_p (reg1, loop))
+	{
+
+	  *base = reg1;
+	  *offset = reg2;
+	  return true;
+	}
+
+      if (REG_P (reg1)
+	  && REG_P (reg2)
+	  && iv_p (reg2, loop))
+	{
+	  *base = reg1;
+	  *offset = reg2;
+	  return true;
+	}
+    }
+  return false;
+}
+
+/* Compute the TAG (or signature) based on BASE, DEST and
+   OFFSET of the load.  */
+
+static unsigned
+make_tag (unsigned dest, unsigned base, unsigned offset)
+{
+  return (dest & 0xf)
+    | ((base & 0xf) << 4)
+    | ((offset & 0x3f) << 8);
+}
+
+
+/* Return true if INSN is a strided load in LOOP.
+   If it is a strided load, set the DEST, BASE and OFFSET.
+   Also, if this is a pre/post increment load, set PRE_POST
+   to true.  */
+
+static bool
+get_load_info (rtx_insn *insn,
+	       struct loop *loop,
+	       bool *pre_post,
+	       rtx *base,
+	       rtx *dest,
+	       rtx *offset)
+{
+  subrtx_var_iterator::array_type array;
+  if (!INSN_P (insn) || recog_memoized (insn) < 0)
+    return false;
+  rtx pat = PATTERN (insn);
+  switch (GET_CODE (pat))
+    {
+    case PARALLEL:
+	{
+	  for (int j = 0; j < XVECLEN (pat, 0); ++j)
+	    {
+	      rtx ex = XVECEXP (pat, 0, j);
+	      FOR_EACH_SUBRTX_VAR (iter, array, ex, NONCONST)
+		{
+		  const_rtx x = *iter;
+		  if (GET_CODE (x) == SET
+		      && strided_load_p (SET_SRC (x), loop, pre_post,
+					 base, offset))
+		    {
+		      *dest = SET_DEST (x);
+		      return true;
+		    }
+		}
+	    }
+	}
+      break;
+
+    case SET:
+      FOR_EACH_SUBRTX_VAR (iter, array, SET_SRC (pat), NONCONST)
+	{
+	  rtx x = *iter;
+	  if (strided_load_p (x, loop, pre_post,
+			      base, offset))
+	    {
+	      *dest = SET_DEST (pat);
+	      return true;
+	    }
+	}
+
+    default:
+      break;
+    }
+  return false;
+}
+
+/* Tag collision avoidance pass for Falkor.  */
+
+void
+execute_tag_collision_avoidance ()
+{
+  basic_block *body, bb;
+  struct loop *loop;
+  rtx_insn *insn;
+
+  compute_bb_for_insn ();
+  /* Compute live regs.  */
+  df_compute_regs_ever_live (true);
+  df_analyze ();
+
+  /* Find the loops.  */
+  loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
+  calculate_dominance_info (CDI_DOMINATORS);
+  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
+    {
+      hash_map <rtx, auto_vec<rtx_insn *> > tag_map (512);
+      body = get_loop_body (loop);
+      auto_vec <rtx> tags;
+
+      /* Record all the memory tags.  */
+      for (unsigned i = 0; i < loop->num_nodes; i++)
+	{
+	  bb = body[i];
+	  FOR_BB_INSNS (bb, insn)
+	    {
+	      unsigned tag;
+	      rtx base = NULL_RTX;
+	      rtx dest = NULL_RTX;
+	      rtx offset = NULL_RTX;
+	      bool pre_or_post = false;
+
+	      if (!INSN_P (insn)
+		  || DEBUG_INSN_P (insn))
+		continue;
+
+	      if (get_load_info (insn, loop, &pre_or_post,
+				 &base, &dest, &offset)
+		  && REG_P (dest))
+		{
+		  int int_offset = 0;
+		  if (offset && REG_P (offset))
+		    int_offset = (1 << 5) | REGNO (offset);
+		  else if (offset && CONST_INT_P (offset))
+		    {
+		      int_offset = INTVAL (offset);
+		      int_offset /= GET_MODE_SIZE (GET_MODE (dest)).to_constant ();
+		      if (!pre_or_post)
+			int_offset >>= 2;
+		    }
+		  tag = make_tag (REGNO (dest), REGNO (base), int_offset);
+		  rtx t = GEN_INT (tag);
+		  if (!tag_map.get (t))
+		    tags.safe_push (t);
+		  tag_map.get_or_insert (t).safe_push (insn);
+		}
+	    }
+	}
+
+      for (unsigned i = 0; i < tags.length (); ++i)
+	{
+	  rtx t = tags[i];
+	  auto_vec<rtx_insn *> *v = tag_map.get (t);
+
+	  for (int j = v->length () - 1; j > 0; --j)
+	    {
+	      /* Get the insns that has tags colliding.  */
+	      rtx_insn *insn = (*v)[j];
+	      rtx pat;
+	      bool changed = false;
+	      int int_offset = 0;
+	      rtx base = NULL_RTX;
+	      rtx dest = NULL_RTX;
+	      rtx offset = NULL_RTX;
+	      bool pre_or_post = false;
+
+	      if (!get_load_info (insn, loop, &pre_or_post,
+				  &base, &dest, &offset))
+		gcc_assert (false);
+
+	      if (offset && REG_P (offset))
+		int_offset = (1 << 5) | REGNO (offset);
+	      else if (offset && CONST_INT_P (offset))
+		{
+		  int_offset = INTVAL (offset);
+		  int_offset /= GET_MODE_SIZE (GET_MODE (dest)).to_constant ();
+		  if (!pre_or_post)
+		    int_offset >>= 2;
+		}
+
+	      /* Go over temporary registers and find a free register, if
+		 available.  */
+	      for (int k = R9_REGNUM; !changed && (k <= R15_REGNUM); k++)
+		if (!df_hard_reg_used_p (k))
+		  {
+		    unsigned tag;
+		    rtx t;
+
+		    tag = make_tag (REGNO (dest), k, int_offset);
+		    t = GEN_INT (tag);
+		    /* Check to see if the new tag also collides with an
+		       existing load.  */
+		    if (tag_map.get (t))
+		      continue;
+
+		    machine_mode mode = GET_MODE (base);
+		    rtx new_reg = gen_rtx_REG (mode, k);
+		    t = GEN_INT (make_tag (REGNO (dest), REGNO (new_reg),
+					   int_offset));
+		    vec <rtx_insn *> *v2 = tag_map.get (t);
+		    if (v2 && (v2->length () > 0))
+		      continue;
+
+		    /* Change the insn: dest = load (base, offset)
+		       into tmp = base; dest = load (tmp, offset).  */
+		    extract_insn (insn);
+		    for (int l = 0;
+			 (!changed) && (l < recog_data.n_operands); l++)
+		      {
+			subrtx_ptr_iterator::array_type array;
+			rtx *op = recog_data.operand_loc[l];
+
+			if (recog_data.operand_type[l] == OP_OUT)
+			  continue;
+
+			FOR_EACH_SUBRTX_PTR (iter, array, op, NONCONST)
+			  {
+			    rtx *loc = *iter;
+			    rtx x = *loc;
+
+			    if (!changed && (base == x))
+			      {
+				pat = gen_rtx_SET (new_reg, base);
+				if (validate_change (insn, loc, new_reg, false))
+				  {
+				    emit_insn_before (pat, insn);
+				    if (pre_or_post)
+				      {
+					rtx pat2 = gen_rtx_SET (base, new_reg);
+					emit_insn_after (pat2, insn);
+				      }
+				  }
+				v->pop ();
+				tag_map.get_or_insert (t).safe_push (insn);
+				changed = true;
+				break;
+			      }
+			  }
+		      }
+		  }
+	    }
+	}
+    }
+
+  loop_optimizer_finalize ();
+}
+
+
+const pass_data pass_data_tag_collision_avoidance =
+{
+  RTL_PASS, /* type */
+  "tag_collision_avoidance", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_tag_collision_avoidance : public rtl_opt_pass
+{
+public:
+  pass_tag_collision_avoidance (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_tag_collision_avoidance, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return (aarch64_tune_params.extra_tuning_flags
+	      & AARCH64_EXTRA_TUNE_AVOID_PREFETCH_TAG_COLLISION)
+	      && optimize >= 2;
+    }
+
+  virtual unsigned int execute (function *)
+    {
+      execute_tag_collision_avoidance ();
+      return 0;
+    }
+
+}; // class pass_tag_collision_avoidance
+
+/* Create a new pass instance.  */
+
+rtl_opt_pass *
+make_pass_tag_collision_avoidance (gcc::context *ctxt)
+{
+  return new pass_tag_collision_avoidance (ctxt);
+}
diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
index 0be1f0d..f185b40 100644
--- a/gcc/config/aarch64/t-aarch64
+++ b/gcc/config/aarch64/t-aarch64
@@ -67,6 +67,15 @@ cortex-a57-fma-steering.o: $(srcdir)/config/aarch64/cortex-a57-fma-steering.c \
 	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
 		$(srcdir)/config/aarch64/cortex-a57-fma-steering.c
 
+falkor-tag-collision-avoidance.o: $(srcdir)/config/aarch64/falkor-tag-collision-avoidance.c \
+    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h $(RTL_BASE_H) \
+    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) $(RECOG_H) \
+    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
+    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
+    $(srcdir)/config/aarch64/aarch64-protos.h
+	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
+		$(srcdir)/config/aarch64/falkor-tag-collision-avoidance.c
+
 comma=,
 MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
 MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))

Patch

From 0cd4f5acb2117c739ba81bb4b8b71af499107812 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org>
Date: Mon, 12 Feb 2018 10:44:53 +1100
Subject: [PATCH 4/4] reorg-for-tag-collision

Change-Id: Ic6e42d54268c9112ec1c25de577ca92c1808eeff
---
 gcc/config/aarch64/aarch64.c | 353 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 353 insertions(+)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 1ce2a0c..48e7c54 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -71,6 +71,7 @@ 
 #include "selftest.h"
 #include "selftest-rtl.h"
 #include "rtx-vector-builder.h"
+#include "cfgrtl.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -17203,6 +17204,355 @@  aarch64_select_early_remat_modes (sbitmap modes)
     }
 }
 
+static bool
+iv_p (rtx reg, struct loop *loop)
+{
+  df_ref adef;
+  unsigned regno = REGNO (reg);
+  bool def_in_loop = false;
+  bool def_out_loop = false;
+
+  if (GET_MODE_CLASS (GET_MODE (reg)) != MODE_INT)
+    return false;
+
+  for (adef = DF_REG_DEF_CHAIN (regno); adef; adef = DF_REF_NEXT_REG (adef))
+    {
+      if (!DF_REF_INSN_INFO (adef)
+	  || !NONDEBUG_INSN_P (DF_REF_INSN (adef)))
+	continue;
+
+      basic_block bb = DF_REF_BB (adef);
+      if (dominated_by_p (CDI_DOMINATORS, bb, loop->header)
+	  && bb->loop_father == loop)
+	{
+	  rtx_insn *insn = DF_REF_INSN (adef);
+	  recog_memoized (insn);
+	  rtx pat = PATTERN (insn);
+	  if (GET_CODE (pat) != SET)
+	    continue;
+	  rtx x = SET_SRC (pat);
+	  if (GET_CODE (x) == ZERO_EXTRACT
+	      || GET_CODE (x) == ZERO_EXTEND
+	      || GET_CODE (x) == SIGN_EXTEND)
+	    x = XEXP (x, 0);
+	  if (MEM_P (x))
+	    continue;
+	  if (GET_CODE (x) == POST_INC
+	      || GET_CODE (x) == POST_DEC
+	      || GET_CODE (x) == PRE_INC
+	      || GET_CODE (x) == PRE_DEC)
+	    def_in_loop = true;
+	  else if (BINARY_P (x))
+	    def_in_loop = true;
+	}
+      if (dominated_by_p (CDI_DOMINATORS, loop->header, bb))
+	def_out_loop = true;
+      if (def_in_loop && def_out_loop)
+	return true;
+    }
+  return false;
+}
+
+/* Return true if X is a strided load.  */
+
+static bool
+strided_load_p (rtx x,
+		struct loop *loop,
+		bool *pre_post,
+		rtx *base,
+		rtx *offset)
+{
+  /* Loadded value is extended, get src.  */
+  if (GET_CODE (x) == ZERO_EXTRACT
+      || GET_CODE (x) == ZERO_EXTEND
+      || GET_CODE (x) == SIGN_EXTEND)
+    x = XEXP (x, 0);
+
+  /* If it is not MEM_P, it is not lodade from mem.  */
+  if (!MEM_P (x))
+    return false;
+
+  /* Get the src of MEM_P.  */
+  x = XEXP (x, 0);
+
+  /* If it is a post/pre increment, get the src.  */
+  if (GET_CODE (x) == POST_INC
+      || GET_CODE (x) == POST_DEC
+      || GET_CODE (x) == PRE_INC
+      || GET_CODE (x) == PRE_DEC)
+    {
+      x = XEXP (x, 0);
+      *pre_post = true;
+    }
+
+  /* get base and offset depending on the type.  */
+  if (REG_P (x)
+      || UNARY_P (x))
+    {
+      if (!REG_P (x))
+	x = XEXP (x, 0);
+      if (REG_P (x)
+	  && iv_p (x, loop))
+	{
+	  *base = x;
+	  return true;
+	}
+    }
+  else if (BINARY_P (x))
+    {
+      rtx reg1, reg2;
+      reg1 = XEXP (x, 0);
+
+      if (REG_P (reg1)
+	  && REGNO (reg1) == SP_REGNUM)
+	return false;
+      reg2 = XEXP (x, 1);
+
+      if (REG_P (reg1)
+	  && iv_p (reg1, loop))
+	{
+
+	  *base = reg1;
+	  *offset = reg2;
+	  return true;
+	}
+
+      if (REG_P (reg1)
+	  && REG_P (reg2)
+	  && iv_p (reg2, loop))
+	{
+	  *base = reg1;
+	  *offset = reg2;
+	  return true;
+	}
+    }
+  return false;
+}
+
+static unsigned
+make_tag (unsigned dest, unsigned base, unsigned offset)
+{
+  return (dest & 0xf)
+    | ((base & 0xf) << 4)
+    | ((offset & 0x3f) << 8);
+}
+
+
+/* Return true if X INSN is a strided load.  */
+
+static bool
+get_load_info (rtx_insn *insn,
+	       struct loop *loop,
+	       bool *pre_post,
+	       rtx *base,
+	       rtx *dest,
+	       rtx *offset)
+{
+  subrtx_var_iterator::array_type array;
+  if (!INSN_P (insn) || recog_memoized (insn) < 0)
+    return false;
+  rtx pat = PATTERN (insn);
+  switch (GET_CODE (pat))
+    {
+    case PARALLEL:
+	{
+	  for (int j = 0; j < XVECLEN (pat, 0); ++j)
+	    {
+	      rtx ex = XVECEXP (pat, 0, j);
+	      FOR_EACH_SUBRTX_VAR (iter, array, ex, NONCONST)
+		{
+		  const_rtx x = *iter;
+		  if (GET_CODE (x) == SET
+		      && strided_load_p (SET_SRC (x), loop, pre_post,
+					 base, offset))
+		    {
+		      *dest = SET_DEST (x);
+		      return true;
+		    }
+		}
+	    }
+	}
+      break;
+
+    case SET:
+      FOR_EACH_SUBRTX_VAR (iter, array, SET_SRC (pat), NONCONST)
+	{
+	  rtx x = *iter;
+	  if (strided_load_p (x, loop, pre_post,
+			      base, offset))
+	    {
+	      *dest = SET_DEST (pat);
+	      return true;
+	    }
+	}
+
+    default:
+      break;
+    }
+  return false;
+}
+
+static void
+aarch64_reorg (void)
+{
+  basic_block *body, bb;
+  struct loop *loop;
+  rtx_insn *insn;
+
+  if (aarch64_tune != falkor)
+    return;
+
+  compute_bb_for_insn ();
+  /* Compute live regs.  */
+  df_compute_regs_ever_live (true);
+  df_analyze ();
+
+  /* Find the loops.  */
+  loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
+  calculate_dominance_info (CDI_DOMINATORS);
+  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
+    {
+      hash_map <rtx, auto_vec<rtx_insn *> > tag_map (512);
+      body = get_loop_body (loop);
+      auto_vec <rtx> tags;
+
+      /* Record all the memory tags.  */
+      for (unsigned i = 0; i < loop->num_nodes; i++)
+	{
+	  bb = body[i];
+	  FOR_BB_INSNS (bb, insn)
+	    {
+	      unsigned tag;
+	      rtx base = NULL_RTX;
+	      rtx dest = NULL_RTX;
+	      rtx offset = NULL_RTX;
+	      bool pre_or_post = false;
+
+	      if (!INSN_P (insn)
+		  || DEBUG_INSN_P (insn))
+		continue;
+
+	      if (get_load_info (insn, loop, &pre_or_post,
+				 &base, &dest, &offset))
+		{
+		  int int_offset = 0;
+		  if (offset && REG_P (offset))
+		    int_offset = (1 << 5) | REGNO (offset);
+		  else if (offset && CONST_INT_P (offset))
+		    {
+		      int_offset = INTVAL (offset);
+		      int_offset /= GET_MODE_SIZE (GET_MODE (dest)).to_constant ();
+		      if (!pre_or_post)
+			int_offset >>= 2;
+		    }
+		  tag = make_tag (REGNO (dest), REGNO (base), int_offset);
+		  rtx t = GEN_INT (tag);
+		  if (!tag_map.get (t))
+		    tags.safe_push (t);
+		  tag_map.get_or_insert (t).safe_push (insn);
+		}
+	    }
+	}
+
+      for (unsigned i = 0; i < tags.length (); ++i)
+	{
+	  rtx t = tags[i];
+	  auto_vec<rtx_insn *> *v = tag_map.get (t);
+
+	  for (int j = v->length () - 1; j > 0; --j)
+	    {
+	      /* Get the insns that has tags colliding.  */
+	      rtx_insn *insn = (*v)[j];
+	      rtx pat;
+	      bool changed = false;
+	      int int_offset = 0;
+	      rtx base = NULL_RTX;
+	      rtx dest = NULL_RTX;
+	      rtx offset = NULL_RTX;
+	      bool pre_or_post = false;
+
+	      if (!get_load_info (insn, loop, &pre_or_post,
+				  &base, &dest, &offset))
+		gcc_assert (false);
+
+	      if (offset && REG_P (offset))
+		int_offset = (1 << 5) | REGNO (offset);
+	      else if (offset && CONST_INT_P (offset))
+		{
+		  int_offset = INTVAL (offset);
+		  int_offset /= GET_MODE_SIZE (GET_MODE (dest)).to_constant ();
+		  if (!pre_or_post)
+		    int_offset >>= 2;
+		}
+
+	      /* Go over temporary registers and find a free register, if
+		 available.  */
+	      for (int k = R9_REGNUM; !changed && (k <= R15_REGNUM); k++)
+		if (!df_hard_reg_used_p (k))
+		  {
+		    unsigned tag;
+		    rtx t;
+
+		    tag = make_tag (REGNO (dest), k, int_offset);
+		    t = GEN_INT (tag);
+		    /* Check to see if the new tag also collides with an
+		       existing load.  */
+		    if (tag_map.get (t))
+		      continue;
+
+		    machine_mode mode = GET_MODE (base);
+		    rtx new_reg = gen_rtx_REG (mode, k);
+		    t = GEN_INT (make_tag (REGNO (dest), REGNO (new_reg),
+					   int_offset));
+		    vec <rtx_insn *> *v2 = tag_map.get (t);
+		    if (v2 && (v2->length () > 0))
+		      continue;
+
+		    /* Change the insn: dest = load (base, offset)
+		       into tmp = base; dest = load (tmp, offset).  */
+		    extract_insn (insn);
+		    for (int l = 0;
+			 (!changed) && (l < recog_data.n_operands); l++)
+		      {
+			subrtx_ptr_iterator::array_type array;
+			rtx *op = recog_data.operand_loc[l];
+
+			if (recog_data.operand_type[l] == OP_OUT)
+			  continue;
+
+			FOR_EACH_SUBRTX_PTR (iter, array, op, NONCONST)
+			  {
+			    rtx *loc = *iter;
+			    rtx x = *loc;
+
+			    if (!changed && (base == x))
+			      {
+				pat = gen_rtx_SET (new_reg, base);
+				if (validate_change (insn, loc, new_reg, false))
+				  {
+				    emit_insn_before (pat, insn);
+				    if (pre_or_post)
+				      {
+					rtx pat2 = gen_rtx_SET (base, new_reg);
+					emit_insn_after (pat2, insn);
+				      }
+				  }
+				v->pop ();
+				tag_map.get_or_insert (t).safe_push (insn);
+				changed = true;
+				break;
+			      }
+			  }
+		      }
+		  }
+	    }
+	}
+    }
+
+  loop_optimizer_finalize ();
+  df_finish_pass (true);
+}
+
 /* Target-specific selftests.  */
 
 #if CHECKING_P
@@ -17675,6 +18025,9 @@  aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_HW_MAX_MEM_READ_STREAMS
 #define TARGET_HW_MAX_MEM_READ_STREAMS aarch64_hw_max_mem_read_streams
 
+#undef TARGET_MACHINE_DEPENDENT_REORG
+#define TARGET_MACHINE_DEPENDENT_REORG aarch64_reorg
+
 #if CHECKING_P
 #undef TARGET_RUN_TARGET_SELFTESTS
 #define TARGET_RUN_TARGET_SELFTESTS selftest::aarch64_run_selftests
-- 
2.7.4